Blog Post 18 JUN 2021

Continuous Integration Level Up, or How We Tamed Our CI Dragon

Lab Zero was hired to help improve the speed and cohesiveness of a major online retail company’s React-based web storefront. Our main goal was to reduce the size of the JavaScript and CSS bundle downloaded by each visitor. This level of optimization might seem excessive, but this client reached a level of scale with their business where even small performance gains can have a surprisingly large impact: the duration of these downloads was shown to cause fluctuations along the lines of hundreds of millions of dollars of revenue annually; users, especially those on slower connections, would leave the site before it finished loading.

Jumping onto the old system

Our client’s website was managed by a series of “app teams”, each of whom controlled the development of a different page on the site; for example, there was the Homepage team, the Item Page team, the Search team, et cetera. All teams’ codebases imported a library whose job was to provide common components and styles to be shared across the site.

Initially, I adopted the role of maintainer of this shared component library. My job was to review app teams’ pull requests (PRs) to the library while also improving the quality of the library’s existing code. For example, I replaced the use of the Lodash utility library with a series of smaller, more opinionated utilities, which reduced bundle sizes by several dozen kilobytes on some pages. While this might sound small, one needs to multiply this amount by the millions of visits the site gets every day to realize the potential savings.

One pain point with maintaining this separate yet shared library was versioning. Major library version updates with breaking changes were few and far between, because the effort to align every app team to change their own code to adopt these updates was prohibitive from a resourcing and timing standpoint. I would usually give myself the task of submitting PRs to these app teams’ libraries, just so I didn’t have to secure a dedicated developer from each library who would have to shift their priorities and make the changes themselves.

Starting fresh

About a year into my engagement, there was a new effort in the company to release a revamped website. The Platform team responsible for this new project decided to use a monorepo structure, which entailed that developers from all app teams would contribute to the same codebase instead of deploying separately.

The Platform team decided to use the Next.js framework, along with Nx for monorepo management. This resulted in a system that allowed teams to work on their respective libraries in a shared, modern, and performant React development environment.

My role on this new project remained essentially the same as the old one. The new Shared Components team was created to either develop new components, or accept submissions of existing components, which would be used across multiple app teams’ library code. The majority of my time was spent reviewing other teams’ PRs which involved changes to our shared code.

Managing submissions and deploying new changes was made significantly easier due to not having to worry about separately deploying our codebase, or versioning it in such a way that it would remain backwards-compatible. Instead of shared library code taking weeks to update and disseminate across every library before reaching production, now it could take a number of hours.

However, this advantage turned out to be more theoretical than practical at the beginning.

The big CI headache

The primary issue was with our Continuous Integration (CI) setup. Every library within the monorepo was expected to maintain 100% code coverage via unit or integration testing, while complemented by additional end-to-end tests. While it was possible—primarily thanks to Nx—to quickly run tests only related to code changes locally, our CI environment was unreliable for a variety of reasons: lack of computing resources, inability to run end-to-end tests for only affected changes, lack of education around writing robust tests that make use of asynchronous paradigms, and frequent unexpected downtime of necessary resources.

This meant that developers across the project soon accepted the reality that their changes, however minor, could possibly take weeks to actually be merged into the main branch, due to the low likelihood that all tests would pass in CI.

The effects of this reality affected how developers worked across the project. First of all, developers would no longer be focused on one or two tickets at a time, but would often have several tickets in play, stuck on “in review” status. This had a cascading effect of more developers being blocked by others’ work, or even their own.

Secondly, developers had to spend a significant chunk of their day restarting the CI process on their open PRs. Initially, this meant having to start from scratch each time and pray that the ever-growing amount of tests across the monorepo would somehow all pass.

Third, even when CI did miraculously pass for a PR—which, again, became exceedingly rare over time—it was common for a PR’s changes to become outdated due to merge conflicts, requiring the developer to start the CI process again from scratch.

Mitigating the worst offenders

Fortunately, the monorepo’s Platform team made many changes along the way that reduced the CI process’s pain points, among which include:

They introduced a process by which developers only had to re-run failing tests.
They allowed for optimistic merging, where a PR passing CI would not have to undergo a second CI run immediately prior to being merged. (This would sometimes result in failing tests making their way into main, so it was seen as a stopgap until overall CI performance improved.)
They worked with the company’s ops team to separate the monorepo’s CI and run it on an instance separate from other company projects.
They spun up Slack bots that would regularly notify the entire web team of specific tests that tended to fail more often than others.
They required individual app teams to prioritize fixing any of these more “flaky” tests to unblock work across the organization.
They eventually understood that test flakiness is often unavoidable, and that it can be helpful to run these tests multiple times until they pass. While this could theoretically result in false positives, in practice it simply streamlined the work that developers were already doing in re-running failing tests.
They investigated moving the monorepo’s entire CI job to a third-party service to reduce under-resourcing and downtime.

All of these changes eventually resulted in a process where a PR could again be submitted, approved, and merged within a few hours, instead of taking weeks to merge.

As far as my own Shared Components team, where our PRs would regularly affect the scopes of multiple teams, this reduced the feasibility of our wide-ranging changes from “impossible” to merely “difficult”, which was good enough for me.

The efforts of our client’s Platform team helped mitigate the significant issues that all developers would encounter in trying to get their tests merged. Despite the speed bumps, the drive to adopt a monorepo paradigm resulted in a much more streamlined, manageable, and faster development process, where developers became quicker to help others solve their problems, and the resulting website was leaner, faster, and more cohesive than its predecessor.

Continue the conversation.

Lab Zero is a San Francisco-based product team helping startups and Fortune 100 companies build flexible, modern, and secure solutions.

Jeffrey Carl Faden

Software Engineer