In order to release reliable software, deployments require a delicate balance of speed and reliability.
At Layer Systems, we prioritise releasing new features, fast feedback loops and responsiveness to customer issues when they do occur. We have a team of software engineers who are all striving to be as productive as they can be, so ensuring we stick to these values while growing as a company means we’ve had to iterate our process along the way.
Early on, we invested heavily in the tooling required to play home to our growing team of software developers. Although much of that tooling is still in use today, we’ve had to make multiple changes to systems and team structure in order for this to scale with our business.
How deployments work today
Every pull request at Layer Systems requires a code review, where work is reviewed by peers.
Once code is reviewed, it can be merged into our development or master branch. However, despite having a multi-node environment, code merged to our master branch is, in most cases, only deployed overnight to avoid interruptions during the business day. Work is only deployed to master during the business day when a defect is detected or an issue is directly customer impacting. Every day, we do multiple merges to master. During each build, one of our senior developers is designated as the person in charge of pushing the code through to production.
On each successful merge, the build is automatically deployed to a “Next” endpoint, meaning our QA and test teams have sight of new features and bug fixes prior to deployment to production. This is a multi-step process that ensures builds are rolled out slowly so that we can detect errors before they affect everyone. These builds can be rolled back if there is a spike in errors and easily hot-fixed if we detect a problem after release.
When it comes to ongoing projects, we generally hold separate branches for these, which means we can deploy any feature branch into development, and make it available for test at the earliest opportunity.
Upgraded Release Process in 2019
1. Creation of release branches
Each feature release (e.g. a collection of new features to be released to production) should start with a release branch, a point in our Git history that allows us to tag the release and a place where we can cherry-pick in hot-fixes for issues discovered during a roll-out to production.
2. Deploy to Next
The next step is to stage the release branch to “Next” and pause on production releases. “Next” is a production environment that mimics live, but shouldn’t be accessed by external users. We should perform additional manual testing in Next because it gives us a higher degree of confidence that the change will work correctly than if we only tested in our beta environment. The roll-out to production starts when our automated deployments are enabled. As we’re very active Layer users ourselves, this will help us catch internal issues early. Once we are confident that core functionality is unchanged, the build is deployed to Bamboo.
3. Node based roll-out to production
If our APM systems are is looking good, we continue to roll out to our production servers, one at a time. By doing this, we slowly expose production traffic to the new build while giving us time to investigate if there are any spikes or anomalies.
What if something goes wrong during a deploy?
Modifying code always presents risk, but our deployment systems always allow us to revert builds, or quickly publish code fixes back out to production. In the event that something does go wrong, we aim to catch it as early as possible. We investigate the issue, identify the PR that is causing problems, revert it, cherry-pick in that revert, and make a new build. However — sometimes we don’t catch a problem before it reaches production. In this scenario, it’s critical to restore service, so we immediately roll back to a previous working build before starting our investigation.
The workflow described above may seem obvious, but our deployments systems went through many iterations over the years in order to get there, including support for a multi-node environment.
Focusing on Reliability
There used to be just one tier before production: staging. Builds would be made, verified on staging, and then go straight to all production servers. This system was simple to understand and allowed any engineer to deploy their own code at any time.
As we grew our internal development team, we hit a point where deploying as fast as possible was hurting the stability of The Layer.
We had a very capable deploy system that was being heavily invested in but the process around deployments had to evolve. The need for safer deploys led us to consider adding new steps to our deploy system, which resulted in our multi-node deploy system described earlier. Under the hood, we continue to use fast deploys and atomic deploys, but we changed the way we carry out deploys. This system has the ability to roll out changes in tiers, which — coupled with much better monitoring and tooling — grants us the ability to catch and mitigate bugs before they have the chance to affect all users.
But we’re not finished yet — we’re continually improving this system through better tooling and automation.