Deploying Our Monolith Application
This is the first in a two-part series about how we deploy code at Hudl. Part 1 describes our deployment scheme for our primary application, affectionately named “The Monolith”. Part 2 talks about how we used the lessons learned here to improve deployments in our multiple-application architecture that we call “The Multiverse”.
As a company, we understand that one of our key competitive edges is moving quickly. We develop and ship new features continuously. Before we started moving toward the Multiverse, we were deploying our Monolith application ten times a day.
From practically the beginning, we knew it was important to be able to ship new code reliably. Deployment should be routine, not special. Having this mindset naturally forces the process to become more reliable and therefore less risky. Consider how safe commercial air travel is today as a result of becoming widespread and routine. From every mishap, lessons are learned and improvements are made. If deployments are rare or prone to errors in your organization, consider for a moment: How good would it feel to be able to just click a button and know with confidence that your application will be updated without any more action on your part? It’s possible and worth the effort to get there.
In addition to deploying reliably, we also knew that we wanted to do our deployments during the day. Our motivation was primarily about not wanting to lose a night of sleep for each deployment, but we also understood that during the day everyone else is awake and available to respond if there are problems.
The process we designed allows us to deploy to production reliably with no down time so we can get our (much-needed?) beauty rest. We use a two-stage deployment process to accomplish that; we divide our servers in half and update one set at a time. The update to each set takes about 15 minutes. The full deployment wraps up in about 30 minutes. And it all happens automatically at the click of a button.
We use this same deployment process to roll back bugs that make it to production. Because it is proven multiple times daily, we know we can trust it to deploy the earlier build reliably, too.
Our Keys for Deployment
Here are the major points from above in a nice, succinct list.
- Do it safely: Deployments are zero down time.
- Do it reliably: Failure is rare.
- Do it quickly: You hired engineers, not babysitters.
- Do it during the day: When problems arise, your responders are awake and ready to react.
We built our own deployment system to meet our requirements. At the heart of our system is Alyx, named for the character in the Half Life 2 video game series. Alyx is a web application that our product team uses to deploy to our test, stage, and production environments.
In addition to being the user interface, Alyx is also the deployment coordinator. It talks to all of the services involved at the right times. Before a deployment can begin, Alyx has to know that a new build of the Monolith is available. Alyx monitors TeamCity, our build server, for these new builds which are triggered by GitHub on Product Team commits. When the builds complete, Alyx updates its UI, displaying a “Deploy” button. The deployment process is then started when the user clicks the button.
The first order of business is for Alyx to take half of the web servers offline. To do that, it contacts a component called Overwatch (yep, Half-Life 2 reference), which is in charge of controlling load on the web servers. Overwatch returns a list of all of the web servers, divided into two groups, set 1 and set 2. Alyx tells Overwatch to redirect all traffic to set 2, which will effectively take set 1 offline.
When it receives this instruction from Alyx, Overwatch updates our service registry with metadata to take the set 1 servers offline. Our routing/load balancing layer reads this information from the registry and shifts traffic accordingly. It takes about 90 seconds for the changes to fully propagate, at which point Overwatch returns control to Alyx.
Next, Alyx connects to the servers in set 1 via a service called Outpost, which runs on all of our Monolith application servers. It sends the download location of the new build in its deployment request. After downloading the new build, Outpost shuts down the existing Monolith application, removes the old files, and extracts the new ones. It also warms up the newly-deployed application so that our users don’t notice a performance hit when it starts receiving traffic again.
After every server in set 1 has finished updating, Alyx contacts Overwatch to swap the sets. This is the point where the new build goes live. While the deployment on the second set of servers is going on, Alyx kicks off a set of about 3000 (and growing) automated regression tests that verify core behavior has not been broken in the new build. Test failures are uncommon, so they are immediately evaluated by the person deploying to determine the cause. In the rare case that we deployed a change that breaks functionality, we can easily choose the earlier build in Alyx and deploy it the exact same way.
Room for Improvement
We’ve continued to improve the reliability of our Monolith deployment system, and it has served our product team well for years. It’s not a perfect system, though. Here are a few areas we know could be improved.
As our company has grown, the number of deployments has grown along with it. At peak, we were deploying our Monolith ten times a day. (The number of Monolith deploys is trending down now with the introduction of our Multiverse architecture.) At half an hour per production deployment, Alyx was busy for more than half of the day. It isn’t hard to see that there’s a limitation on the number of deployments we can have in a single day, and we were on pace to hit it soon.
Our deployment system requires that we have at least twice the number of servers necessary to handle the load because we take half of them offline to update them. Ultimately, we don’t mind this because we want to be able to lose an availability zone and still handle traffic like a champ. But it does impose a capacity requirement that not everyone may want.
Alyx fully automates the deployment process once the button is clicked. However, as the person performing the deployment, it’s up to you to ensure that everything looks as you expect it to. If the build you’re pushing causes errors or slows performance, you’ve got to react. That means your primary focus is the deployment, sucking half an hour out of your day. Moreover, if you make the call to roll back, it’s another thirty minutes. Meanwhile, the build you want to roll back is still out there, potentially causing problems. Reducing the deployment time would be a win for people deploying and a win for our users.
We’ve worked hard to make deployment failures very rare, but occasionally something happens which we did not anticipate or which is outside of our control. When it does, it’s always a time sink. As you could infer from above, Alyx is stateful in our design, having to track the servers it has and has not deployed to. If the deployment fails, we have to reset the state and begin again. The person deploying and anyone else waiting in line are set back that much time.
An improvement to this system might be to capture state and resume deploys, but the complications combined with the rarity of failures made us drop the idea. If we were to do it again, we would instead design Alyx to be stateless so that crashes and network failures would only delay, and not fail, a deployment.
Overall, we’re pleased with Alyx and the deployment system we’ve designed and built, but we know there are valuable improvements to be made. Naturally, we hope you found this post helpful or inspiring. Be sure to check out part 2 where we apply our lessons learned from this architecture to make a faster and more robust system!