Deploying at Eaze
Late January 2020 was a bummer. In the wake of some pretty dire cuts, ceejbot and I found ourselves the sole custodians of Eaze's infrastructure. I remember looking at the scope of problem ahead of us and asking myself "what am I giving up doing now because I am taking on this work?"
Our first task: deploy a newly written identity verification service for trust & safety. One small step.
Flash forward to the end of 2020. We built a new infrastructure team. Deploys went from low-confidence, hourlong ordeals involving careful coordination of shared resources to blink-and-you'll-miss-it, safe operations against on-demand development clusters. As a result of our work, our cloud provider bill dropped by 66%. The work to get the system to move again was deliberate and intense, but like the compounded effects of an ion engine's paper-weight pressure against a starship, we've achieved some incredible velocity over time.
Oh, and we put Rust in production!
So, what did we do? What did we learn?
starting at zero
For a long time, bright lines of responsibility existed between teams, ones that could not be crossed: a backend engineer could not expect much access to the AWS console, say. Likewise, it was frowned upon for frontend engineers to cross into the backend. Predictably, this resulted in a system that mistrusted itself. The code developed a Velcro-like consistency at the seams where teams met, born out of defensiveness and mutual mistrust. These codebases represented silos of knowledge. When you can't cross team boundaries, you can't socialize knowledge about your corner of the system! Over time, maintainers of the system went on to new opportunities, unintentionally leaving huge knowledge gaps in their wake.
The technical debt was holding us back and the company realized this. Thus, over the fall and summer of 2019 I was tasked with working across teams to build a next-generation infrastructure for Eaze. However, the company had not yet come to grips with the organizational debt it accrued, so this project was doomed: because of the mistrust between teams, it was built in a corner without interacting with the other systems. This insulated the decision-making of the next generation project at the expense of making sure those decisions were ultimately noncommittal; they could not affect the working system or the existing team structure maintaining it.
The next generation project was to be built using Kubernetes and Golang, the former as a means to placate management and the latter as a means of resolving a long-standing conflict between C# and Node developers in the company. Within a week of landing at the company in the summer of 2019 I bounced off of Go. This is not to say it is a bad language, but it and I do not share the same values, and I was hesitant to hand it to the team.
In particular, I was spoiled by my experience with the Rust compiler; I was disappointed to find that Go's compiler was not nearly as helpful or respectful of my time. Thus began my campaign to get Rust into production at Eaze. I nerdsniped ceej into writing small Rust CLI tools as glue for the nextgen project with me. She had prior experience with Rust's 2015 edition; ergonomic improvements in the 2018 edition made the sale easy. Ultimately, Rust was the only lasting contribution of this failed nextgen project.
The important point here is that we were working to tame a feral system, one whose maintainers had long since left, and about which documentation was sparse or non-existent. There was a small mountain of existing infrastructure managed by a variety of homegrown tools, all of the knowledge of which had just been lost from the company. On hand, we had pieces of a next-gen, Docker-and-Kubernetes approach, written to avoid ruffling feathers. We looked at what we had and said: "no."
shoe-string service mesh
To be frank, Kubernetes and Docker were overkill for our problems and team scale.
We needed something fast, correct, and ready yesterday. ceej and I had previous, shared context for such a system: we wanted something similar to the deploy system at NPM. This was ceej's second time implementing such a system, as she was the primary author of NPM's deploy system; I previously put together something similar for my personal infrastructure having found great joy using that system at NPM.
Our goals and values were:
- Developers should feel safe to deploy at any time of day, any day of the week, limited only by their sense of the impact of their changes & their commitment to work/life balance.
- We should start with a simple working system for a single service, then add features as needed to support subsequent services.
- What we build should be able to be operated by any engineer: we should not hide important infrastructure tooling from them. (We want people to walk away with experience with Terraform, not "eaze's bin/apply-changes-somehow.sh script.")
- When in doubt about a decision, consider that we want our work to generalize to multiple clusters.
- We intend to use this new infrastructure to tame our existing feral services, to make them malleable so we can take them back under control & begin making necessary changes to our system.
The gist of where we landed is: services run directly on EC2 instances. Services run 4 processes, load-balanced by a single colocated reverse proxy. When a deploy starts, we take each service process out of rotation, update the code & configuration, roll the process, wait for success from a ping endpoint, then add it back into rotation. Rinse and repeat until all service processes are back up. Report progress into a special slack channel.
These EC2 instances are managed by an autoscaling group, and (at least initially) each service had an application load balancer proxying to the nginx on each box. (We later moved to single ALBs with multiple listener rules, thanks to maciej.)
Configuration and deploy event information comes from a watchable key-value
store. Service configuration combines keys of the form
<servicename>/ENV_VAR_NAME with a
global/ENV_VAR_NAME namespace to produce
a single environment file suitable for feeding to systemd. Since the key-value
store is watchable, there's a special
deploys/<servicename> key containing a
JSON blob with the S3 URL of a tarball containing the service executable, &
other deploy metadata. Whenever this changes, a deploy is triggered.
Configuration changes, on the other hand, aren't watched & can be triggered
This service deploy key is managed by a lambda listening for S3 bucket event changes. Whenever an object lands in the bucket, the S3 metadata from that object is turned into JSON along with the object's location in the bucket. That JSON is written to the appropriate consul deploy key.
The S3 object is provided by a GitHub action, which builds & zips up the
service on a push to a
deploy/$cluster branch and drops the result into S3.
(We eventually wrote a GitHub workflow to synchronize service GitHub deploy
workflows, but that's a story for another time.)
We tracked Honeycomb trace IDs in our build metadata, so we could show the fruits of our effort in concrete manner.
More important than the particulars of what we built are how we built it and what we learned along the way.
thesis, antithesis, and synthesis
With the exception of a few months in 2018 and 2019, Ceej and I have worked together for the last five years. We have a lot of shared context and values -- whether that's as a result of working together so long or why we've worked well together, it's unclear; probably a little of column A and a little of column B.
We approached rewriting Eaze's infrastructure stack with a lot of opinions about how we should work together. Most usefully: having gleefully worked each other into burnout in the past, we understood which behavioral patterns were most likely to burn us out.
As a result, the rule was that we only worked on the infrastructure together in pairing sessions, and never let our enthusiasm for the work spill out into evenings or weekends. Any time there was some bit of code we felt particularly passionate about writing, we'd confer and make sure we both got a chance to work on it during work hours. In order to get some quiet programming time apart, we would take the time to define some self-contained tasks: writing a CLI tool with well-known inputs and outputs to be plugged in to a bash script written separately, for example.
We pair programmed on all of this work over Zoom. The day to day work took on a morning/afternoon cadence -- a few hours before lunch, followed by a longer pairing session in the afternoon. At the end of the work day, one or both of us would summarize the work in a public Slack channel, both to communicate progress out as well as to help us get started the next day. We both monitored each other's stress levels and would regularly make sure to trade off driving duties at least every other day. We'd switch off more often for particularly tricky tasks. (In mid-spring I took to working outside to get some sunshine. Little things helped a lot with my morale, since this was around the outset of the pandemic)
More important than the act of driving or navigating was the decision-making cadence. When we needed to make a tactical decision with strategic ramifications, we'd compare the outcomes against our values going in. Often multiple solutions would present themselves. We'd then converse about the solutions, often adopting a thesis/antithesis pattern: "If we do X, Y gets easier", "ah, but if Z is important, we should reconsider because X makes it more difficult." If we got too far into a rabbit hole, we'd back out and re-examine the larger problem.
We continue these practices even today!
This might not seem especially relevant in a technical context, but in my opinion, this is the meat and potatoes of programming in a team context. There always must be a balance of interests, a drive to share knowledge, and respect for boundaries. Successful engineering has lot more to do with working well in a team than it does with being technically "correct" individually.
the most valuable single line fix
We visited each service in our system working in this cadence. By our third or fourth node service, the pattern for service resources was clear, and we started to modularize our Terraform code. We had a fairly workable pattern established: packer, powered by ansible, to provide base Amazon Machine Images for use in autoscaling groups provisioned by Terraform. Terraform additionally handled the brackish residue of per-service instance state via cloudinit templates -- configuration service location, service name, etc.
The system was coming together. We had one last hurdle: Eaze's largest codebase, a monolithic C# application written against Microsoft IIS on a fairly ancient version of .NET. All, of course, running on Windows. This still ran on a dangerous, error-prone deploy system written on top of Chef and AWS CodeDeploy.
There were the expected, mundane difficulties: the friction of having to switch
operating systems, learn PowerShell, & figure out the difference between
chocolatey and nuget. The sting of finding out that only a particular
combination of python version, ansible version, and ansible plugin version
worked to connect ansible runs. Trying in vain to remember that the directory
\\. But we prevailed, and -- we thought -- we had a workable
solution we could deploy darkly, taking no production traffic.
We were wrong.
When we spun up our newly configured windows boxes in production, the performance of the live application absolutely tanked. We quickly decommissioned our boxes and went to investigate.
It turned out that we had a frankly ridiculous number of Redis connections. Looking through the infrastructure repositories for context around this we found that -- shortly after April 20th, 2017, an infra engineer scaled the application to 36 processes per box in reaction to unexpected failures. Similarly, the redis box had been scaled up massively. While our load was not insubstantial, ceej and I had some context here too -- we had both seen much smaller Redis instances meet the needs of much larger traffic loads at NPM. This smelled like a misconfiguration issue.
We looked over the application setup and eventually saw the culprit: an inversion-of-control container was responsible for registering a Redis provider. Whenever this Redis provider was instantiated, it would attempt to create a pool of 20 active connections to Redis. However, the provider was configured to be instantiated on every request to any endpoint. Our production windows box starved the live application of connections via the healthcheck endpoint.
After consulting with our comrades, we changed the configuration and redeployed. Success!
We were immediately able to downscale the managed redis instance we were using. Having cleared that hurdle, we were able to proceed with replacing the old deploy system in its entirety, including the Windows boxes.
It's important to call out that this is, again, the bread and butter of successful software engineering in legacy systems: it is anthropological study of the situation that caused the system to arise, informed by second-hand artifacts from the people involved in creating it. Lines of code are not a valuable metric here; the value was in understanding the circumstances in which the mistake was made, and determining that it was safe to fix the mistake. In this case, the root problem was that there was such a divide between the prior infrastructure team and backend teams -- neither could find the problem by themselves.
growing the team and moving on to new problems
Throughout this process, we had also been pairing with Joe and Chris. They had expressed interest in getting into infrastructure early in the year, though they had historically only worked on Eaze's website prior to that. They're both deep experts in that, but once trusted to work on the infrastructure they did some amazing work in further reducing our AWS spend, moving services to the new deploy flow, and grappling with our CDN configuration.
This informed the strategy for growing functioning teams: start by pairing on a hard problem. Gradually, as the system begins to take working form, fold other engineers in. The goal is to end up with agreement on values & team direction. (This cuts both ways: not only are you onboarding the new teammate to your values, but you've also got to be willing to take on some of their values!) Gall's law applies to human systems.
Ceej and I stopped focusing on infrastructure around mid-to-late May. As our final task, we added affordances for provisioning and decommissioning clusters. We onboarded a familiar former coworker -- Maciej -- before moving on to our next task.
There are still features I want to see in our infrastructure stack (automatic nightly recreation of clusters!), but since Joe, Chris, and Maciej have taken over the work, it's only gotten better in their hands. The deploy system has been a huge success: it started the long thaw of the technical debt freezing our system in place. Others have done incredible work accelerating that thaw.
2020 has been a tough year, but this little corner of my life was hugely fulfilling. I'm proud of the team we built and the work we did.