Let me tell you a story.
Alice is an experienced software engineer who has designed, developed, and deployed a service responsible for payment, which must be up and running at all times, otherwise, her company’s customers won’t be able to do the one thing they were destined to do.
One night, she got a call informing her of an ongoing production incident – the database central processing unit (CPU) is at 100%.
What should she do?
She knows the service, where to look for monitoring data, and what logs are available. Within seconds she dissects whether this is a blip, or a critical incident, requiring scale-out or fail-over of the service.
But what do we do when other members of the team are involved, like Bob, a junior engineer who joined a month ago and provides support while Alice is on vacation, or Carol, a reliability engineer providing 24x7 support?
Here are some insights we gathered from such a scenario:
Insight #1: The people operating the service are not those who built it so plan accordingly.
In theory, every potential failure mode needs an automated response, but in practice, creating (and testing) automated responses is very expensive, and even the most unlikely of scenarios need potential response plans. You should therefore have runbooks – a compilation of standardized written procedures for completing repetitive information technology processes and operations within a company.
So, let's say Bob has access to the runbooks, but there are many of them, all with different information, but none expressing expected outcomes. This leaves Bob in a pickle, as he doesn't want to do anything that could potentially harm the system. This leads us to our second insight.
Insight #2: Runbooks should answer not just the “how”, but also the “why” and “when”.
In the business, there is an old saying that goes “if you think you have backups, but you never tried to restore from a backup, you don’t have backups.” The same thing goes for runbooks – if you haven’t tried to run them, they are just, well, books. From this we learn:
Insight #3: Induce failures (in production) to test runbooks and provide the team with practice.
The story continues with Bob’s monitoring dashboard showing the database CPU at 100% utilization. But Bob isn't sure if this is a problem.
The reason he is not sure of this is because no one has called him screaming. So maybe there is no problem? Obviously, 100% CPU is not ideal, but it could’ve been worse, right? If only there was a way to help Bob understand what is happening, without relying on people calling to tell him so. This is what brings into play the next insight:
Insight #4: Failures are rarely binary; measure actual customer impact, as operational metrics are not enough.
In the end, someone did call Bob. Dave. who called last week too. Dave is an engineer in the analytics team, and he has this weird use case, which sometimes causes failures. Alice was working with Dave to figure it out, just before she left on vacation.
Bob doesn’t know what’s going on yet, but he has a bad feeling about it, and the CPU utilization continues to hold steady at 100%. Should he wake up Carol? Or maybe even call Alice? She did mention taking the laptop with her on vacation.
Clearly, Bob needs help. Who should he call, and when? This leads us to our next insight.
Insight #5: Have clear escalation guidelines for incidents and assign roles to manage them efficiently.
Bob gets Alice, Carol, and Dave on a call together, and they figure it out. Dave had recently made a change to his code. He started with:
And as he grew tired of it failing at the most unfortunate times, he changed it to:
There are two things to note from Dave’s fix. First, it is really easy to implement such a fix, which means it is a very common way of “fixing” intermittent errors. Second, it is terribly wrong, in a non-obvious way.
The reason it is not obvious is that one must consider the larger context in which one’s code runs, and how it impacts other services it interacts with. What if failing once makes it more likely to fail again? What if the code keeps failing indefinitely? If before you had an intermittent error, now you have an intermittent error and a self-inflicted denial of service attack. This brings us to the next insight.
Insight #6: Implement retry with an exponential backoff algorithm across your clients and services.
So, Dave rolls back his “fix” and once deployed, Bob finally sees the CPU utilization drop to single digits. Finally, they achieve success.
When Alice is back from vacation, she looks at the root cause analysis (RCA) document Bob has been working on and realizes a critical part is missing. Sure Dave’s “fix” created a stampede effect, but she’s been around long enough to know that it’s the sole responsibility of the service to protect itself against any potential abuse or misuse.
Also, Dave wasn’t the only one affected. Erin and Frank had their workloads fail too; it just took them a couple more hours to figure out why.
The ability to identify and single out, and therefore stop a customer from wreaking havoc upon the system, without affecting others is an important best practice for designing multi-tenant services, which in and of itself is another insight.
Insight #7: Expect the unexpected from your customers and create clear boundaries for what your system can handle, ensuring one bad actor can’t impact others.
Holding an RCA document, Alice joins the Weekly Operational Review meeting, a company-wide gathering of engineering leaders, reviewing incidents happening across the organization, to learn important lessons so that they are not repeated.
Such meetings are crucial to building an Engineering Culture, where tough questions are encouraged, and blameless post-mortem analysis can take place; where individuals feel safe to share what happened because mistakes are seen as an opportunity to make the system more resilient. This leads us to our final insight.
Insight #8: Incidents will happen; build a culture where they are seen as an opportunity to learn and improve.
Written by Gleb Keselman, Director of R&D at Intuit