Learn how you can add fire drills to your software development lifecycle to increase your production resilience and better understand how your systems handle failure.

What is a fire drill?

There are many names for similar processes, “fire drills”, “game days”, “chaos testing”. In this post I’ll mainly be speaking about what has worked well for my teams, but this is a process that has a lot of room for interpretation and can easily be adjusted to fit your team’s needs. It’s the principles that matter.

At a high-level, a fire drill is a process for executing a set of failure scenarios against your application/system and confirming that what you expect to happen actually happens. It’s deceiving in it’s simplicity, but an incredibly powerful tool for building resilient systems.

Why? What problem do fire drills solve?

Fire drills force you to question your assumptions. What happens if your database connection is unavailable? What error codes will your application return? Do your PagerDuty alarms fire? Have you documented how to handle that alarm? Does a failure in component A affect component B?

How to prepare for a fire drill

Identifying dependencies

Walk through your application/system and identify any dependencies you have. Do you make calls to another API (even one you own)? Do you connect to a database? Do you consume or publish to any queues? This is a good time to document all of these things with some simple architectural diagrams if you haven’t already.

Alert/Alarm inventory

List out every alert/alarm that you have for this application/system (you do have alarms, right?). Gather links to all of your metrics/dashboards. Find (or create) a runbook to track all of your alerts and what should be done if they fire.

Prepare your scenarios

Now that you have listed out all of your dependencies and alerts, you can start thinking about the scenarios that you want to test during the fire drill. A good starting point is to have a scenario for each dependency failing, and one that exercises each alarm.

Some example scenarios:

Flesh each of these out as much as you can. You’re going to have to make a trade-off of completeness vs time investment. Ideally you would cover multiple cases for each dependencies, since your system might handle a total outage more gracefully than degraded performance or higher latencies. Make a judgement call here on how deep you want to go.

Prepare execution steps

For each scenario, you need to think through and document how you will inject that failure into your system (and un-inject it!). Write down the specific steps so that anyone on your team would be able to recreate that failure.

This can be hard, you might have to get creative to cover all of your scenarios. For some of the simpler ones you can do things to you application config to simulate the failure (swap out endpoints, use the wrong credentials, wrong ports, etc). Depending on your tech stack, there may be tools that you can use to inject some of the more tricky failures like network congestion or random packet loss (ex. tylertreat/comcast). We’ll walk through some specific example scenarios a little further down.

An aside: Dev/Staging vs Production

Ideally, you would run your fire drill scenarios in a production environment. This will give you the most accurate results and the best chance of finding issues that my only pop up in production. For example, staging environments often have less resources allocated to them, less traffic flowing through them, and less alerting or SLO tracking. You should evaluate your own situation, maybe your team has perfect parity between production and pre-production environments (kudos if so). But, more likely than not, there are differences that could bury issues that even a fire drill will not expose.

One common thing I’ve seen is that alerts in pre-production environments do not get routed to a real PagerDuty (or similar) instance, making it hard to verify that pages actually fire correctly.

If your application/system can be deployed in multiple regions or data centers, that can be a good way to run your scenarios against production without it affecting your customers.

Start with the assumption that you will run the scenarios in production, and if that isn’t possible, have a strong argument ready for why not.

Document assumptions & expectations

For each scenario you prepared above, document what you expect to happen when you execute it.

Be as specific as possible. If you don’t know, dig into the code to make an educated guess. It’s okay to leave this blank in cases where you’re not sure how the system will respond, but it might be a red flag if you aren’t sure how you expect your system to respond to failure.

Execute the fire drill

Now for the fun part 🎉.

Executing the fire drill is best done as a team (or at least with one partner), since there can be a lot to manage and monitor. Schedule some time with your team and block off at least an hour or more (I’ve had some larger systems with a dozen+ scenarios take upwards of half a day or more). Grab some donuts/burritos and get crackin'.

For each scenario, in order:

You can split this work up; have one person documenting, one person injecting the failures, another making requests, etc. Whatever works best for your team.

Examine results

Good thing you took such detailed notes while executing each scenario. Let’s examine.

Identify remediations

Now that you’ve examined the results of the fire drill, it’s time to come up with some action items, or remediations.

Your stated expectations almost certainly didn’t line up with reality for every scenario, I bet there were some surprises.

As a team, think about what you could do to fix your system to respond better to these failures. Maybe you just need to tune some alert timings/thresholds. Maybe you need to introduce or tune timeout settings. Maybe you need to take a big step back and think about larger architectural changes to get more resiliency.

Prioritize and commit

Now this exercise wouldn’t be very useful if we didn’t do anything with the results.

Stack rank your remediations, and commit to tackling some of the easier ones ASAP. Bigger items you may need to schedule into your road map later, but at least now you know the cases where your system doesn’t response as expected.

If you haven’t gone to production yet1, you should be able to clearly identify which remediations should be fixed before letting customers in.

When to (re)run a fire drill

Fire drills shouldn’t be a one time event. Yes they are super important before releasing a new system, but they also should be run periodically to identify any drift in expectations. This is more of an issue for projects which are being actively developed, since you are more likely to make changes that could affect the scenarios.

Come up with what works best for your team, maybe you fire drill your top 3 most important systems every 6 months.

A good rule of thumb is that you should execute a fire drill every time you add a new dependency to the system, since that is a very clear addition of logic where you need to handle failures. Note that you may not need to run the entire fire drill again, maybe just one or two new scenarios.

An Example

Let’s walk through an example fire drill for an imagined system, Message Saver 9000.

graph LR A{{AWS SQS}}-.message.->B[Message Saver 9000] B --> C[(AWS DynamoDB)]

This system consumes messages from an AWS SQS queue and writes to a DynamoDB table. To keep things simple for this example, there is no way to read the messages back out.

Here is our simple fire drill scenario template:

# Scenario How to Simulate Expectation Actual
TBD TBD TBD TBD TBD

Setup

Okay, let’s think about the failure cases for Message Saver 9000.

Let’s write these up, assuming we have some basic monitoring and alarms in place.

# Scenario How to Simulate Expectation Actual
1 SQS is down Remove the ReceiveMessage permission from the service’s IAM role Can’t read messages. SQSFailure alarm should fire. NoMessagesIngested alarm should fire after 10 minutes. TBD
2 DynamoDB is down Change the DYNAMODB_ENDPOINT env var to a bogus value Can’t save items. DynamoFailure alarm should fire. SQS lag should start to build up. SQSFallingBehind alarm should fire after 5 minutes. TBD
3 Service is overloaded Publish many messages to the SQS queue SQS lag will be immediate. SQSFallingBehind alarm should fire immediately. DynamoDBThrottled alarm might fire. TBD

Execute

Okay, let’s say we execute these scenarios, and we documented the following findings.

# Scenario How to Simulate Expectation Actual Result
1 SQS is down Remove the ReceiveMessage permission from the service’s IAM role Can’t read messages. SQSFailure alarm should fire. NoMessagesIngested alarm should fire after 10 minutes. SQSFailure alarm fired after 30s. NoMessagesIngested fired after 10m. All alarms resolved within 5m after restoring the permission.
2 DynamoDB is down Change the DYNAMODB_ENDPOINT env var to a bogus value Can’t save items. DynamoFailure alarm should fire. SQS lag should start to build up. SQSFallingBehind alarm should fire after 5 minutes. DynamoFailure did not fire. SQSFallingBehind fired after 10m. Metrics did not show any DynamoDB failures. Consumer caught up within 2m of restoring connection.
3 Service is overloaded Publish many messages to the SQS queue SQS lag will be immediate. SQSFallingBehind alarm should fire immediately. DynamoDBThrottled alarm might fire. SQSFallingBehind took 10m to fire. DynamoDBThrottled fired within 3m. Metrics did not show any Dynamo throttling. Not all messages made it to the DB!

Remediate

Oops, looks 2 out of our 3 scenarios did not go as planned!

Let’s come up with some remediations for the issues we saw.

Now we document the remediations, get them into our issue tracker, and assign a priority.

Scenario # Issue Priority Ticket #
2 DynamoFailure alarm didn’t fire properly, and we didn’t see any DynamoDB failures in our metrics dashboard. Medium MSAVER-1234
3 SQSFallingBehind alarm took too long to fire when we overloaded the queue, we expected it to fire more immediately. Low MSAVER-1235
3 When DynamoDB was throttling us, not all messages made it to the DB, we lost data! High MSAVER-1236

Looks like we found a pretty serious data durability issue we need to remediate before we release!


  1. Right before you deploy a new system is a great time to fire drill! ↩︎