Die another day: the post-disaster post-mortem and how to perform one

The post-mortem is vital for establishing a culture of transparency and continuous improvement

Your team has been fighting a major incident for hours, hitting one dead end after another. Finally, you manage to isolate the problem. When the systems return to normal, you let out a collective sigh of relief, shut down the response call, and go back to bed, never to think of it again.

Or so you hope.

There's actually one more thing your team needs to do before moving on: a post-mortem. Post-mortems help establish a culture of transparency and continuous improvement, which is what will set your company apart from others.

Without a post-mortem, your team misses an opportunity to learn what they're doing right, what need to be improved and how to avoid making the same mistakes in the future. A well-designed, blameless post-mortem improves both your systems and your incident response process.

A post-mortem can be basic or complex, but there are a number of core elements to consider to make it a valuable exercise to help future planning.

Logistics

Experience suggests that a post-mortem should occur within three business days of a severity-one incident and seven business days of a severity-two incident. Typically, the incident commander—plus relevant service owners, responders, engineering and product managers for impacted systems—will be present.

Together, they'll capture what happened as items on a timeline - starting at a point prior to the incident and working forward. Quantify or attribute a third-party source to each item to stay rooted in fact rather than opinion. Take a similar approach to capturing business and customer impacts.

Look for contributing factors

There is rarely a single root cause of a major incident in a complex system. More often, a combination of factors is in play. If you only find one, you probably haven't looked hard enough.

Monitoring systems are a rich source of information. Look for irregularities, like sudden spikes, before and at the time of the incident. Reproduce the incident in a non-production environment. If you modify or remove variables, does the incident still occur? If so or if not, what does that tell you?

Many contributing factors will be difficult to uncover, arising as they do from the complex relationships between people, processes and technologies. The following prompts - inspired by Gary Klein's debriefing questions in Sidney Dekker's The Field Guide to Understanding Human Error - should help:

Prior experience

How did past experience shape (or could have shaped) the way you expected the incident to pan out and what were the impact of decisions taken? Was this an anticipated class of problem? Had there been similar incidents in the past?

The strategic environment

How did the strategic environment in which you operate impact the incident or the response to it? For example, when faced with finite time and resources, did you de-prioritise work that could have prevented or mitigated the incident?

Assessment

How did you view the health of the service prior to the incident? Has the incident given you cause to revise that view? Will the problem get worse as use of the service scales?

Planning for action

How did you decide on the best course of action? What operational or organisational factors influenced that decision?

Seeking help

If you asked for help, what triggered the decision to do so and was it readily available to you?

Throughout the post-mortem, ask what went well and what could have been improved - and don't forget to analyse the response as well as the incident.

It's very important to avoid attributing blame. If an individual made a mistake, assume any team member could have done the same. You might want to anonymise the process for that reason. Doing so will encourage more people to speak up instead of staying silent because they fear being blamed.

Insights gleaned from post-mortems will help identify actions that reduce the likelihood of future, similar incidents happening, as well as reduce their severity when they do happen.

Share the outcome of the post-mortem. It's not about airing dirty laundry, and nobody's suggesting you give away proprietary information. Instead, acknowledging what happened and sharing what is being done to address it is important to rebuild customer trust. That way, everybody gets a good night's sleep.

Steve Barrett is head of EMEA at PagerDuty