Google describes its 'blameless post-mortem culture'

Philip Beevers, site reliability manager at Google, tells Computing how the internet giant diagnoses faults

A 'blameless culture' is necessary to correctly identify and fix the root causes of faults, according to Philip Beevers, site reliability manager at Google.

Speaking at the launch of Google's new UK region of its cloud platform product, where Telegraph CTO Toby Wright described the benefits of the improved speed it would bring, Beevers explained that this philisophy is at the core of his site reliability engineering (SRE) team.

"We have a blameless post-mortem culture," said Beevers. "So after any kind of failure or outage we try to understand the root cause of the problem to stop it happening again. The idea is it's blameless, I can't stress that enough. We genuinely believe that it's processes not people that fail, and we assume all engineers act with the best of intentions.

"It means there's no fear of career consequences with investigation issues," he continued. "We're not just looking for prevention, but also the ability to detect similar problems in future. So we design to mitigate outages more quickly, and we look for ways to mitigate the impact [of outages] more rapdily than in the incident we already had," Beevers explained.

The goal of the SRE function is to make Google's services more reliable for its customers. But Google's approach is a little different to most firms, Beevers said.

"Our SRE function is what you get when you ask software engineers to design an operations function. This is very different to a traditional ops function. These engineers have the same skills as the teams that build our services, but with different domain of application: reliablity and scalability. So there's a parity of skills between product developers and the site reliability engineers, and that changes the relationship. It ecourages people to transfer between two groups, and means that there's a free exchange of ideas and principles," said Beevers.

Another difference, he added, is the way Google measures and calibrates its reliability. Instead of purely trying to keep faults to a minimum, Google works out how many issues it can have before its customers start to feel the pain.

"That gives us an error budget," said Beevers. "It's the amount of errors you can have before you inflict undue pain on your customers.

"So if you have some error budget left then it's okay to keep launching stuff, but if not, you have to stop to avoid causing pain. We use data to take the emotion out of the decision, so there's no longer a confrontation between people wanting to launch new products and people wanting to improve reliability," he argued.

Computing's Cloud and Infrastructure Summit 2017 will be held on 20th September in Central London. Register now to confirm your place.