Cake buttons, order monkeys and SRE at Trustpilot

The review website's SRE manager Morten Reinholdt Boelskifte on innovation at the company

No-one could accuse Trustpilot of lacking imagination in its technology choices. The Copenhagen-based consumer review website is almost Google-like in its willingness to experiment with different ways of working. Indeed, one Google innovation - the site reliability engineer (SRE) role is central to its thinking.

At Computing's DevOps Live event last week, SRE manager Morten Reinholdt Boelskifte told delegates about some of the tools and methodologies that the firm deploys to stay on the cutting edge.

First, there's the hierarchy of platforms. When a new project is rolled out the default is to make it SaaS (on AWS or GCP). If it's a relatively simple app then serverless is the first choice, followed by containers, and so on with high availability instances the last in line.

"We ask ‘are you really sure you want to do this'? Because we really don't want to maintain instances," said Boelskifte.

Indeed, Trustpilot actively seeks to minimises the amount of infrastructure it needs to look after.

"When we started this journey three years ago, we looked at the tasks we needed to do and decided everything would be IaC [infrastructure as code]," he explained.

Then there's the heirarchy of tools. Declarative models are chosen over procedural ones where possible, and open source is picked over proprietary. For IaC the company uses Hashicorp's Terraform.

"Why Terraform? It's open source, it has the declarative approach, it supports modularisation so you can break up services into small modules, and it supports custom providers."

Terraform allows Trustpilot to run on both AWS and GCP and also plugs into Github, Gitlab, monitoring tools including PagerDuty and various APIs, he said.

To keep the system in check, the various teams (data science, business experience, data platform, and so on) each own one or more ‘contexts', with each service living in only one context.

"This is very beneficial for SRE because we can tell exactly who knows what. When an alert happens we can load up the context and service and send it straight to the right team."

These alerts are pushed through the usual channels, including Slack, but also each team also has a desk lamp that glows red when there's a new alert from the SRE team. More welcome no doubt, there's also an Amazon button for cake alerts - when someone has a birthday treat to share.

Each engineer is allowed to spend up to 20 per cent of his or her time working on personal projects "so that innovations can come from anywhere", and everyone has bi-weekly one-to-one meetings with their line manager. In the spirit of knowledge sharing there's a SRE team book club. Rather than everyone reading the same book all the way through, each team member reads one chapter and then presents it to the rest.

Unsurprisingly, the firm is automated to the max. Boelskifte spoke about Trustpilot's 11 monkeys. These build on the chaos monkey idea pioneered by Netflix in which services are automatically turned on and off at random to see what falls over.

"Ours are more like order monkeys," he said. "They help us keep up a valid infrastructure. A team can spin up anything they want and then our monkeys will come by and say, oh you'll need this alerting, let me put that in for you".

Among the 11 monkeys are Terminator, which kills any instance that has been running more than 24 hours. Then there's Firewall Monkey, Tuning Monkey, Performance Monkey, Open Source Monkey and also one named Snitch.

"Since the teams can do whatever they want on the infrastructure, Snitch keeps an eye on what they do and then notifies us in the SRE channel. That helps us keep on top of what all the teams are doing."