Interview: the FT's Sarah Wells on DevOps, automation and balancing experimentation and stability

Technical director for operations on the challenges and rewards of innovation

Sarah Wells is technical director for operations and reliability at publisher Financial Times. It's a role that provides the foundations for new and agile ways of publishing, supporting delivery teams, developing monitoring systems and improving the reliability of the publishing platforms, always with an eye to future developments in both technology and publishing.

Wells started her career in scientific publishing, leaving in her late twenties to take a masters in Computer Science, before joining the FT in 2011 as a Java developer and tech lead. Promoted to principal engineer in 2013, she moved to her current role at the start of this year.

Computing: How many people are there in your team and what are their roles?

Sarah Wells: My team is made up of our first-line Operations team and a Reliability Engineering team. Operations has 13 people split between Manila and London, and Reliability Engineering has six people plus up to six people from delivery teams joining us on secondment for three months at a time. The idea of is to get the best ideas from those teams and also make sure that the things we build meet people's needs.

What does DevOps mean at the FT? Can you paint a picture of before and after?

DevOps means that a delivery team's responsibilities don't end when code gets shipped. It's about building systems and running them, with a focus on resilience. The team considers monitoring, deployment, security and quality.

When I first joined the FT, code was packaged up by development teams and deployed manually by a separate technical operations team following steps laid out in an Excel spreadsheet. It was time consuming, error-prone, and meant releases were done out of hours, once a month. At that time, it also took months to buy and set up a new server.

We went from 12 releases a year to more than 3,000, and from one monolith to over 300 microservices

The FT invested in automation, particularly of provisioning. Once we could provision a server in minutes - initially in our own private cloud by now on AWS or Heroku - we could move to a microservices architecture. Once we automated deployment and focused on redundancy, we could release small changes in minutes. For publishing, content delivery and the website, which is just one part of the technology the FT produce, we went from 12 releases a year to more than 3,000, and from one monolith to over 300 microservices.

Is it a challenge finding the right skills and experience?

We don't recruit for DevOps and we don't have a separate DevOps team. We look for ‘T-shaped' engineers with a wide set of skills aligned with depth in a particular area. They might be more comfortable writing deployment scripts in Ansible and setting up CDN rules or writing microservice applications or automation of testing and monitoring.

Developers at the FT do less coding now than they used to and more time on what used to be considered sysadmin or ops tasks - setting up deployment pipelines, adding monitoring and alerting, evaluating new technologies (often SaaS or PaaS), and supporting live systems.

Most teams have people acting as 'ops cops' during normal business hours, which means keeping an eye on production systems and fixing any problems. The key thing is adaptability and a willingness to learn.

Is there anything you'd have done differently in you migration to DevOps and or containers with benefit of hindsight?

We've gained a lot from empowering our teams to make their own decisions, but it means they spend time solving the same problems. We have multiple different deployment architectures, ticketing systems and monitoring solutions and operating many different things is complicated and expensive.

Operating many different things is complicated and expensive

It's not so much that we would have done it differently - but I think we need to consolidate some of it now, and that's one of the things my team hopes to do.

Interview: the FT's Sarah Wells on DevOps, automation and balancing experimentation and stability

Technical director for operations on the challenges and rewards of innovation

'Cloud Native' - is it a helpful term or another distracting buzzword?

I think it does describe an approach that some organisations have: on-demand provisioning, widespread automation including continuous delivery pipelines, allowing a culture of experimentation and acceptance of failure and change; creating single responsibility services whether via microservices, containerisation and orchestration, or via event-driven serverless; a preference to use SaaS, PaaS, IaaS.

Developers love new tech but how do you balance innovation and experimentation with standards and stability?

I think developers who have to run a system will try out new and innovative things, but if they aren't stable and easy to operate, they will get removed - if you're the one that gets woken up at 2am, you make different decisions.

At the FT we tend to distinguish between systems that are critical to our business and our brand and ones that are less critical. Our criteria for choosing technology can be different between the two - for critical systems we need cross-region replication, fast failover, and other resilience mechanisms.

We find that the majority of our incidents aren't associated to innovative or experimental stuff, in fact. Our legacy systems, built before we developed our DevOps mindset, can cause issues. And the boundaries between teams can be an area where things go wrong and it can take time to work out who can fix those issues.

You recently spoke at KubeCon. What is the role of Kubernetes - or other platforms - in achieving this balance between experimentation and stability?

We built our own cluster management and orchestration stack when we first went live with containers in 2015. The tools out there weren't mature and complete. Kubernetes offers us a more stable platform, and one where we can learn from other people. It's the emerging standard in this area.

We are happy using leading edge tools and technologies and building things ourselves, but once people start building products or managed solutions, we'd rather move to those

We are happy using leading edge tools and technologies and building things ourselves, but once people start building products or managed solutions, we'd rather move to those. EKS or Fargate are interesting to us for that reason. At the moment, we have our own Kubernetes stack but I expect to be on managed Kubernetes or on containers as a service within a few years.

It's only one part of the FT that is using Kubernetes. We have other teams also using containers but on ECS. We have teams deploying non-containerised apps to AWS using Ansible and Puppet. We also deploy apps to Heroku, and use serverless approaches heavily: they're both things that make the servers someone else's problem.

What are your team's plans over the next 18 months?

We formed the team at the beginning of 2018 and our initial focus is to make information visible to teams. We think showing information in an accessible format will encourage teams to improve things. An example would be showing teams the number of times their monitoring has indicated a failure over the last 24 hours. Over-sensitive systems can report a lot of false positives.

We're looking to automate things. We still create incident tickets by hand, for example, and we want to automate that. Automation is quicker, more accurate and it's less error-prone.

We're looking to incentivise the behaviour we want

Then we're looking to incentivise the behaviour we want. For example, we want to create monitoring dashboards for teams automatically if the information they have in our business operations database is correct.

We're currently building a central store of business information, stored in a graph because that lets us ask interesting questions like 'show me all the systems where the person who knows about the system has left the FT' or 'show me what systems this system relies on, and whether they are at the same level of availability and resilience'.

We're also replacing our monitoring aggregation and visualisation with Prometheus and Grafana - we already have a large number of different monitoring tools. Prometheus is a really good way to aggregate all that information together as there are many exporters from other metrics formats into Prometheus format. Grafana is an easy way to create dashboards based on those metrics - although we will likely also build some dashboard applications as well.

We're going to automate incident ticket creation to make sure we capture every event and can use that to reduce noise in our monitoring.

And finally, we want to continue to develop the role of our first-line operations, to run chaos engineering experiments with teams, to review our services every year, a kind of MOT, and generally partner with delivery teams.