Going cloud native at the FT. An interview with tech director Sarah Wells

'Don't try to do everything at once. Start with infrastructure as code and automation. Don't go for microservices and containers too early...'

Few sectors have had to change more rapidly or more frequently than news media, where the move to online, the plethora of free competitive content available on the web, and the precipitous fall in advertising revenues have created an extremely challenging working environment.

In most cases, the companies that have adapted best are those that have used technology to their advantage. With media and technology so closely intertwined, clearly IT leaders in the sector need to keep their eye firmly on the horizon: by the time the future arrives it could be too late. Approaches need to be chosen for their adaptability: it's hard to pivot while carrying the dead weight of inflexible and unresponsive infrastructure on your back.

The FT has managed to thrive in this fast-changing environment, maintaining its reputation for insightful and reliable journalism and attracting a whole new audience of subscribers to its online services, while retaining the distinctive pink-coloured newspaper that is its best-known trademark.

On the IT side, the ongoing transition to digital has been enabled by a focus on automation and a lot of hard work behind the scenes, alongside a pretty big change in department culture, says tech director for operations and reliability, Sarah Wells.

Mindful of the need for agility, the FT was an early adopter of cloud native technologies, even though at the time that meant individual teams having to build an in-house cluster management and orchestration stack before the arrival of Kubernetes made such tasks a great deal easier. Wells remains an advocate for this area of technology because of the increase in productivity, reliability and flexibility it can bring.

Computing: How would you define ‘cloud native'? What are its distinctive features?

SW: Cloud native to me is about building things that benefit from the cloud rather than just running on it. You miss out on a lot if you only use the cloud to provide flexible provisioning of VMs on demand.

The distinctive features on top of that are automation, continuous delivery, microservices and software as a service. You should be using containers or serverless, and having someone else run your databases or message queues.

How long have you been using cloud native tools?

We built our first systems based on microservices seven years ago, so very early. Different teams took slightly different approaches, and we are comfortable with that. While the website team opted for heavy use of Heroku to do a lot of the work of deploying and running applications, the content team that I ran started looking at containers in 2014, and we built our own cluster orchestration platform.

We were early adopters of all of this tech, and it's benefited us, even where we've gone back afterwards to take advantage of things becoming available - for example, my team replaced our own cluster orchestration with Kubernetes once that was stable, and moved to EKS on AWS when that matured.

Would you say cloud native is the way of the future for most enterprise apps?

When you're building new things, you should be aiming to have someone else doing as much as possible of the work. We can build a new system from a collection of lambdas and an Amazon-hosted queue and database quickly and cheaply, and then we can focus on the business functionality. Why wouldn't enterprises opt for this too?

So, no downsides?

There is definitely a cost to going fully cloud native, because it involves some big changes in the way that teams work - for example, you need to move to an approach where teams that build the systems also run them, and operating microservices can be a lot of work. But you can benefit with each step you take along the way.

What are the main benefits you aim to derive from cloud native?

The key benefit is being able to deliver new features and new products more quickly. It's about the business benefit.

If you do it right, you will release hundreds of times more often, with a much lower failure rate. Features can be released in days not months.

Alongside that, the way that costs work allows you to allocate a much more accurate total cost of ownership - you can tell how much you are spending on a particular system, and how much a 10 per cent increase in customers impacts your costs.

IT Leaders Forum: Getting Your DevOps Initiative Right: Business Context and Maintaining Momentum. Join us online Monday 19th October

What are the main challenges to using cloud native technologies in your experience?

Adopting cloud native involves some big changes, in particular around culture. You need to empower your teams to make a lot of their own decisions - otherwise they won't be able to move quickly to deliver value. That means also that you need those teams to run the things they build.

Microservices are a lot more complicated to operate because they tend to run in a state of grey failure, where the resilience you've built in means your business functionality is working fine, but you see quite a bit of noise as things stop and restart.

Kubernetes and service meshes are complex. You are building a platform. Now that cloud providers can provide managed Kubernetes, it's good to take advantage of that.

If I was building something from scratch now, I would look for things that abstract away the complexity. Serverless is good for event-based flows. We use Heroku a lot at the FT, because if you want to run a node app, it just works.

Which choices have you made for the following?

Kubernetes? We're now using EKS on AWS.

Service mesh? We had already built lots of what a service mesh provides before they became available. We may be using them in places at the FT.

Monitoring? We have lots of different monitoring tools at the FT, and we aggregate all the different outputs into a Prometheus instance. That gives us a single place to look at our monitoring - a system we call Heimdall as it has a view over everything.

Observability, logging and tracing? We aggregate all our logs, and we like those logs to be structured. Being able to link all the logs for an event together because they are tagged with a unique transaction ID is essential in being able to support systems where a request might go through 20 different services. I'm interested in high cardinality observability tools like Honeycomb or Lightstep, but we haven't installed any specific observability tooling. We also use Splunk for log aggregation and Grafana for metrics.

Security and compliance? We do a lot of automated scanning. We have some automated compliance checks - for example, to make sure that S3 buckets for systems that contain PII data aren't public.

Cloud native databases / storage layer? We use a lot of databases run by someone else, either via AWS or as Heroku plugins.

Remote procedure calls? A lot of our microservices communicate using JSON over HTTP, which is good when jumping from one language to another and for manual testing and debugging.

Infrastructure as code platforms? We prefer infrastructure to be held as code, and we use tools like Ansible and Terraform.

How do you decide which cloud native tools to use?

There is a fair amount of leeway for teams to choose, but if you bring in a new supplier, there is some diligence via our procurement process. Some things are supported by central teams - the kinds of tools that every development team needs to interact with, such as log aggregation and monitoring. Many tools find an audience when one team uses it and recommends it to others.

What advice would you give to someone considering cloud native for a new business application or refactoring an old one?

Don't try to do everything at once. Start with infrastructure as code and automation. Don't go for microservices and containers too early - if you think about things carefully you can structure a monolith so that it's easy to extract services if you need to for scaling purposes or because you have too many people trying to change a single system.

For refactoring you can use the strangler pattern, where you refactor the application to gradually extract things that represent a discrete piece of functionality into new services. Sam Newman has a book on monolith to microservices that talks about this kind of thing. I have to say, I don't have any experience, we built an entire replacement content publishing flow!

Is cloud native really only suitable for large applications, or do you use it as a general purpose deployment platform for smaller apps too?

Cloud native in general is absolutely suitable for smaller apps. I do think you benefit from having a central team that has a set of tooling that means any delivery team can do some minimal configuration to get their code running in production.

I really wouldn't incur the overhead of learning how to use something as complicated as Kubernetes unless I was going to run a lot of services on it. But I've seen companies that have a central platform team that runs Kubernetes clusters that other teams can use, and that can work very well.

Cloud providers do tend to offer managed Kubernetes now - but I think the idea that you choose to run on managed Kubernetes to make it easier to switch to a new cloud provider is part of a generally too high level of concern with vendor lock-in. You'd likely be better off using vendors own platforms and tooling if you haven't already learnt Kubernetes!

If you aren't prepared to use the things your cloud provider has that are unique to it, you are paying a constant tax. On AWS, I would use AWS RDS and Kinesis or SNS. It ties me in, but it's so quick to get going.