Zero downtime - how Bloomberg builds extreme reliability into its applications

Kubernetes, multi-cloud and open source technologies are all key ingredients, says head of compute infrastructure Andrey Rybka

Andrey Rybka, head of compute infrastructure Bloomberg, explains the measures the financial data company and publisher takes to ensure its services never go down, including the role of Kubernetes, multi-cloud, and a strategy of always selecting open source.

If you want to see the direction tech is moving you can do worse than to consider extreme cases. Many of Bloomberg's services would certainly fall into this category. Availability is an existential issue when you're looking after Bloomberg Terminal, the real time financial data system on which traders and governments depend, for example.

"Sometimes central banks cannot issue credit without the Bloomberg Terminal," said Andrey Rybka head of compute infrastructure at financial data company Bloomberg. "The service is very critical, it cannot just stop, and we cannot take an outage."

Indeed, the highly automated, timezone-spanning nature of global finance means that at every second of every day someone, somewhere will be making a trade (or more like millions of trades) and so the many supporting services provided by Bloomberg must always be available.

Bloomberg may be an extreme case, but it's not alone in its requirement for always-on services. Government departments, media outlets, online stores - few public-facing organisations can afford downtime. But the complex infrastructure on which services are based means outages do occur. In the past year all the major cloud providers have gone down for one reason or another, sometimes due to a seemingly innocuous configuration error or a supporting service failure. If availability is a must-have, the question is how to mitigate the inevitable?

As part Bloomberg's pursuit of zero downtime and low latency, Rybka closely monitors the reliability of cloud providers over time, particularly with respect to Kubernetes, which is the platform of choice for critical applications. He feeds these metrics into a formula he devised to calculate the likely application uptime and performance. "Some cloud vendors are significantly more reliable than others", he said, declining to name names.

Other inputs include the number of availability zones and cloud regions across which applications are hosted - as well as the number of cloud providers themselves. To minimise the chance of downtime Bloomberg has adopted a multi-cloud strategy, using the platform-agnosticism of Kubernetes to ensure applications can be switched between providers, or between public and private clouds, as required.

If you think of like what's uniformly offered by the different cloud providers, it's Kubernetes

"If you think of like what's uniformly offered by the different cloud providers, it's literally Kubernetes," Rybka said. "I call it a multi-cloud fabric that allows us to run across AWS, Azure, Google and other cloud providers that offer a managed Kubernetes offering. Sure, there are some differences in some components, but it's more or less similar, and when I use Helm to install my application, it works the same as on-prem, so you can standardise."

Managing state across platforms is another area that requires careful management, since different cloud platforms generally favour their own data stores. The Bloomberg team gets around this one by insisting that databases deployed are open source and work well with Kubernetes, generally picking options from the Cloud Native Computing Foundation (CNCF) stable.

"What's good about the CNCF landscape is there are a lot of choices that are open-source based," Rybka said.

"So, we decided to standardise on Kubernetes as a compute runtime, and we tell our teams any service that they need should be based on open source technology that is uniformly available across all cloud providers. So, if you need caching use something like Redis, if you need a distributed data bus you would use something like Kafka. And if you need databases, pick Postgres or MySQL, they're all databases that are well managed."

He continued: "Even if the cloud provider has a proprietary service, if it's based on open source technologies we can use it. For example, AWS Aurora Postgres to our application is just Postgres."

Another edict is to select managed services: "We don't tell teams to stand up their own database on AWS or whatever, we say use a managed offering but with open source technology."

You can't say this is a giant cost centre and everything is free. This is not communism

Running multiple cloud instances can quickly get expensive. Rybka's team has come up with a set of a cost management best practices for multi-cloud environments.

First, all resources are tagged and assigned to a particular team, with each group responsible for managing its own costs.

"You can't say this is all a giant cost centre and everything is free. This is not communism," he said. "So, we have labelling, targeting and budgeting for each account."

Cost management is made easier by the inherent scalability of Kubernetes, he went on, allowing you to expand only when required.

"Let's say at 9am Eastern Time Wall Street is about to open, so we need to scale out to many instances. Kubernetes has primitives to effectively create many more replicas of what you need to prepare for the flood of additional requests. Then at 4pm Wall Street closes and you can scale down your service, and Kubernetes will manage the load and cost implicitly."

Kubernetes is a key tool in the quest for zero downtime, with the API able to mitigate service failures such as a DNS server failure at one cloud, before failover to another, but this requires data and state synchronisation to avoid the end user being adversely affected, for example a shopping cart mysteriously emptying when the switch happens. Again, open source databases are available under the CNCF umbrella, that can mitigate this risk, working across all the major cloud providers and allowing global geo replication of data and state, one example being Apache Cassandra.

"It can even do replication across noisy networks that aren't super reliable," said Rybka. "I'm not saying it's the only way, but it's one way you can solve the multi-cloud story. And Kubernetes allows you to have small clusters on standby that you can scale up when you need more."

This is still a work in progress - and it's a lot of work

Multi-cloud is a journey, one that involves a great deal of trial and error and stress testing, including chaos engineering to unearth hidden vulnerabilities. There are multiple things that can go wrong in such a complex, interconnected environment, but Rybka sees the ability to move from provider to provider as core to the goal of always-on services.

"I don't want to oversell this, to say it's a solved problem. This is still a work in progress - and it's a lot of work," said Rybka.

That said, while there have been occasional problems, so far these have been nipped in the bud before affecting customers.

"There have been some minor reliability incidents, but nothing at wide scale. And as we started adopting Kubernetes, specifically the services that run on Kubernetes have been extremely reliable."

With tolerance for downtime - any downtime - dropping steadily in several sectors, we expect many organisations to follow Bloomberg's journey with interest.