Kubernetes: seven lessons learned from deploying it at Adobe Advertising Cloud

SREs from Adobe step though the challenges they encountered deploying containers on AWS and OpenStack and how they overcame them

In a presentation at the Open Infrastructure Summit in Denver, site reliability engineers Mike Tougeron and Tony Gosselin of Adobe Advertising Cloud described seven lessons they'd learned from deploying Kubernetes in seven data centres. The following is a summary of their presentation.

Lesson 1. Communication, teamwork and training

"You can't do good training unless you have good teamwork, and you can't have good teamwork unless there's good communication," said Tougeron. "This is the most important lesson we learned on our journey."

Everyone will have a preferred method of communication, be that Slack, email, video-conferencing, face-to-face or SMS. It's important to use all these channels to repeatedly "shout from the mountain top" to ensure that everyone who needs to be aware of changes in a cluster has got the message, he said. The responsibility can then fall on the operators of the clusters to ensure their end users know what's going on.

Tougeron advised operations leaders to meet all the key stakeholders individually, then deploy all means of communication available to ensure the message of any changes gets through. That way the engineers can act as a single team rather than as dispersed, siloed interest groups.

On the training side, he advised a 1-to-10 ratio of experts to users. Experts can create tools to abstract away some of the complexities of Kubernetes and then pair-program with teams for a quick uptake.

For most users "kubectl should not be the primary entry point to the cluster," Tougeron said.

Lesson 2: Code-to-production pipelines

It's important to ensure the pipeline is running efficiently, otherwise developers will find workarounds to get their code into production, with a consequent loss of efficiency of scale - which in part means choosing the right tools. Tougeron's team found that Helm works well for GitOps workloads, but that the application team did not require the full set of functionality and that a simpler tool like Kustomize was sufficient and easier to pick up.

At Adobe, they moved from using multiple repos on Github to a monorepo with code for multiple projects stored in one place. But said Tougeron, while this has bought benefits the process was painful and raised the barrier to using Kubernetes while it was ongoing.

"Try to ensure your shifts from one way of doing code to another don't overwhelm your engineers," he said.

Lesson 3: The ABCs of production

There are a number of potential problems when running Kubernetes tools, Tougeron said. For example, Adobe makes good use of Kubernetes Horizontal Pod Autoscaler (HPA), and also plugins for custom scaling operations, but using both together can have unexpected consequences.

"One of the gotchas we had with this was setting the number of replicas inside our deployments to say we want three replicas as a minimum, but then HPA would say we want between three and a dozen. But if it scaled up to 10 nodes the next time we deployed to three, say to introduce new code, HPA would scale back down to three and then up to 10. So if you are using HPA, don't set your deployment replicas, do it inside of HPA."

Other issues can arise when manually deleting pods. Kubernetes pod disruption budgets are designed to stop pods being removed if that might adversely affect a service, but manual deletion can override this safety measure.

Blue/green testing and using canaries with Kubernetes clusters can be very challenging, said Tougeron. Achieving visibility and observability into clusters is difficult and may require using several metrics as proxies.

At Adobe, deployment of Kubernetes coincided with moving to DevOps. The team found that when bringing together Dev and Ops sometimes important areas of knowledge and expertise could get lost in the reshuffle, and of course everyone can now do everything. Therefore it's important to build in guard rails to avoid over or under-provisioning or risky practices.

Lesson 4: Multi-cloud challenges

Adobe Advertising Cloud is split between AWS, where machine learning, associated services and persistent data live, and OpenStack which houses the bidding ad server processing and more ephemeral workloads in a private cloud spanning six global regions.

To take advantage of the flexibility afforded by containers, the firm first deployed three clusters of Kubernetes in AWS, and followed that with deployment to OpenStack, but the experience was not directly transferable.

"We ran into a number of problems because there was no persistent storage," Gosselin said.

The team had not deployed autoscaling in the OpenStack cloud either, and there was the issue of compute and rack anti-affinity to consider.

To ensure the system would work on every platform, the team created a single code base for all clouds, using modules to accommodate differences in cloud environments.

"The advantage of doing it this way is we build a consistent experience for our users, our engineers," Gosselin said. "When we apply fixes they go across all clusters. When we apply new features, they go across all clusters that are applicable, and this keeps us honest, making sure we're designing for both."

In order to achieve this, the team set aside a cluster as a lab for testing changes across cloud environments. They also developed a tool called OSSIA to solve the anti-affinity issue in OpenStack.

[Next: Lesson 5: Knowing your applications]

Kubernetes: seven lessons learned from deploying it at Adobe Advertising Cloud

SREs from Adobe step though the challenges they encountered deploying containers on AWS and OpenStack and how they overcame them

Lesson 5: Knowing your applications

"Kubernetes operates in a different way than we're used to in terms of bare metal and virtualised infrastructure," said Gosselin. "Don't just assume you can plop your application in a container on Kubernetes and it will work."

Among the factors to consider are service discovery, persistent and shared storage, scheduling and restarting, and network ingress and egress.

The team learned this particular lesson when deploying Elasticsearch on Kubernetes - fortunately in a development environment. The deployment worked fine initially, but after an update the system crashed with a loss of data.

"We didn't think about how our application would work with scheduling," said Gosselin.

The problem was fixed by adding pod disruption budgets and pre- and post- operations so that nodes explicitly signal when they are going down or starting up.

Many third-party applications now come with their own Helm charts, Gosselin added, which makes the job of configuring Kubernetes to support them a lot easier.

Lesson 6: Metrics-based monitoring

In Kubernetes everything is transient. Just targeting a pod for monitoring as you would in a VM won't work.

"You need to think about it at the application level," Gosselin explained. "Looking at trends and deltas rather than specific events."

The team aggregates the results of metrics based monitoring in one place so they can study correlations to assess things like resource requests versus actual usage, and pinpoint likely culprits when performance takes a hit.

"You need to collect metrics for everything and apply monitoring for everything, even the simplest application," said Gosselin.

Lesson 7: Autoscaling benefits and challenges

Autoscaling should lead to cost savings by only deploying the resources needed. However, it does make for more frequent rescheduling of applications.

"We ran into a problem with a stats app that took two minutes to come back online," said Tougeron. "In AWS sometimes it takes time for a volume to move from one host to another. A pod comes up after a couple of seconds, but you have to think about the impact of autoscaling on the application performance, not just Kubernetes performance."

Rounding off, Tougeron said learning these lessons had helped the Adobe engineers to work as one team in deploying and configuring Kubernetes.

"We're the same team now, so we all learn and fix things together," he said adding: "everyone loves working with Kubernetes".

The development to production lifecycle is very fast, he went on moving from days to hours, and new systems go live within hours not weeks.

"I don't know anyone who's not super happy with it," Tougeron concluded.

The AI and Machine Learning Awards are coming! In July this year, Computing will be recognising the best work in AI and machine learning across the UK. Do you have research or a project that you think deserves wider recognition? Enter the awards today - entry is free.