Crash test: How Auto Trader is improving resilience as it moves to Google Cloud

Before, during and after the cloud migration, testing has been essential to Auto Trader's infrastructure

System monitoring is an important part of IT in many modern businesses - and when your service is available 24 hours a day, 365 days a year, it becomes critical.

Car sales site Auto Trader has been online since 1996; its website attracts more than 10 million unique visitors and 800 million pageviews a month; any downtime means lower exposure and could cost sales.

"We're a test-driven infrastructure," said operations engineering lead Dave Whyte, who has worked at Auto Trader for almost 14 years. "Anytime anyone writes code we put out a test, write a test - we're evaluating that line of code."

Auto Trader has been working with QA company Eggplant since 2002, when the firm saw the risks associated with slowdowns and outages in its digital environment.

Although the partnership began with only 10 page and three user journey tests, it has grown over the last 16 years: Auto Trader now runs 53,000 tests every day. Last year it used Eggplant's services to conduct 19.5 million tests, and has almost 200 monitors in place to track webpages, APIs, email services and more.

APIs have presented a new challenge. Parts of the Auto Trader website rely on API connections to a third party, and - in the past - a problem at that partner could affect the provision of services. Tracking the source could be an issue, as the third party wouldn't always admit to things going wrong.

"They'd be saying it was a problem with our systems, and we'd find it really hard to prove," said Whyte. "Now a lot of our third parties are actually tested directly via the Eggplant monitoring, so it can't be, ‘Oh, it's Auto Trader', because of the testing we've got set up.

"We're testing those directly every five or 10 minutes or so, so you can easily see, if there's a problem from an independent third party source, that it's in their server and it's not touching ours...then there's pretty much definitely an issue there that they're going to investigate.

"It's key to us that we have the right levels of monitoring, so that we pick up on issues 24/7; then my team will get an alert and troubleshoot whatever that issue is."

Troubleshooting and redundancy are important - Auto Trader has an entire data centre that simply mirrors its primary site - and the firm performs regular disaster recovery tests across all aspects of its systems.

Whyte described Auto Trader's new approach to service availability, which involves migrating away from data centres to the cloud and Kubernetes. "We're going to essentially build all the apps from the ground up with redundancy," he said.

The company is mostly utilising the Google Cloud, "because that's where we feel Kubernetes is more mature." It isn't ignoring other providers, though, and is moving services from the data centres to "wherever we think it'll be best placed," including AWS. It aims to be finished with the migration within the next 18 months.

While the work is underway, Auto Trader is routing traffic from its twin data centres through a large PGP connection, with links to the Google Cloud for even more redundancy. Eggplant provides web support monitoring, and Whyte and his team have found a way to improve the performance of their tests using the cloud.

"In the past we'd probably get some alerts telling us about a mis-shoot from the Eggplant testing; it might be a component in one of our tests not quite working, but it's still quite hard to go down and troubleshoot: what was the issue, what layer was the issue, what's the application for that one test?

"With Kubernetes we can build it so that in any test that comes from Eggplant, we can see every single application that hits end-to-end, and we can see throughput and a whole lot more data that we didn't have access to before.

"This is all literally through the building from the ground up. So...we're building open source tracing and visibility on throughput and having much more - and more secure - control about what can come into the endpoints, and tighten things down a lot more and add more visibility about aspects we probably didn't have before."

The Auto Trader/Eggplant partnership is set to continue as it moves its infrastructure from the physical world to the virtual.