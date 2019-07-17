AO.com, the online white goods retailer, was famously founded on the back of a £1 bet. Founder John Roberts was outspoken about the buying journey that customers had to embark on when buying appliances, prompting his friend (and later company co-founder) Alan Latchford to ask him, "Well, instead of moaning about it, John, why don't you do something about it?" Latchford bet Roberts £1 that he wouldn't quit his job to take on the challenge - and the rest is history.

Founded as Appliances Online in 2000, the company rebranded to AO.com in 2013 and went public a year later. Like many firms, it has followed the USA in making Black Friday part of the UK retail calendar, and this peak period has been instrumental in promoting digital change.

Carl Phillips is AO.com's group director of IT, and has been with the firm since 2007. He says that Black Friday 2016, although perfectly stable in terms of system performance, was "the straw that broke the camel's back" when it came to the on-prem versus cloud debate.

"We'd been talking about the virtues of doing a cloud migration for some time before that, but it was very much in the aftermath of Black Friday 2016 where [we made the final decision]. We'd...done very well across Black Friday and the systems had stood up well, but we'd worked far too hard to get to that place, and there was a big opportunity cost to that: we could have been spending that time doing other projects, which would have made the web journey smoother for our customers or given them more choice.

"I think the aftermath of Black Friday 2016 is what gave us the courage to actually go and do a cloud migration. We'd flirted with with small microservices in the cloud prior to that, but this would be the first project where we would do something of a much larger scale."

It took the IT team three months to prepare for the Black Friday that year, an experience that Phillips was in no rush to repeat. In the run-up to the big day, the team was running load tests against the live production environment in the early hours of the morning.

"Prior to migrating into AWS, our test environments - because it was cost-prohibitive for them to be the same size as the production environment - were logically the same as a live environment, but [with] far reduced capacity. So our load testing and performance testing for Black Friday used to be conducted out of hours on the live site…

"What we would do...was have a team wake up in the middle of the night, perform a set of load tests, recover the site...and then go back to bed. Obviously that was quite painful for them; it impacted their working week, because they'd only get in at lunchtime the next day and then they were still tired. It wasn't ideal."

Never again

Come early 2017, the company had decided to transition its bespoke e-commerce platform to the cloud - and the work had to be finished before Black Friday that same year. "We had a very limited timescale," Phillips recalls.

The company opted to perform a simple lift-and-shift to Amazon Web Services, so that the foundations would be in place as early as possible. It partnered with MSP Claranet to help with the project, which took six months.

After the migration was complete, the IT department began looking to more large-scale architectural changes, like moving tests to AWS. This has enabled AO.com to produce scalable environments "identical to live in its capacity, in its size [and] in how it's architected."

Another significant change is in the way that caching is handled, and the impact that it has had on the customer experience.

"When we first went into the cloud, our on-premise solution used to have an in-memory caching tier. We had a webfarm of approximately...10 web servers, and each server had a complete copy of the cache in its memory at all times.

"The problem was that if we deployed to a server, or if we brought a new server online, that cache could be really cold, whilst the rest of the servers could have a really warm cache. For our customers, it would be a bit of a lottery which one they would hit, and it could have a fairly dramatic impact on the performance of that page.

"One of the things that we were very quickly able to do, once we'd gone into AWS, was to roll out a centralised caching tier and do away with that in-memory cache. The performance benefits for our customers were quite amazing, quite frankly; it made...doing things like deployments and operating the environment much more straightforward."

Horizon scanning

Like many companies, AO.com has grown more complex as it has expanded, and Phillips says that his team is on a journey to see when applications have become too monolithic. To counter this, the team is examining microservices and serverless - although, he says, "You have to tread carefully with those terms, because there's a lot of hype around both of them."

Microservices have potential mainly for the time they can save: "If you can validate an idea in a day or two, as opposed to spending a few weeks...I think there's a lot of value in that," said Phillips. Serverless, on the other hand, largely comes down to a cost issue, in that you only pay for what you consume.

In the meantime, AO.com's growth has driven a need to scale up the IT function along what Phillips calls "seams of demand."

"One of our strategies for scaling the department has been to find ‘seams of demand' that change on different axes, and to engineer the systems along those seams, such that they can be owned by highly specialised, highly autonomous teams. Within those teams...we do a lot of proof of concepts, so we really believe in not investing too much in what is actually just a bet or a hypothesis. We'll try and prove the idea without investing lots of time in engineering, if you like."

For any company facing a similar challenge of preparing their IT estate for a peak period, Phillips advises approaching the challenge systematically:

"I would say you have to do it scientifically. In any system, there's usually one or a very few performance-constraining factors; bottlenecks, if you like. [You need] good tests for understanding how your site will get used - so being able to predict what the usage is, because it's a lot more subtle than just how many users you're going to get across the day or across the hour…

"You need to not just hope for the best… However good you think your platform is, however well you think it will stand up to what's coming, I think it's always more comforting to have data, because it can just be one check-in by one developer that sinks the whole thing."