Escaping technical debt: moving to the cloud at the Royal Society of Chemistry

'We needed a much more flexible and cost-efficient operation to keep up with the big players in the field', says head of DevOps Chris Callaghan

There comes a time when you just can't continue along the same track any longer. For the Royal Society of Chemistry, the 175-year-old learned society and publisher of scientific research, this point was reached in 2017.

"We were based in an on-site VMware data centre near our office in Cambridge," said head of DevOps Chris Callaghan. "We were starting to run into problems with availability and scalability and at the time a lot of the hardware became end-of-life. So, we were facing a huge refresh bill for an organisation of our size -around £2 million - and we'd also ended up losing some skills required to manage and maintain the physical infrastructure.

"Trying to stay on top of modernising, trying to keep up with the volume of technical debt we had, meant that we were just kind of treading water."

It was at that point that the RSC decided to move to the public cloud.

"We needed to make it a much more flexible, much more cost-efficient operation in order to keep up with the big players in the field," said Callaghan.

This four-year process of cloud migration is now nearing completion, with almost all infrastructure now moved to AWS.

Everything's done in code, everything's fully automated, everything's push-on-green, everything has full test coverage...

The first step was to start changing the structure of the internal IT team, introducing site reliability engineers (SREs) and DevOps, and retraining the operations staff to think in terms of code.

"The whole team should be working with everything as code. Software developers do that already, of course, but we're embedding that practice with the operations people to think in terms of the software development lifecycle. Thinking about how you move towards continuous processes for everything from analysis, design, development, testing, cloud engineering, where everything's done in code, everything's fully automated, everything's push-on-green, everything has full test coverage."

The 30-strong team, augmented with external developers, now comprises analysts, an agile coach, SREs, UX designers and back-end engineers.

"We've pretty much got a software development house inside the organisation," Callaghan said.

What are the right ways to approach DevOps? Register for Deskflix Season 4 now to find out

The migration itself has been a stepwise process, moving first to a hosted VMware environment as an interim stage. Even this required some significant groundwork.

"We spent about 10 months putting in a whole new MPLS connection to get the bandwidth we needed," said Callaghan. "Previously we just had one 100 megabit connection in and out of the office."

Connectivity sorted, the next 12 months were spent moving all the public-facing web environments into the managed data centre as an insurance against failure.

"We were basically securing our revenue stream, which meant we could then start doing a proper safe cloud migration."

The last couple of years have seen the corporate staff-facing environment and systems migrating to AWS. RSC's setup is based on Windows servers on VMware, which is not a typical DevOps environment, Callaghan admits, but there are no plans to move away from that at the moment. The team looked first at VMware's own VMware Cloud on AWS (VMC), but decided against it because VMware's machines are configured for average workloads, rather than the storage-heavy requirements of a publisher that needs to keep research dating back to 1841 available.

"To get what we needed would have been a fair bit more expensive, so we took the decision to go for a more native AWS approach," Callaghan said.

What I really want to achieve with our team is to have the best publishing platform in the world

The RSC has serious ambitions for its new platform, which aims to provide a seamless process by which scientific authors can submit their papers for peer review and publication in the RSC's journals. In a sector in which a great deal rides on reputation, it's important to attract the best researchers and not put barriers in their way.

"What I really want to achieve with our team is to have the best publishing platform in the world for chemical journals, the best possible experience for our authors and our researchers and our readers," said Callaghan.

With competition for top scientists and researchers coming from publishing giants like Elsevier and Wiley, as well as ‘co-opertition' from similar organisations such as the American Chemical Society, it's important the platform can flex with changes in the industry, such as the increasing popularity of open access publishing, which is already seeing an increased throughput of articles.

Providing the best user experience for fickle authors means the team - and indeed RSC's back office staff - needs to be able to see exactly how the platform is being used at all times. As a customer of New Relic for a number of years, Callaghan's team used the observability software initially to baseline the systems prior to migration. More recently it's been used to gain visibility into the whole stack, to monitor not just how applications are performing, but also how they are being used.

"We're using New Relic to get a much stronger and much more detailed view into the application stack than we had before. We've found the remote telemetry data we've been able to produce has been really useful for the cloud migration, and also as a business tool in working with our product managers because we're rebuilding the sites and we want to find out what the most valuable things are," Callaghan explained.

So far the move to cloud has been a demonstrable success in terms of improved flexibility and continuity, particularly on the administration side.

"When you've got basically Windows DFS servers and attached disks, everything is an admin overhead when you want to keep extending those disks. S3 makes that so much easier. You just define a structure, a schema, and throw it up there, and you get 11 nines of availability."

The plan over the next two or three years is to further decompose sites into a standard microservices architecture, and to work on automating as much as possible, using AI to cut down on the operational noise.

"Automation is one of the key tenets of DevOps, but we're trying to think beyond automation about how do you really make something automatic, and I think the concept of applied intelligence is really interesting here."