How Paddy Power Betfair used cloud, software-defined networking and DevOps to transform the business
Stephen Lowe, director of technology at Paddy Power Betfair, explains that the technology changed things for the better, but that some staff were inevitably left behind
Paddy Power Betfair found itself with a problem. Having grown quickly, partly by simple customer acquisition, and partly by merger, it had an IT estate that wasn't scalable or flexible enough for its needs, and worse, its development and production environments didn't match, making it hard for changes to be tested properly.
So the gaming company decided to put some of its applications up in the cloud as part of a broader IT refresh. The betting industry is bound by strict rules, but when the firm went to the regulators to check what they could and couldn't do, the answer was effectively ‘What's a cloud?'.
Eventually, the company was told that websites can be hosted wherever, but important functions such as random number generation have to be performed on a box hosted in the country in which the service is being consumed.
"So we went hybrid," explained Stephen Lowe, director of technology at the firm, addressing Computing's recent Cloud and Infrastructure Summit. "But we had big problems with dependencies, as there were so many projects on the go."
Paddy Power Betfair has in the region of 1,000 people in its technology team, of which 800 are developers or delivery teams.
"They're pushing out lots of change," said Lowe. "That's 500 to 600 changes a week to the production estate. The poor old infrastructure guys get left until the last minute, so a last minute network change comes in and the whole process gets stuck."
He explained that many development changes have a dependency on work that needs to be performed by the network team, but that team has its own backlog, making it hard to squeeze new changes in.
"Say a change needs security, that takes time. So you complete that and do all the handovers, then you've lost three to four days' work out of your two week sprint, which slows us down. The network team was huge bottleneck for us," said Lowe.
Another problem, he explained, was the classic issue that a change worked fine in dev, but then fell over once it was moved to production. The reason behind that, he said, was the speed of the firm's growth.
"Over time we grew rapidly. We had 20 per cent growth of the customer base every year for 20 years. The dev estate didn't keep up with the production estate, so what you test in dev is nothing like what the production estate looks like. We needed one estate which was mirrored across both environments."
The final problem was capacity, especially with the firm's loads proving extremely spiky. Paddy Power Betfair's data centres largely run at around 20-30 per cent capacity, until Saturday, when they more than double that usage with the football and horse racing schedule.
"For one hour on Saturday afternoons everything's really hot," said Lowe. "And then Grand National day is a point of pride, staying up on that day is one of the big challenges. It's three to four times as big as anything else we do."
But scaling all of the firm's infrastructure for those points in time is inefficient, so the decision was made to adopt a hybrid strategy, renting capacity in the public cloud rather than have a large data centre consuming power and needing cooling all year round.
Lowe and his teams came up with four principal reasons to build a cloud environment: more stability, better testing, faster delivery and better scalability.
He also wanted to give the dev team more control over infrastructure, he explained.
"If the dev team could control the bits they need, they can make the changes they require without bothering the network team. So we decided to make everything code. You can check in a firewall change as a piece of code, then use continuous delivery processes."
But in order for that to be viable, you can't have differing test and production environments. Standardisation is needed, so that was the next job on the schedule of works.
"Now we can run almost everything on x86, there is still some custom hardware, but if everything is x86 it's easier to scale up," said Lower. "You don't need to worry about specialist vendors, or weird firmware upgrades, and you can manage it all as code."
With the environments standardised and capacity issues resolved, the next challenge was adopting a DevOps working model.
"The ops team's mantra used to be ‘We keep the lights on', and the dev team's was ‘We change stuff', and never the two shall meet. Then the big DevOps revolution came. Everyone told me different definitions of DevOps, but we created a team anyway, even without really knowing what it was," he explained.
One of the first attempts was to simply stick some devs and ops staff in a single team and just see what happened.
"It was good but didn't solve the problem," said Lowe. "It just moved the bottlenecks to the DevOps team. We didn't get anything through the pipeline faster, but it did improve the communication."
So they decided to train the developers to help them understand operations. They helped the developers understand how code runs in production, what it interfaces with, how load balancing works and how traffic moves around the network.
"So the devs were all trained with ops skills, and were now much more independent," added Lowe. "That was good. But ops still think that their job is to keep the lights on. And the developers say if we're now doing all this infrastructure work, we're responsible for it."
How Paddy Power Betfair used cloud, software-defined networking and DevOps to transform the business
Stephen Lowe, director of technology at Paddy Power Betfair, explains that the technology changed things for the better, but that some staff were inevitably left behind
The issue of who was responsible for what was resolved by placing all the IT teams on call, round the clock, all year. So if any hardware or software had an issue, the last person to work on it would be called in.
"What drives the culture more than anything is we put all the IT team on call. All the devs, all QAs [quality assurance], me, everyone.
"So if I release code on a Friday and it breaks on Saturday, I get a call to come and fix it. So suddenly the quality of software releases went up. And people didn't try to push through code on a Friday evening, instead they made it wait till Monday morning, so there's lots of time to test. That worked better and had more impact than all the other rules we tried to implement," said Lowe.
Which was all well and good, but the ops teams were still out in the cold. They couldn't change anything and didn't have access to the abstraction layer open to the devs.
"So we moved DevOps into ops, and said you guys need to understand development, and made the infrastructure accessible to the devs, alongside the appropriate checks and balances," said Lowe.
"So we said your job as ops now is not to keep the lights on for the entire estate, just the hardware. But it's also your job to give devs tools that are safe for them to use. Don't let them break stuff because they don't understand it, and definitely make sure they don't break the website!" he explained.
"So now everyone can do a bit of dev, everyone now also does testing. And now problems due to dev errors have gone down."
But not everyone was happy with the cultural changes, Lowe admitted, citing an engineer who had spent six years becoming Cisco certified only to be told he now needed to write code instead.
While Lowe described the result of the changes as "brilliant", he admitted that things still sometimes go wrong, and it's not always immediately clear at which link in the chain the fault lies.
"So we stole a concept from Google called Site Reliability Engineering, which involves our geekiest, nerdiest people," he said. "Their job is when a problem is too tricky, they tell us where to look. They know where every single micron is on the network, so they troubleshoot to tell us where to go when it goes wrong."
The end result is that the firm now has infrastructure teams instead of ops, and delivery teams who also perform testing and support instead of devs.
Borrowing a line from Spiderman, Lowe concluded by saying that "with great power comes great responsibility".
"We have given everyone a lot of control and that creates nervousness. Executives get scared when you say a thousand people can now change your network. That's terrifying," he admitted.
The solution has been to remove certain permissions in order to improve security and reliability, and there's now an automated pipeline to move new changes into production, which includes a full testing cycle.
"But if all hell breaks loose you can just get a password and push a change through more quickly, and then there'll be a big investigation to ensure your change didn't break anything."
One of the benefits is that with everything now being written in code, environments are very easy to replicate.
"So now we can performance tests more easily. OpenStack has integration with AWS, so I can spin things up quickly for performance testing."
One of the hardest parts of the process was the cultural change, and the need to hire replacements for the people inevitably left behind.
"But the biggest pain points for us were the ops guys. It was a huge cultural shift for them. Trying to hire new people was hard too. They'd say they've got 16 years working on infrastructure, and we'd say we don't do that here, and they'd tell us we're crazy. Someone actually said that to me in an interview, and surprisingly he didn't get the job," said Lowe.
As a final anecdote, Lowe mentioned a problem with the recent integration of the Paddy Power and Betfair systems which would have been cataclysmic in the old days, but was far simpler to fix in the new software-defined world.
"As part of the migration we merged companies and used the same IP address ranges. We connected the data centres together and found that all IP addresses overlapped. Disaster. But it only took a day to re-IP the entire Betfair estate. That's software-defined networking at its best.
"Things that would've been massively challenging six months ago were suddenly quite simple."