bet365's Journey to DevOps

Steven Briggs, Head of DevOps, Hillside Technology, bet365's technology business, discusses his organisation's path towards implementing the DevOps culture

In many respects, we've been successfully deploying DevOps-like practices in the business long before the DevOps function was officially formed. Nevertheless, as part of the never ending need to challenge ourselves to meet the demands of the bet365 offering, taking on DevOps practices more explicitly now feels like a natural part of our evolution.

Computing's DevOps Excellence Awards 2020 are now open for entries.

The complex technology challenges that we have always had to address as a company have meant that the stability and security of live system has always been a shared responsibility across Development, Infrastructure and Operations. Therefore, the approaches and methodologies taken across all teams have, by necessity, been closely aligned. The goal of ensuring site availability without reducing speed of change has always been a shared one.

As the complexity of the platform has increased, the critical question is how to keep pace with code releases, while ensuring the platform remains stable and secure?

It's in answering this question that a more formal adoption of DevOps practices and principles has become valuable and why the creation of a DevOps function provides us with a real opportunity. Where, each part of the system has traditionally been treated in isolation, the new DevOps function will enable us to build a holistic view of how all parts of the system work together and progress engineering principles that make monitoring, reporting and self-diagnostics/healing more achievable.

To achieve these aims we've identified 3 key areas of focus:

1. Automation - The introduction of software engineering approaches into the classic infrastructure environment to drive automation.

2. Monitoring - The improvement of the breadth, depth and sophistication of monitoring and the level of insight that can be gained for the health of the live estate via dashboards and analytics tooling.

3. Self-healing - The delivery of an intelligent platform that can self-diagnose and neutralise application issues before they affect the customer.

Automation

The Infrastructure teams have already started progressing activities to increase the levels of Automation. Bringing software engineering principles into this operational sphere will help significantly in progressing this while ensuring we maintain the right levels of governance and don't lose any of the benefits of having a human complete that task.

In simple terms, we are looking at activities that are typically manual and labour intensive and converting them into code. The more we can automate, the freer the team will be to optimise the infrastructure needed to support the intensity of our consistent code releases, i.e. for higher value activities.

However, automation brings greater benefits than just increases in speed and reduction of effort. It will also encourage us to challenge our existing processes, to simplify our estate, to break out of silos and streamline activities end-to-end.

Monitoring

Coding principles will also have a key impact on monitoring. The classic mindset around infrastructure is one of box watching. You're focused on the CPU, processes and capacity. If all the lights are green, you're happy. How the customer experience has deteriorated, i.e. what systems, products and functionality is impacted is harder to spot.

There is a need to understand the health of the overall customer offering and the way that interrelated systems are performing across the estate. If there is a failure of specific asset in the data centre, we need to understand what that the impact of that is on all dependent systems. This will involve drawing data from the many monitoring applications that are used across technology. We'll have to correlate events in order to avoid drowning in noise and we're currently assessing suitable technology to help us achieve that.

It's only by truly understanding how applications are performing in the live estate that we can continue to introduce change, at pace, with minimal risk. Our demanding release schedule and complex globally distributed systems make this a challenging objective. We currently have 45 million customers in 20 languages across the globe and in each instance, our presence is tailored to the specific needs of the individual region. We, therefore, have to deal with a lot of variation and complexity.

Self-remediation

The Holy Grail of Infrastructure Operations is clearly for issues to resolve themselves. By combining automation capability and monitoring insight that can become a reality. Even at these early stages there are opportunities to kick-off simple automated responses when issues

occur. Over time we'll be able to build on this capability, the benefits of doing so are clear - vastly improved resolution times and an improved overall health of the system.

bet365's Journey to DevOps

Steven Briggs, Head of DevOps, Hillside Technology, bet365's technology business, discusses his organisation's path towards implementing the DevOps culture

Collaboration across departments

We don't strive for perfect code. We don't have that luxury. We are driven by the sporting calendar, which means our product development deadlines are set in stone. In this environment, we want good, tested, proven, deployable code. But we must accept it comes with issues and that how we construct and deploy our code can have implications on how we then monitor it.

Platform changes will inevitably be part of the solution, but right now for DevOps the primary focus will be to advance our monitoring and automation capabilities and drive greater collaboration through shared insight.

Collaboration between departments is key. We already have a shared goal across the whole of Technology to introduce change at pace while ensuring site stability, but we don't always have a common language and shared insight as to how we do so. There are ways you can engineer code that makes monitoring, reporting and self-diagnostic/healing more achievable. For us, DevOps is about sharing coding principles and practices across all departments to ensure we can achieve our goals.

It's early days. We are at the embryonic stage, feeling our way and mapping things out. This is not about ripping everything out and starting again. It's about increasing collaboration between our existing team structures and evolving traditional operational practices. DevOps will give us the framework to introduce these changes.

The first step has been to bring together several operational teams who will collaborate to drive operational automation, increase monitoring sophistication and enhance the deployment pipeline. These teams include Software Release, IT Operations, Problem and Incident Management and Service Delivery. We are also in the process of building an SRE team.

Long-term, the goal is to spot problems with the code and heal them before any customer impact is reported to us. Until we've achieved that, we still have more work to do.

It's a journey that we are excited to take and one where we are seeing a positive response from people across the business.

Register for Computing's next DevOps themed event here, DevOps Live 2020, to be held on the 18th March 2020.