Black Friday, Cyber Monday and the digital paradox: time for hypercare

Black Friday, Cyber Monday and the digital paradox: time for hypercare

Image:
Black Friday, Cyber Monday and the digital paradox: time for hypercare

Greater online sales with increased reliance on a fallible infrastructure

This year's Black Friday and Cyber Monday will test not just the public's appetite for online shopping outside of lockdown - they'll test businesses' digital preparedness, too.

Covid-19 lockdowns saw online shopping hit a high water mark in 2021 - accounting for 36 per cent of total retail sales in the UK compared with 20 per cent pre-pandemic.

Digital is the new norm for many retailers. But the foundations of digital business are not always as solid as we'd like. Some of the most seemingly robust of systems can still suffer outages.

With £12.5 billion in sales at stake in the retail calendar's most important event, businesses will want to minimise the impact of unplanned events, to ensure a seamless experience for customers as well as protect their operations.

That takes a model of elevated support for planned events such as Black Friday or Cyber Monday. Enter hypercare, a model founded on a set of procedures and tools - and on a culture - to ensure IT responders can move quickly on unplanned outages.

On your best practices

Rapid response is at the heart of hypercare and this means having a plan so IT responders are on call, familiar with the incident management process and know which individuals and teams to work with. This plan is built using best practices in crisis management and daily operations.

Best practices are a staple of the data-centre world - considered a model of reliability. According to the Uptime Institute: "Having and maintaining appropriate procedures is essential to achieving performance and service availability."

Running chaos engineering exercises is a useful way to build team experience and develop responses: these responses should then be captured as best-practice procedures. Incidents should also be rated based on severity when defining best practice, to ensure alerts target the right team members for a timely and targeted response and thus avoiding inefficient, "all-hands" fire drills. Best practices should also establish channels of communications between responders.

Automation is an important way to ensure best practices are executed. The most efficient way to achieve real-time recovery is with automatically-triggered response phases which minimise disruption, rather than relying on manual implementation.

Give us the tools

Digital infrastructure is a complex and dynamic, multi-provider environment. Managing this requires tools that let IT delve into systems to observe their health and to pre-empt and fix problems; that means tools for real-time monitoring, logging and tracing that report into a centralised system.

There's no shortage of system data and alerts in digital operations and without robust management it can be difficult for IT responders to fathom what's happening during an event. This plethora of data can also breed analysis paralysis with a frenzy of alerts, notifications and false positives. This means incidents can go undetected or ignored, and can also lead to burnout with responders pulled into firefights.

IT responders therefore need tools to capture the relevant system data and alert the "right" people. The former means analytics to analyse data. The latter entails intelligent alerting - screening out alert noise and notifying only those on call, for a targeted and speedy response.

Don't blame me - or anybody

It's easy to want to blame somebody for an outage, but a culture of blame won't fix problems, or prevent them happening again. Rather, people will hide their mistakes, denying IT responders the insight to diagnose and fix things. Incidents can also be harbingers of bigger problems that may remain hidden and unfixed.

It's important to instil a culture where people can learn from mistakes and proactively prevent further issues. That means creating an environment where employees can raise issues without fear of retribution, make mistakes without fear of punishment, and investigate new ways of working to fix things.

Conclusion

2021 uncovered the growing paradox of digital businesses: greater online sales with increased reliance on a fallible infrastructure. The risks can't be eliminated but they can be managed by wrapping systems in an additional layer of tried and tested response - that's hypercare.

Jill Brennan, is VP EMEA at PagerDuty