AWS outage exposes critical cloud vulnerabilities and raises sovereignty concerns

Questions are asked about why a third of the global internet has a single point of failure

A DNS failure in AWS’s oldest region caused widespread disruption across global internet services, highlighting the risks of over-reliance on centralised cloud infrastructure and the urgent need for multi-region resilience and sovereign cloud strategies.

AWS fixed the DNS issue at its US-EAST-1 Region in North Virginia quickly yesterday. The report of user issues went up on the AWS Health dashboard at 12:11 AM PDT and engineers identified the problem in less than two hours.

The problem related to related to DNS resolution of the DynamoDB API endpoint. That underlying issue was fixed by 2:22 AM PTT and “significant signs of recovery” were observed by around 3:00 AM PDT.

However, affected websites and applications seemed to take far longer to recover, and as the US woke up yesterday morning, a whole new raft of problems were reported. AWS posted afresh at 3:35 AM that users were experiencing issues with EC2 launch instances and these problems took much longer to resolve than the initial problems approximately 12 hours in total.

Cloud is supposed to be resilient

With such extensive disruption, one obvious question comes to mind. Aren’t cloud services supposed to be resilient? Amazon, Microsoft and Google (and let’s be honest, most cloud infrastructure belongs to one of those) can provide redundancy that would be impossible to replicate on a smaller scale. That’s one of the main selling points.

How could a problem in one region bring down around a third of the global internet?

Speaking to The Register yesterday, Omdia Chief Analyst Roy Illsley explained why the US-EAST-1 Region is an Achilles heel for AWS.

"US East is the home of the common control plane for all of AWS locations except the federal government and European Sovereign Cloud. There was an issue some years ago when the problem was related to management of S3 policies that was felt globally."

US-EAST-1 is also the oldest AWS region, and certain global AWS services or features need infrastructure or services running from US-EAST-1, irrespective of where the customer is. This includes DynamoDB Global Tables which was the initial source of yesterday’s problems.

Many of the AWS customers affected by yesterday’s outage will be running their applications and workloads from more local availability zones, but the dependencies caused them problems regardless.

What’s the damage?

The cost of yesterday’s outage is likely to be vast. Estimates of the cost of last year’s CrowdStrike outage to the global economy are around the $5 billion mark. Delta Airlines is still trying to recover $500 million.

Banks including Lloyds, Bank of Scotland and Halifax were hit yesterday and payments failed for countless individuals and companies. Unravelling these claims is going to take time but you can bet that millions of pounds of compensation claims will be lodged by consumers, banks and online commerce and retail platforms.

Many organisations will be keenly examining their AWS SLAs this morning, but at the same time will no doubt be heeding this event as yet another reminder of the importance of operational resiliency.

Ismael Wrixen, CEO of digital commerce platform Thrivecart commented:

“Today’s outage isn't just an "east coast AWS” problem; it's a reminder that 100% uptime is a myth for everyone. The internet runs on shared infrastructure. The real story isn't just that AWS had a critical issue, but how many businesses discovered their platform partner had no plan for it, especially outside of US hours. This is a harsh wake-up call about the critical need for multi-regional redundancy and intelligent architecture.

“Every minute this occurs, entrepreneurs are learning the most painful lesson in e-commerce: your perfectly optimized ad funnel means nothing if the "buy" button is dead.”

Mark Boost, CEO of UK Cloud provider Civo and a consistent advocate for sovereign cloud, focuses on the dangers of having critical UK infrastructure controlled by another country.

“Why are so many critical UK institutions, from HMRC to major banks, dependent on a data centre on the east coast of the US?” he asks. “Sovereignty means having control when incidents like this happen - but too much of ours is currently outsourced to foreign cloud providers.

“The AWS outage is yet another reminder that when you put all your eggs in one basket, you're gambling with critical infrastructure. When a single point of failure can take down HMRC, it becomes clear that our reliance on a handful of US tech giants has left core public services dangerously exposed.”

Some MPs share Boost’s concerns. The Treasury Committee wrote to the economic secretary to the Treasury yesterday demanding to know why Amazon has not been designated a critical third party (CTP) and whether the Treasury is concerned about key parts of UK infrastructure being hosted abroad.

Plenty of organisations will today be asking themselves the same question.