Amazon blames automated capacity scaling program for major outage last week

An automated programme to scale capacity caused outage, Amazon says

Image:

An automated programme to scale capacity caused outage, Amazon says

The company expects to release a new version of Service Health Dashboard next year to resolve the problem

Amazon Web Services (AWS) said on Friday that an automated process in its cloud computing unit caused the massive outage that brought down several websites, apps and streaming platforms worldwide last week.

In a statement, the company explained that the problem began on 7 December, when "an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behaviour from a large number of clients inside the internal network."

That, in turn, created a huge spike in connection activity, overwhelming the networking devices between the main AWS network and the internal network and causing delays for communication between these networks. As a result, the latency and errors for services communicating between the networks increased, leading to even more connection attempts and retries.

The issue even impacted the company's ability to see what exactly was going wrong with the system.

According to Amazon, engineers in the operations team were prevented from using the real-time monitoring system and internal controls that they typically rely on.

Amazon says it expects to release a new version of Service Health Dashboard in early 2022 that will make it easier for the company to understand service impact. The firm also plans to release a new support system architecture that will actively run across multiple AWS regions, enabling AWS to communicate with its customers without delay.

The problems on Tuesday lasted for several hours before AWS managed to fix the issue.

Thousands of users complained on social media platforms that their smart home devices and other internet-connected services had ceased to work.

AWS later confirmed that it was experiencing problems in the 'US-East-1 Region'.

The outage knocked down numerous high-profile sites and online services, including Netflix, Roku, Disney+, Ticketmaster and Ring.

Amazon's e-commerce website Amazon.com and its Prime Video service also went down for thousands of users, as did Slack, Coinbase, and stock trading app Robinhood.

The issue affected internal tools at the company as well, including the Flex and AtoZ apps that are used by Amazon's warehouse and delivery workers, making it impossible for them to scan packages or access delivery routes.

Amazon sellers said they were unable to access Seller Central, an internal website used to manage customer orders, as a result of the outage.

However, this was not the first instance of an outage impacting Amazon services.

In July, Amazon experienced problems in its online stores service, following a service disruption from content distribution network Akamai. The disruption lasted for about two hours and affected more than 38,000 Amazon users.

AWS was also hit with a massive outage in November last year, which took down thousands of websites and services, including those belonging to Adobe, Flickr, Roku, Twilio and Autodesk.

The issue was triggered by the addition of new servers to the Amazon Kinesis real-time data processing service. Amazon said at the time that it would apply lessons learned to improve the reliability of its services.

In 2017, AWS suffered a major incident when an employee accidentally turned off more servers than intended during repairs of a billing system.