Cloudflare outage knocks out websites across globe

Cloudflare outage knocks out websites across globe

Image:
Cloudflare outage knocks out websites across globe

Services that went offline as a result of the outage included Amazon, Discord, GitLab, Twitch, Steam, Coinbase, Telegram and DoorDash

The content delivery network and DDoS mitigation firm Cloudflare had an outage on Tuesday, causing a number of sites to go offline, including Shopify and Discord.

The problem started at 0627 GMT, and it took the firm until 0742 GMT to successfully put all of its data centres back up and verify that they were operating as intended.

During the period, visitors trying to access websites powered by Cloudflare were presented with a message that read "500 internal server error" suggesting that the web server was having issues.

Cloudflare referred to the situation as a "critical P0" incident - which is a broad term used for an urgent, top-priority issue.

The company provided an explanation for the outage in a blog post, stating that it was caused by a change that was implemented as a part of an ongoing project to improve resilience in "our busiest locations".

The firm said that the outage disrupted traffic in 19 of its data centres, which handle a significant amount of the company's overall worldwide traffic. User reports indicated that the list of websites and services that went offline as a result of the outage included Amazon, Discord, GitLab, Twitch, Steam, Coinbase, Telegram, DoorDash and more.

After receiving complaints from users all across the globe, Cloudflare launched an investigation at about 0634 GMT/UTC.

"A change to the network configuration in those locations caused an outage which started at 0627 UTC," it said.

"At 0658 UTC the first data centre was brought back online and by 07:42 UTC all data centres were online and working correctly."

Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, San Jose, Singapore, Sydney and Tokyo are among the data centres that were affected in Tuesday's incident.

"We appreciate everyone's patience during this incident, and we apologise for any disruption that may have occurred," Cloudflare said, adding that it was their mistake and not the result of an attack or malicious activity.

The company says it has been working for the last 18 months to transition all of its busiest locations to an architecture that is more flexible and robust.

"In this time, we've converted 19 of our data centres to this architecture, internally called Multi-Colo PoP (MCP)."

The company notes the new architecture has provided it with significant reliability improvements, and also allowed it to run maintenance in these locations without disrupting customer traffic.

However, since these facilities handle a major amount of the Cloudflare traffic, any failure here may have a very broad effect, and "unfortunately, that's what happened today," according to the firm.

Cloudflare is a leading player in the world of web infrastructure, known for providing various services for the internet to function properly, such as routing of internet traffic, securing websites from cyberattacks, video streaming, domain registration, etc.

The company has millions of clients all around the globe, including large enterprise firms.

This is not the first time that Cloudflare has suffered a self-inflicted service interruption. In 2019, the firm said that a large service outage was triggered not by a cyberattack, but rather by a flaw in the company's firewall software.

In July 2020, another outage rendered a significant number of websites unreachable for a period of time.