Cloudflare blames global outage on botched firewall rule change

clock • 3 min read

Deployment of single misconfigured firewall rule caused CPU server spike across Cloudflare's infrastructure

Cloudflare has revealed the reasons for its global outage on Tuesday, which made as much as 10 per cent of the global internet unaccessible.

In a blog post, the company claimed that an injudicious update to its firewall rules caused a CPU spike across the Cloudflare infrastructure. That tied up the company's servers, preventing Cloudflare from connecting internet users to websites, resulting in a rash of ‘502 Bad Gateway' errors across the world.

The outage started at precisely 2.42pm, the company admitted, causing a global outage across its network. "The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules," the company explained.

It continued: "The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100 per cent on our machines worldwide. This 100 per cent CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82 per cent."

The blog post claimed that the company had never seen such CPU ‘exhaustion' before.

"We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today's outage."

At 3.02pm, the company worked out the cause of the problem. "We understood what was happening and decided to issue a ‘global kill' on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic," and normality was restored by 3.09pm.

A fixed firewall ruleset was re-rolled out just before 4pm with no issues.

"We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future."

 

2 July 2019: The Cloudflare content delivery network should be getting back to normal this afternoon following what the company described as a "network performance issue". 

The networking problem effectively took down websites across the world, particularly in the UK, but also much of Europe and both the east and west coast of America, according to DownDetector, which was also affected. 

As much as 10 per cent of the internet was affected, according to reports. 

Initially, the company had warned in  an update on its website: "Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region."

Within the last few minutes, though, the company has updated its System Status, claiming to have rolled-out a fix for the issue, and is now "monitoring the results". 

It stated: "Cloudflare has implemented a fix for this issue and is currently monitoring the results. We will update the status once the issue is resolved."

Cloudflare was founded in 2009. Today, it claims the highest number of connections to internet exchange points of any network across the world. Cloudflare caches content to its edge locations, enabling organisations to deliver content faster and with less stress on their own networks.

In addition to content delivery, it also provides DDoS mitigation services and internet security services. In 2014, it claimed to have mitigated the world's biggest-ever (up until then) distributed denial of service attack, going on to provide some detail about the attack

It has also, though, faced legal action from a porn baron for providing the same anti-DDoS services to piracy websites. 

You may also like
Cloudflare's estate breached by suspected state-sponsored threat actors

Hacking

The attackers exploited unrotated access token and service account credentials obtained from an Okta breach in October

clock 05 February 2024 • 2 min read
HSBC online banking outage hampers Black Friday shopping for thousands

Finance

Many customers complained about the lack of communication from the bank

clock 27 November 2023 • 3 min read
Okta notifies 5,000 staff of breach at third-party provider

Hacking

Breach linked to third-party vendor

clock 03 November 2023 • 2 min read

Sign up to our newsletter

The best news, stories, features and photos from the day in one perfectly formed email.

More on Cloud and Infrastructure

Vinted's stylish security: Navigating fashion-tech fusion

Vinted's stylish security: Navigating fashion-tech fusion

Turning to open source to address containers and microservices

Tom Allen
clock 07 December 2023 • 4 min read
SAP enters vector database fray with new AI capabilities for HANA Cloud

SAP enters vector database fray with new AI capabilities for HANA Cloud

Another 'AI everywhere' announcement

John Leonard
clock 02 November 2023 • 2 min read
The CMA's investigation into the UK cloud services market comes at crucial time

The CMA's investigation into the UK cloud services market comes at crucial time

Ofcom's referral to the CMA underscores the gravity of competition concerns in the cloud services sector

Josh Boer
clock 31 October 2023 • 4 min read