Cloudflare blames global outage on botched firewall rule change

clock
Cloudflare says 'sorry' and explains the reasons for Tuesday's global internet outage
Image:

Cloudflare says 'sorry' and explains the reasons for Tuesday's global internet outage

Deployment of single misconfigured firewall rule caused CPU server spike across Cloudflare's infrastructure

Cloudflare has revealed the reasons for its global outage on Tuesday, which made as much as 10 per cent of the global internet unaccessible.

In a blog post, the company claimed that an injudicious update to its firewall rules caused a CPU spike across the Cloudflare infrastructure. That tied up the company's servers, preventing Cloudflare from connecting internet users to websites, resulting in a rash of ‘502 Bad Gateway' errors across the world.

The outage started at precisely 2.42pm, the company admitted, causing a global outage across its network. "The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules," the company explained.

It continued: "The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100 per cent on our machines worldwide. This 100 per cent CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82 per cent."

The blog post claimed that the company had never seen such CPU ‘exhaustion' before.

"We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today's outage."

At 3.02pm, the company worked out the cause of the problem. "We understood what was happening and decided to issue a ‘global kill' on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic," and normality was restored by 3.09pm.

A fixed firewall ruleset was re-rolled out just before 4pm with no issues.

"We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future."

 

2 July 2019: The Cloudflare content delivery network should be getting back to normal this afternoon following what the company described as a "network performance issue". 

The networking problem effectively took down websites across the world, particularly in the UK, but also much of Europe and both the east and west coast of America, according to DownDetector, which was also affected. 

As much as 10 per cent of the internet was affected, according to reports. 

Initially, the company had warned in  an update on its website: "Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region."

Within the last few minutes, though, the company has updated its System Status, claiming to have rolled-out a fix for the issue, and is now "monitoring the results". 

It stated: "Cloudflare has implemented a fix for this issue and is currently monitoring the results. We will update the status once the issue is resolved."

Cloudflare was founded in 2009. Today, it claims the highest number of connections to internet exchange points of any network across the world. Cloudflare caches content to its edge locations, enabling organisations to deliver content faster and with less stress on their own networks.

In addition to content delivery, it also provides DDoS mitigation services and internet security services. In 2014, it claimed to have mitigated the world's biggest-ever (up until then) distributed denial of service attack, going on to provide some detail about the attack

It has also, though, faced legal action from a porn baron for providing the same anti-DDoS services to piracy websites. 

More on Cloud and Infrastructure

ERP and the slow road to the cloud

ERP and the slow road to the cloud

Unit4, like all ERP vendors, would love its customers to move to the cloud, but what can it offer to tempt them to make the switch?

John Leonard
clock 22 June 2022 • 8 min read
Cloudflare outage knocks out websites across globe

Cloudflare outage knocks out websites across globe

Services that went offline as a result of the outage included Amazon, Discord, GitLab, Twitch, Steam, Coinbase, Telegram and DoorDash

clock 22 June 2022 • 3 min read
Consumer champion launches £750m legal claim against Apple for iPhone battery throttling

Consumer champion launches £750m legal claim against Apple for iPhone battery throttling

Justin Gutman has launched a £750m claim against Apple for misleading iPhone users by hiding a power management tool in software update iOS 10.2 which reduced performance

clock 16 June 2022 • 1 min read