Deployment of single misconfigured firewall rule caused CPU server spike across Cloudflare's infrastructure
Cloudflare has revealed the reasons for its global outage on Tuesday, which made as much as 10 per cent of the global internet unaccessible.
In a blog post, the company claimed that an injudicious update to its firewall rules caused a CPU spike across the Cloudflare infrastructure. That tied up the company's servers, preventing Cloudflare from connecting internet users to websites, resulting in a rash of ‘502 Bad Gateway' errors across the world.
The outage started at precisely 2.42pm, the company admitted, causing a global outage across its network. "The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules," the company explained.
"Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100 per cent on our machines worldwide. This 100 per cent CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82 per cent."
The blog post claimed that the company had never seen such CPU ‘exhaustion' before.
"We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today's outage."
At 3.02pm, the company worked out the cause of the problem. "We understood what was happening and decided to issue a ‘global kill' on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic," and normality was restored by 3.09pm.
A fixed firewall ruleset was re-rolled out just before 4pm with no issues.
"We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future."
2 July 2019: The Cloudflare content delivery network should be getting back to normal this afternoon following what the company described as a "network performance issue".
The networking problem effectively took down websites across the world, particularly in the UK, but also much of Europe and both the east and west coast of America, according to DownDetector, which was also affected.
As much as 10 per cent of the internet was affected, according to reports.
Initially, the company had warned in an update on its website: "Cloudflare is observing network performance issues. Customers may be experiencing 502 errors while accessing sites on Cloudflare. We are working to mitigate impact to Internet users in this region."
Within the last few minutes, though, the company has updated its System Status, claiming to have rolled-out a fix for the issue, and is now "monitoring the results".
It stated: "Cloudflare has implemented a fix for this issue and is currently monitoring the results. We will update the status once the issue is resolved."
Cloudflare was founded in 2009. Today, it claims the highest number of connections to internet exchange points of any network across the world. Cloudflare caches content to its edge locations, enabling organisations to deliver content faster and with less stress on their own networks.
In addition to content delivery, it also provides DDoS mitigation services and internet security services. In 2014, it claimed to have mitigated the world's biggest-ever (up until then) distributed denial of service attack, going on to provide some detail about the attack.
It has also, though, faced legal action from a porn baron for providing the same anti-DDoS services to piracy websites.