Window's cloud service, Azure, suffered an outage on 26 July because of a "safety valve" configuration error, Microsoft has said.
The outage, which affected parts of Western Europe for more than two hours, meant that customers experienced external connectivity loss to the internet.
In a blog post, Windows Azure general manager Mike Neil explained that the public cloud application platform's network infrastructure uses a "safety valve mechanism" to protect against potential networking failures escalating by limiting the amount of connections that can be accepted by its datacentre network hardware devices.
He said that before the outage, Microsoft added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match the new capacity.
"Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing them to reach 100 per cent central processing unit utilisation – impacting data traffic," Neil explained.
Microsoft resolved the issue by increasing limit settings in the affected cluster and across all Windows Azure datacentres. In addition, it improved validation across its datacentres and improved its network monitoring systems to detect and mitigate connectivity issues before these affected running services, Neil said.
Neil said that Microsoft is currently applying fixes for the identified bugs to the device software. Microsoft did not disclose how many customers were affected.
Microsoft issued an apology for the outage on the Windows Azure dashboard on 26 July and later claimed that the outages were resolved by early afternoon that day.
"We apologise for any inconvenience this outage may have caused our customers. The duration of the service interruption was approximately 2.5 hours and was resolved at 6:33 AM PDT [Pacific daylight time]. Customers who have questions regarding this incident are encouraged to contact customer service and support," it said.
The Azure outage was the latest of several crashes affecting public cloud services, highlighting the fragility of cloud computing. It followed two major crashes affecting Amazon's cloud in the space of a month, and recent downtime at Salesforce.com.
Sometimes, the power of the mainframe is the most cost effective answer. Computing's Peter Gothard puts Computing's readers' questions on the future of the mainframe to IBM's Z13 expert Steven Dickens.
This Dummies white paper will help you better understand business process management (BPM)