Azure outage knocks government's CloudStore offline

Outage is blamed on leap year but exposes risks of a pure cloud strategy

The government's newly launched CloudStore, a catalogue of approved cloud services and providers for public sector organisations, was offline for much of yesterday afternoon due to problems with its hosting provider, Microsoft Azure.

A Microsoft spokesperson said at the time that customers in the US and Europe were affected.

"On 28 February 2012 at 5:45 PM [Pacific Standard Time], Microsoft became aware of an issue impacting Windows Azure service management in a number of regions," the spokesperson said.

"Windows Azure engineering teams developed, validated and deployed a fix that resolved the issue for the majority of our customers. Some customers in three sub-regions - North Central US, South Central and North Europe - remain affected."

While services for the majority of affected customers were restored yesterday, Microsoft said there are continuing issues for some customers.

Bill Laing, corporate vice president server and cloud at Microsoft, put the issue down to a software error in recognising the leap year.

"The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year," he wrote on a Microsoft blog.

Laing added that his team is still working on fixing the issue, but could give no definitive resolution time.

"Some sub-regions and customers are still experiencing issues and as a result of these issues they may be experiencing a loss of application functionality. We are working to address these remaining issues," he said.

This not the first time that cloud services have been stricken by outages.

In August last year, Amazon's EC2 cloud service and Microsoft's BPOS online desktop application suite both suffered outages that lasted several days. These were due to lightning storms.

Vineet Jain, CEO of cloud storage provider Egnyte, said that such outages make a case against a pure cloud strategy. Instead, he recommended a hybrid cloud/on-premise architecture for resiliency.

"Downtime can be significantly mitigated if organisations were to adopt a hybrid cloud strategy," he said.

"By maintaining a behind-the-firewall presence and syncing that to the public cloud, companies are creating an insurance policy just for these situations.

"At the same time, they can keep downtime to a minimum and ensure their employees are as productive as possible in an emergency situation like this. Hybrid cloud is the smartest path to a productive workforce for today's enterprise."