Computing research: the many faces of IT disaster

By John Leonard
24 May 2012 View Comments
sodom-disaster

Hardware failure

Further reading

This is the most common source of IT angst.

“A disk in a RAID5 array failed on a NAS hosting several key VMs. However rather than working as advertised (one disk failure should not bring down the array), RAID5 decided it had had enough and the whole array dropped out. Joy.”

“The emergency generator was installed by a cowboy leading to the site receiving 480 volts. This caused exploding light bulbs, flaming printers & sparking PCs all over the building!”

Human error

Human error is often a contributing factor in events that can spell disaster for the unwitting enterprise – just as it often represents a challenge to organisations’ data security.

One senior IT strategist said: “The computer controlling the pincode door access to the data centre was housed inside the DC.  When the DC lost power no one could get in. The hardened door needed to be cut off by a team of welders. Unfortunately no-one thought to move the access computer after the incident it happened again a couple of months later. The lesson was then learned.”

Another complained: “A visiting engineer accidentally hit the "emergency power off" button in the comms room.”

And another told us: “An operator turned up drunk for the night shift and threw up over the operating console keyboard.  We were down for two days.”

Elsewhere, two instances were reported of cleaners unplugging a vital server in order to use a vacuum cleaner, while much ire was directed towards clumsy utilities workers drilling through vital cables – a surprisingly common cause of data centre downtime.

Chain reactions

Chain reactions are another factor that might seem to be unpredictable at the time, but in reality can be anticipated with good scenario planning and a measure of common sense: in other words, by asking what the consequences would be of a specific disruptive event taking place.

Regular maintenance and installing a backup cooling system will minimise a common cause of outages: air conditioning failure. The overheating of UPSs resulting from such failure was blamed for more than a few catastrophic incidents, while damage to servers and power supplies from condensation and leaking air con pipes was also reported.

One respondent said: “Server room air conditioning failure caused the server room door to expand and jam, so not only did the server overheat and stop working, but we needed to disassemble the server room lock before we could get access to the room. We have now installed secondary aircon with external alerts.”

Reader comments
blog comments powered by Disqus
Newsletters
Windows 10 - will you upgrade?

Microsoft has made an early version of Windows 10 - its next operating system - available for download. The OS promises better integration and harmonisation across platforms, including mobile and desktop. Will your business be upgrading?

27 %
43 %
10 %
20 %