The recent Delta Airlines incident sent a shockwave through the data centre industry. According to the airline company, ‘a critical power control module at our technology Command centre malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.’
Other reports blamed switchgear failure or a generator fire for the outage. Later published reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
Self-proclaimed professionals (or rather armchair critics), who are not part of the airline team, were very active over Internet space explaining the exact technical nature of what had happened, not to mention, providing Delta Airlines their free technical advice what should have been done to have the incident prevented from happening.
Some articles forced the reader to spend a good ten minutes going through a large number of unexplained assumptions and after a couple of days being overwhelmed with information, the conclusion could be made that all the articles missed the crucial point in addressing the fundamental root cause of the incident; lack of risk management practices.
Had Delta performed regular risk assessments on their infrastructure it would have had a more than decent chance to identify the problem in a timely manner and take appropriate action. With business critical facilities and equipment being involved, risk logs and a risk treatment plan based on business priorities would have been enough to inform senior management with justification to correct the issues before they turn into a problem of this magnitude.
Why it was never undertaken, or not undertaken at the appropriate levels, remains a guess; fact is that many organisations struggle to perform risk assessments on their data centre believing that such need is not present. Organisations often rely on facilities and equipment redundancy which may not be present or simply do not operate as expected. Compounded by the lack of risk management practices, incidents such as the occurance at Delta Airlines are bound to happen.
EPI can provide organisations a helping hand; on October 5-6 we are organizing another web-based CDRP (Certified Data Centre Risk Professional) event (Singapore time 10 am). This course will teach students how to build a risk management program covering all required steps and phases required to keep risk at bay. More schedule can be found here or contact your prefered EPI training partner.
Contributed by, Jan Willem Mooren, EPI
Questions? Email us!