Verne Global

Data Center | Industry |

1 July 2017

Share Data or Suffer Again

Written by Peter Judge (Guest)

Peter Judge is the Global Editor at Datacenter Dynamics. His main interests are networking, security, mobility and cloud. (All blogs by Guest bloggers are their own)

In May, British Airways suffered a catastrophic data center failure. It caused chaos for BA, as none of its planes could fly for the weekend. The cost has been estimated at $80 million, just in terms of the lost business and compensation - to say nothing of the damage to BA’s brand.

BA’s CEO has promised an inquiry to find out what went wrong, but leaks have suggested that a contractor was somehow able to switch off the power to racks running the live IT services. It appears this happened without triggering any backup power. It also seems that, although BA has two data centers, the system did not failover to the second one. And finally, another unconfirmed report tells how the power was switched back on in an uncontrolled way, causing damage to the IT hardware.

This is so far removed from normal practice in data centers, that I believe the enquiry is more about finding a good story to tell, and getting some breathing space before an embarrassing admission to the public. However bizarre the cause, I think the failure will not be a new one. Data centers are complex human-technological systems. There are lots of moving parts, and lots of interdependences, But, they are pretty well understood, and there are only so many ways a data center can go wrong.

There have been many failures in the airline industry which are superficially similar. In 2016, JetBlue had a two hour outage because of botched maintenance work at a Verizon data center. This differs in many substantial ways to BA’s troubles, and service was resumed more quickly. In January 2017, Delta had a two hour outage, before service was resumed. Just one week earlier, United had a two hour failure.

The data center industry has its own fault investigators, who will have been called in to find and explain what went wrong. They will sign a non-disclosure agreement (NDA), obviously, before BA lets them see the site. Once on site ,they will check the evidence and turn to their BA client, and say, wearily: “Yes, we thought so. It was a combination of X and Y.”

The BA staff will be shocked the investigation was that easy, but the investigators will sigh and remind them of that NDA. When they probed other failures, they had to sign similar documents. They’ve seen exactly this problem, but they haven’t been able to publicise what they know. Those damn NDAs. They protect each victim from sharing their shame and loss. But they also prevent the industry from learning the lessons, till the knowledge gradually seeps out.

It’s worth mentioning in passing that shared service facilities and cloud providers will probably get more access to that seeping of knowledge than an in-house site like BA - just because they are bigger, and more IT and facilities people pass through them. But there’s an irony here. If one of BA’s planes fell out of the sky, we’d know all about it. Incidents involving airplanes must be reported and analysed by law.

A lost weekend for thousands of customers doesn’t compare to the tragedy of a plane crash, so maybe it makes sense that data centers don’t get that kind of scrutiny. So far, lives don’t generally depend on data centers. But that will change. Ed Ansett of i3 Consulting believes that data center operators will have to share the data about any failures, He’s done enough failure investigations to know the repeating patterns. He’s also seen the increasing reliance we place on data centers, and he believes that regulation will come. If facilities don’t share their information for the good of the industry, they will eventually be forced to share it by law.

There’s a move to get this sharing going - and maybe by voluntary action head off heavy-handed regulations. Ed and some colleagues are launching a charity called the Data Center Incident Reporting Network (DCIRN) which will be a neutral site to share the information about any serious incidents or near misses. Some of the details are still be settled, but information will be anonymised, and it will be shared fairly. If DCIRN takes off, it could be a useful aid to learning about data center failures.

However the issue is handled, it will become more prominent. Data centers will become more and more central to our lives, in airlines, hospitals and other essential infrastructure. Dealing with disasters, handling customers well, sharing information and learning should be second nature to any data center operator.

Note: You can read more of Peter's blogs at Datacenter Dynamics here.


Sign up for the Verne Global newsletter

Opinion, thought leadership and news delivered directly to your inbox once a month.