Are Data Center "Tiers" Still Relevant?
miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."
It's not just in IT. I work for an organization that uses a LOT of refrigeration in the form of walk-in refrigerators and freezers. Each one can hold product worth up to $1M and all can be lost in a temperature excursion. So we started designing in redundancy: 2 separate refrigeration systems per box, backup controller, redundant power feeds from different transfer switches over divers routing (Brown's Ferry lessons learned). Oh, and each facility had twice as many boxes as needed for the inventory.
After installation, we began getting calls and complaints about how our "wonder boxes" were pieces of crap, that they were failing left and right, etc. We freak out and do some analysis. Turns out that, in almost every instance, a trivial component had failed in 1 compressor and the system had failed over to the other system, ran for weeks-months, and then that failed too. When we asked why they never fixed the first failure, they said "What failure?" When we asked about the alarm the controller gave due to mechanical failure, we were told that it had gone off repeatedly but was ignored because the temperature readings were still good and that's all Operations cared about. In some instances the wires to the buzzer was cut, and in one instance, a "massive controller failure" was really a crash due to the system memory being filled by the alarm log.
Yes, we did some design changes, but we also added another base principle to our design criteria: "You can't engineer away stupid."
"As God is my witness, I thought turkeys could fly." A. Carlson