Slashdot Mirror


Are Data Center "Tiers" Still Relevant?

miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."

19 of 98 comments (clear)

  1. It depends by afidel · · Score: 4, Interesting

    If you are large enough to survive one or more site outages then sure you can go for a cheaper $/sq ft design without redundant power and cooling. If on the other hand you are like most small to medium shops then you probably can't afford the downtime because you haven't reached the scale where you can geographically diversify your operations. In that case downtime is probably still much more costly than even the most expensive of hosting facilities. I know when we looked for a site to host our DR site we were only looking at tier-IV datacenters because the assumption is that if our primary facility is gone we will have to timeshare the significantly reduced performance facilities we have at DR and so downtime wouldn't really be acceptable. By going that route we saved ~$500k on equipment to make DR equivalent to production at a cost of a few thousand a month for a top tier datacenter, those numbers are easy to work.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  2. Infrastructure is very important. by CherniyVolk · · Score: 4, Interesting

    Infrastructure is more important than "best practices". Infrastructure is more of a physical, concrete aspect. Practices really aren't that important once the critical, physical disasters begin. As an example, good hardware will continue to run for years. Most of the downtime in regards to good hardware will most likely be due to misconfiguration, human error that sort of thing. A Sys Admin banks on some wrong assumption, messes up a script or hits the wrong command, but nonetheless the hardware is still physically able and therefore the infrastructure has not been jeopardized. If the situation is reversed, top notch paper plans and procedures... with crappy hardware. Well... the realities of physical discrepancies are harder to argue than our personal, nebulous, intangible, inconsequential philosophies of "good/better/best" management procedures/practices.

    So to me the question "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices?" is best translated as "To belittle the concept of uptime and it's association with reliability, are data centers relying too much on the raw realities of the universe and the physical laws that govern it and not enough on some random guys philosophies regarding problems that only manifest within our imaginations?"

    Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on science and not enough on voodoo/religion?"

  3. Tiers and Data Center Redundancy by japhering · · Score: 3, Insightful

    Data center redundancy is a need thing. However, most data center designs for get to address the two largest causes of down time ... people and software. People are people and will always make mistakes, as such there are still things that can be done to reduce the impact of human error.

    Software, very rarely is designed for use in redundant systems. More likely, the design is for use in a hot-cold or hot-warm recovery scenario. Very rarely is it designed for multiple hot across multiple data centers.

    Remember, good disaster avoidance is always cheaper than disaster recovery when done right.

    1. Re:Tiers and Data Center Redundancy by japhering · · Score: 2, Insightful

      And if you had two identical data centers, where each in and of itself was redundant with software designed to function seamlessly across the two in a hot-hot configuration .. there would have been NO downtime.. the university would have been up the entire time with little to no data loss.

      So say I'm Amazon and my data center burns down.. 48 hours with ZERO sales for a disaster recovery scenario vs normal operations for the time it takes to rebuild/move the burned data center..

      I think I'll take disaster avoidance and keep selling things :-)

    2. Re:Tiers and Data Center Redundancy by aaarrrgggh · · Score: 2, Insightful

      Unless you were doing maintenance in the second facility when a problem hit the first. That is what real risk management is about; when you assume hot-hot will cover everything, you have to make sure that is really the case. Far too often there are a few things that will either cause data loss or significant recovery time even in a hot-hot system when there is a failure.

      Even with hot-hot systems, all facilities should be reasonably redundant and reasonably maintainable. Fully redundant and fully maintainable can be a pipe-dream.

    3. Re:Tiers and Data Center Redundancy by japhering · · Score: 3, Interesting

      Precisely, I've spent the last 12 years (prior to be laid off) working in a hot-hot-hot solution. Each center was fully redundant and ran at no more then 50% dedicated utilization. Each data center got 1 week worth of planned maint every quarter for hardware and software updates when that data center was completely off line leaving a hot-hot solution.. if something else happened we still had a "live" data center while scrambling to recover the other two.

      We ran completely without change windows as we would simply deadvertize an entire data center do the work and readvertize, them move on to the next data center. In the event of high importance, say a cert advisory requiring an immediate update, we would follow the same procedures just as soon as all the requisite mgmt paperwork was complete.

      And yes, we were running some of the most visible and highest traffic websites on the internet.

  4. But it's never the software... by Sarten-X · · Score: 2, Insightful

    "A stick of RAM costs how much? $50?"

    I don't remember the source of that quote, but it was in relation to a company spending money (far more than $50) to reduce the memory use of their program. Sure, there's a lot of talk in computer science curricula about using efficient algorithms, but from what I've seen and heard, companies almost always respond to performance problems by buying bigger and better hardware. If software weren't grossly inefficient, how would that affect data centers? Less power consumption, cheaper hardware, and more "bang for your buck", so to speak.

    Eventually, this whole debate becomes moot, as data centers can get more income from the hardware, thus still provide the uptime, redundancy, and features, without the need to cut costs. Once those basic needs are out of the way, there's room for expansion into other less-than-critical offerings, and finally, innovation in areas other than uptime.

    --
    You do not have a moral or legal right to do absolutely anything you want.
    1. Re:But it's never the software... by Maximum+Prophet · · Score: 2, Insightful

      That works if you have one program that you have to run every so often to produce a report. If your datacenter is more like Google, where you have 100,000+ servers, a 10% increase in efficiency could eliminate 10,000 servers. Figure $1,000 per server and it would make sense to offer a $1,000,000 prize to a programmer that can increase the efficiency of the Linux kernel by > 10%.

      B.t.w Adding one stick of RAM might increase the efficiency of a machine, but in the case above, the machines are probably maxed out w.r.t. RAM. Adding more might not be an option without an expensive retrofit.

      --
      All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
    2. Re:But it's never the software... by Maximum+Prophet · · Score: 4, Insightful

      Code scales, hardware doesn't. If you have one machine, yes, it cheaper to get a bigger, better machine, or to wait for one to be released.

      If you have 20,000 machines, even a 10% increase in efficiency is important.

      --
      All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
    3. Re:But it's never the software... by Mr.+DOS · · Score: 2, Informative

      Perhaps this TDWTF article is what you were thinking of?

            --- Mr. DOS

  5. Perfect illustration by jeffmeden · · Score: 4, Insightful

    Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar.

    Repeat after me: There is no replacement for redundancy. There is no replacement for redundancy. Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*. Redundancy is irreplaceable. If you rely on your servers (the servers housed in one place) you had better have redundancy for EVERY. SINGLE. OTHER. ASPECT. If not, you can expect downtime, and you can expect it to happen at the worst possible moment.

    1. Re:Perfect illustration by Timothy+Brownawell · · Score: 2, Insightful

      Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.

      No, I've also heard about cases where both redundant systems failed at the same time (due to poor maintenance) and where the fire department won't allow the generators to be started. Everything within the datacenter can be redundant, but the datacenter itself still is a single physical location.

      Redundancy is irreplaceable.

      Distributed fault-tolerant systems are "better", but they're also harder to build. Likewise redundancy is more expensive than lack of redundancy, and if you have to choose between $300k/year for a redundant location with redundant people vs. a million-dollar outage every few years, well, the redundancy might not make sense.

  6. pointless marketing by vlm · · Score: 5, Informative

    Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators

    I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up. And yes I've been involved in numerous power failure incidents (dozens) at numerous companies, and only experienced two incidents of successful backup of commercial power loss.

    Transfer switches that don't switch. Generators that don't start below 50 degrees. Generators with empty fuel tanks staffed by smirking employees with diesel vehicles. When you're adding capacity to battery string A, and the contractor shorts out the mislabeled B bus while pulling cable for the "A" bus.

    Experience shows that if a companies core competency is not running power plants, they would be better off not trying to build and maintain a small electrical power plant. Microsoft has conditioned users to expect failure and unreliability, use that conditioning to your advantage... the users don't particularly care if its down because of a OS patch or a loss of -48VDC...

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:pointless marketing by Ephemeriis · · Score: 2, Insightful

      I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.

      A lot of folks don't really contemplate what a loss of power means to their business.

      Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity between them and the ISP - so their Internet is down anyway.

      I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible.

      --
      "Work is the curse of the drinking classes." -Oscar Wilde
    2. Re:pointless marketing by R2.0 · · Score: 5, Interesting

      It's not just in IT. I work for an organization that uses a LOT of refrigeration in the form of walk-in refrigerators and freezers. Each one can hold product worth up to $1M and all can be lost in a temperature excursion. So we started designing in redundancy: 2 separate refrigeration systems per box, backup controller, redundant power feeds from different transfer switches over divers routing (Brown's Ferry lessons learned). Oh, and each facility had twice as many boxes as needed for the inventory.

      After installation, we began getting calls and complaints about how our "wonder boxes" were pieces of crap, that they were failing left and right, etc. We freak out and do some analysis. Turns out that, in almost every instance, a trivial component had failed in 1 compressor and the system had failed over to the other system, ran for weeks-months, and then that failed too. When we asked why they never fixed the first failure, they said "What failure?" When we asked about the alarm the controller gave due to mechanical failure, we were told that it had gone off repeatedly but was ignored because the temperature readings were still good and that's all Operations cared about. In some instances the wires to the buzzer was cut, and in one instance, a "massive controller failure" was really a crash due to the system memory being filled by the alarm log.

      Yes, we did some design changes, but we also added another base principle to our design criteria: "You can't engineer away stupid."

      --
      "As God is my witness, I thought turkeys could fly." A. Carlson
  7. RAID by QuantumRiff · · Score: 4, Interesting

    Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID:
    Redundant Array of Inexpensive Datacenters..

    Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

    --

    What are we going to do tonight Brain?
    1. Re:RAID by jeffmeden · · Score: 2, Informative

      Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters.. Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

      That all depends. A 5 9s datacenter is a full ten times more reliable than a 4 9s datacenter (mathematically speaking). So, all things being equal (again, mathematically), you would need ten 4-9 centers to be as reliable as your one 5-9 center. However geographic dispersion, outage recover lead time, bandwidth costs, maintenance, etc. can all factor in to sway the equation either way. It really comes down to itemizing your outage threats, pairing that with the cost of redundancy for each threatened component, and then looking at the cost of downtime as part of the business process. It's rarely as simple as "why not just build two at twice the price".

  8. uptime matters by Spazmania · · Score: 2, Insightful

    Designing nontrivial systems without single points of failure is difficult and expensive. Worse, it has to be built in from the ground up. Which it rarely is: by the time a system is valuable enough to merit the cost of a failover system, the design choices which limit certain components to single devices have long since been made.

    Which means uptime matters. 1% downtime is more than 3 days a year. Unacceptable.

    The TIA-942 data center tiers are a formulaic way of achieving satisfactory uptime. They've been carefully studied and statistically tier-3 data centers achieve three 9's uptime (99.9%) while tier-4 data centers achieve four 9's. Tiers 1 and 2 only achieve two 9's.

    Are there other ways of achieving the same or better uptime? Of course. But they haven't been as carefully studied which means you can't assign a high a confidence to your uptime estimate.

    Is it possible to build a tier-4 data center that doesn't achieve four 9's? Of course. All you have to do is put your eggs in one basket (like buying all the same brand of UPS) and then have yourself a cascade failure. But with a competent system architect, a tier-4 data center will tend to achieve at least 99.99% annual uptime.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  9. Re:No by Forge · · Score: 2, Informative

    Sometimes people do irrational things in DATA center. I.e. Where I live/work the Electricity company is notoriously unreliable. We had a 5 minute outage this morning for no apparent reason, We had 3 last week of varied durations. This in the heart of the business district where power is most reliable.

    Because of this our Data center has redundant UPS and Redundant Generators. All but the least critical servers have dual power supplys, plugged into independent circuits.

    We have multiple ACs but they are not strictly set up to be redundant. When one breaks down we have to haul standing fans to the area to keep the machines cool enough while the AC is repaired.

    The stupid thing though is that most of the smaller switches have a single power supply and most machines are plugged into a single switch. So our last UPS failure resulted in two whole racks of servers being inaccessible for 15 minutes, while I ran over there, figured out what the problem was and plugged the switch into a neighboring RACK.

    --
    --= Isn't it surprising how badly I spell ?