Slashdot Mirror


Can Maintenance Make Data Centers Less Reliable?

miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."

4 of 185 comments (clear)

  1. In between maybe? by anarcat · · Score: 5, Insightful

    Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

    --
    Semantics is the gravity of abstraction
  2. Can faulty logic make data centers less reliable? by DragonHawk · · Score: 5, Insightful

    From TFS:

    "... poorly documented maintenance can lead to conflicts with automated systems ..."

    That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.

    Sheesh.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  3. Maintenance-induced failure. by Animats · · Score: 5, Insightful

    There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.

    The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.

    You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?

    1. Re:Maintenance-induced failure. by DarthBart · · Score: 5, Insightful

      There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".