Can Maintenance Make Data Centers Less Reliable?

← Back to Stories (view on slashdot.org)

Can Maintenance Make Data Centers Less Reliable?

Posted by samzenpus on Sunday November 27, 2011 @06:16AM from the if-it-isn't-broken dept.

miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."

10 of 185 comments (clear)

Min score:

Reason:

Sort:

In between maybe? by anarcat · 2011-11-27 06:20 · Score: 5, Insightful

Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

--
Semantics is the gravity of abstraction
1. Re:In between maybe? by Elbereth · 2011-11-27 06:40 · Score: 5, Interesting
  
  I suppose that I'd agree. Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on. Luckily, as a talented engineer, he could usually fix whatever the problem was, but it was a huge pain in the ass. Of course, back then, commodity computer hardware was hugely unreliable, with vast gaps in quality between price ranges, and we were working with pretty cheap stuff. Still, to this day, I dread the thought of turning off a computer that has been working reliably. You never know when some piece of crap component is nearing the end of its life, and the stress of a power cycle could what pushes it over the edge into oblivion (or highly unreliably behavior). I used to be fond of constantly messing with everything, fixing it until it broke, but his influence moderated that impulse in me, to the point where I usually freak out when anyone suggests unnecessarily rebooting a computer. Surely, there's something to say for preventative maintenance, and I'd rather be caught with an unbootable PC during regularly scheduled maintenance than suddenly experiencing catastrophic failure randomly, but there's something to be said for just leaving the shit alone and not messing with it. Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory, pull out a cable, or knock loose a power connector. The fewer the times you come into contact with the thing, the better. If I could build a force field around every PC, I probably would.
2. Re:In between maybe? by mehrotra.akash · 2011-11-27 07:02 · Score: 5, Funny
  
  fixing it until it broke
  Thats the spirit!!
3. Re:In between maybe? by greenfruitsalad · 2011-11-27 12:25 · Score: 5, Interesting
  
  i can't agree. i used to but now i cannot afford to.
  we recently experienced 2 catastrophes (datacentre-wide downtimes, you know things that NEVER happen) and the results were unbelievable. GRUBs failed to load OSes, machines were without a bootloader (due to emergency disk hotswaps), some machines simply didn't turn on, services didn't autostart, a few virtual servers autostarted on multiple hosts (instead of just one), fsck on some of our volumes took hours to finish, 30% of supermicro IPMI cards were unresponsive, etc. it revealed that almost nobody had followed procedures properly.
  after that, every single service we have is built in a clustered manner with nodes spread across multiple datacentres. I now restart machines and pull cables at regular intervals to test bgp/ospf, clustering, recoveries, to check filesystems, etc. i am now also ABLE TO SLEEP.
Security updates by bjb_admin · 2011-11-27 06:26 · Score: 5, Informative

Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.

I can think of many occasions that a security update has broken a server/router/etc. Obviously the lack of a security update can lead to a bigger headache in the future. But the typical user doesn't understand and has the attitude "IT broke the server again".

If a virus or hacker causes an issue the attitude is "I hope they fix that soon. I hate viruses/hackers" (obviously this is a huge generalization).
Can faulty logic make data centers less reliable? by DragonHawk · 2011-11-27 06:44 · Score: 5, Insightful

From TFS:

"... poorly documented maintenance can lead to conflicts with automated systems ..."
That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.
Sheesh.

--

dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Maintenance-induced failure. by Animats · 2011-11-27 06:46 · Score: 5, Insightful

There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.
The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.
You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?
1. Re:Maintenance-induced failure. by DarthBart · 2011-11-27 07:13 · Score: 5, Insightful
  
  There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".
Re:Maintenance took down Chernobyl by crankyspice · 2011-11-27 06:51 · Score: 5, Informative

That being said, it was because their procedures were shit, not because they were doing maintenance.
Actually, no, the Chernobyl disaster was sparked with a 'live' test of a new, untested mechanism for powering reactor cooling systems in the event of a disaster that brought down the power grid. http://en.wikipedia.org/wiki/Chernobyl_disaster#The_attempted_experiment (And even that test was delayed several hours, into a shift of workers that weren't properly prepared to conduct the test.)

--
geek. lawyer.
This is well known from Formula One by igb · 2011-11-27 07:04 · Score: 5, Interesting

Some years ago, the F1 rules were changed so that cars were in parc ferme conditions, with strict limits on what can be done to them, from the start of qualifying on Saturday lunchtime until the race finishes on Sunday afternoon.
The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.
There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).
The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.
In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations).
So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.