Can Maintenance Make Data Centers Less Reliable?
miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."
Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.
Semantics is the gravity of abstraction
I believe the article is referring to major hardware replacements, stress testing, etc. But there is other preventative or even detective work that needs to be done in data centers large and small that have nothing to do with equipment. You can't just blithely assume that things are always going to work as they are supposed to work. One time, we discovered that the camera server for one of our clients had stopped recording for no good reason, and upon closer inspection discovered that the hard drive failed and we had no alert system in place since it wasn't a "real" server but just a heavy duty XP machine. After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.
Occasionally living proof of the Ballmer peak.
Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.
I can think of many occasions that a security update has broken a server/router/etc. Obviously the lack of a security update can lead to a bigger headache in the future. But the typical user doesn't understand and has the attitude "IT broke the server again".
If a virus or hacker causes an issue the attitude is "I hope they fix that soon. I hate viruses/hackers" (obviously this is a huge generalization).
===
"Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'
===
It isn't just human error: the very act of performing intrusive tasks under the theory of "preventative maintenance" can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s when the concept of reliability centered maintenance was developed for turbine engines and eventually full airliners. Look up the classic report by Nowland & Heap. Very much counter-intuitive if one has been trained to believe in the classics of "preventative teardowns" and fully known failure probability distribution functions, but matches up well to what experience field mechanics have been saying since the days of the pyramid construction.
sPh
Of course, today there is a huge "RCM" consulting industry, 7-step programs, etc that bears little resemblance to the original research and theories; don't confuse that with the core work.
That being said, it was because their procedures were shit, not because they were doing maintenance.
The guy at the garage always recommends I do an $80 transmission flush.
Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.
From TFS:
"... poorly documented maintenance can lead to conflicts with automated systems ..."
That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.
Sheesh.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.
The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.
You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?
Check your transfer switch ratings. I guarantee it will be spec'd much lower than you think. The electricians think it'll only be switched a couple times in its life. The diesel service provider thinks you're running it twice a week. Whoops. If you run it once a week, it'll only survive a couple years, then you'll get a facility wide multi-hour outage. I've personally seen it over and over again over the past two decades. The best part is "we have a procedure" so it'll only be run during maint hours and the desk jockeys 200 miles away will run it rain or shine, so its guaranteed that the xfer switch destroys itself at 2 am during a blizzard and it'll take half a day to repair.
Very few xfer switches are more reliable than commercial utility power. Installing a UPS actually lowers reliability in almost all professional situations.
My favorite power outage was caused by a gas leak a couple blocks away, where the utility co shut down the AC and then threatened to take an axe to the gen/UPS if not also shut off. This was not in the official written report, just word of mouth.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Planned obsolescence has been promoted in all aspects of life since post WW2 and now it is hard to imagine the world without it. That line of thinking has been creeping into everything even in areas where it doesn't seem to apply.
Does this play a factor on the perception of preventative maintenance or its frequent application? I think it probably does in at least a couple ways, don't you?
Democracy Now! - uncensored, anti-establishment news
I read through the entire article, and saw zero data to support his assertion. I'm sure he has the data, but the article didn't reference a single piece of it. Without any data to support the theory all we have is a fluff opinion piece. Shame on Data Center Knowledge for writing an article about a scientific investigation, and not presenting a single piece of scientific evidence.
AccountKiller
Which means for every online server you need an offline test machine -- and a way to simulate the operating environment in order to test. Not many companies have the skill of cash to do that.
The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.
There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).
The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.
In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations).
So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.
...is the one farthest from the nearest engineer."
Consider The Pioneer and Voyager spacecraft and the Mars landers.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
vigorous maintenance
excessive maintenance
poorly documented maintenance
Those are all qualified as out of the ordinary. Anything in excess (on either side of the scale, whether it is too much or not enough) is a problem. Of course maintenance must be performed, but I guess some data centers have a strange idea of best practices, or they do not follow them.
Twinstiq, game news
Precisely my thought.
Maintenance, like anything else you do in a datacenter or wherever you work, must be done correctly. If maintenance reduces the reliability of the maintained entity, then per definition, it was not correctly performed.
Doing something correctly requires knowledge, planning and training. Just like everything else.
Although everyone makes mistakes, some people make hundreds of times more errors than others. Whether that's due to inherent lack of ability, poor training, lacking oversight, laziness, time pressures or just a slapdash attitude varies with each person. One place I was involved with (as an external consultant) made over 12,000 changes to their production systems every year. It turned out that well over half of those were backing out earlier changes, correcting mistakes/bugs from earlier "fixes" or other activities (a lot that resulted in downtime, and far too much of it unscheduled or emergency downtime) that should not have happened and could have been prevented.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
You must not deal with any Oracle database servers. They leak like a sieve.
[John]
Shit better not happen!
http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/20000/5000/600/125621/125621.strip.zoom.gif
This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.