Can Maintenance Make Data Centers Less Reliable?

In between maybe? by anarcat · 2011-11-27 06:20 · Score: 5, Insightful

Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

--
Semantics is the gravity of abstraction

Re:In between maybe? by Elbereth · 2011-11-27 06:40 · Score: 5, Interesting

I suppose that I'd agree. Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on. Luckily, as a talented engineer, he could usually fix whatever the problem was, but it was a huge pain in the ass. Of course, back then, commodity computer hardware was hugely unreliable, with vast gaps in quality between price ranges, and we were working with pretty cheap stuff. Still, to this day, I dread the thought of turning off a computer that has been working reliably. You never know when some piece of crap component is nearing the end of its life, and the stress of a power cycle could what pushes it over the edge into oblivion (or highly unreliably behavior). I used to be fond of constantly messing with everything, fixing it until it broke, but his influence moderated that impulse in me, to the point where I usually freak out when anyone suggests unnecessarily rebooting a computer. Surely, there's something to say for preventative maintenance, and I'd rather be caught with an unbootable PC during regularly scheduled maintenance than suddenly experiencing catastrophic failure randomly, but there's something to be said for just leaving the shit alone and not messing with it. Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory, pull out a cable, or knock loose a power connector. The fewer the times you come into contact with the thing, the better. If I could build a force field around every PC, I probably would.
Re:In between maybe? by mehrotra.akash · 2011-11-27 07:02 · Score: 5, Funny

fixing it until it broke
Thats the spirit!!
Re:In between maybe? by sphealey · 2011-11-27 07:02 · Score: 3, Informative

===
Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on.
===
Neither you nor your friend are alone in thinking that:
AD-A066579, RELIABILITY-CENTERED MAINTENANCE, Nowlan & Heap, (DEC 1978) [this used to be available for download from the US Dept of Commerce web site; now appears to be behind a US government paywall (!)]
A more recent summary:
http://reliabilityweb.com/index.php/articles/maintenance_management_a_new_paradigm/
sPh
Re:In between maybe? by 9jack9 · 2011-11-27 07:15 · Score: 1

Are you sure you're on the right web site?
Re:In between maybe? by AliasMarlowe · 2011-11-27 07:18 · Score: 4, Informative

It lives on also among the DoD's general specifications, and can be downloaded from this page.

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Re:In between maybe? by mspohr · 2011-11-27 07:57 · Score: 4, Interesting

Do you know why satellites last so long in a hostile environment?... because nobody touches them.
"If it's not broken, don't fix it."

--
I don't read your sig. Why are you reading mine?
Re:In between maybe? by dave562 · 2011-11-27 08:11 · Score: 2

I am still that way with firmware upgrades. I think it probably has something to do with our generation. In the 90s, computer hardware was touchy and was expensive to replace. If you're like me, you probably grew up blowing into Nintendo game cartridges when they did not work. But back to firmware, I only upgrade it when necessary. Over the last fifteen years I have seen too many firmware upgrades bork hardware that was working just fine. With security patches I do them monthly, but not firmware. And never CIsco IOS. Once the config is good, leave it be!
Re:In between maybe? by CyprusBlue113 · 2011-11-27 08:26 · Score: 4, Insightful

Do you know why satellites last so long in a hostile environment?... because nobody touches them.
"If it's not broken, don't fix it."
Actually I'm pretty sure it's the millions that are spent engineering each individual one so that it specifically can survive many years in said hostile enviroment.
If we spent anywhere near what is spent on proper engineering in time and money, everyday crap would be pretty damn reliable too, just not nearly as cost effective

--
a handful of selfish greedy people are no match for millions of selfish, greedy people -u4ya
Re:In between maybe? by Libertarian001 · 2011-11-27 08:43 · Score: 1

Actually, the components would be pretty close to as cost effective. The executives just wouldn't get their uber-fat rewards.
Re:In between maybe? by jaymz666 · 2011-11-27 09:13 · Score: 1

how many are hooked into the internet running an os that gets attacked?
Re:In between maybe? by Lorens · 2011-11-27 09:55 · Score: 1

I've seen for myself that hard disks that run for a long time (years) have problems starting up again after a power off. I've long supposed that it had to do with some bearings wearing out or oil getting used up. RAID is of course the correct answer to that, but even if I have to offline a service for some reason, I've gotten into the habit of not powering off the second side of a HA pair until the first one is safely back up.
Re:In between maybe? by mabhatter654 · 2011-11-27 10:21 · Score: 4, Insightful

if that's the case, you don't have CONTROL over your equipment.
That was acceptable for Windows 95 but not even for desktop PCs anymore, let alone server equipment. My opinion is that your equipment isn't stable UNTIL you can turn it off and on again reliably. And yes... that is an ENORMOUS amount of work.
If you can't reliably replace individual pieces then you don't have control for maintenance... sure you can stick your head in the sand and just not touch anything... but that's just piling up all the things you didn't take time to figure out until come critical time later.
Re:In between maybe? by datavirtue · 2011-11-27 10:22 · Score: 1

Buy good stuff, document, have on-line test systems, and keep replacement hardware on-hand. No maintenance required, why mess with stuff if it is working. If you break a critical system in the midst of maintenance you have to either lie about it, or fess up and explain to management that you were dinking with it.

--
I object to power without constructive purpose. --Spock
Re:In between maybe? by sphealey · 2011-11-27 10:27 · Score: 1

Thanks!
sPh
Re:In between maybe? by Anonymous Coward · 2011-11-27 10:29 · Score: 2, Interesting

If your buying new or refurbished electronics are THAT unreliable, why the !%!@#$!@%! are you using them?
If a router fails to come up because a cap is ready to blow, what happens when it blows WHILE IT'S RUNNING?
I had that happen with 2 Cisco ASA firewalls. One was 5 years old, the other was a few months. They were using HSRP and decided fighting amongst each-other for control was a great idea because one of the ports was going out. We took the old one offline; wouldn't turn on anymore. The new one? Worked fine.
Over a long enough time-line the failure rate for equipment is 100%. Equipment is usually rated with a MTBF; there's LOTS of documentation on when you replace. You replace Laptops Every 2 years, Desktops and Servers every 3, Networking equipment every 4, appliances per the manufacturers specs, and the lan copper & fiber either when you're doing a major rebuild or when the kit is being replaced.
If management is too incompetent to tell what the TCO for a mission critical project is and budget the cash for replacements, why are you working for them?
Rebooting servers is something that needs to happen, depending on the OS, monthly, quarterly and for high-end enterprise systems, biannually. What happens if you don't reboot and purge errors on a schedule? E.G. For a Windows Fileserver; you reboot monthly, run chkdsk, export settings via config files (or run it in a VM) at the BARE minimum and run backups. When you build a database you need to build a routine to purge bad data every once in awhile. For a web server, a nightly reboot is commonplace.
I worked at a warehouse a few years back; 500k+ sq feet, 500+ employee's. They didn't invest in their tech and when their Oracle DB went corrupt, they didn't even have backups. Someone at corporate devised a way to use the corporate records to rebuild their records; 2 weeks later they were back up and running but not before losing 2 vendors. The cost of three 9's for them was right around 80k for the install and ~20k/year thereafter. The cost of the failure was nearly 2 million; the vendors that did stay required they provide expedited shipping to their customers. Did I mention it went down during the Christmas shipping season?
Who paid for that?
If you're running in an environment that badly maintained, You're the managerially-acceptable fall-guy to justify their bonuses; if the equipment is in such a bad state you're afraid of you should be looking for work at a company that does things right.
Re:In between maybe? by Sadsfae · 2011-11-27 11:03 · Score: 1

Reboot a webserver nightly?
You must be doing something wrong I'd that's required.

--
Have a squat over at the hobo house.
Re:In between maybe? by Anonymous Coward · 2011-11-27 11:07 · Score: 0

There is some truth to that. But what that allows you to do is be prepared for and schedule your failures.
Plan your maintenance on the machine at a time when you have spares and its availability isn't critical.
Because if you do not, that time the power goes out for 3 hours at your office and your UPS's are dead and the system shuts off, is not the time to find out the machine hasn't been able to POST in 6 months. Now you are at a critical situation, likely during business hours, and have to scramble even more than if you would have had a planned reboot of the system.
Re:In between maybe? by Anonymous Coward · 2011-11-27 12:14 · Score: 0

Simple solution: Monitoring
As much as datacenter admins may hate "playing whack-a-mole" sometimes it is the best way to hande the situation. This is one of those times. A good, fully automated monitoring and diagnostics system, combinded with human intervention ONLY when a problem arises seems to be the perfect balance. Design your disaster recovery systems, test them once, and if they work, quit fucking with the working system. Just sit on your hands, watch your system monitor, and don't mess with something that's working. When a problem arises, close it, patch that one hole (and dont do a full audit, just patch that one hole) and go back to watching your monitors.
As much as you may not like this, attentive eyeblls are as effective as fast fingers, and are incapable of causing the human errors the later are prone to.
Re:In between maybe? by greenfruitsalad · 2011-11-27 12:25 · Score: 5, Interesting

i can't agree. i used to but now i cannot afford to.
we recently experienced 2 catastrophes (datacentre-wide downtimes, you know things that NEVER happen) and the results were unbelievable. GRUBs failed to load OSes, machines were without a bootloader (due to emergency disk hotswaps), some machines simply didn't turn on, services didn't autostart, a few virtual servers autostarted on multiple hosts (instead of just one), fsck on some of our volumes took hours to finish, 30% of supermicro IPMI cards were unresponsive, etc. it revealed that almost nobody had followed procedures properly.
after that, every single service we have is built in a clustered manner with nodes spread across multiple datacentres. I now restart machines and pull cables at regular intervals to test bgp/ospf, clustering, recoveries, to check filesystems, etc. i am now also ABLE TO SLEEP.
Re:In between maybe? by Anonymous Coward · 2011-11-27 12:40 · Score: 0

Your friend sounds like a "strange attractor", they can be luddites or geeks though geeks can at least fix the aftermath of the effect. Check out "Requiem for a Ruler of Worlds" by Brian Daley for what that means. :)
Ah the 1990s. That sounds like "Magitronic" motherboards. We had issues with those and I nursed them along with a version of " red Cramolin" that's no longer available and never powering them off or removing cards. My boss decides he's going to 'improve' something. He removes a card and is shocked to find the card connector still attached. He calls me in a panic that it needs 'fixed right now'. It was a shame I was a couple hundred miles away and "my car is broken" and "I have no money". I ended up having him send me money that was not a loan, yes he's sleazy, and got in my miraculously cured car and drove 10 minutes from home. I had to use another board I'd never told him about and got the system back up. I made him order two of the latest non-crap-motherboards so I could finish fixing it and have a backup. For once he didn't give me any crap. I quit working there not long after that incident and it went bankrupt not long after.
If kept clean and baring capacitor plague I've had almost no reason to take a server down. The worst are windows database servers and one IBM antique server running in emulation due to age. They both get slow and need the harddrives defragged and their own maintenance routines run. IBM's offal must be run weekly or more.
I manage three larger servers running several linux instances and those I just don't mess with other than validating backups. We had one backup solution lie and say everything was OK but what it stored was garbage. The vendor was not helpful. A drive recovery place saved my ass. One of those cases where multiple drives in a raid failing.
Re:In between maybe? by Kumiorava · 2011-11-27 14:10 · Score: 1

I have seen the same, my long running hard drives wouldn't boot up again. When I opened one of these stopped hard drives the head was glued to the boot sector. I assumed that since boot sector is never used during the 3-5 years of operation it collects some trash that will attach to the reading head when HD is stopped and cools down.
Sometimes moving the case is enough to get bad connections appear (or opposite, when computer is brought for maintenance it starts to work) but pressing down all the cards and chips that are using sockets solves those problems. Vacuum cleaner is also a good tool when talking about an old computer with mouse nest or something similar in it. Anyway computers can run well when stopped frequently, as long as you maintain similar schedule during the lifecycle of the computer.
Re:In between maybe? by Anonymous Coward · 2011-11-27 20:55 · Score: 0

ly first thought as well :
why fix it if it ain't broken?
(security risks aside :P )
Re:In between maybe? by Billly+Gates · 2011-11-27 23:41 · Score: 1

In case you haven't noticed the MBA bean counters caused a servre recession and almost a depression. I wrote an entry last week about how some companies refuse to pay for tape backups! I was dumbfounded!?
$1500 was too much for a company making $1,000,000 a year in revenue and the cost accountants refused to implement it. This is normal now and where are you going to go? The company down the street has the same morons who want their bonus. If you read reviews on hard drives on Egghead you will people with $500,000 mission critical servers rating $79 green crappy WD caveliers at only 5400 rpms for their raids because management doesn't want to pay $300 for each drive that is enterpsie grade OMG the HORRORS.
Many are leaving I.T. and becoming bean counters. IT is the wave of the future just like outsourcing. In many cases it makes sense to use the cloud anyway as these guys only look at profitability ratios and think everything is always scarce even if the company makes tons of money. There is nothing you can do and yes being dollar dumb but penny wise is the standard way to run a business by everyone.

--
http://saveie6.com/
Re:In between maybe? by Anonymous Coward · 2011-11-27 23:44 · Score: 0

Or its just because they have things that NEVER (NEVER) change like bootloaders. Then, they can issue a command to the bootloader to reboot the system if it needs it. This, and things like watchdog timers, make sure systems stay up and running.
Re:In between maybe? by Anonymous Coward · 2011-11-28 00:27 · Score: 0

"Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory"
That Sir, is because you are an unqualified retard
Re:In between maybe? by Anonymous Coward · 2011-11-28 01:42 · Score: 0

. My opinion is that your equipment isn't stable UNTIL you can turn it off and on again reliably.

Perhaps for some cases. But once you get to a certain level of complexity it might not be a good idea :).
Take your heart or brain for instance. You cannot turn them completely off (heart stopped, "brain dead") and on again reliably. There's always a risk they won't start back up properly.
And yet they are typically quite stable for 60+ years without having ever been completely shut off.
Re:In between maybe? by gx5000 · 2011-11-28 03:01 · Score: 1

Cut costs by spreading nonsense.... I lost track of the racks I cried over when on a call in the late 90's...

--
End of Line.
Re:In between maybe? by Anonymous Coward · 2011-11-28 04:12 · Score: 1

Not Fixing It Until It Breaks!
My car mechanic no longer performs 'Routine Maintenance'. He says wait until it breaks or the performance has degraded before repairing. He feels this is important on newer cars and doing things such as changing transmission fluid actually causes problems. He says if it runs fine, then run along. Come back when it is obviously failing. He does do oil changes as that it the only thing that he feels is mileage dependent.
Re:In between maybe? by operagost · 2011-11-28 06:32 · Score: 1

Compliance with most security standards pretty much requires rebooting every Windows server at least once a month now after the security updates come out. Hope you never have to deal with that.

--

Gamingmuseum.com: Give your 3D accelerator a rest.
Re:In between maybe? by Anonymous Coward · 2011-11-28 08:59 · Score: 0

Um, no. Satellite components are hideously non-cost-effective. Just what they're made of ("Silicon-on-sapphire") should give you an idea. A rad-hard FPGA from Actel with the same functionality as a $2.50 chip for a satellite starts at $5000.00 - And goes up from there fast.
AC
Re:In between maybe? by nine-times · 2011-11-28 09:04 · Score: 1

Yes, obviously there is a happy medium and where that compromise lies should be evaluated for the individual situation. I've been very much a fan of the concept that "if it ain't broke, don't fix it". On the other hand, if a situation is so fragile that I literally can't touch the thing without fear of everything collapsing, then something is wrong. Like if you can't reboot a server without expecting that it should come back up, then I should fix the computer to make it more reliable. If I can't have a server be down for any amount of time, then I need to work to increase the redundancy. If a system is so custom-configured that I don't have any hope of rebuilding it if it dies, then I need to backup the configuration and work towards a more standard setup that a trained monkey could rebuild with proper documentation.
Solutions should be robust. At the same time, I don't think it makes sense to go tearing things apart all the time without reason.
Re:In between maybe? by Anonymous Coward · 2011-11-28 09:49 · Score: 0

You should wake up. I'd bet your system has gotten much more complex and therefore much more prone to failure, and the same people or caliber of people who didn't follow the procedures before have more to follow.

Maintenance and prevention are not always the same by sandytaru · 2011-11-27 06:23 · Score: 3, Interesting

I believe the article is referring to major hardware replacements, stress testing, etc. But there is other preventative or even detective work that needs to be done in data centers large and small that have nothing to do with equipment. You can't just blithely assume that things are always going to work as they are supposed to work. One time, we discovered that the camera server for one of our clients had stopped recording for no good reason, and upon closer inspection discovered that the hard drive failed and we had no alert system in place since it wasn't a "real" server but just a heavy duty XP machine. After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

--
Occasionally living proof of the Ballmer peak.

Security updates by bjb_admin · 2011-11-27 06:26 · Score: 5, Informative

Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.

I can think of many occasions that a security update has broken a server/router/etc. Obviously the lack of a security update can lead to a bigger headache in the future. But the typical user doesn't understand and has the attitude "IT broke the server again".

If a virus or hacker causes an issue the attitude is "I hope they fix that soon. I hate viruses/hackers" (obviously this is a huge generalization).

Re:Security updates by Anonymous Coward · 2011-11-27 07:12 · Score: 0

Incompetent admins.
Re:Security updates by kasperd · 2011-11-27 08:55 · Score: 1

Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.
Some vendors will push security updates and other updates through the same channel, sometimes even without the user knowing if a particular update is fixing a security problem. Occasionally the update will even be installed without the user's accept.

If it was just a matter of only installing updates to fix known security problems and no other changes were made to the software, I think cases of an update causing a problem would be much smaller.

--

Do you care about the security of your wireless mouse?

Reliability Centered Maintenance by sphealey · 2011-11-27 06:26 · Score: 4, Interesting

===
"Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'
===

It isn't just human error: the very act of performing intrusive tasks under the theory of "preventative maintenance" can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s when the concept of reliability centered maintenance was developed for turbine engines and eventually full airliners. Look up the classic report by Nowland & Heap. Very much counter-intuitive if one has been trained to believe in the classics of "preventative teardowns" and fully known failure probability distribution functions, but matches up well to what experience field mechanics have been saying since the days of the pyramid construction.

sPh

Of course, today there is a huge "RCM" consulting industry, 7-step programs, etc that bears little resemblance to the original research and theories; don't confuse that with the core work.

Re:Reliability Centered Maintenance by sphealey · 2011-11-27 06:28 · Score: 1

===
n the 1970s when
===
Sorry; that should be "1960s" not 70s.
Re:Reliability Centered Maintenance by Anonymous Coward · 2011-11-27 07:40 · Score: 0

FYI Slashdot has quote tags.

Maintenance took down Chernobyl by ExtremeSupreme · 2011-11-27 06:27 · Score: 3, Informative

That being said, it was because their procedures were shit, not because they were doing maintenance.

Re:Maintenance took down Chernobyl by crankyspice · 2011-11-27 06:51 · Score: 5, Informative

That being said, it was because their procedures were shit, not because they were doing maintenance.
Actually, no, the Chernobyl disaster was sparked with a 'live' test of a new, untested mechanism for powering reactor cooling systems in the event of a disaster that brought down the power grid. http://en.wikipedia.org/wiki/Chernobyl_disaster#The_attempted_experiment (And even that test was delayed several hours, into a shift of workers that weren't properly prepared to conduct the test.)

--
geek. lawyer.
Re:Maintenance took down Chernobyl by scamper_22 · 2011-11-27 08:16 · Score: 1

Unfortunately, that is where real world data and experience comes into play.
GIVEN that their procedures were shit, maintenance actually made things worse and thus cased Chernobyl.
Now theoretically, you just need better procedures to make maintenance be a net positive. However, that doesn't change the advice that you shouldn't do such maintenance... GIVEN that your procedures are bad.
Given that humans are error prone and the IT procedures are shit, it is probably good advice to not do the maintenance.
Re:Maintenance took down Chernobyl by Anonymous Coward · 2011-11-27 09:27 · Score: 0

That is precisely the point. Maintenance procedures by their very nature are performed much less often than normal operations so the likelihood of failure either due to human error or erroneous procedures is much higher.
Re:Maintenance took down Chernobyl by vlm · 2011-11-27 09:38 · Score: 2

GIVEN that their procedures were shit, maintenance actually made things worse and thus cased Chernobyl.
I'm guessing you were going for the sarcasm points, but for those who don't know about nuke eng as much as myself and presumably scamper, they had perfectly good procedures for experiment engineering evaluation that they mysteriously chose not to follow, and there was no maintenance involvement at all. Its the opposite of what he was claiming.
The quickie one liner of what happened is a RBMK has an extremely sensitive control loop by the very nature of what it means to be a RBMK, and the engineers who know exactly what happens when you suddenly slam the gain of a control loop like that up to 11 were intentionally cut out of the loop; no one officially knows why; the negative oscillations to zero were not terribly impressive, but everyone noticed the final positive swing to 40 GW or so.
The ironic part is they were trying to improve safety by figuring out a ultra short term blackstart capability for the safety systems. It would actually have worked pretty well on a PWR design, which is probably what gave them that peculiar idea. One of the dead guys probably successfully did that "all the time" on his old PWR...

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:Maintenance took down Chernobyl by vlm · 2011-11-27 09:45 · Score: 1

I thought about it a bit, and it would have worked on a PWR, but it would have worked REALLY WELL on a BWR, if you can survive the pressure fluctuations, which it probably could have. Of course if they're not running the reactivity control loop past engineering, they're not going to run the hydrodynamic control loop past engineering, so they might have industriously found a way to blow themselves up that way too.
PWRs are dead stable, not terribly sexy, quite heavy and bulky, and have more moving parts. If the coolant is boiling in the core, you're doin' it wrong in a PWR.
BWRs are way less stable (but seem rock solid compared to a RBMK), used to be the new sexiness before pebble bed reactors and all that, smaller and lighter, and have fewer moving parts. If the coolant is boiling in the core, you're doin' it right, in a BWR.
The point is if you survive the surge, the BWR is happy, actually thrilled, to have some core boiling, whereas the PWR might get royally pissed off if the pressurizer can't pressurize.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:Maintenance took down Chernobyl by jbengt · 2011-11-27 11:02 · Score: 1

Actually, no, the Chernobyl disaster was sparked with a 'live' test of a new, untested mechanism for powering reactor cooling systems in the event of a disaster that brought down the power grid.
Like the GP said, it was because their procedures were shit, not because they were doing maintenance.
Re:Maintenance took down Chernobyl by Waccoon · 2011-11-27 13:52 · Score: 1

Actually, it was pretty much everything. Flawed design, untested technology, substandard construction, cost cutting, politics galore, arrogant and overconfident leaders, nuclear cowboys with something to prove, unqualified staff, etc.
The disaster itself was huge, but I'm surprised that nuclear power has been as safe as it is given how badly humans are at coping with such enormous responsibilities. Looking bad at the the Communist years, having a culture driven by financial profit doesn't seem as bad as a culture driven purely by political power.
Re:Maintenance took down Chernobyl by Anonymous Coward · 2011-11-27 15:27 · Score: 0

Looking [back] at the the Communist years, having a culture driven by financial profit doesn't seem as bad as a culture driven purely by political power.
We are indeed blessed for being ruled by people after both power and money. I for one couldn't think of anyone more suited to rule than someone who cares only about themselves and how much control they have over their subjects.
Re:Maintenance took down Chernobyl by Anonymous Coward · 2011-11-28 04:40 · Score: 0

BS detector is off the charts. There are books published by the people who were in the control room when things went south and what you wrote above has no relation to what happened. That is the "official version" that USSR apparatchiks put up to shift the blame from the highly connected designers to the operators who did everything by the book.

Re:Maintenance and prevention are not always the s by belrick · 2011-11-27 06:29 · Score: 2

After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?

where's the car analogy? by Anonymous Coward · 2011-11-27 06:33 · Score: 4, Funny

The guy at the garage always recommends I do an $80 transmission flush.

Re:where's the car analogy? by xs650 · 2011-11-27 15:41 · Score: 2

What he was really recommending was an $80 wallet flush.
Re:where's the car analogy? by NerveGas · 2011-11-28 04:01 · Score: 1

Never do a flush, unless you can afford a new transmission. The correct way is to drop the pan, replace the filter, and replace the fluid.

--
Oh, you're not stuck, you're just unable to let go of the onion rings.
Re:where's the car analogy? by Anonymous Coward · 2011-11-28 09:14 · Score: 0

That one's worth it. every 30k miles. If you don't think so, stop changing your oil too.

Re:Maintenance and prevention are not always the s by sphealey · 2011-11-27 06:36 · Score: 2

===
No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?
===

It is deeply unclear whether what is traditionally termed "preventative maintenance" (intrusive work involving disassembling, eyeballing, software probing, etc) actually improves reliability over conditioning monitoring tests followed by break-fix work as described by the parent post. More PM, more procedures, more teardowns, and so forth are the standard prescription for improving reliability but there is metric tons of evidence the universe just doesn't work that way.

sPh

Well that's an answer to yesterday's question... by wilfie · 2011-11-27 06:37 · Score: 1

asking why everyone hates the IT department

More MBA Constultant BS... by Anonymous Coward · 2011-11-27 06:37 · Score: 3, Interesting

Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.

Re:More MBA Constultant BS... by ColdWetDog · 2011-11-27 08:29 · Score: 1

Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.
Read it and grab your tin foil and duct tape. You're gonna need lots.

--
Faster! Faster! Faster would be better!

Another lesson relearned by jimbrooking · 2011-11-27 06:43 · Score: 2

In days of old, running "big iron" from Control Data and Cray, the worst days of system instability were those following "preventive maintenance". Plus ca change....

Can faulty logic make data centers less reliable? by DragonHawk · 2011-11-27 06:44 · Score: 5, Insightful

From TFS:

"... poorly documented maintenance can lead to conflicts with automated systems ..."

That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.

Sheesh.

--

dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.

Maintenance-induced failure. by Animats · 2011-11-27 06:46 · Score: 5, Insightful

There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.

The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.

You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?

Re:Maintenance-induced failure. by DarthBart · 2011-11-27 07:13 · Score: 5, Insightful

There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".
Re:Maintenance-induced failure. by FaxeTheCat · 2011-11-27 07:32 · Score: 1

The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.
I saw a while back (probably a year or two) that this is the way Microsoft will run (runs?) their Azure systems.

By the time 10% of the units are not working, it may be time to upgrade to the latest technology anyway. If you exclude disks, then I am certain you could run such a container for more than 10 years that way.
Re:Maintenance-induced failure. by brusk · 2011-11-27 07:34 · Score: 2

I think that predates Windows. Crashes of various kinds were frequent on Apple IIs, Commodores, etc. You just got use to various reboot/retry routines.

--
.sig withheld by request
Re:Maintenance-induced failure. by jklovanc · 2011-11-27 09:05 · Score: 2

Possibly because ten out of 100 units have failed because a $200 hard drive has failed in each one? Does that mean that the whole $100,000 cluster needs to be replaced? Spending $100,000 instead of $2000 is not a great decision.
Re:Maintenance-induced failure. by Anonymous Coward · 2011-11-27 10:04 · Score: 0

I think that predates Windows.
Not true. I used many computers and programmable calculators before and during that time and they were all very reliable by comparison to windows. Unlike Microsoft the other vendors actually cared about the quality of their products and giving value in exchange for your cash. Microsoft was always in it just for the money. Their first products (Microsoft Z80/CPM Fortran and Assembler, Basic) were okay but it was swiftly downhill after that and they didn't care whether transactions were win-win or win-lose.
Crashes of various kinds were frequent on Apple IIs, Commodores, etc.
Not even close. There were of orders of magnitude more crashes under windows and this was true for decades afterwards.
Microsoft and Windows are almost entirely responsible for raising a generation of people who erroneously think computers crashing daily are somehow "normal" and that continues to this day.
You just got use to various reboot/retry routines.
No, it was all much the same, just had an okay to reboot or enter debugger popup.
Re:Maintenance-induced failure. by Lennie · 2011-11-27 10:43 · Score: 1

My guess is, the logic is here:
If after a couple of years running a bunch of systems and 10% has already failed (in a certain short amount of time), the others will probably follow soon.

--
New things are always on the horizon
Re:Maintenance-induced failure. by marcosdumay · 2011-11-27 12:21 · Score: 1

If you read the Unix Haters Handbook, you'll see that they notice that "Unix boots fast", a lot of times.
I wasn't using computers (even less "real" computers) by that time, so I don't know how reliable is their description. But they describe exactly that shift in mentality, and way before Windows time. Then free unixes came, and everything changed...

--
Rethinking email
Re:Maintenance-induced failure. by marcosdumay · 2011-11-27 12:24 · Score: 1

That logic applies to mechanical parts and electronic circuits that operate way above their current limit. It is not valid for electronic circuits operating with reasonable currents.
Try "If after a couple of years 10% has already failed, we still have a couple of years until the next 9% (10% of 90%) fail".

--
Rethinking email
Re:Maintenance-induced failure. by Anonymous Coward · 2011-11-27 17:27 · Score: 0

If you are a server administrator and your system is running Windows Server and it crashes on you, then you aren't doing it right or you have a buggy server app with memory leaks!! Windows clients (Professional, Home, etc.) tend to crash because of either a bad driver, a bad combination of apps, user error, or viruses. None of these apply to Windows server boxes because you shouldn't be installing drivers, multiple apps, installing patches without testing first, and should be running AV software. The point is that Windows Server is stable as long as you have an experienced Windows engineer.
When I was doing Windows administration, I had boxes that never crashed and only needed to be rebooted for patches. I've since moved on to network engineering, where it isn't rare to see that a switch or router has been up for 2 years or more.
Re:Maintenance-induced failure. by aaarrrgggh · 2011-11-27 18:01 · Score: 1

For things like hard drives, you can have a periodic process to replace them; in OP's scenario you should even get rid of the air filters and have it sealed and nitrogen inerted, with a robot inside to change drives. Things that fail regularly are provided with supplemental redundancy initially, if economically justifiable.
When the unit is hauled off, it is simply refreshed and redeployed elsewhere.
The challenge that keeps me from making millions off of it is how you scale it down. If you can get 70 PB in a container, how do you sell/finance/license 10PB increments?
Re:Maintenance-induced failure. by toddestan · 2011-11-28 15:27 · Score: 1

You have no clue what you're talking about. Back then, computers had no protected memory and generally any time a program misbehaved it would bring the system down. The old Macs and Amigas crashed hard all the time, just like their Windows counterparts. Granted, without the constant updates and stuff changing all the time, if you got a specific workflow that worked and you repeated it over and over and over it would generally work every time. But don't pretend like they were more reliable because they weren't (I'll give you the calculators though).

In other news ... by xs650 · 2011-11-27 06:47 · Score: 1

Any maintenance done wrong or in excess can be more damaging to a system than no maintenance.

Rather simplistic, but... by Konster · 2011-11-27 06:48 · Score: 1

If it isn't broken, don't fix it.

I agree... by Anonymous Coward · 2011-11-27 06:49 · Score: 1

I have done data center hardware maintenance, IMAC, troubleshooting, repairs, etc for the past 15 years. From my experience, the biggest problem has been with Sys Admins who do not know their hardware, do not follow procedure, quick to point fingers at hardware and to root problem, and and over-all belief that firmware upgrades are the "Holy Grail" to prevent and fix all problems.

With most systems redundant in power, storage, etc, everything does not always need to be power cycled. That causes fans and other components to fail when coming back up....yes, they were probably failing but if a repair can be done on the fly, why power cycle all the time?

It really gets fun when someone does not have their system alerts set up properly and someone takes down a server with updates to the OS/App side and then discover the problems and they all complain about how they are losing thousands of dollars in down time. If the system is that critical, why not design a fail over system, it would be cheaper in the long run....

Also, most hardware service outfits have several spares available for 7/24 repairs but when a Sys Admin or business unit decides to down several systems in a weekend that have been running for months/years and upon power up there are several drive failures, fan failures, PPM failures, power supply failures, cache battery failures, etc, most vendors including OEMs cannot have all parts available at a moments notice since we spare according to failure rates an part pricing. A pro-active heads up project with a list of hardware that is going to be worked on, would be nice. We could stand by with parts available for repairs and even on site "eyes and hands" if the systems crap out and need on site assistance.

There are several flaws in data center management from the single file and print server in a small office to large data centers with huge support groups like banks and government centers. The same problems exist on all levels...some are just better prepared for the "worst" where others think we are magician/mind readers....

As "Bones" would have said if he were in IT..."Dammit Jim, I am a technician....not a magician!"

The key to achieving high uptime ... by zensonic · 2011-11-27 06:50 · Score: 2

... is actually quite simple: You keep your hands off the systems. Period.

In detail, you plan, install and _test_ your setup before it enters production. You make sure that you can survive whatever you throw at it wrt. errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows! No manager should ever be allowed to demand downtime out of band. Period. In between you basically minimize your involvement with the systems and plan your activities for the next scheduled closing window.

And you ofcourse only deploy stable, true and tested versions of software and operating systems. And even though your OS supports online capacity expansion on the fly, you really shouldn't use the capability unless you absolutely have to. Instead you plan ahead in your capacity management procedure and add capacity in the closing windows. And you do not test and rehearse failures! It only introduces risks ... besides that you have already tested and documented them. And as you haven't changed the configuration, there is no need to test again.

So in essence. Common sense will easily yield 99.9%. Carefull planning and execution will yield 99.99%. The really hard part is 99.999%... /zensonic

--
Thomas S. Iversen

Re:The key to achieving high uptime ... by Smallpond · 2011-11-27 07:03 · Score: 4, Insightful

Which means for every online server you need an offline test machine -- and a way to simulate the operating environment in order to test. Not many companies have the skill of cash to do that.
Re:The key to achieving high uptime ... by darth+dickinson · 2011-11-27 08:05 · Score: 2

And I would have my own personal unicorn that craps Skittles on demand. Also, I could eat candy and poop diamonds.
Meanwhile, here in the real world... systems experience unexpected failures that will require them to be patched/rebooted/etc at the most inconvenient of times.
Re:The key to achieving high uptime ... by afidel · 2011-11-27 15:15 · Score: 1

Yes, you need to have dev, test, and/or qa servers for every system in your environment. This is IT 101, and now that we have VM's it's cheaper than ever to achieve. Between those nonprod systems and our DR failover systems we probably average 3-3.5 servers for every production machine except for our Citrix servers which constitutes a large farm of identical machines.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.

Missing the point... by nitehawk214 · 2011-11-27 06:51 · Score: 1

Obviously not doing maintenance is much worse than the risk incurred by doing maintenance. However in the 2 years of using the datacenter my company relies on, I can say the only 3 major outages have been due to non-routine maintenance. Once was during a power upgrade, the datacenter supposedly has redundant power company connections, but the plug was somehow pulled during the upgrade anyhow. Another was during network maintenance, our dual redundant internet connections turned out not to be so redundant when half the system went offline. And finally during AC upgrades that caused our entire server cluster to overheat and shut down.

So, the story here isn't "maintenance causes unplanned downtime therefore doing less maintenance causes less downtime". Its more like "unusual maintenance is more likely to get messed up". One would think you would be more careful when performing unusual procedures, but that is still when things get screwed up most often.

--
I'm a good cook. I'm a fantastic eater. - Steven Brust

Re:Missing the point... by dave562 · 2011-11-27 08:19 · Score: 1

It is time to switch data centers. We're with AT&T (and AT&T is far from the best) and they do quarterly power tests without a single problem. They've done core infrastructure router upgrades with zero down time. All in all, I'm very happy with the service. Any competent co-lo provider should be able to handle the issues you've had without any hiccups.

Transfer switch ratings by vlm · 2011-11-27 06:52 · Score: 4, Interesting

Check your transfer switch ratings. I guarantee it will be spec'd much lower than you think. The electricians think it'll only be switched a couple times in its life. The diesel service provider thinks you're running it twice a week. Whoops. If you run it once a week, it'll only survive a couple years, then you'll get a facility wide multi-hour outage. I've personally seen it over and over again over the past two decades. The best part is "we have a procedure" so it'll only be run during maint hours and the desk jockeys 200 miles away will run it rain or shine, so its guaranteed that the xfer switch destroys itself at 2 am during a blizzard and it'll take half a day to repair.

Very few xfer switches are more reliable than commercial utility power. Installing a UPS actually lowers reliability in almost all professional situations.

My favorite power outage was caused by a gas leak a couple blocks away, where the utility co shut down the AC and then threatened to take an axe to the gen/UPS if not also shut off. This was not in the official written report, just word of mouth.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Re:Transfer switch ratings by Anonymous Coward · 2011-11-27 10:45 · Score: 0

Sorry but if you are looking at ratings on an ATS, you won't find any indications as to cycle rating. Any reputable ATS/Genset manufacturer designs the switch to withstand min. 10k cycles. The Genset should be problem-free for at least 2000hrs notwithstanding oil/filter changes. Most issues experienced tend to be piss-poor installs of UPS systems, poor choices with Genset/ATS/ups combinations. Not pointing fingers however this area too falls under the poor maintenance procedures and engineers not understanding the effects of over/undersizing these systems.
I can rant for hours on this topic as power quality is a huge issue when dealing with electronics.... word of advice leave the bloody photocopiers and laser printers off the damn ups....
Re:Transfer switch ratings by Anonymous Coward · 2011-11-27 14:47 · Score: 0

Depends. We've got one that's survived California redneck power. It's transfered at least 13 times in the last 4 years. ConEd may be better, but PG&E didn't handle summers well. The site close to us sometimes got paid by PG&E, not just with cash, but polution permits, to run their huge generator farm to take some load of the remains of their grid. But yes, do the preventive maintenance on your transfer switch. Ours overheated and died from an ant invasion. I shit you not.
Also, the UPS will increase your equipment reliability by minimizing the brownouts and wicked harmonics from your neighbor's wierd-ass shit, like huge switching power supplies and the like.

Re:Maintenance and prevention are not always the s by bussdriver · 2011-11-27 06:57 · Score: 3, Interesting

Planned obsolescence has been promoted in all aspects of life since post WW2 and now it is hard to imagine the world without it. That line of thinking has been creeping into everything even in areas where it doesn't seem to apply.

Does this play a factor on the perception of preventative maintenance or its frequent application? I think it probably does in at least a couple ways, don't you?

--
Democracy Now! - uncensored, anti-establishment news

Re:Maintenance and prevention are not always the s by Smallpond · 2011-11-27 06:59 · Score: 1

Internal monitoring of components is a lot better now than it used to be. We used to go around and check all of the supply fans once a month because it was the highest failure rate component on the desktops we were using and there was no indication until the machine started crashing from overheating.

Useless article with no data. by Vellmont · 2011-11-27 07:00 · Score: 4, Interesting

I read through the entire article, and saw zero data to support his assertion. I'm sure he has the data, but the article didn't reference a single piece of it. Without any data to support the theory all we have is a fluff opinion piece. Shame on Data Center Knowledge for writing an article about a scientific investigation, and not presenting a single piece of scientific evidence.

--
AccountKiller

Re:Useless article with no data. by Anonymous Coward · 2011-11-28 05:31 · Score: 0

Well that's nice to know... I'm glad my boss doesn't read /. or he'd use it as ammo for why we shouldn't waste time maintaining our environment.
Re:Useless article with no data. by Anonymous Coward · 2011-11-28 10:34 · Score: 0

Maxim 5
OLD
Comprehensive data about failure rates must be available before it is possible to develop a really successful maintenance program
NEW
Decisions about the management of equipment failures will nearly always have to be made with inadequate hard data about failure rates

Far from obvious by sphealey · 2011-11-27 07:04 · Score: 1

===
Obviously not doing maintenance is much worse than the risk incurred by doing maintenance.
===

That's far from obvious, actually, and is demonstrably wrong for many types of systems and installations.

sPh

This is well known from Formula One by igb · 2011-11-27 07:04 · Score: 5, Interesting

Some years ago, the F1 rules were changed so that cars were in parc ferme conditions, with strict limits on what can be done to them, from the start of qualifying on Saturday lunchtime until the race finishes on Sunday afternoon.

The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.

There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).

The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.

In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations).

So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.

Re:This is well known from Formula One by scattol · 2011-11-27 08:23 · Score: 4, Insightful

Those cars, to be competitive, were engineered to fall apart on the other side of the finish line. Without maintenance they would have failed. They are now engineered to last a few races instead of just one. Odds are they are slightly slower in one form or the other but it being a level playing field, it doesn't matter.
Re:This is well known from Formula One by Anonymous Coward · 2011-11-27 09:10 · Score: 0

Doesn't hold for the consumer market however. In F1, you have lots of R&D being pumped into performance and reliability. In the general consumer market, items are sold mainly on price. So in a general sense, you get what you pay for. For example, server hardware is more reliable than your $300 desktop special.
Re:This is well known from Formula One by jklovanc · 2011-11-27 09:11 · Score: 1

So they removed the maintenance between qualifying and race. That does not mean that the race team does not do "maintenance" between races. Yes there is "less maintenance" but not "no maintenance".
Re:This is well known from Formula One by vlm · 2011-11-27 09:14 · Score: 1

Also don't forget driving style is modified. "F it, you guys are going to tear the thing down anyway so I'm gonna lean it out to sneak in a couple more laps per tank" is replaced with babying the car.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:This is well known from Formula One by Waccoon · 2011-11-27 14:04 · Score: 1

Another way of looking at it, is that it doesn't matter what the rules are, so long as everyone follows the same rules.
Shorter races allow people to run closer to the edge, longer races require conservative builds. In the end, all the teams have the same problems to overcome, so in theory, nobody has the competitive advantage.
This is a different situation than, say, international trade, where one country is forced to meet safety regulations and another is not. Now that's when you end up with "wrecks and broken parts" all over the "track".

If it's working by tmpsantos · 2011-11-27 07:06 · Score: 1

Sounds like the "If it's working don't touch" maintenance policy.

"The most reliable machine... by John+Hasler · 2011-11-27 07:06 · Score: 3, Funny

...is the one farthest from the nearest engineer."

Consider The Pioneer and Voyager spacecraft and the Mars landers.

--
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.

Re:Can faulty logic make data centers less reliabl by HalAtWork · 2011-11-27 07:10 · Score: 3, Insightful

Exactly.

vigorous maintenance
excessive maintenance
poorly documented maintenance

Those are all qualified as out of the ordinary. Anything in excess (on either side of the scale, whether it is too much or not enough) is a problem. Of course maintenance must be performed, but I guess some data centers have a strange idea of best practices, or they do not follow them.

--
Twinstiq, game news

battery maintenance / changing out battery is nee by Joe_Dragon · 2011-11-27 07:20 · Score: 1

battery maintenance / changing out old battery's is need as a dieing battery can fail to work when need or at the worst they can have a explosion.

Lesson learnt in Aviation by Anonymous Coward · 2011-11-27 07:21 · Score: 1

Commercial Aviation has learnt this lesson a long time ago, after analysis showed that maintenance induced failures exceeded the potential failures avoided.

Now most parts for airframes and engines are on IRAN schedules - Inspect and Repair/Replace as needed. Only a few filters and things are replaced on a fixed schedule.

soft vs hard reboot by Joe_Dragon · 2011-11-27 07:24 · Score: 1

Some times software / os needed at least a soft reboot from time to time to clean up stuck software and remove memory leaks.

Now some stuff like firmware updates may need a hard reboot.

As for power cycleing some times you need to do it to get back into a crashed system.

Re:soft vs hard reboot by Pharmboy · 2011-11-27 07:38 · Score: 1

Some times software / os needed at least a soft reboot from time to time to clean up stuck software and remove memory leaks.
What operating system are you using, Windows 98? The worst case I've had with Linux (CentOS 5.4) is NFS locking up and taking one of the CPUs for a ride to jump the load to around 8 to 9. Even then, a little patience killed the process after a while. There have been times when it was FASTER to hard boot (but the risks suck), but most modern applications and operating system ON THE SERVER don't usually leak memory in this day or age. I'm guessing most people restart processes regularly via CRON or the MS equiv. on a regular basis anyway.
Obviously you reboot for many firmware updates or kernel updates, but almost never do modestly maintained servers just "crash" without there being a hardware failure.

--
Tequila: It's not just for breakfast anymore!
Re:soft vs hard reboot by Bigbutt · 2011-11-27 08:21 · Score: 3, Insightful

You must not deal with any Oracle database servers. They leak like a sieve.
[John]

--
Shit better not happen!
Re:soft vs hard reboot by hedwards · 2011-11-27 08:45 · Score: 2

NFS locking up is ultimately a part of the spec. It was originally a stateless filesystem that operated over UDP. Unless you're using a more recent revision of the protocol and have it configured as such, you're going to have issues with it locking up regularly.
Re:soft vs hard reboot by Anonymous Coward · 2011-11-27 09:06 · Score: 0

You've never had a Linux system with a bunch of zombies stuck in IOWait? Not even SIGKILL will cause them to die.
Re:soft vs hard reboot by PCM2 · 2011-11-27 09:45 · Score: 2

Desktop PCs and servers seem to have largely overcome the need to reboot regularly, but other segments of the industry seem to be moving backwards. My Android handset actually says in the manual that you should power cycle it regularly. With a firmware upgrade, it even started giving me a warning from time to time, telling me I had not power cycled the phone in X amount of times and that I should do that now or risk instability. (Am I crazy for assuming that a phone OS is a markedly less complex environment than a Linux server? And here I thought Android applications ran in a fully memory-managed, garbage-collecting environment.)

--
Breakfast served all day!
Re:soft vs hard reboot by Pharmboy · 2011-11-27 10:32 · Score: 1

Zombies on a SERVER, no. (the topic of the article) I'm more conservative what I run on a server, however.

--
Tequila: It's not just for breakfast anymore!
Re:soft vs hard reboot by afidel · 2011-11-27 14:01 · Score: 2

Or JAVA, we run all the big enterprise application servers and they all run considerably better if they are rebooted on a regular basis.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:soft vs hard reboot by arth1 · 2011-11-27 18:10 · Score: 1

You must not deal with any Oracle database servers. They leak like a sieve.
Before the 8 day power outage here a few weeks ago, I had main servers with 1000+ day uptimes. Databases and Java included. Memory leaks? Sure, but what does reboot have to do with that? If you don't know how to restart processes without rebooting, nor setting sensible limits for memory and CPU usage, you have no business running servers.
Re:soft vs hard reboot by Anonymous Coward · 2011-11-27 18:11 · Score: 1

Technically, a zombie can't be stuck in an I/O wait status.
Re:soft vs hard reboot by arth1 · 2011-11-27 18:22 · Score: 1

Or JAVA, we run all the big enterprise application servers and they all run considerably better if they are rebooted on a regular basis.
No, they don't. They run in userspace, which means you can simply stop and restart the processes with the same result.
If anything, rebooting will cause your apps to run ever so slightly slower than if you had restarted the processes. Yes, really. We are not living in Windows 98 times anymore, where reboots would speed up the machine.
One reason being how modern OSes handle memory paging. It takes time before the OS has dropped all the pages that never are used, so it can use that memory for caching and other things.
Another reason being daemons which have less and less work to do over time (like ntp, which gains accuracy with uptime, or entropy daemons which finally get their buffers full).
Re:soft vs hard reboot by arth1 · 2011-11-27 18:35 · Score: 1

NFS locking up is ultimately a part of the spec. It was originally a stateless filesystem that operated over UDP. Unless you're using a more recent revision of the protocol and have it configured as such, you're going to have issues with it locking up regularly.
Nah. UDP has the advantage of being faster, because of less overhead. If anything, I find nfs4 to be an ugly beast compared to nfs3, with far more problems (including data corruption).
Part of the perceived problem with nfs 2/3 is that the mount options default to what's needed for a "diskless" mount, with "hard,nointr". Yes, you wan the system to hang indefinitely if it can't mount its require drives, so that makes sense for that situation. But that's not how most people run nfs - almost everyone can set "intr" here, and depending on the mount type, quite often "soft" too.
Re:soft vs hard reboot by afidel · 2011-11-27 19:48 · Score: 1

I meant restarted, but frankly for our VM's restarting the service and rebooting the machine take about the same amount of time and since the load balancers can just take the node out a rolling reboot isn't a big deal.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:soft vs hard reboot by Bigbutt · 2011-11-27 23:40 · Score: 1

Well, the real problem are the DBAs. Certainly I can stop and start the Oracle processes without much trouble but the DBAs insist on a reboot and since they "own" the systems, they have most of the say in the matter. I'm glad I was able to prevent it from being a monthly "reboot all the Oracle servers" as the DBAs insisted it was healthy to reboot them periodically. It was a fight to keep it from happening though. They weren't listening to my technical answers so my biggest leverage was that we're short staffed and don't have the time to work on projects _and_ reboot 30 or 40 Oracle servers a month.
[John]

--
Shit better not happen!
Re:soft vs hard reboot by Kjella · 2011-11-28 01:00 · Score: 1

And here I thought Android applications ran in a fully memory-managed, garbage-collecting environment.)
GC only works on things that have no references. There's lots of ways to run out of memory by filing up lists and maps and such with objects, even if they're never used again the garbage collector doesn't know that.

--
Live today, because you never know what tomorrow brings
Re:soft vs hard reboot by TheLink · 2011-11-28 01:36 · Score: 1

Can't you just write a script or cron job to reboot 30-40 servers?
--
- Too many replies beneath your current threshold
Re:soft vs hard reboot by bungo · 2011-11-28 02:44 · Score: 1

I agree. I'm an Oracle DBA, and someone who wants to have the server rebooted every month is taking the easy way out and probably doesn't understand all of the processes that their system runs.
Most versions of the Oracle Server software have had problems that eat up memory or have processes that will eventually hang. The correct way to deal with this is to test and patch, although it isn't always possible to reproduce issues in a non-production system, as the production load and usage can be very difficult to reproduce.
I have also been in position where I've been "over-ruled" and instead of investigating the issue, the management has decided to just reboot every week to resolve the issues, which does actually work, even though it is a short sighted solution.
On the other hand, rebooting servers should be a straight forward task, and should be automated - So I gather that your excuse of being short staffed was just a lie to get your way.

--
"The best part? I became an ordained minister while not wearing pants." -- CleverNickName
Re:soft vs hard reboot by greed · 2011-11-28 03:56 · Score: 1

And as soon as you even suggest "soft" mounts, the people who have been to Sun Brainwash Camp start freaking out. "No," they scream, "soft mounts cause data corruption!"
Apparently, in their world, no program ever checks the return code from a system call. And, also in their world, NFS servers reboot often and without notice--hard mounts don't protect data that the client hasn't sent, after all, they only protect against server reboots (but not failures).
If we have an NFS server go down, I want stuff to crash. We have to validate all the files that were in-flight at the time of the crash anyway. I've seen enough files grow runs of all-zero blocks after an NFS server crash with hard mounts to know that isn't guaranteed safety.
(Oh, BTW, the Sun people love "sync" mounts, too: you can't even disable it in the Sun server daemons. But it has a massively negative effect on performance. There are people that just insist on optimizing for the rare events. And yet, it's nearly impossible to fault out to a new server with NFS hard mounts.)
Re:soft vs hard reboot by mcgrew · 2011-11-28 05:04 · Score: 1

And it's not just smartphones, either. It was so long since I booted a PC for anything except saving electricity and upgrading a kernel that when my phone stopped getting on the internet, my daughter had to remind me to try rebooting. And the reboot worked! Sometimes when I'm bluetoothing pictures from my phone to my computer, the phone gets confused and has to be rebooted before it will send pictures. And it runs neither Windows, Android, or iOS.

--
Free Martian Whores!
Re:soft vs hard reboot by kriston · 2011-11-28 05:08 · Score: 1

Some of the ideas of rebooting database servers regularly come from the fact that many databases use shared memory, i.e. shmget(2), and not only do they not release the segments when they crash, the segments are often locked in place due to open file handles held by zombie processes. Of course things are much better now in the Linux domain but it's something worth considering when you stop foaming over long uptimes.
Uptimes.org is gone for a reason, folks.

--
Kriston
Re:soft vs hard reboot by pnutjam · 2011-11-28 07:42 · Score: 1

Citrix always needs reboots....

--
Cheap storage VM.
Re:soft vs hard reboot by hedwards · 2011-11-28 16:45 · Score: 1

TBH it's largely academic as these days I'd personally be more comfortable using SFTP or SCP for that use. But, the main reason people have that opinion of NFS is that most of us weren't using those options. I haven't spent much time using NFS and have found it to be more of a headache than it's generally worth. But then again I started using it after there were alternatives.
Re:soft vs hard reboot by Bigbutt · 2011-11-30 04:05 · Score: 1

On the other hand, rebooting servers should be a straight forward task, and should be automated - So I gather that your excuse of being short staffed was just a lie to get your way.
While it can be automated, it still takes resources to verify the systems came back up correctly and work to resolve any issues that crop up. Someone has to babysit when it happens. The systems have to be staggered because they're high availability so we can't be one sided too long. Which just means we'll have someone watching reboots once or twice a week. Add in the difficulty of getting approvals to do this and it really is a resource issue.
[John]

--
Shit better not happen!
Re:soft vs hard reboot by Bigbutt · 2011-11-30 04:12 · Score: 1

Just to respond to the 'foaming over long uptimes' portion of your comment. I and my team are not obsessed with uptimes. I am adverse to just rebooting systems to "fix" problems without at least investigating. Sure a reboot will restore the server to a good state, but ultimately it doesn't actually "fix" the problem. And I do regularly check systems for zombie or defunct processes. We have issues with OpenView agents on the Linux systems having defunct processes which was fixed recently with an update. And one system has a perl script written by someone at Oracle for reports that gets about 100 defunct processes a month. For a majority of the systems though, the defunct listing is zero. However, should investigation determine that it's as you say, I don't have a problem with rebooting to clean up memory.
[John]

--
Shit better not happen!

Re:Can faulty logic make data centers less reliabl by FaxeTheCat · 2011-11-27 07:27 · Score: 4, Insightful

Precisely my thought.

Maintenance, like anything else you do in a datacenter or wherever you work, must be done correctly. If maintenance reduces the reliability of the maintained entity, then per definition, it was not correctly performed.

Doing something correctly requires knowledge, planning and training. Just like everything else.

Here's the car analagy ... by Anonymous Coward · 2011-11-27 07:33 · Score: 1

One of the Asian automotive manufacturers, Toyota, Honda, or Nissan, sent around a crew in their assembly plants to retorque any loose nut/bolt/screw they found. They saw a dramatic reduction in plant down-time. Ben Franklin 'a stitch in time' comes to mind. Computers will fail eventually. If the system is so unstable that 'maintenance' tips it over the edge then you were about to have a severe incident anyway (and probably not the team available to fix it, like the maintenance crew that are still right there). Make prudent backups before any major maintenance, plan for some random glitch getting things back on line, and you're covered. It's hard to record the 'things gone right" from maintenance .. but easy to record the number of issues of problems after maintenance.

Classic article on this problem: ValuJet 592 by rbrander · 2011-11-27 07:33 · Score: 1

http://www.theatlantic.com/magazine/archive/1998/03/the-lessons-of-valujet-592/6534/

William Langewiche, Atlantic Monthly, "The Lessons of ValuJet 592". It was basically done in because it was transporting safety equipment itself, which was vulnerable to a hard-to-predict failure. The more complex we make air travel, with its multiple checks and layers of protection, the more opportunities for failure. Adding another check to avoid 592, as they did, creates yet another opportunity.

It is, as they say, a Hard Problem. Yet, still: the US recently celebrated 10 whole years without a major airliner loss, despite a phenomenal amount of air travel. Things are getting better. Hard != Insoluable.

Laziness makes the datacenter unreliable by Anonymous Coward · 2011-11-27 07:34 · Score: 0

Lets face it... if 'we' administrators REALLY did our job and REALLY knew our stuff, we wouldn't have to patch nearly as often as we do, because our systems would be hardened and properly protected leaving many patches unneeded unless for the services our systems were providing required them.

And in those cases, our documentation would allow us to REALLY test exactly what we needed to and script the installs specifically for the systems that needed the patching.

Unfortunately, we DON'T REALLY know our systems the way we need to in order to PROPERLY secure them and maintain them. I see all too frequently the practice of 'getting it working' versus 'proper implementation' because we don't have the time or resources.

THIS unfortunately is what leads us to the regular patch cycles that we have, because when we don't really know and understand our systems and defend them independently by proper hardening, we must rely on patching everything, because in the end, everything is exposed. (Mind you, that's Windows, Linux, mobile, etc. etc.)

BUT, I don't want to leave it all to us administrators, lets face it, VENDORS and MANUFACTURERS have a hand in this as well. Nine times out of ten they don't even know their products well enough to tell us administrators what to do, or how to properly implement their solutions. They are so quick to market, their products stink. A proper and secure implementation with a product that is not written properly might as well be a waste of time. They are LAZY too. Their products don't work as advertised and they are so focused on sales, they don't even care that they aren't installed properly. When I have to work with top level support to get a security product to function as advertised because it 'isn't normally installed that way'; however, THAT IS they way it SHOULD BE installed to be properly secured, then there is a problem with the product, PERIOD.

So, to my brethren in IT, let's make a statement to really and truly understand our environments and NOT let our product manufacturers or in-house developers BE LAZY and make our systems unreliable.

Its All In the Process ... by __aajwxe560 · 2011-11-27 07:35 · Score: 2

Having been involved in Technical Ops of both large and small companies for many years, I have seen DR exercises and design that have run the gambit. I tend to think The key thing I have found to the success of any organization, exercise, or philosophy, is the underlying process that drives execution. The larger the team/org, the more change points, which in turn leads to more variables between tests. This creates complexity, as a test that ran fine a few months ago may not run the same today. However, ensuring change does not overrun process in understanding and applying the change into the greater design is a key to ensuring each test improves upon the last, until such time this is a finite process.

For example, when working for one of the big 401k's, the first DR exercise evaluated the data center completely being leveled and re-locating both technical services as well as the ~300 on site employees to another location. Long story short, the first exercise of this was scheduled for 2 days, and while it worked, we identified dozens of issues. We scheduled the next test 6 months later and addressed what we believed were all of the issues; on next test, we ran into perhaps ~10 issues. The next test we scheduled 3 months ahead and ran into ~2 issues. All awhile, things continue to change and innovation is occurring, change process control is ensuring that new things are being factored into the continual DR process/exercise. For a small telecom I worked for, the same type of testing was accomplished with ~2-3 week turn around time (smaller team, less change points, more dynamic response), but with same underlying principles.

Documentation of such things is critical, and employee turnover is often one of the greatest risk points. Having a diversified staff with overlapping knowledge should minimize the later risk to some degree, and if implemented fully, risk should be diminished.

So how does all this tie back into maint? Well, it is anticipated that if any system runs long enough, their will be opportunity for failure. It is preparation for when such failure occurs, one can balance the capability of providing a measured window of downtime (if any) and provide some degree of predictability (i.e. I test once a quarter). The counter to this can certainly be overzealous maint, so certainly their is a point to being reasonable. For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective. Either way, this is providing some degree of confidence that this should prolong engine life.

Re:Its All In the Process ... by vlm · 2011-11-27 09:22 · Score: 1

For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective.
LOL old Saturn cars were famous for a valve issue where the engine suddenly starts to burn about a quart of oil per 1k mile after about 125K miles of service ... engine capacity is 4 quarts oil... "most" owners don't even know what a dip stick is, much less how to read it... lots of saturn engines dead with an empty oil pan...
You'd be surprised how often the manufacturers actually know what they're doing with stuff like that.
Something I never understood about that whole mentality.. pay the bank $40000 in payments to buy a $30000 car then destroy the car by trying to save $30. I can see never doing maintenance on a $500 beater, but...

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

it's a zoo! by Anonymous Coward · 2011-11-27 07:42 · Score: 0

tell that to my spiders. they are trying to "maintain" their webs in my servers for some reasons (hint:very bad)!

well some windows updates still need reboots less by Joe_Dragon · 2011-11-27 07:46 · Score: 1

well some windows updates still need reboots it's less then it was in the past but still more then with linux.

Also a lot of NON OS software updates / installers at least say they need a reboot.

It depends by Glendale2x · 2011-11-27 07:47 · Score: 1

If "maintenance" means doing a forklift upgrade of all the computer and networking equipment every year or two then of course your reliability is going to suck, especially the human error factor with all of that new, unfamiliar equipment.

On the other side of things if someone thinks that never changing the oil in the generator is going to make it more reliable then they're in for a surprise. When I think about datacenter "maintenance" I think: changing the CRAC air filters, cleaning any outdoor coils, changing the oil on the generator, loading the generator, replacing old lead acid batteries, checking building integrity, making sure birds aren't nesting anywhere stupid, and so forth. Physical plant won't last forever.

--
this is my sig

Illuminating web page by golodh · 2011-11-27 08:01 · Score: 1

The Nowlan & Heap report is a bit heavy to read, but there is an illuminating web-page here: http://www.mutualconsultants.co.uk/rcm.html that conveys the essence.

See especially the sections "How equipment fails" and "Operating Context and Functions"

Re:Illuminating web page by sphealey · 2011-11-27 10:37 · Score: 1

Add some really heavy-duty math to that with "Mathematical Aspects of Reliability-Centered Maintenance" by H. L. Resnikoff ! Way over my head. But the basic idea is simple. In a reasonably well-designed system with reasonably reliable components, you have the least information about that which interests you the most: failure rates. Making standard probability-distribution failure analysis virtually impossible (even if one discards the questionable "everything has a bathtub failure distribution" assumption).
sPh

The quality of the people matters a lot by petes_PoV · 2011-11-27 08:08 · Score: 3, Insightful

Although everyone makes mistakes, some people make hundreds of times more errors than others. Whether that's due to inherent lack of ability, poor training, lacking oversight, laziness, time pressures or just a slapdash attitude varies with each person. One place I was involved with (as an external consultant) made over 12,000 changes to their production systems every year. It turned out that well over half of those were backing out earlier changes, correcting mistakes/bugs from earlier "fixes" or other activities (a lot that resulted in downtime, and far too much of it unscheduled or emergency downtime) that should not have happened and could have been prevented.

--
politicians are like babies' nappies: they should both be changed regularly and for the same reasons

Re:The quality of the people matters a lot by gweihir · 2011-11-27 08:26 · Score: 1

Was just writing my posting while you did yours. I could not agree more. Additional aspects are
- Engineers and managers that try to justify their existence by performing a lot of maintenance
- Incompetence due to bad training, arrogance and inexperience
Example: I recently pulled an Ethernet cable with broken connector out of a mission critical server (was not in production, we were reviewing cabling correctness). Turns out that some brain-dead person did the cabling with old used cables. 1 Minute of downtime there likely cost more than the whole set of about 200 cables. Sometimes you really start to doubt that humans have intelligence.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:The quality of the people matters a lot by jklovanc · 2011-11-27 09:17 · Score: 1

This sounds very much like a phenomenon I called "al dente programming"; throw code at a problem until something sticks. There is very little thought to the consequences of actions and assumptions that any issues will be fixed later. If people would just slow down a bit, do some research and think about the action maybe they would be starting fewer fires that need to be put out. It is circular logic; I don't have time to think about something because it is a fire but not thinking about something creates fires in the future so I am back in the same place.

Busy work by Anonymous Coward · 2011-11-27 08:11 · Score: 0

Talk like that will drive the economy into a (deeper?) recession.

Depends on the people by gweihir · 2011-11-27 08:14 · Score: 1

If you have rushed, underqualified people do the maintenance, then sure, it decreases reliability. If you have careful, non-rushed and competent people doing it, I doubt very much that the same is true. These people tend to be a bit more expensive, but cutting cost in the wrong places is a traditional occupation of managers in IT.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:Depends on the people by sphealey · 2011-11-27 10:39 · Score: 1

===
If you have rushed, underqualified people do the maintenance, then sure, it decreases reliability. If you have careful, non-rushed and competent people doing it, I doubt very much that the same is true.
===
Go read some of the original references on Reliability Centered Maintenance, particularly the Nowlan & Heap report referenced upthread by multiple posters. Your basic assumption has been shown to be very often incorrect in practice.
sPh
Re:Depends on the people by gweihir · 2011-11-27 12:17 · Score: 1

I think what is more likely is that competence is overstated in practice. Nobody will admit they use low-competence people for difficult jobs. Also, competent people will take a mostly hands-off approach to maintenance.Of course if maintenance always means "change something", the assumption is wrong. But that approach to maintenance is the wrong one in the first place.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

If it ain't broke, don't fix it. by Anonymous Coward · 2011-11-27 08:16 · Score: 0

Age old proverb, true today as it was a century ago.

All I want... by ibsteve2u · 2011-11-27 08:27 · Score: 1

All I want are systems with interchangeable fan/air inlet filters on the outside of the case that do not require a tool to remove and replace - let alone a power cycle. Is that so much to ask?

It's funny...I have cases where that sort of filter exists for a bottom-mounted power supply, but the case's own fans? Have to take 'em apart to properly clean the filters. And please don't say "Just lug a vacuum cleaner around." - they rarely do a good job if they actually are "luggable"; they (in a rare phrase) don't suck hard enough.

It is my experience that dirt and heat are the single greatest enemies of any electronic device, and will be as long as superconductivity without cooling is infeasible.

--
Orwell: "In a Time of Universal Deceit, telling the Truth is a Revolutionary Act"

When I was a computer tech... by Archtech · 2011-11-27 08:28 · Score: 1

... a very long time ago, we had a saying about this.

"It's called preventive maintenance because it prevents the computer from working".

--
I am sure that there are many other solipsists out there.

The Real Problem by Anonymous Coward · 2011-11-27 08:28 · Score: 0

Is that Documentation requires time, and thus money - something Management just can't be bothered to allocate resources for.
This article misses the Real Problem by miles.
FAIL.

Re:Maintenance and prevention are not always the s by Anonymous Coward · 2011-11-27 08:46 · Score: 1

It's planned to survive for however long it's estimated to be used by the consumer. That's because the one determining factor is technological advancement. Progress is a moving target. It's reflected in the market place. Once we hit a brick wall in progress, the focus will return to reliability and planned long-life. Similar to how my old 1950s Sunbeam toaster will be passed on from generation to generation like a family heirloom.

Can too much FOO make BAR unreliable? by davidwr · 2011-11-27 08:49 · Score: 1

There, fixed the headline for you.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

The threat today is automated updates by syousef · 2011-11-27 09:02 · Score: 1

One bad automated update can lead to your system hosed or obscure reliability problems, perhaps not showing up for a while and the worst ones again leaving you with little option but to rebuild a system.

So I turn off auto update on everything I can, and manually update periodically. I consider the security risk smaller this way. I get it stable, and let it run that way for a few months at least. Then update security fixes etc.

--
These posts express my own personal views, not those of my employer

Re:The threat today is automated updates by flyingfsck · 2011-11-27 11:43 · Score: 1

Absolutely correct. My servers run for many years, till the hardware eventually fails, with zero updates and zero restarts. Don't fix it if it ain't broke.

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Re:The threat today is automated updates by Billly+Gates · 2011-11-27 23:30 · Score: 1

Rootkit makers and malware spammers who inject SEO ads on your webservers love you both!

--
http://saveie6.com/

In the Army we had "PMCS". by khasim · 2011-11-27 09:34 · Score: 1

https://en.wikipedia.org/wiki/Preventive_Maintenance_Checks_and_Services

Or to use a more common example, think about changing the oil in your car every (time interval) or (distance interval). Will it stop failures? Maybe. Maybe not.

On the other hand, every time you "work" on a system you introduce entropy.

As long as you remove more entropy than you introduce, you should have a more reliable system (than if you hadn't worked on it at all). But that gets into the training/knowledge of the person performing the PM.

That's where planning comes in. by khasim · 2011-11-27 10:00 · Score: 1

You only need a server for each item that is different. So if you standardize on hardware / OS then you only need 1 server to test hardware drivers and OS updates and so forth.

Beyond that, you really should have a test database system and a test app system. You never want to deploy updates into a production environment without going through a test system first (which is NOT the same as a development environment).

Virtual systems can help a lot with the server requirements. But you still need to understand the hardware / virtual / OS / app differences and plan accordingly.

Then there is incremental Dis-synergy by Anonymous Coward · 2011-11-27 10:01 · Score: 0

I work for a global pharma company with several key data centers (DC) around the world, and they are interconnected with each other and 100's of client sites. The problem is that this interconnection infrastructure and the internal infrastructure at each site are in a constant state of flux as new requirements are met (technical, business, regulatory, etc), and old ones go away. Thus the apps and infrastructure and their interdependencies are always several steps ahead of keeping documentation up to date despite rigorous requirements (not only per good IT practice, but also per medicinal regulatory requirements).

There have been "unintended consequences" every time there has been such maintenance, from upgrading a few switches to full DC power recycles to check UPS batteries/generators, and other parts and pieces - "oh, the US network to the rest of the globe goes through THOSE switches now? ... They use WHICH LDAP server?" The complexity is beyond any of most ambitious and determined efforts of a lot of very smart people working their tails off trying to make it all go right using all the IT buzzword-compliant techniques - ISO 9000, ITIL, LEAN, Six Sigma, etc. That is exacerbated by personnel turnover (or just plain too much laying off per the bottom-liners) throwing new support people into unfamiliar circumstances as someone mentioned above.

We have to do this stuff, but we have not figured out how to keep up with all the moving parts as they keep changing...

YMMV

Re:Can too much FU make BAR unreliable? by catmistake · 2011-11-27 10:11 · Score: 1

FTFY

--
The Admin and the Engineer

Sounds like my desktops by coldsalmon · 2011-11-27 10:28 · Score: 1

My super-cool Linux box at home, which I work on all the time, is much less reliable than my work desktop running XP, which I never touch except to do my job. The most reliable, of course, is the headless Linux server that sits under my desk at home and never gets touched. In fact, I have this separate server precisely because I know that I will mess up my desktop by trying to fix/maintain it all the time, and I intend never to touch the server unless something goes wrong.

Re:Sounds like my desktops by swalve · 2011-11-27 16:55 · Score: 1

The solution is not to touch things less often, but to learn how not to make mistakes. What's the Linux cardinal rule? Never use root for anything. That is ridiculous- learn how to check your typing before hitting enter and you'll never have any problems.
Re:Sounds like my desktops by coldsalmon · 2011-11-28 07:54 · Score: 1

Ah, but the problem itself is that it is impossible to avoid making mistakes. If your solution to this problem is to never make a mistake, then we are operating on different premises. I am very skeptical of your premise that it is possible to avoid making mistakes. Checking one's typing and avoiding root will prevent some mistakes, but there are always other mistakes to be made. I'm sure you can think of some mistakes that you yourself have made which involved neither typos nor unnecessary root access.
Re:Sounds like my desktops by swalve · 2011-11-28 12:49 · Score: 1

I make mistakes all the time. But not on mission critical gear. Measure twice, cut once.

oblig Dilbert by arielCo · 2011-11-27 10:42 · Score: 3, Funny

http://dilbert.com/dyn/str_strip/000000000/00000000/0000000/100000/20000/5000/600/125621/125621.strip.zoom.gif

--
This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.

Re:oblig Dilbert by marcosdumay · 2011-11-27 12:40 · Score: 1

That's maybe the only opportunity one have to see Alice afraid :)

--
Rethinking email

nothing new by mjwalshe · 2011-11-27 10:58 · Score: 1

I remember they found that reducing the bank cleaning frequency increased to reliability of strowger exchanges (old school mechanical phone exchanges)

Short term gain, long term pain by Nefarious+Wheel · 2011-11-27 11:36 · Score: 1

Whenever I hear this meme touted (and I've heard it a *lot* over the last 40 years) I immediately think -- someone wants to shave a few maintenance dollars, trading short-term gain for long-term pain.

Your money, your choice.

--
Do not mock my vision of impractical footwear

Set it and forget it by 1310nm · 2011-11-27 12:26 · Score: 1

Having been in telecom for over a decade, I can attest to the fact that network reliability is closely related to the amount of change activity taking place in the network.

I mostly agree. by marcosdumay · 2011-11-27 12:30 · Score: 1

Just notice that there is some value in forcing your hardware to fail in a time your downtime will be cheaper. Also, if you are smart, you'll induce redundant components to fail in different times, so downtime will cost just the maintaince price.

Of course, places tha fail to see that they are antecipating failures by doing maintence won't ever plan for that.

--
Rethinking email

Slashdot turned into InfoWorld? by guruevi · 2011-11-27 12:51 · Score: 1

according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors

Well, yeah, now that you identified the weasel words and marketing speak, it sounds a lot less worse. In other news: government says it needs to expand itself, banks say you should put your money with them and Coca-Cola says they deliver a better product than Pepsi.

I clicked through to the website - I get such magazines for 'free' every month that praise vendor after vendor for their proprietary products to help manage some or other problem that is simply fixed by any decent sysadmin. The whole article is just fluff about common sense - yes, if you PM a component it's not going to be available in a double setup, that's why if you really need availability the mantra that a system is not truly redundant unless you have 3 independent systems - this has been long known by those that build high availability clusters (as in really high availability) but due to cost "savings" by upper levels it's often down to two systems or one real and one virtual system.

--
Custom electronics and digital signage for your business: www.evcircuits.com

[kicks] and giggles by Anonymous Coward · 2011-11-27 16:00 · Score: 0

There is also that update that fails and is stuck at an arbitrary percent regardless of everything you throw at it fails, short of lobotomizing the registry. With these kinds of experiences it's no wonder anyone would feel like they are sticking a firecracker in their baby for [kicks] and giggles.

Jeffery! by Anonymous Coward · 2011-11-27 16:37 · Score: 0

Stop "That"!

Don't touch "It"!

You will go blind.

Re:well some windows updates still need reboots le by tlhIngan · 2011-11-27 16:41 · Score: 1

well some windows updates still need reboots it's less then it was in the past but still more then with linux.
Also a lot of NON OS software updates / installers at least say they need a reboot.

It's because Microsoft doesn't support replacing in-use DLL's. The primary reason for that is DLL's don't implement binary-compatible interfaces - an in-use DLL may have a different ABI than that of the new one.

The "reboot" requirement comes from this - to allow replacing of in-use DLL's, the system has a special registry entry that allows listing files in use to be moved by the kernel on reboot.

It affects Linux less as the main libraries are typically not only binary compatible, but binary compatible with previous versions that may have incompatible interfaces. Plus IPC tends to happen across sockets with well-defined interfaces, unlike that for random DLL's that interact through things like COM.

Anyhow, it's always good practice to after doing an update, rebooting the system twice. The first one applies afl the updates and makes sure things work. The second is to ensure things come up *again* in a normal reboot, and not one that did stuff like update DLLs and such.

Re:Can faulty logic make data centers less reliabl by aaarrrgggh · 2011-11-27 17:47 · Score: 1

The power systems comparison is in the old days (through about '92), we would frequently do building power-down maintenance at data centers. It started out at 3-year intervals, and was stretched to 5. Hundreds of electricians and several hundred IT staff would participate. Thermal shock killed 5% of the equipment on power-up, but everybody was standing by with spares. Everything would be cleaned, torqued, repaired, and tested. The DR systems would actually be put into action.

It took about 6 months of preparation, and cost roughly $4MM per site (200,000 square feet of raised floor).

Every few years there was a 'gotcha' moment, one year someone was killed, another the facility dropped load after restart due to human error. This all despite proper procedures and preparation. Today, with better UPSs, conical washers, and 2n designs,we eliminate the issues and get by with token IR scans.

Re:battery maintenance / changing out battery is n by aaarrrgggh · 2011-11-27 18:07 · Score: 1

Flooded batteries require maintenance. VRLA batteries require replacement. In both cases, a battery monitoring system measuring cell impedance goes a long way, but periodic discharge will tell more.

Fear, Uncertainty, Doubt by Anonymous Coward · 2011-11-27 23:07 · Score: 0

What is the definition of "failure". Steve Fairfax said, “if you buy a 2N data center, you’ll have twice as many component failures as a 1N data center. But you’ll be more reliable.”

More redundancy = higher reliability.

Eh? by Anonymous Coward · 2011-11-27 23:52 · Score: 0

Rebooting does not require power cycling.

Time by Anonymous Coward · 2011-11-28 00:02 · Score: 0

The longer a device runs, the higher the likelihood that some counter will overflow/wrap and cause odd/fatal behavior - the Windows 49 day bug is one example.

Capex vs OPEX by Anonymous Coward · 2011-11-28 01:25 · Score: 0

Projects tend to be late and over budget.

Call it maintainence to bypass oversight & embarassment.

At what point does 'mainainence' become: 'keep the paint - gut everything else'?

http://developergeeks.com/article/60/software-reliability-engineering

VMs have been a godsend for this. by zerofoo · 2011-11-28 02:35 · Score: 1

Until we started building our servers as VMs, I always thought - leave well enough alone. Patch when necessary, but don't mess with success.

Decoupling servers from hardware has been a huge help to our testing, backup, patching, and recovery processes.

Server borked? Restore the most recent snapshot on another machine - or better still have the hypervisor do it for you.

We even have our servers snapshotted before and after patching - just in case.

-ted

Funny Story by Cameron+Fwoosh · 2011-11-28 05:45 · Score: 1

I used to work for a company that managed systems remotely. Everything from network devices, servers, etc. The sites all manage their own workstations and internal network devices, but we took care of the Enterprise. So... One day, while looking at my logs I notice an alert pop up for one site that indicated the network devices all went down. This site in particular is a small site that does not generally have people needing outside connectivity, but they pay to have uptime and are therefore just as important as any large site. Due to the polling process of our alert system, this means that there can be a 5 minute window of error on the alert. It could have happened anytime from the second the poll went out, to 5 minutes ago when the last poll went out. I investigate by attempting to log into the device, and am able to log in without any problems. I check the logs and see...sure enough, power failure. So, I call the site admin to make sure everything is alright, and the admin tells me...No power failure at the site. He checks the network room and finds nothing out of the ordinary. Well, this warrants further investigation just to see if there is any issue with the equipment...and because I am bored enough to check on anything at this point. So after looking at logs for the last two months, I find a very interesting fact. Power Failure tickets auto-gen at the same time each Monday and Wednesday for the last two months. Long story short, we find that the cleaning lady doesn't have a free power outlet to plug in the vacuum, so is unplugging the equipment, does her work in the small area, and then plugs the equipment back in. Since the cleaning lady has room clearance, and everyone in the building knows her, there simply was never a question when she went into the small room to clean. Now, I know its a different kind of maintenance than the OP is discussing, but I thought I would share a funny little story for your Monday. Hope you enjoy.

Bullshit. by Anonymous Coward · 2011-11-28 09:00 · Score: 0

You must not deal with any Oracle database servers. They leak like a sieve.

I deal with four of them that run pretty much 24x7x365. Running on Window 2003 Server 32-bit no less. Only one of them undergoes weekly reboots to deal with a problem, that ironically is a Microsoft leak, and has nothing to do with Oracle. The other 3 boxes run until something forces a reboot, generally about once a year.

These Oracle servers run critical roles a 911 public safety emergency dispatch center too.

I also run Oracle on Linux and Oracle on AIX and those are even a lot more stable than the Oracle/Windows boxen. The Oracle/AIX server (running a legacy utility billing system for a municipal govt) once ran nonstop for almost three years uninterrupted before it was shutdown due to a new UPS/generator system being wired into the data center.

If you cannot run an Oracle database server stably, you're doing something either very wrong... or very unusual.

I've been an Oracle admin for over 15 years and although Oracle is sophisticated and complex, it's not insurmountable.

not just data centers by Anonymous Coward · 2011-11-28 15:47 · Score: 0

heck, i discovered a while ago, that every hour my car spent at the mechanics needed an additional half hour on my part to fix whatever the mechanic screwed up in his work; rerouting the wires which had been left dangling on the exhaust manifold, replacing missing fasteners, etc.

NSFW maintenance hints by Anonymous Coward · 2011-11-29 01:09 · Score: 0

I can't remember where I heard this, but the more I think about it, the truer it becomes.

Hardware is like an erect penis - it will stay up unless you fuck with it.
Corollary 1) Of course, paying it a bit of attention every once it a while will be necessary
Corollary 2) Nothing lasts forever, no matter how good it is at the start.

Resnikoff by golodh · 2011-11-29 19:11 · Score: 1

For some reason or another this author has included a mini course in standard graduate-level probability theory in his report.

Perhaps he needed to reach a minimum pagecount?

Slashdot Mirror

Can Maintenance Make Data Centers Less Reliable?

185 comments