Slashdot Mirror


Can Maintenance Make Data Centers Less Reliable?

miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."

46 of 185 comments (clear)

  1. In between maybe? by anarcat · · Score: 5, Insightful

    Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

    --
    Semantics is the gravity of abstraction
    1. Re:In between maybe? by Elbereth · · Score: 5, Interesting

      I suppose that I'd agree. Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on. Luckily, as a talented engineer, he could usually fix whatever the problem was, but it was a huge pain in the ass. Of course, back then, commodity computer hardware was hugely unreliable, with vast gaps in quality between price ranges, and we were working with pretty cheap stuff. Still, to this day, I dread the thought of turning off a computer that has been working reliably. You never know when some piece of crap component is nearing the end of its life, and the stress of a power cycle could what pushes it over the edge into oblivion (or highly unreliably behavior). I used to be fond of constantly messing with everything, fixing it until it broke, but his influence moderated that impulse in me, to the point where I usually freak out when anyone suggests unnecessarily rebooting a computer. Surely, there's something to say for preventative maintenance, and I'd rather be caught with an unbootable PC during regularly scheduled maintenance than suddenly experiencing catastrophic failure randomly, but there's something to be said for just leaving the shit alone and not messing with it. Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory, pull out a cable, or knock loose a power connector. The fewer the times you come into contact with the thing, the better. If I could build a force field around every PC, I probably would.

    2. Re:In between maybe? by mehrotra.akash · · Score: 5, Funny

      fixing it until it broke

      Thats the spirit!!

    3. Re:In between maybe? by sphealey · · Score: 3, Informative

      ===
      Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on.
      ===

      Neither you nor your friend are alone in thinking that:

      AD-A066579, RELIABILITY-CENTERED MAINTENANCE, Nowlan & Heap, (DEC 1978) [this used to be available for download from the US Dept of Commerce web site; now appears to be behind a US government paywall (!)]

      A more recent summary:

      http://reliabilityweb.com/index.php/articles/maintenance_management_a_new_paradigm/

      sPh

    4. Re:In between maybe? by AliasMarlowe · · Score: 4, Informative

      It lives on also among the DoD's general specifications, and can be downloaded from this page.

      --
      Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    5. Re:In between maybe? by mspohr · · Score: 4, Interesting
      Do you know why satellites last so long in a hostile environment?... because nobody touches them.

      "If it's not broken, don't fix it."

      --
      I don't read your sig. Why are you reading mine?
    6. Re:In between maybe? by dave562 · · Score: 2

      I am still that way with firmware upgrades. I think it probably has something to do with our generation. In the 90s, computer hardware was touchy and was expensive to replace. If you're like me, you probably grew up blowing into Nintendo game cartridges when they did not work. But back to firmware, I only upgrade it when necessary. Over the last fifteen years I have seen too many firmware upgrades bork hardware that was working just fine. With security patches I do them monthly, but not firmware. And never CIsco IOS. Once the config is good, leave it be!

    7. Re:In between maybe? by CyprusBlue113 · · Score: 4, Insightful

      Do you know why satellites last so long in a hostile environment?... because nobody touches them.

      "If it's not broken, don't fix it."

      Actually I'm pretty sure it's the millions that are spent engineering each individual one so that it specifically can survive many years in said hostile enviroment.

      If we spent anywhere near what is spent on proper engineering in time and money, everyday crap would be pretty damn reliable too, just not nearly as cost effective

      --
      a handful of selfish greedy people are no match for millions of selfish, greedy people -u4ya
    8. Re:In between maybe? by mabhatter654 · · Score: 4, Insightful

      if that's the case, you don't have CONTROL over your equipment.

      That was acceptable for Windows 95 but not even for desktop PCs anymore, let alone server equipment. My opinion is that your equipment isn't stable UNTIL you can turn it off and on again reliably. And yes... that is an ENORMOUS amount of work.

      If you can't reliably replace individual pieces then you don't have control for maintenance... sure you can stick your head in the sand and just not touch anything... but that's just piling up all the things you didn't take time to figure out until come critical time later.

    9. Re:In between maybe? by Anonymous Coward · · Score: 2, Interesting

      If your buying new or refurbished electronics are THAT unreliable, why the !%!@#$!@%! are you using them?

      If a router fails to come up because a cap is ready to blow, what happens when it blows WHILE IT'S RUNNING?

      I had that happen with 2 Cisco ASA firewalls. One was 5 years old, the other was a few months. They were using HSRP and decided fighting amongst each-other for control was a great idea because one of the ports was going out. We took the old one offline; wouldn't turn on anymore. The new one? Worked fine.

      Over a long enough time-line the failure rate for equipment is 100%. Equipment is usually rated with a MTBF; there's LOTS of documentation on when you replace. You replace Laptops Every 2 years, Desktops and Servers every 3, Networking equipment every 4, appliances per the manufacturers specs, and the lan copper & fiber either when you're doing a major rebuild or when the kit is being replaced.

      If management is too incompetent to tell what the TCO for a mission critical project is and budget the cash for replacements, why are you working for them?

      Rebooting servers is something that needs to happen, depending on the OS, monthly, quarterly and for high-end enterprise systems, biannually. What happens if you don't reboot and purge errors on a schedule? E.G. For a Windows Fileserver; you reboot monthly, run chkdsk, export settings via config files (or run it in a VM) at the BARE minimum and run backups. When you build a database you need to build a routine to purge bad data every once in awhile. For a web server, a nightly reboot is commonplace.

      I worked at a warehouse a few years back; 500k+ sq feet, 500+ employee's. They didn't invest in their tech and when their Oracle DB went corrupt, they didn't even have backups. Someone at corporate devised a way to use the corporate records to rebuild their records; 2 weeks later they were back up and running but not before losing 2 vendors. The cost of three 9's for them was right around 80k for the install and ~20k/year thereafter. The cost of the failure was nearly 2 million; the vendors that did stay required they provide expedited shipping to their customers. Did I mention it went down during the Christmas shipping season?

      Who paid for that?

      If you're running in an environment that badly maintained, You're the managerially-acceptable fall-guy to justify their bonuses; if the equipment is in such a bad state you're afraid of you should be looking for work at a company that does things right.

    10. Re:In between maybe? by greenfruitsalad · · Score: 5, Interesting

      i can't agree. i used to but now i cannot afford to.

      we recently experienced 2 catastrophes (datacentre-wide downtimes, you know things that NEVER happen) and the results were unbelievable. GRUBs failed to load OSes, machines were without a bootloader (due to emergency disk hotswaps), some machines simply didn't turn on, services didn't autostart, a few virtual servers autostarted on multiple hosts (instead of just one), fsck on some of our volumes took hours to finish, 30% of supermicro IPMI cards were unresponsive, etc. it revealed that almost nobody had followed procedures properly.

      after that, every single service we have is built in a clustered manner with nodes spread across multiple datacentres. I now restart machines and pull cables at regular intervals to test bgp/ospf, clustering, recoveries, to check filesystems, etc. i am now also ABLE TO SLEEP.

  2. Maintenance and prevention are not always the same by sandytaru · · Score: 3, Interesting

    I believe the article is referring to major hardware replacements, stress testing, etc. But there is other preventative or even detective work that needs to be done in data centers large and small that have nothing to do with equipment. You can't just blithely assume that things are always going to work as they are supposed to work. One time, we discovered that the camera server for one of our clients had stopped recording for no good reason, and upon closer inspection discovered that the hard drive failed and we had no alert system in place since it wasn't a "real" server but just a heavy duty XP machine. After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

    --
    Occasionally living proof of the Ballmer peak.
  3. Security updates by bjb_admin · · Score: 5, Informative

    Sometimes I get the feeling that security updates can in most cases cause more problems than the issues themselves.

    I can think of many occasions that a security update has broken a server/router/etc. Obviously the lack of a security update can lead to a bigger headache in the future. But the typical user doesn't understand and has the attitude "IT broke the server again".

    If a virus or hacker causes an issue the attitude is "I hope they fix that soon. I hate viruses/hackers" (obviously this is a huge generalization).

  4. Reliability Centered Maintenance by sphealey · · Score: 4, Interesting

    ===
    "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'
    ===

    It isn't just human error: the very act of performing intrusive tasks under the theory of "preventative maintenance" can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s when the concept of reliability centered maintenance was developed for turbine engines and eventually full airliners. Look up the classic report by Nowland & Heap. Very much counter-intuitive if one has been trained to believe in the classics of "preventative teardowns" and fully known failure probability distribution functions, but matches up well to what experience field mechanics have been saying since the days of the pyramid construction.

    sPh

    Of course, today there is a huge "RCM" consulting industry, 7-step programs, etc that bears little resemblance to the original research and theories; don't confuse that with the core work.

  5. Maintenance took down Chernobyl by ExtremeSupreme · · Score: 3, Informative

    That being said, it was because their procedures were shit, not because they were doing maintenance.

    1. Re:Maintenance took down Chernobyl by crankyspice · · Score: 5, Informative

      That being said, it was because their procedures were shit, not because they were doing maintenance.

      Actually, no, the Chernobyl disaster was sparked with a 'live' test of a new, untested mechanism for powering reactor cooling systems in the event of a disaster that brought down the power grid. http://en.wikipedia.org/wiki/Chernobyl_disaster#The_attempted_experiment (And even that test was delayed several hours, into a shift of workers that weren't properly prepared to conduct the test.)

      --
      geek. lawyer.
    2. Re:Maintenance took down Chernobyl by vlm · · Score: 2

      GIVEN that their procedures were shit, maintenance actually made things worse and thus cased Chernobyl.

      I'm guessing you were going for the sarcasm points, but for those who don't know about nuke eng as much as myself and presumably scamper, they had perfectly good procedures for experiment engineering evaluation that they mysteriously chose not to follow, and there was no maintenance involvement at all. Its the opposite of what he was claiming.

      The quickie one liner of what happened is a RBMK has an extremely sensitive control loop by the very nature of what it means to be a RBMK, and the engineers who know exactly what happens when you suddenly slam the gain of a control loop like that up to 11 were intentionally cut out of the loop; no one officially knows why; the negative oscillations to zero were not terribly impressive, but everyone noticed the final positive swing to 40 GW or so.

      The ironic part is they were trying to improve safety by figuring out a ultra short term blackstart capability for the safety systems. It would actually have worked pretty well on a PWR design, which is probably what gave them that peculiar idea. One of the dead guys probably successfully did that "all the time" on his old PWR...

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  6. Re:Maintenance and prevention are not always the s by belrick · · Score: 2

    After that blunder, I was asked to check on all the cameras servers once a week and make sure I could actually open up and view recordings from days past. This is a preventative action, but not really a maintenance one.

    No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?

  7. where's the car analogy? by Anonymous Coward · · Score: 4, Funny

    The guy at the garage always recommends I do an $80 transmission flush.

    1. Re:where's the car analogy? by xs650 · · Score: 2

      What he was really recommending was an $80 wallet flush.

  8. Re:Maintenance and prevention are not always the s by sphealey · · Score: 2

    ===
    No, it's not preventative. It does nothing to prevent the problem. It detects the problem earlier (before, say, a business user does). That's monitoring. It's proactive, not reactive - perhaps that's what you mean?
    ===

    It is deeply unclear whether what is traditionally termed "preventative maintenance" (intrusive work involving disassembling, eyeballing, software probing, etc) actually improves reliability over conditioning monitoring tests followed by break-fix work as described by the parent post. More PM, more procedures, more teardowns, and so forth are the standard prescription for improving reliability but there is metric tons of evidence the universe just doesn't work that way.

    sPh

  9. More MBA Constultant BS... by Anonymous Coward · · Score: 3, Interesting

    Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.

  10. Another lesson relearned by jimbrooking · · Score: 2

    In days of old, running "big iron" from Control Data and Cray, the worst days of system instability were those following "preventive maintenance". Plus ca change....

  11. Can faulty logic make data centers less reliable? by DragonHawk · · Score: 5, Insightful

    From TFS:

    "... poorly documented maintenance can lead to conflicts with automated systems ..."

    That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.

    Sheesh.

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  12. Maintenance-induced failure. by Animats · · Score: 5, Insightful

    There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.

    The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.

    You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?

    1. Re:Maintenance-induced failure. by DarthBart · · Score: 5, Insightful

      There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".

    2. Re:Maintenance-induced failure. by brusk · · Score: 2

      I think that predates Windows. Crashes of various kinds were frequent on Apple IIs, Commodores, etc. You just got use to various reboot/retry routines.

      --
      .sig withheld by request
    3. Re:Maintenance-induced failure. by jklovanc · · Score: 2

      Possibly because ten out of 100 units have failed because a $200 hard drive has failed in each one? Does that mean that the whole $100,000 cluster needs to be replaced? Spending $100,000 instead of $2000 is not a great decision.

  13. The key to achieving high uptime ... by zensonic · · Score: 2

    ... is actually quite simple: You keep your hands off the systems. Period.

    In detail, you plan, install and _test_ your setup before it enters production. You make sure that you can survive whatever you throw at it wrt. errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows! No manager should ever be allowed to demand downtime out of band. Period. In between you basically minimize your involvement with the systems and plan your activities for the next scheduled closing window.

    And you ofcourse only deploy stable, true and tested versions of software and operating systems. And even though your OS supports online capacity expansion on the fly, you really shouldn't use the capability unless you absolutely have to. Instead you plan ahead in your capacity management procedure and add capacity in the closing windows. And you do not test and rehearse failures! It only introduces risks ... besides that you have already tested and documented them. And as you haven't changed the configuration, there is no need to test again.

    So in essence. Common sense will easily yield 99.9%. Carefull planning and execution will yield 99.99%. The really hard part is 99.999%... /zensonic

    --
    Thomas S. Iversen
    1. Re:The key to achieving high uptime ... by Smallpond · · Score: 4, Insightful

      Which means for every online server you need an offline test machine -- and a way to simulate the operating environment in order to test. Not many companies have the skill of cash to do that.

    2. Re:The key to achieving high uptime ... by darth+dickinson · · Score: 2

      And I would have my own personal unicorn that craps Skittles on demand. Also, I could eat candy and poop diamonds.

      Meanwhile, here in the real world... systems experience unexpected failures that will require them to be patched/rebooted/etc at the most inconvenient of times.

  14. Transfer switch ratings by vlm · · Score: 4, Interesting

    Check your transfer switch ratings. I guarantee it will be spec'd much lower than you think. The electricians think it'll only be switched a couple times in its life. The diesel service provider thinks you're running it twice a week. Whoops. If you run it once a week, it'll only survive a couple years, then you'll get a facility wide multi-hour outage. I've personally seen it over and over again over the past two decades. The best part is "we have a procedure" so it'll only be run during maint hours and the desk jockeys 200 miles away will run it rain or shine, so its guaranteed that the xfer switch destroys itself at 2 am during a blizzard and it'll take half a day to repair.

    Very few xfer switches are more reliable than commercial utility power. Installing a UPS actually lowers reliability in almost all professional situations.

    My favorite power outage was caused by a gas leak a couple blocks away, where the utility co shut down the AC and then threatened to take an axe to the gen/UPS if not also shut off. This was not in the official written report, just word of mouth.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  15. Re:Maintenance and prevention are not always the s by bussdriver · · Score: 3, Interesting

    Planned obsolescence has been promoted in all aspects of life since post WW2 and now it is hard to imagine the world without it. That line of thinking has been creeping into everything even in areas where it doesn't seem to apply.

    Does this play a factor on the perception of preventative maintenance or its frequent application? I think it probably does in at least a couple ways, don't you?

  16. Useless article with no data. by Vellmont · · Score: 4, Interesting

    I read through the entire article, and saw zero data to support his assertion. I'm sure he has the data, but the article didn't reference a single piece of it. Without any data to support the theory all we have is a fluff opinion piece. Shame on Data Center Knowledge for writing an article about a scientific investigation, and not presenting a single piece of scientific evidence.

    --
    AccountKiller
  17. This is well known from Formula One by igb · · Score: 5, Interesting
    Some years ago, the F1 rules were changed so that cars were in parc ferme conditions, with strict limits on what can be done to them, from the start of qualifying on Saturday lunchtime until the race finishes on Sunday afternoon.

    The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.

    There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).

    The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.

    In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations).

    So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.

    1. Re:This is well known from Formula One by scattol · · Score: 4, Insightful

      Those cars, to be competitive, were engineered to fall apart on the other side of the finish line. Without maintenance they would have failed. They are now engineered to last a few races instead of just one. Odds are they are slightly slower in one form or the other but it being a level playing field, it doesn't matter.

  18. "The most reliable machine... by John+Hasler · · Score: 3, Funny

    ...is the one farthest from the nearest engineer."

    Consider The Pioneer and Voyager spacecraft and the Mars landers.

    --
    Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
  19. Re:Can faulty logic make data centers less reliabl by HalAtWork · · Score: 3, Insightful
    Exactly.

    vigorous maintenance
    excessive maintenance
    poorly documented maintenance

    Those are all qualified as out of the ordinary. Anything in excess (on either side of the scale, whether it is too much or not enough) is a problem. Of course maintenance must be performed, but I guess some data centers have a strange idea of best practices, or they do not follow them.

  20. Re:Can faulty logic make data centers less reliabl by FaxeTheCat · · Score: 4, Insightful

    Precisely my thought.

    Maintenance, like anything else you do in a datacenter or wherever you work, must be done correctly. If maintenance reduces the reliability of the maintained entity, then per definition, it was not correctly performed.

    Doing something correctly requires knowledge, planning and training. Just like everything else.

  21. Its All In the Process ... by __aajwxe560 · · Score: 2

    Having been involved in Technical Ops of both large and small companies for many years, I have seen DR exercises and design that have run the gambit. I tend to think The key thing I have found to the success of any organization, exercise, or philosophy, is the underlying process that drives execution. The larger the team/org, the more change points, which in turn leads to more variables between tests. This creates complexity, as a test that ran fine a few months ago may not run the same today. However, ensuring change does not overrun process in understanding and applying the change into the greater design is a key to ensuring each test improves upon the last, until such time this is a finite process.

    For example, when working for one of the big 401k's, the first DR exercise evaluated the data center completely being leveled and re-locating both technical services as well as the ~300 on site employees to another location. Long story short, the first exercise of this was scheduled for 2 days, and while it worked, we identified dozens of issues. We scheduled the next test 6 months later and addressed what we believed were all of the issues; on next test, we ran into perhaps ~10 issues. The next test we scheduled 3 months ahead and ran into ~2 issues. All awhile, things continue to change and innovation is occurring, change process control is ensuring that new things are being factored into the continual DR process/exercise. For a small telecom I worked for, the same type of testing was accomplished with ~2-3 week turn around time (smaller team, less change points, more dynamic response), but with same underlying principles.

    Documentation of such things is critical, and employee turnover is often one of the greatest risk points. Having a diversified staff with overlapping knowledge should minimize the later risk to some degree, and if implemented fully, risk should be diminished.

    So how does all this tie back into maint? Well, it is anticipated that if any system runs long enough, their will be opportunity for failure. It is preparation for when such failure occurs, one can balance the capability of providing a measured window of downtime (if any) and provide some degree of predictability (i.e. I test once a quarter). The counter to this can certainly be overzealous maint, so certainly their is a point to being reasonable. For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective. Either way, this is providing some degree of confidence that this should prolong engine life.

  22. The quality of the people matters a lot by petes_PoV · · Score: 3, Insightful

    Although everyone makes mistakes, some people make hundreds of times more errors than others. Whether that's due to inherent lack of ability, poor training, lacking oversight, laziness, time pressures or just a slapdash attitude varies with each person. One place I was involved with (as an external consultant) made over 12,000 changes to their production systems every year. It turned out that well over half of those were backing out earlier changes, correcting mistakes/bugs from earlier "fixes" or other activities (a lot that resulted in downtime, and far too much of it unscheduled or emergency downtime) that should not have happened and could have been prevented.

    --
    politicians are like babies' nappies: they should both be changed regularly and for the same reasons
  23. Re:soft vs hard reboot by Bigbutt · · Score: 3, Insightful

    You must not deal with any Oracle database servers. They leak like a sieve.

    [John]

    --
    Shit better not happen!
  24. Re:soft vs hard reboot by hedwards · · Score: 2

    NFS locking up is ultimately a part of the spec. It was originally a stateless filesystem that operated over UDP. Unless you're using a more recent revision of the protocol and have it configured as such, you're going to have issues with it locking up regularly.

  25. Re:soft vs hard reboot by PCM2 · · Score: 2

    Desktop PCs and servers seem to have largely overcome the need to reboot regularly, but other segments of the industry seem to be moving backwards. My Android handset actually says in the manual that you should power cycle it regularly. With a firmware upgrade, it even started giving me a warning from time to time, telling me I had not power cycled the phone in X amount of times and that I should do that now or risk instability. (Am I crazy for assuming that a phone OS is a markedly less complex environment than a Linux server? And here I thought Android applications ran in a fully memory-managed, garbage-collecting environment.)

    --
    Breakfast served all day!
  26. oblig Dilbert by arielCo · · Score: 3, Funny
    --
    This post contains no rudeness or derision of any kind. All arguments are friendly. Terms and exclusions may apply.
  27. Re:soft vs hard reboot by afidel · · Score: 2

    Or JAVA, we run all the big enterprise application servers and they all run considerably better if they are rebooted on a regular basis.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.