Slashdot Mirror


Why You Shouldn't Reboot Unix Servers

GMGruman writes "It's a persistent myth: reboot your Unix box when something goes wrong or to clean it out. Paul Venezia explains why you should almost never reboot a Unix server, unlike say Windows."

75 of 705 comments (clear)

  1. Uptime by cdoggyd · · Score: 5, Funny

    Because you won't be able to brag about your uptime numbers.

    1. Re:Uptime by Anrego · · Score: 5, Funny

      I once had to move my router (486 running slackware and with a multi-year uptime) across the room it was in. It was connected to a UPS, however the cable going from the UPS to the computer was wrapped through the leg of the table it was sitting on.

      I actually _removed the table leg_ so I could hawl the 486 still plugged into the UPS across the room and quickly plug it in before it powered down!

      and then we had the first real substantial power failure in years like a few months later.. and the thing had to go down :(

      But yeah.. now I reboot frequently to verify that everything still comes up properly.

    2. Re:Uptime by idontgno · · Score: 2

      and then we had the first real substantial power failure in years like a few months later.. and the thing had to go down :(

      Perhaps caused by minor hard drive damage caused by relocating the system while under power?

      A rotary-media hard drive is fairly robust, if static. If spinning, it's more fragile than a Slashdotter's ego.

      I mean, it's your server, and it's an ancient 486 and all, so respect the hardware to the limit and extent you want to, but for me, if it's mine and uses hard drives, it doesn't move 2 inches or tip 5 degrees while it's powered.

      --
      Welcome to the Panopticon. Used to be a prison, now it's your home.
    3. Re:Uptime by Anrego · · Score: 4, Funny

      I meant mains power.. due to a hurricane actually (hurricane Juan).

      The machine came out fine (and actually still runs.. though I don't use it as a router any more). Those old drives are surprisingly robust ..

      But yeah.. I was actually surprised.. and I did it more for the sake of the doing (the only reason I even left the machine going was because of the uptime). I'd never pull a stunt like that with a real machine :D

    4. Re:Uptime by kaiser423 · · Score: 2

      But yeah.. now I reboot frequently to verify that everything still comes up properly.

      Yea, I do that too. Too many time I've been out of the house, and a power failure happens and not all of the boxes boot up correctly, and then I can't access my stuff, or the wife complains about the picture gallery being down, etc, etc. To me, it's a good admin practice. If you aren't 100% sure that your servers boot up properly, how exactly are you prepared for a failure? Take it offline off hours, give it a reboot and make sure that it comes up, services start, and it re-joins its place in the network properly. Should be standard machine admin practices....

    5. Re:Uptime by 19thNervousBreakdown · · Score: 3, Informative

      They're made of considerably smaller platters, so there's much less gyroscopic force (or whatever the fuck it's called), they spin down within minutes of being idle on most laptops, and every laptop these days comes with an accelerometer-based parking utility that stops the drive no matter what it's doing if there's too much force--they're almost certainly configured to be over-conservative from the factory, but generally it's difficult to even carefully pick a laptop up without it parking the drive.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
    6. Re:Uptime by mallyone · · Score: 3, Funny

      I bet any slashdotter that he still has a 3 legged table! :).

    7. Re:Uptime by David+Gerard · · Score: 2

      There's a reason my MP3 player and netbook both use flash. The MP3 skidded across the pavement as I was getting home this evening, and I have a 3yo daughter who delights in knocking things over.

      --
      http://rocknerd.co.uk
  2. Persistent myth? by 6031769 · · Score: 5, Interesting

    This is not a myth I had heard before. In fact, none of the *nix sysadmins I know would dream of rebooting the box to clear a problem except as a last resort. Where has this come from?

    --
    Burns: We're building a casino!
    McAllister: Arrr. Give me 5 minutes.
    1. Re:Persistent myth? by SCHecklerX · · Score: 4, Informative

      Windoze admins who are now in charge of linux boxen. I'm now cleaning up after a bunch of them at my new job, *sigh*

      - root logins everywhere
      - passwords stored in the clear in ldap (WTF??)
      - require https over http to devices, yet still have telnet access enabled.
      - set up sudo ... to allow everyone to do everything
      - iptables rulesets that allow all outbound from all systems. Allow ICMP everywhere, etc.

    2. Re:Persistent myth? by afabbro · · Score: 5, Insightful

      This is not a myth I had heard before.

      +1. This article should be held up as a perfect example of building a strawman.

      "It's a persistent myth that some natural phenomena travel faster than the speed of light, but at least one physicist says it's impossible..."

      "It's a persistent myth that calling free() after malloc() is unnecessary, but some software engineers disagree..."

      "It's a persistent myth that only the beating of tom-toms restores the sun after an eclipse. But is that really true?"

      --
      Advice: on VPS providers
    3. Re:Persistent myth? by arth1 · · Score: 4, Insightful

      Don't forget 777 and 666 permissions all over the place, and SELinux and iptables disabled.

      As for "ALL(ALL) ALL" entries in sudoers, Ubuntu, I hate you for ruining an entire generation of linux users by aping Windows privacy escalations by abusing sudo. Learn to use groups, setfattr and setuid/setgid properly, leave admin commands to administrators, and you won't need sudo.

      find /home/* -user 0 -print

      If this returns ANY files, you've almost certainly abused sudo and run root commands in the context of a user - a serious security blunder in itself.

    4. Re:Persistent myth? by arth1 · · Score: 2

      Unfortunately, the GUI-befuddled people cause problems even on distro levels. Perfectly serviceable text configuration files give way to humongous xml files, or even databases without a plain text front end.
      This makes administration a real pain, and adds nothing except catering to the point-and-drool generation.

    5. Re:Persistent myth? by Dracos · · Score: 3, Funny

      I'm not familiar with Unix itself enough to comment, but with both Linux and *BSD...

      I'm not sure how to respond to that.

    6. Re:Persistent myth? by arth1 · · Score: 2

      I've noticed a few article lately about how 'real men' login as root at all times

      No, they don't. They only do that when they need it, and have configured their systems so they rarely need it.

    7. Re:Persistent myth? by ByOhTek · · Score: 2

      That sounds like horrible software.

      Thinking of the Windows servers I admin and used be an assistant admin for - we usually used reboots only after a large number of other diagnostics were tried. For our desktop users, yes, we said reboot first - but anything on the server should be stable enough as to not need a reboot.

      Actually, I am to blame for the one Windows server restart at my last job that wasn't due to a patch that required it. Long day, logged onto the backup domain controller and accidentally restarted it instead of logged out (yes, it has that extra popup. it was a LONG day).

      But yeah, I think the rebooting first with a server is in general just bad administration, regardless of OS.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    8. Re:Persistent myth? by mini+me · · Score: 2, Interesting

      He is quite correct in his assertion that Linux and BSD are not Unix. Without experience with real Unix systems, it would be impossible for him to verify that they exhibit the same behaviour. However, Mac OS X is Unix. I find it hard to believe that someone posting on Slashdot has not at least spent some time evaluating OS X, even if they ultimately decided it was not for them.

    9. Re:Persistent myth? by RocketRabbit · · Score: 2

      The BSD variants are descended from the original Berkeley Unix codebase, which is simply an enhanced form of original ATT Unix. BSD is Uinx. However, I think that of the BSD variants in use today, only Apple has had theirs certified by the Open Group, which makes it not just Unix but Unix(tm).

    10. Re:Persistent myth? by Waffle+Iron · · Score: 3, Informative

      And yes, either one works, but '\\' is not necessary and it's a POS pattern that too many people follow because they don't or can't read the docs.)

      Here's a snippet from Microsoft's own current MSDN example on the PathMatchSpec() API call:

      ...
      void main(void)
      {
      // String path name 1.
      char buffer_1[ ] = "C:\\Test\\File.txt";
      char *lpStr1;
      lpStr1 = buffer_1;
      ...

      Gee, I wonder where these people get their path separator ideas? Maybe it's because they *did* read the docs.

    11. Re:Persistent myth? by element-o.p. · · Score: 4, Interesting

      As for "ALL(ALL) ALL" entries in sudoers, Ubuntu, I hate you for ruining an entire generation of linux users by aping Windows privacy escalations by abusing sudo.

      Yeah, I agree with you in principle, although to be fair, there really isn't a way that Ubuntu could know what user account you are going to set up before you actually set it up, and therefore, there isn't really a way for Ubuntu to create an appropriate sudoers entry to give admin privileges to the server admin.

      Learn to use groups, setfattr...properly...

      Okay, agreed...

      Learn to use...setuid/setgid properly...

      Ugh...setuid and setgid, IMHO, should be used as little as possible. If there's a security hole in your app, then having it setuid/setgid allows a sufficiently skilled user the ability to gain elevated privileges. I'd much prefer to use sudoers to give access to specific apps to people I trust than give any user access to an app I "trust" through setuid/setgid.

      ...leave admin commands to administrators, and you won't need sudo.

      Maybe I'm just missing something, but that sounds really stupid to me. While I'm a reasonably skilled Linux admin, I don't pretend to know everything, and maybe you can teach me something I've missed in my experience so far. If so, cool. But from my perspective, sudo is an ideal tool for granting appropriate permissions as required to trusted individuals. Sudo logs the user name and command in the log files, so if someone is abusing sudo, you know. Sudo can e-mail failures to admin staff, so if someone is habitually trying to exceed their permissions, you know. Sudo allows pretty fine-grained access to users based upon group or user name, so you can easily allocate permissions as required (well, relatively easily, anyway) -- much more fine-grained than Unix User/Group/Other permissions would allow. For example, with sudo you could allow senior admins (group: admin) and web developers (group: www-dev) read/write permissions to CGI script directories, junior admins (group: jadmin) read-only permissions and all other users (group: users) no access. Uh-oh...we've got four groups here: admins, jadmins, www-dev and users, so doing that with standard Unix permissions is going to be kind of difficult (admins could be members of the www-dev group I suppose, but I can imagine cases where group A might need permissions to a subset of files that group B owns, but shouldn't have access to another subset, which would really complicate things). Sudo is a powerful tool, and just like all the other tools you mentioned, should be used appropriately as a component of overall system security.

      find /home/* -user 0 -print

      If this returns ANY files, you've almost certainly abused sudo and run root commands in the context of a user - a serious security blunder in itself.

      Maybe. I see what you are saying, but as a counter-example, I sometimes run tcpdump from within my home directory when troubleshooting problems. tcpdump has to run as superuser, and I have a lot more faith in giving myself and other admins permission to run "sudo tcpdump" than running tcpdump setuid 0. Again, maybe I'm just missing something, but I really don't have a huge problem with tcpdump (or other admin tools) writing UID 0 data to an admin user's home directory.

      --
      MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?
    12. Re:Persistent myth? by Creepy · · Score: 2

      Linux and many BSD flavors are not UNIX - UNIX is a trademark of the Open Group, and anyone that doesn't pay for certification and licensing of the trademark from the Open Group cannot call themselves UNIX. Apple has paid for certification and trademark usage, so is UNIX, Linux and many flavors of BSD have not.

    13. Re:Persistent myth? by element-o.p. · · Score: 2

      I've noticed a few article lately about how 'real men' login as root at all times, but I've worked in Unix/Linux since the 90's, and this seems to be a recent phenomena.

      Yeah, I've seen that, too. I cut my sys admin teeth in a shop where we used sudo extensively. After four years, I did not have the root password to any of the *Nix servers we had (nor did I want them), but I did have "sudo all" permissions. After I left that job, I came to my present environment where the senior admin didn't want to bother setting up sudoers (to be fair, there were only two of us in the sys admin role, so if he didn't run a command as root, he knew who did...), and the fact that I sign in as root on our servers *still* makes me cringe.

      IMHO, and perhaps veering slightly off-topic, "real men" are secure enough in their own virility that they don't have to resort to acts of reckless bravado to prove how "manly" they are <shrug>

      --
      MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?
    14. Re:Persistent myth? by Thyrsus · · Score: 2

      Unix is a trademark owned by the The Open Group, and you may use that trademark to describe your system if you pay money to have them run their tests to verify compliance with the Single Unix Specification. I believe Red Hat has done that in the past, and that particular version of Linux was thus bona fide Unix(R), but it seems Red Hat has not chosen to continue certifying their systems. Someone please correct me if I'm wrong.

      I believe Red Hat sent back upstream all the changes they needed to make to pass the test; I presume many others also worked on conformance to the standard. Sometimes those behaviors aren't there unless the POSIXLY_CORRECT environment variable is set.

      Thus, while not "legally" Unix, Linux normally does realize all the concepts and behaviors of real Unix.

    15. Re:Persistent myth? by zoips · · Score: 2

      FWIW, I've never seen a garbage collector that actually worked 100%. We'd be better off writing good clean code instead of relying on them.

      Garbage collectors for languages like C/C++ are conservative because they can't always tell what's a pointer to an object or not, so these will not work 100% of the time. In langauges where that isn't an issue (such as Java, Python, Ruby, whatever), the only time a garbage collector will fail is if it struggles with cyclical structures. Java's garbage collector works 100% of the time: if an object is not reachable, it will be reclaimed. Don't confuse this, however, with not having memory leaks; shitty developers who forget they've stashed a live reference to an object have only themselves to blame when the garbage collector does not rightly collect their object.

    16. Re:Persistent myth? by John+Hasler · · Score: 2, Informative

      ...only Apple has had theirs certified by the Open Group, which makes it not just Unix but Unix(tm).

      No. That makes it Unix(tm) but not Unix. With a hacked Mach kernel, a modified BSD userland, and a totally custom GUI it is considerably less like Unix than is Linux. BSD, on the other hand, is a direct descendant of Seventh Edition Unix. The fact that Open Group was willing to sell Apple a trademark license shows just how worthless that trademark is.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    17. Re:Persistent myth? by SWPadnos · · Score: 2

      Um. This has nothing to do with the path separator and everything to do with the C language.

      In C, the character '\' is the escape character. That's how you can print newlines ('\n'), tabs ('\t'), and other things. SInce the backslash has a special meaning sometimes, you have to escape it with a backslash if you want one in your string.

      To get the string literal "C:\command.com" in your program, you have to declare it as "C:\\command.com" in the C source.

      --
      - The Sigless Wonder
    18. Re:Persistent myth? by TheQuantumShift · · Score: 2

      So I took 2 minutes and actually read the article. The point was that Unix is not Windows and reboots are not a fix-all. No straw man, just common sense advice for those MCSE's out there. It's also good advice for Windows, but after a few attempts to discover root cause only to find out that the MF'n Event Log is "corrupt and cannot be read", I don't blame people for just rebooting/reinstalling. Hell, it's what MS says to do; which just goes to show they don't even know how their black box works...

      --

      Shift happens. Fire it up.
    19. Re:Persistent myth? by TheHedonismBot · · Score: 3, Informative

      Maybe. I see what you are saying, but as a counter-example, I sometimes run tcpdump from within my home directory when troubleshooting problems. tcpdump has to run as superuser, and I have a lot more faith in giving myself and other admins permission to run "sudo tcpdump" than running tcpdump setuid 0. Again, maybe I'm just missing something, but I really don't have a huge problem with tcpdump (or other admin tools) writing UID 0 data to an admin user's home directory.

      You don't have to be root to use tcpdump. On ubuntu, do this:

      sudo aptitude install libcap2-bin
      sudo setcap cap_net_raw,cap_net_admin=eip `which tcpdump`

      If you run: getcap `which tcpdump` and it shows: /usr/sbin/tcpdump = cap_net_admin,cap_net_raw+eip then you're good to go. Now try running tcpdump as a regular user.

    20. Re:Persistent myth? by roman_mir · · Score: 2

      I find it hard to believe that someone posting on Slashdot has not at least spent some time evaluating OS X, even if they ultimately decided it was not for them.

      - hmmm. I've worked with computers since about 91 and professionally since 95 and I only really touched the old apple machines a few times. So yeah, there are people here who didn't evaluate OSX (and not intending to)

    21. Re:Persistent myth? by theCoder · · Score: 2

      At work I have a Windows box that, I kid you not, can only run about 25000 processes between reboots. It doesn't seem to be able to reuse process IDs, and once it gets to about PID 100000 or so (Windows PIDs are always multiples of 4), it just can't reliably spawn new processes. Including the process to shutdown and reboot the computer (equivalent of `shutdown'). Windows seems to generate PIDs somewhat randomly, so sometimes creating a process is able to find a good PID and it works, but other times it can't find a PID so it fails.

      Now, a normal person running Word and Excel probably wouldn't notice this. But I wanted to use this computer to build software using make. Well, make creates a lot of processes, especially with lots of subdirectories and sub-makes. Suffice it to say, it doesn't even make it through 'make clean' before running out of PIDs.

      I never did figure out what caused this issue. Probably, a bad combination of kernel level drivers. One post online thought a similar problem might be related to the video card driver, but I'm inclined to think that the anti virus (Norton) is at least partly responsible. I was able to duplicated this on other similar machines, but not every machine.

      So, anyway, maybe your software spawns a lot of processes and is running out of PIDs after a day or so.

      --
      "Save the whales, feed the hungry, free the mallocs" -- author unknown
    22. Re:Persistent myth? by Xtifr · · Score: 2

      I believe Red Hat has done that in the past

      A company called Lasermoon once got their flavor certified. I don't believe that Red Hat ever did so.

      At the time, the biggest issue was a feature called STREAMS (all-caps), which Linus refused to include in the kernel, arguing that it was unnecessary for a system that came with source. Caldera (now SCO) acquired Lasermoon and included STREAMS in some of their versions of Linux, and was lobbying to have it included as a standard feature despite Linus's objections, but I don't believe that any flavor of STREAMS ever appeared in RH.

      According to Wikipedia, STREAMS are now an optional feature in the latest Single Unix Spec (SUS), so a system like Linux (or BSD) that lacks STREAMS can now be certified as Unix(tm), but that was not always true, so for a long time, Linux did not "realize all the concepts and behaviors of real Unix," and the only reason it can now is because the definition was deliberately changed to allow it to be included.

      As an ironic side note, the requirement for STREAMS seems to have been dropped at just about the same time (circa 2003) that Caldera morphed into a mad dog and began attacking the rest of the Linux community.

    23. Re:Persistent myth? by arth1 · · Score: 2

      Point taken. Next question: do you want a regular user to run tcpdump?

      chmod o-x /usr/sbin/tcpdump
      chgrp adm /usr/sbin/tcpdump
      chattr +i /usr/sbin/tcpdump

      Now only members of the adm group can run tcpdump, and no one can make a hardlink to it either.

      Or, you can allow individual users:
      setfacl -m u:someone:rx /usr/sbin/tcpdump

    24. Re:Persistent myth? by TheHedonismBot · · Score: 2

      Point taken. Next question: do you want a regular user to run tcpdump?

      Create user "tcpdumper" in group "tcpdumper".
      chgrp tcpdumper `which tcpdump`
      chmod 754 `which tcpdump`

      Ensure tcpdump works only for root and a member of the 'tcpdumper' group.

    25. Re:Persistent myth? by BlueBlade · · Score: 3, Insightful

      - iptables rulesets that allow all outbound from all systems. Allow ICMP everywhere, etc.

      As a network admin, I have violent fantasies of driving hot nails through the privates of the "Let's block all ICMP by default" admins whenever I come up at a new client's site to troubleshoot some complex networking issues. If you block ICMP echo, you better have an extremely good reason for it. If it's from a public WAN link facing the internet, then *maybe* you might have a case (but most often not). If it's on a web server or other public-facing services, you PROBABLY DON'T HAVE A VALID REASON. If you block traceroutes from anywhere except edge firewalls, you are a clueless idiot. And even then, requests coming from inside interfaces should be let through. THIS IS ESPECIALLY TRUE OVER MPLS AND Site-to-Site VPN LINKS!

      Whew, that felt good. Seriously, blocking icmp doesn't do *anything* for security. If you are getting flooded by icmp packets, just configure a flood threshold. These days, any icmp DoS flood that is bad enough to actually interrupt services very likely doesn't need the extra "reply" traffic to work. And if your clever "security" of not replying to pings on anything that has ports open is stupid, as a simple port scan will reveal the host.

      Please, for the sake of every network admin's sanity, leave ICMP alone. Thank you.

      --
      Religion is the best example of mass psychosis
  3. Uh.. no by Anrego · · Score: 5, Informative

    I for one believe in frequent-ish reboots.

    I agree it shouldn't be relied upon as a troubleshooting step (you need to know what broke, why, and why it won't happen again). That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue. Verifying that the system still boots correctly is imo a good idea.

    Also, all that fancy high availability failover stuff... it's good to verify that it's still working as well.

    The "my servers been up 3 years" e-pene days are gone folks.

    1. Re:Uh.. no by Anonymous Coward · · Score: 2

      Disagree.

      Rebooting is bad. It booted the first time, Why would it not boot the second?

      If you don't have proper controls than you should not have anyone touching the box.

    2. Re:Uh.. no by JonySuede · · Score: 2

      . That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue.

      we reboot our unix server once a month exactly for this reason, we have been bitten once so we learned this the hard way.

      --
      Jehovah be praised, Oracle was not selected
    3. Re:Uh.. no by DaMattster · · Score: 2

      I for one believe in frequent-ish reboots.

      I agree it shouldn't be relied upon as a troubleshooting step (you need to know what broke, why, and why it won't happen again). That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue. Verifying that the system still boots correctly is imo a good idea.

      Also, all that fancy high availability failover stuff... it's good to verify that it's still working as well.

      The "my servers been up 3 years" e-pene days are gone folks.

      Well, you make a point but, shouldn't a server be replaced when it gets old enough anyway? Wouldn't it be nice to have a server up for 3 years of reliability? At this point, who really cares if a reboot would cause a failure? You have backups, plan to replace the aging hardware. It doesn't pay to be miserly with server hardware, especially because its quality has gone on a downward trend as demand for cheaper pricing goes up. And how does verifying a system boot really ensure the the server is working correctly? Too often, I have seen a server boot without problem but other latent problems arise - i.e. failing network cards and failing cooling fans.

    4. Re:Uh.. no by Gaygirlie · · Score: 2

      I do actually recommend to RTFA. He quite clearly says you shouldn't need to reboot the whole system unless you're patching kernel itself, more-or-less everything else can be just restarted or reloaded, including kernel modules, and he even backs up his argument against rash reboots with some valid logic. (Though it's something any system administrator worth anything should already know without a random person on teh internets telling him! Really, shame on you if you just reboot every time you see a problem.) He doesn't say to never reboot, either, even though the submission does make it sound like it.

    5. Re:Uh.. no by Anrego · · Score: 4, Insightful

      Maybe true if the box is set up then never touched. If anything new has been installed on it.. or updated.. I think it's a good idea to verify that it still boots while the change is still fresh in your head. Yes you have changelogs (or should), but all the time spent reading various documentation and experimenting on your proto box (if you have one) is long gone. There's lots of stuff you can install and start using, but could easily not come up properly on boot.

      And why are reboots bad. If downtime is that big a deal, you should have a redundant setup. If you have a redundant setup, rebooting should be no issue. I've seen a very common trend where people get some "out of the box" redundancy solution running... then check of "redundancy" on the "list of shit we need" and forget about it. Actually verifying from time to time that your system can handle the loss of a box without issue is important (in my view).

    6. Re:Uh.. no by jcoy42 · · Score: 2

      Well, that's your opinion.

      The boot up process starts a lot of extra electrical noise in the box by spinning up all the fans, HDs, probing things, etc. That's usually when something breaks. What I have seen is that boxes which get rebooted frequently tend to burn out faster. I have had 2 otherwise equivalent machines, purchased at the same time, one used for dev and one for production, and the dev machine burned out 2 years before we retired the production machine (burned out means too many fan/disk/CPU failures to bother with). The biggest difference? The dev machine was updated and rebooted far more frequently. The production machine we took care to only muck with when we had to, and when possible, we fixed it without a reboot.

      Now it could be that the frequent updates on the dev machine is what caused it to burn out faster (more random use), and sure, it could have been a fluke, but look at it this way- when does a light bulb burn out? When you turn it on or when it's left on?

      --
      Never trust an atom. They make up everything.
    7. Re:Uh.. no by GreyLurk · · Score: 2

      Why reboot? Why not just kill off the process, clear the temp files, and restart the process?

    8. Re:Uh.. no by OzPeter · · Score: 3, Insightful

      (wishing that /. would allow edits)

      To add to my previous comment. The general consensus of disaster recovery best practice is that you do not test a backup strategy, you test a restore strategy. Rebooting a server is testing a system restore process.

      --
      I am Slashdot. Are you Slashdot as well?
    9. Re:Uh.. no by Kjella · · Score: 2

      Well, you make a point but, shouldn't a server be replaced when it gets old enough anyway? Wouldn't it be nice to have a server up for 3 years of reliability? At this point, who really cares if a reboot would cause a failure? You have backups, plan to replace the aging hardware.

      You care because it's 2:30 in the morning, your manager is yelling at you because the all important end-of-quarter stuff is due in the morning, the server is full of one day's production data that isn't backed up yet and even though you have money in the budget you don't have a hot server with the exact same software/patch level/configuration ready to dump your backups into?

      Very few systems are so critical they can't have some planned downtime. Unplanned downtime on the other hand can be extremely costly, and the only thing that matters is fixing it ASAP, save no expense. You can afford new hardware, what you can't afford is the time to install/setup that hardware.

      --
      Live today, because you never know what tomorrow brings
  4. slashdot: *world link farmers by Anonymous Coward · · Score: 5, Insightful

    i'm really tired of this semi-technical stuff on slashdot that seems aimed at semi-competent manager-types.

  5. Counter point -- pre-emptive reboot by Syncerus · · Score: 5, Insightful

    One minor point of disagreement. I'm a fan of the pre-emptive reboot at specific intervals, whether the interval be 30 days, 60 days, or 90 days is up to you. In the past, I've found the pre-emptive reboot will trigger hidden system problems, but at a time when you're actually ready for them, rather than at a time when they happen spontaneously ( 2:30 in the morning ).

    --
    "Man is nothing without the works of man" -- Helvetius
    1. Re:Counter point -- pre-emptive reboot by Wovel · · Score: 2

      Interestingly, all his arguments against rebooting would bolster your argument for periodic planned reboots. One of his points was that someone may have screwed up the system, it would be better to find that in a controlled environment.

      I will stay away from periodic reboots and remain firmly entrenched in the land of if it ain't broke, don't fix it.

  6. Of course you reboot, in controlled settings by pipatron · · Score: 4, Insightful

    FTFA:

    Some argued that other risks arise if you don't reboot, such as the possibility certain critical services aren't set to start at boot, which can cause problems. This is true, but it shouldn't be an issue if you're a good admin. Forgetting to set service startup parameters is a rookie mistake.

    This is retarded. A good admin will test so that everything works, before it will get a chance to actually break. Anyone can fuck up, forget something, whatever. Doesn't matter how experienced you are. Murphys law. The only way to test if it will come up correctly during a non-planned downtime is to actually reboot while you have everything fresh in memory and while you're still around and can fix it. Rebooting in that case is not a bad thing, it's a responsible thing to do.

    --
    c++; /* this makes c bigger but returns the old value */
  7. What a load of BS by kju · · Score: 4, Insightful

    I RTFA (shame on me) and it is in my opinion absolutely stupid.

    There is actually only one real reason given and that is that if you reboot after some services ceased working, you might end up with a unbootable machine.

    In my opinion this outcome is absolutely great. Ok, maybe no great, but it is important and rightful. It forces you to fix the problem properly instead of ignoring the known problems and missing yet unknown problems which might bite you in the .... shortly after.

    Also: When services start being flakey on my system, i usually want to run an fsck. In 16 years linux/unix administrations I found quite a time that the FS was corrupted without an apparent reason and with beeing unnoticed before. So a fsck is usually a good thing to run when strange things happen and to be able to run it, i nearly always need to reboot.

    I can't grasp what kind of thinking it must be to continue running a server where some services fail or behave strangely. You could end up with more damage than cause by a outage when the reboot does not go through. You just might want to do the reboot at off-peak hours.

    1. Re:What a load of BS by ColdWetDog · · Score: 3, Funny

      You just might want to do the reboot at off-peak hours.

      As someone who tends to work during 'off-peak' hours, I have a special room in Hell just expressly reserved for admins like you (and my admins who apparently are your soul mates). Just thought I'd mention this. You've been warned.

      --
      Faster! Faster! Faster would be better!
  8. *NIX 101 by Zero1za · · Score: 2

    This is like *NIX 101.

    But then, try changing the locale on a running system...

  9. Ummm, that's a crap article by Sycraft-fu · · Score: 4, Insightful

    More or less it is "You shouldn't reboot UNIX servers because UNIX admins are tough guys, and we'd rather spend days looking for a solution than ruin our precious uptime!"

    That is NOT a reason not to reboot a UNIX server. In fact it sounds like if you've a properly designed environment with redundant servers for things, a reboot might be just the thing. Who cares about uptime? You don't win awards for having big uptime numbers, it is all about your systems working well and providing what they need and not blowing up in a crisis.

    Now, there well may be technical reasons why a reboot is a bad idea, but this article doesn't present any. If you want to claim "You shouldn't reboot," then you need to present technical reasons why not. Just having more uptime or being somehow "better" than Windows admins is not a reason, it is silly posturing.

    1. Re:Ummm, that's a crap article by pz · · Score: 2

      Please point out exactly where in the article the issue of uptime is raised. I fail to see it. Many others have also suggested that long uptimes ("e-pene" as one poster put it) is the reason for avoiding reboots. There has been no such suggestion that I could find. I authored a post to the previous thread about the origins of the Unix attitude against reboots that was highly rated and nowhere in that post, or in the follow-on replies, was uptime ever considered an issue.

      The issue -- the only issue -- is interrupting service to many users. Modern machines that serve tens to thousands of users cannot be brought down willy-nilly without incurring the wrath of those users, and rightfully so. Bringing down a system because the sysadmin was too lazy to understand what the problem was is inexcusable. The sysadmin's job is to keep the service running. When there's one user, such as in QA, or a single-user desktop, reboots can happen at will. When there are many many users, such as in a production box, file server, or similar, reboots should never be used as a problem-solving tool.

      So let go of the old, dead horse about uptime bragging rights. A correct, properly maintained Unix system does not need to be rebooted except under highly unusual circumstances. The reason that Windows boxes are treated differently is because Windows is a comparatively new OS that started out life as a one-seat system whereas, paraphrasing what I wrote in an earlier post, Unix and its intellectual antecedents had been running multi-seat systems for nigh on three decades before Windows started doing that. It's fact, not being better or worse, and the Unix and Windows cultures have grown around those two views.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    2. Re:Ummm, that's a crap article by Jose · · Score: 2

      Now, there well may be technical reasons why a reboot is a bad idea, but this article doesn't present any.

      hrm, the article states: ...If you shrug and reboot the box after looking around for a few minutes, you may have missed the fact that a junior admin inadvertently deleted /boot and some portions of /etc and /usr/lib64 due to a runaway script they were writing. That's what was causing the segfaults and the wonky behavior. But since you rebooted the server without digging into the problem, you've made it much worse, and you'll soon boot a rescue image -- with all kinds of ponderous work awaiting you -- while a production server is down.

      and:
      In many cases, it's extremely important not to reboot, because the key to fixing the problem is present on the system before the reboot, but will not be immediately available after. The problem will recur, and if the only known solution is to reboot, then the problem will never be fixed unless or until someone decides not to reboot and instead tries to find the root of the problem.

      and while I disagree with this one slightly..as the problem may still be present after a reboot..I defintely agree with what the author is saying...find the actual root of the problem, and fix it..don't just cross your fingers and hope a reboot will fix the problem.

      Also the author never mentions preserving uptime of the server as a goal..he does mention a few times patching in place..which will mean killing services, effectively making that particular server unavailable.

      --
      The basic sleazeware produced in a drunken fury by a bunch of UCBerkeley grad students was still the core of BIND. --PV
  10. HP-UX says... by RedK · · Score: 2

    You lie.

    Seriously. I don't know what HP is doing, but NFS hangs/stuck processes that you can't kill -9 your way out of is just wrong.

    --
    "Not to mention all the idiots who use words like boxen."
    Anonymous Coward on Monday August 04, @06:49PM
    1. Re:HP-UX says... by inflex · · Score: 2

      NFS is designed to be like that, block/hang until connection is restored... though not sure about the resilliance to the sig-9 though. You do now have the option on some NFS systems to have a soft-block.

    2. Re:HP-UX says... by RedK · · Score: 2

      I've had systems with HP-UX that could rpcinfo/showmount on the NFS server and yet still had hanged filesystems. Soft, hard, whatever mount option, it's random. Then when you try to shut down the NFS subsystem, the rpc processes get stuck, you try to kill -9 and they simply don't die. umount -f doesn't work. Nothing works.

      You really have to have experience on HP-UX to understand the pain... And if only I was talking about the old 11iv1 instead of the brand spanking new 11iv3 with ONCplus up to date.

      --
      "Not to mention all the idiots who use words like boxen."
      Anonymous Coward on Monday August 04, @06:49PM
    3. Re:HP-UX says... by sribe · · Score: 4, Informative

      Seriously. I don't know what HP is doing, but NFS hangs/stuck processes that you can't kill -9 your way out of is just wrong.

      Kind of a well-known, if very old, problem. From Use of NFS Considered Harmful:

      k. Unkillable Processes

      When an NFS server is unavailable, the client will typically not return an error to the process attempting to use it. Rather the client will retry the operation. At some point, it will eventually give up and return an error to the process.
      In Unix there are two kinds of devices, slow and fast. The semantics of I/O operations vary depending on the type of device. For example, a read on a fast device will always fill a buffer, whereas a read on a slow device will return any data ready, even if the buffer is not filled. Disks (even floppy disks or CD-ROM's) are considered fast devices.

      The Unix kernel typically does not allow fast I/O operations to be interrupted. The idea is to avoid the overhead of putting a process into a suspended state until data is available, because the data is always either available or not. For disk reads, this is not a problem, because a delay of even hundreds of milliseconds waiting for I/O to be interrupted is not often harmful to system operation.

      NFS mounts, since they are intended to mimic disks, are also considered fast devices. However, in the event of a server failure, an NFS disk can take minutes to eventually return success or failure to the application. A program using data on an NFS mount, however, can remain in an uninterruptable state until a final timeout occurs.

      Workaround: Don't panic when a process will not terminate from repeated kill -9 commands. If ps reports the process is in state D, there is a good chance that it is waiting on an NFS mount. Wait 10 minutes, and if the process has still not terminated, then panic.

    4. Re:HP-UX says... by Svartalf · · Score: 2

      That's why I'm all for coming up with something OTHER than NFS for server framework. Seriously. And using it in a HP/HA cluster is...verging on insane... It's an old crufty design that was designed for use in a simpler time with simpler conditions- and it wasn't all that great then.

      --
      I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
  11. Virtualization to the rescue by Anonymous+Showered · · Score: 4, Interesting

    I run web servers for a few dozen clients, and rebooting a remote machine was always scary. There was the possibility that something might not boot up during startup (e.g. SSHd) and I would be locked out. I would then have to travel to my data center downtown (about 30 minutes away) and troubleshoot the problem. Since I don't have 24/7 access to the DC (I don't have enough business with the DC to warrant an owned security pass...) I have to wait until they open to the general clientèle in the morning.

    With ESXi, however, I'm not that scared anymore. If something does go wrong, I have a console to the VM through vCenter client (the application that manages virtual machines on the server). It's happened once where a significant upgrade of FreeBSD 7.2 to 8.1 was problematic. Coincidentally, it was because I didn't upgrade the VMware tools (open-vmware-tools port). Nonetheless, I managed to fix the problem through vCenter.

    This is why I love virtualization in general. It's making managing servers easier for me.

  12. I read TFA by pak9rabid · · Score: 2, Interesting

    What a load of horse shit.

  13. Library uprades are the tricky part by Anonymous Coward · · Score: 2, Informative

    Often system upgrades (eg. security fixes) include new versions of libraries and such. It's impossible for the package manager to know which processes are using those libraries so it can't automatically restart everything. Consider if you have custom processes running, the package manager wouldn't even know about them.

    Therefore you have to do it manually, but then you have the same problem. It's damn hard to know which processes are using the libraries that were upgraded. Really, really hard if it's a big server running hundreds or thousands of processes. Often it's easier just to reboot so you make sure everything is running the current version of all the libraries. If you don't then you can't be sure that all the security fixes are actually running on the system since it will be using the old cached versions of the libraries in RAM.

  14. Not better than the others by cpct0 · · Score: 2, Interesting

    Quotes from stupid people:
    You should never reboot a Mac, it's not like Windows.
    You should never reboot Unix/Lunux, it's not like Windows.

    Well, you shouldn't reboot Windows either. You reboot it when it goes sour. Our Windows servers seldom go sour, so we don't reboot them. Same for Mac or *nix.

    Problem is when it starts to cause problems. Like our /var/spool partition deciding it has better things to do than exist... or the ever so important NFS or iSCSI mount that decides to Go West, and gives us the ??? ls we all dread ... with umounting impossible, so remounting impossible, and all these stale files and stuff. You either tweak these things for hours cleaning up all processes, or you reboot.

    In fact, being a good sysadmin, all my servers are MEANT to be rebooted if something goes sour. One SVN project goes sour? check if it's not the repository itself that got problems, or if the system needs to save something to safely exist ... and if not, reboot the server. Everything magically restarts itself, does its little sanity check, and a quick look at a remote syslog to make certain everything is all right. 2 minutes lost for everyone, not 3 hours of trying to clean up mess left by some stray process somewhere or trying to kill the rogue 100 compression and rsync jobs that got started eating up all RAM, CPU and network.

    Since all our servers are single processes and are either VMs or single machines, it's a breeze to do this. iSCSI will diligently wait before the machine is back up before trying to reconnect. NFS will keep its locked files up, and will reconnect to them. No, seriously, everything simply reconnects!

    Of course, the idea is to minimize these occurences, so we learn from it, and we try to repair what could've caused this problem in the first place. And there's a place to do this in a server crash postmortem. But no need to make users wait while we try to figure out wth.

  15. Oh, this fool again by Enry · · Score: 2

    While it's true servers don't need to be restarted as often as Windows counterparts, there are valid reasons for restarting a server:

    - new kernel, new features
    - new kernel, new security patches (yes, these are distinct reasons)
    - ensure all services restart in the event of a real failure
    - we have cases where memory fills and the system starts thrashing. It may cure itself eventually, but you can't get in via SSH or console (and no, the OOM killer doesn't kick in).

    I think item #3 is important. If you have a crusty system that's been in place for a while and it reboots for some reason, you now have to spend time to make sure everything started, figure out what didn't start, and why. This doesn't mean you need to restart once a week, but every 6-12 months is certainly reasonable.

    1. Re:Oh, this fool again by sjames · · Score: 2

      Did you RTFA? He did NOT say NEVER boot, just that it is not a valid troubleshooting step (it MAY be part of a valid solution to a diagnosed problem). He explicitly named your first 2 points as good reasons to reboot. The 3rd is a bit of a rookie mistake, but as long as sshd and basic networking starts the rest can be resolved in the unlikely case that a reboot does happen. Arguably, that's a test procedure and not a troubleshooting solution.

      The thrashing case is one he didn't mention. It is sometimes the only way to get control of a machine. I would point out though that it isn't a troubleshooting step, it's an unfortunately necessary solution to the immediate problem which is preventing you from properly investigating the real problem (what made it thrash).

      The real point though is that the Windows admin seems to think reboot is the first thing to try and if it comes back up it will be left at that (until tomorrow when the same problem comes up again. Lather, rinse, repeat.) while the Unix admin considers it to be a last resort and then preferably done based on a completed diagnosis as a considered resolution to the problem.

  16. This is a myth? by pclminion · · Score: 4, Interesting

    I've heard a lot of myths. I've never heard a myth stating "You need to reboot a UNIX system to fix problems." If anything I've heard the opposite myth. Who promulgates this shit?

    I do remember ONE time a UNIX system needed a reboot. We (developer team) were managing our own cluster of build machines. The head System God was out of town for two weeks. We were having problems with a build host, and tried everything. Day after day. Finally, on the last day before System God was due to return, it occurred to me that the one thing we hadn't tried was to reboot the machine. The reboot fixed the problem, whatever it was.

    I felt stupid. One, for not figuring out the problem in a way that could avoid a reboot. Two, for not recording enough information to determine root cause in a post-mortem analysis. Three, for configuring a system in such a way that a reboot might be required in order to fix a problem.

    To this day I believe that reboot was unnecessary, although at the time it was the fastest way to resolving the immediate blocking issue.

  17. Sometimes ... by PPH · · Score: 4, Funny

    ... the crap I read on Slashdot is so unbelievable, I have to reboot my laptop in the hopes that it will go away.

    --
    Have gnu, will travel.
  18. Not just *nix by Spad · · Score: 2

    The same argument can be applied to Windows servers; sometimes rebooting will only make things worse, or at least no make things any better. Unfortunately, these days the trusty reboot is often the first option instead of last resort; at the very least some basic troubleshooting needs to be done to identify potential causes before you likely erase half the evidence.

    I suffer from a desktop variant of this issue at work, whereby re-imaging has become the "troubleshooting" tool of choice, to the point that all thought has now left the support process so that I've witnessed an engineer re-image a PC 3 times (at 30+ minutes each time) before someone else identified that the issue was being caused by a BIOS setting and that re-imaging was a complete waste of time.

    Let's face it, if your admin/support staff are lazy and/or stupid, then it doesn't matter which approach they take because they're not going to fix the problem anyway.

  19. New rule for Slashdot by aztektum · · Score: 5, Insightful

    /. editors: I propose a new rule. Submissions with links to PCWorld, InfoWorld, PCMagazine, Computerworld, CNet, or any other technology periodical you'd see in the check out line of a Walgreens be immediately deleted with prejudice.

    They're the Oprah Magazine of the tech world. They exist to sell ads by writing articles with grabby headlines and little substance.

    --
    :: aztek ::
    No sig for you!!
  20. Eh? by ledow · · Score: 2

    - Design system
    - Build system (involves inevitable reboots)
    - Test system (involves inevitable reboots)
    - Move system into production.

    Once the services you need start up the way you want, don't play with it. Put it into service and have backups of the original image, any changes you make and a working replacement (Yes, have a working replacement - there is *nothing* better than having another machine sitting next to your server that can take over its job with the flick of a switch while you repair it - it also lets you test changes safely, and whenever you're sure the system is how you want it, you push the same image to your "copy of" server).

    If you do it properly, that machine will then stay up until hardware failure. Sometimes that *can* be years away. If you do it properly, you shouldn't ever, ever, ever be rebooting a server that's in production - you're just masking the real problem. Yeah, it'll work most of the time but it's just a way of papering over the cracks. The server hung, the service died, the settings got out of sync, or whatever, for a reason. Just rebooting is ignoring that reason for sake of service continuance - if the service is that vital, you should have high enough availability to cover such incidences or that same problem will come back to bite you later.

    Nobody cares about enormous uptimes, but having a server that you haven't NEEDED to touch in months is a good thing. It means that it has a well-defined function and has been performing correctly - that's your "stable" version and should be treated as such. Every time you make a change to a server, it then becomes a "current/experimental" version that you should be wary of.

    At worst, when a problem appears, you turn ON a replacement server and fix the one that is showing problems. If its role is well-specified, you don't get "feature creep" where it's running a million things that it never used to and they're not in your startup properly because it's never rebooted enough for you to test them.

    On Windows, or Unix, you shouldn't have to reboot. If you do, it's to test something or correctly reinitialise after fixing a problem (a post-solution reboot just to make sure it works as required isn't a bad thing but certainly not "required"). The worry of hardware failure on boot shouldn't stop you rebooting, and similarly you shouldn't reboot just to "spot" problems. Both suggest inattention and lack of suitable backups/replacements/high availability solutions.

    Systems can easily go 3-4 years in operation without requiring a reboot. If your hardware is good quality, you're monitoring the server as you should be, you have adequate backups/replacements and the role it performs isn't changed, there's no need to ever reboot it past initial testing. I have internal school servers that only get rebooted in the summer (i.e. once per annum) and that's only because the power goes off to upgrade the electrics each year.

    If it wasn't for that, I'd just leave them running. They don't need kernel 2.6.192830921830 and they have been doing that same job reliably for a LONG time. I'm not going to kick them into a reboot "just because". Similarly even the tiniest memory leak in their processes would cause me problems that I would spot immediately.

    As it is, 450 happy users all day long for years. The last one I installed actually took a whack from a collapsed networking cabinet coming off the wall (full of fully-populated Gigabit switches) and dropping six feet onto it. Apart from a small dent it carried on just fine, and the disks were idle, and SMART / data integrity show no problems. I rebuilt the entire network cabling around it because switching it off wasn't necessary. If it did reboot and it didn't come up in the expected state? There's a copy of it on another machine on the other side of the room - it's predecessor that also didn't reboot for years but wasn't fast enough to run the amount of PHP / MySQL we needed it to among its other functions. Having the replacement machine

  21. An Appropriate Hacker Koan by idontgno · · Score: 4, Funny

    courtesy of Appendix A of the Jargon File.

    Tom Knight and the Lisp Machine

    A novice was trying to fix a broken Lisp machine by turning the power off and on.

    Knight, seeing what the student was doing, spoke sternly: "You cannot fix a machine by just power-cycling it with no understanding of what is going wrong."

    Knight turned the machine off and on.

    The machine worked.

    --
    Welcome to the Panopticon. Used to be a prison, now it's your home.
  22. So what. 3650 days of uptime, who cares? by Anonymous Coward · · Score: 3, Interesting

    It makes a nice figure. Ten years. HP-UX running a few more or less referential databases. 3650 days. Was it patched properly? Did anyone *really* look after it? The only thing that can be said, is that it apparently was quite a stable machine room in terms of 10 full years of electrical & other provisions, more or less intact.

    Then it was shut down for good.

    I'd rather see regular maintenance breaks and maintenance windows (pun not entirely intended), than collect numbers in the uptime command's output. But the story is true, after I left that company not a single soul ever rebooted it. Ten years after they send me an email, with an attachment of a putty session. Ten years, :)

  23. You SHOULD reboot UNIX boxes too by Max+Romantschuk · · Score: 2

    After making configuration changes it makes _a lot_ of sense to reboot if possible. That way you can determine that your changes indeed load properly after a reboot. You don't want that kind of a surprise when you have long since forgotten all the little tweaks in place.

    --
    .: Max Romantschuk :: http://max.romantschuk.fi/
  24. Re:Another Linux admin with a superiority complex. by The+Moof · · Score: 3, Informative

    Why should I bother disabling it?

    Generally, good administrators tend to disable service that aren't wanted or needed in their systems. Who's to say that there's not going to be a vulnerability for the service discovered down the road (*coughSolariscough*) that would make you vulnerable?

  25. Re:Another Linux admin with a superiority complex. by djp928 · · Score: 2

    Windoze admins...

    The very first word in your "+5 Informative" diatribe is a derogatory term blanketing all administrators of Windows systems. Anything else you have to say should now be taken as extremely biased, if not plain ignorant. I've been an administrator of Unix systems for over 20 years, and an administrator of Linux and Windows servers since their early days. Being a Windows admin does not mean that one is uniformed or technically inept, any more than being a *nix admin makes one smarter.

    Stereotypes exist for a reason. If it wasn't true for at least a statistically relevant number of samples, then the stereotype would not exist.

    Yeah, those damn lazy black people, always raping our precious white women.