Why You Shouldn't Reboot Unix Servers
GMGruman writes "It's a persistent myth: reboot your Unix box when something goes wrong or to clean it out. Paul Venezia explains why you should almost never reboot a Unix server, unlike say Windows."
← Back to Stories (view on slashdot.org)
Because you won't be able to brag about your uptime numbers.
This is not a myth I had heard before. In fact, none of the *nix sysadmins I know would dream of rebooting the box to clear a problem except as a last resort. Where has this come from?
Burns: We're building a casino!
McAllister: Arrr. Give me 5 minutes.
I for one believe in frequent-ish reboots.
I agree it shouldn't be relied upon as a troubleshooting step (you need to know what broke, why, and why it won't happen again). That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue. Verifying that the system still boots correctly is imo a good idea.
Also, all that fancy high availability failover stuff... it's good to verify that it's still working as well.
The "my servers been up 3 years" e-pene days are gone folks.
i'm really tired of this semi-technical stuff on slashdot that seems aimed at semi-competent manager-types.
One minor point of disagreement. I'm a fan of the pre-emptive reboot at specific intervals, whether the interval be 30 days, 60 days, or 90 days is up to you. In the past, I've found the pre-emptive reboot will trigger hidden system problems, but at a time when you're actually ready for them, rather than at a time when they happen spontaneously ( 2:30 in the morning ).
"Man is nothing without the works of man" -- Helvetius
FTFA:
Some argued that other risks arise if you don't reboot, such as the possibility certain critical services aren't set to start at boot, which can cause problems. This is true, but it shouldn't be an issue if you're a good admin. Forgetting to set service startup parameters is a rookie mistake.
This is retarded. A good admin will test so that everything works, before it will get a chance to actually break. Anyone can fuck up, forget something, whatever. Doesn't matter how experienced you are. Murphys law. The only way to test if it will come up correctly during a non-planned downtime is to actually reboot while you have everything fresh in memory and while you're still around and can fix it. Rebooting in that case is not a bad thing, it's a responsible thing to do.
c++;
I RTFA (shame on me) and it is in my opinion absolutely stupid.
There is actually only one real reason given and that is that if you reboot after some services ceased working, you might end up with a unbootable machine.
In my opinion this outcome is absolutely great. Ok, maybe no great, but it is important and rightful. It forces you to fix the problem properly instead of ignoring the known problems and missing yet unknown problems which might bite you in the .... shortly after.
Also: When services start being flakey on my system, i usually want to run an fsck. In 16 years linux/unix administrations I found quite a time that the FS was corrupted without an apparent reason and with beeing unnoticed before. So a fsck is usually a good thing to run when strange things happen and to be able to run it, i nearly always need to reboot.
I can't grasp what kind of thinking it must be to continue running a server where some services fail or behave strangely. You could end up with more damage than cause by a outage when the reboot does not go through. You just might want to do the reboot at off-peak hours.
This is like *NIX 101.
But then, try changing the locale on a running system...
More or less it is "You shouldn't reboot UNIX servers because UNIX admins are tough guys, and we'd rather spend days looking for a solution than ruin our precious uptime!"
That is NOT a reason not to reboot a UNIX server. In fact it sounds like if you've a properly designed environment with redundant servers for things, a reboot might be just the thing. Who cares about uptime? You don't win awards for having big uptime numbers, it is all about your systems working well and providing what they need and not blowing up in a crisis.
Now, there well may be technical reasons why a reboot is a bad idea, but this article doesn't present any. If you want to claim "You shouldn't reboot," then you need to present technical reasons why not. Just having more uptime or being somehow "better" than Windows admins is not a reason, it is silly posturing.
You lie.
Seriously. I don't know what HP is doing, but NFS hangs/stuck processes that you can't kill -9 your way out of is just wrong.
"Not to mention all the idiots who use words like boxen."
Anonymous Coward on Monday August 04, @06:49PM
I run web servers for a few dozen clients, and rebooting a remote machine was always scary. There was the possibility that something might not boot up during startup (e.g. SSHd) and I would be locked out. I would then have to travel to my data center downtown (about 30 minutes away) and troubleshoot the problem. Since I don't have 24/7 access to the DC (I don't have enough business with the DC to warrant an owned security pass...) I have to wait until they open to the general clientèle in the morning.
With ESXi, however, I'm not that scared anymore. If something does go wrong, I have a console to the VM through vCenter client (the application that manages virtual machines on the server). It's happened once where a significant upgrade of FreeBSD 7.2 to 8.1 was problematic. Coincidentally, it was because I didn't upgrade the VMware tools (open-vmware-tools port). Nonetheless, I managed to fix the problem through vCenter.
This is why I love virtualization in general. It's making managing servers easier for me.
What a load of horse shit.
Often system upgrades (eg. security fixes) include new versions of libraries and such. It's impossible for the package manager to know which processes are using those libraries so it can't automatically restart everything. Consider if you have custom processes running, the package manager wouldn't even know about them.
Therefore you have to do it manually, but then you have the same problem. It's damn hard to know which processes are using the libraries that were upgraded. Really, really hard if it's a big server running hundreds or thousands of processes. Often it's easier just to reboot so you make sure everything is running the current version of all the libraries. If you don't then you can't be sure that all the security fixes are actually running on the system since it will be using the old cached versions of the libraries in RAM.
Quotes from stupid people:
You should never reboot a Mac, it's not like Windows.
You should never reboot Unix/Lunux, it's not like Windows.
Well, you shouldn't reboot Windows either. You reboot it when it goes sour. Our Windows servers seldom go sour, so we don't reboot them. Same for Mac or *nix.
Problem is when it starts to cause problems. Like our /var/spool partition deciding it has better things to do than exist... or the ever so important NFS or iSCSI mount that decides to Go West, and gives us the ??? ls we all dread ... with umounting impossible, so remounting impossible, and all these stale files and stuff. You either tweak these things for hours cleaning up all processes, or you reboot.
In fact, being a good sysadmin, all my servers are MEANT to be rebooted if something goes sour. One SVN project goes sour? check if it's not the repository itself that got problems, or if the system needs to save something to safely exist ... and if not, reboot the server. Everything magically restarts itself, does its little sanity check, and a quick look at a remote syslog to make certain everything is all right. 2 minutes lost for everyone, not 3 hours of trying to clean up mess left by some stray process somewhere or trying to kill the rogue 100 compression and rsync jobs that got started eating up all RAM, CPU and network.
Since all our servers are single processes and are either VMs or single machines, it's a breeze to do this. iSCSI will diligently wait before the machine is back up before trying to reconnect. NFS will keep its locked files up, and will reconnect to them. No, seriously, everything simply reconnects!
Of course, the idea is to minimize these occurences, so we learn from it, and we try to repair what could've caused this problem in the first place. And there's a place to do this in a server crash postmortem. But no need to make users wait while we try to figure out wth.
While it's true servers don't need to be restarted as often as Windows counterparts, there are valid reasons for restarting a server:
- new kernel, new features
- new kernel, new security patches (yes, these are distinct reasons)
- ensure all services restart in the event of a real failure
- we have cases where memory fills and the system starts thrashing. It may cure itself eventually, but you can't get in via SSH or console (and no, the OOM killer doesn't kick in).
I think item #3 is important. If you have a crusty system that's been in place for a while and it reboots for some reason, you now have to spend time to make sure everything started, figure out what didn't start, and why. This doesn't mean you need to restart once a week, but every 6-12 months is certainly reasonable.
I've heard a lot of myths. I've never heard a myth stating "You need to reboot a UNIX system to fix problems." If anything I've heard the opposite myth. Who promulgates this shit?
I do remember ONE time a UNIX system needed a reboot. We (developer team) were managing our own cluster of build machines. The head System God was out of town for two weeks. We were having problems with a build host, and tried everything. Day after day. Finally, on the last day before System God was due to return, it occurred to me that the one thing we hadn't tried was to reboot the machine. The reboot fixed the problem, whatever it was.
I felt stupid. One, for not figuring out the problem in a way that could avoid a reboot. Two, for not recording enough information to determine root cause in a post-mortem analysis. Three, for configuring a system in such a way that a reboot might be required in order to fix a problem.
To this day I believe that reboot was unnecessary, although at the time it was the fastest way to resolving the immediate blocking issue.
... the crap I read on Slashdot is so unbelievable, I have to reboot my laptop in the hopes that it will go away.
Have gnu, will travel.
The same argument can be applied to Windows servers; sometimes rebooting will only make things worse, or at least no make things any better. Unfortunately, these days the trusty reboot is often the first option instead of last resort; at the very least some basic troubleshooting needs to be done to identify potential causes before you likely erase half the evidence.
I suffer from a desktop variant of this issue at work, whereby re-imaging has become the "troubleshooting" tool of choice, to the point that all thought has now left the support process so that I've witnessed an engineer re-image a PC 3 times (at 30+ minutes each time) before someone else identified that the issue was being caused by a BIOS setting and that re-imaging was a complete waste of time.
Let's face it, if your admin/support staff are lazy and/or stupid, then it doesn't matter which approach they take because they're not going to fix the problem anyway.
/. editors: I propose a new rule. Submissions with links to PCWorld, InfoWorld, PCMagazine, Computerworld, CNet, or any other technology periodical you'd see in the check out line of a Walgreens be immediately deleted with prejudice.
They're the Oprah Magazine of the tech world. They exist to sell ads by writing articles with grabby headlines and little substance.
No sig for you!!
- Design system
- Build system (involves inevitable reboots)
- Test system (involves inevitable reboots)
- Move system into production.
Once the services you need start up the way you want, don't play with it. Put it into service and have backups of the original image, any changes you make and a working replacement (Yes, have a working replacement - there is *nothing* better than having another machine sitting next to your server that can take over its job with the flick of a switch while you repair it - it also lets you test changes safely, and whenever you're sure the system is how you want it, you push the same image to your "copy of" server).
If you do it properly, that machine will then stay up until hardware failure. Sometimes that *can* be years away. If you do it properly, you shouldn't ever, ever, ever be rebooting a server that's in production - you're just masking the real problem. Yeah, it'll work most of the time but it's just a way of papering over the cracks. The server hung, the service died, the settings got out of sync, or whatever, for a reason. Just rebooting is ignoring that reason for sake of service continuance - if the service is that vital, you should have high enough availability to cover such incidences or that same problem will come back to bite you later.
Nobody cares about enormous uptimes, but having a server that you haven't NEEDED to touch in months is a good thing. It means that it has a well-defined function and has been performing correctly - that's your "stable" version and should be treated as such. Every time you make a change to a server, it then becomes a "current/experimental" version that you should be wary of.
At worst, when a problem appears, you turn ON a replacement server and fix the one that is showing problems. If its role is well-specified, you don't get "feature creep" where it's running a million things that it never used to and they're not in your startup properly because it's never rebooted enough for you to test them.
On Windows, or Unix, you shouldn't have to reboot. If you do, it's to test something or correctly reinitialise after fixing a problem (a post-solution reboot just to make sure it works as required isn't a bad thing but certainly not "required"). The worry of hardware failure on boot shouldn't stop you rebooting, and similarly you shouldn't reboot just to "spot" problems. Both suggest inattention and lack of suitable backups/replacements/high availability solutions.
Systems can easily go 3-4 years in operation without requiring a reboot. If your hardware is good quality, you're monitoring the server as you should be, you have adequate backups/replacements and the role it performs isn't changed, there's no need to ever reboot it past initial testing. I have internal school servers that only get rebooted in the summer (i.e. once per annum) and that's only because the power goes off to upgrade the electrics each year.
If it wasn't for that, I'd just leave them running. They don't need kernel 2.6.192830921830 and they have been doing that same job reliably for a LONG time. I'm not going to kick them into a reboot "just because". Similarly even the tiniest memory leak in their processes would cause me problems that I would spot immediately.
As it is, 450 happy users all day long for years. The last one I installed actually took a whack from a collapsed networking cabinet coming off the wall (full of fully-populated Gigabit switches) and dropping six feet onto it. Apart from a small dent it carried on just fine, and the disks were idle, and SMART / data integrity show no problems. I rebuilt the entire network cabling around it because switching it off wasn't necessary. If it did reboot and it didn't come up in the expected state? There's a copy of it on another machine on the other side of the room - it's predecessor that also didn't reboot for years but wasn't fast enough to run the amount of PHP / MySQL we needed it to among its other functions. Having the replacement machine
courtesy of Appendix A of the Jargon File.
Welcome to the Panopticon. Used to be a prison, now it's your home.
It makes a nice figure. Ten years. HP-UX running a few more or less referential databases. 3650 days. Was it patched properly? Did anyone *really* look after it? The only thing that can be said, is that it apparently was quite a stable machine room in terms of 10 full years of electrical & other provisions, more or less intact.
Then it was shut down for good.
I'd rather see regular maintenance breaks and maintenance windows (pun not entirely intended), than collect numbers in the uptime command's output. But the story is true, after I left that company not a single soul ever rebooted it. Ten years after they send me an email, with an attachment of a putty session. Ten years, :)
After making configuration changes it makes _a lot_ of sense to reboot if possible. That way you can determine that your changes indeed load properly after a reboot. You don't want that kind of a surprise when you have long since forgotten all the little tweaks in place.
.: Max Romantschuk
Why should I bother disabling it?
Generally, good administrators tend to disable service that aren't wanted or needed in their systems. Who's to say that there's not going to be a vulnerability for the service discovered down the road (*coughSolariscough*) that would make you vulnerable?
Windoze admins...
The very first word in your "+5 Informative" diatribe is a derogatory term blanketing all administrators of Windows systems. Anything else you have to say should now be taken as extremely biased, if not plain ignorant. I've been an administrator of Unix systems for over 20 years, and an administrator of Linux and Windows servers since their early days. Being a Windows admin does not mean that one is uniformed or technically inept, any more than being a *nix admin makes one smarter.
Stereotypes exist for a reason. If it wasn't true for at least a statistically relevant number of samples, then the stereotype would not exist.
Yeah, those damn lazy black people, always raping our precious white women.
Making fun of dumb people since 2009