Why You Shouldn't Reboot Unix Servers
GMGruman writes "It's a persistent myth: reboot your Unix box when something goes wrong or to clean it out. Paul Venezia explains why you should almost never reboot a Unix server, unlike say Windows."
← Back to Stories (view on slashdot.org)
Because you won't be able to brag about your uptime numbers.
This is not a myth I had heard before. In fact, none of the *nix sysadmins I know would dream of rebooting the box to clear a problem except as a last resort. Where has this come from?
Burns: We're building a casino!
McAllister: Arrr. Give me 5 minutes.
I for one believe in frequent-ish reboots.
I agree it shouldn't be relied upon as a troubleshooting step (you need to know what broke, why, and why it won't happen again). That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue. Verifying that the system still boots correctly is imo a good idea.
Also, all that fancy high availability failover stuff... it's good to verify that it's still working as well.
The "my servers been up 3 years" e-pene days are gone folks.
i'm really tired of this semi-technical stuff on slashdot that seems aimed at semi-competent manager-types.
One minor point of disagreement. I'm a fan of the pre-emptive reboot at specific intervals, whether the interval be 30 days, 60 days, or 90 days is up to you. In the past, I've found the pre-emptive reboot will trigger hidden system problems, but at a time when you're actually ready for them, rather than at a time when they happen spontaneously ( 2:30 in the morning ).
"Man is nothing without the works of man" -- Helvetius
FTFA:
Some argued that other risks arise if you don't reboot, such as the possibility certain critical services aren't set to start at boot, which can cause problems. This is true, but it shouldn't be an issue if you're a good admin. Forgetting to set service startup parameters is a rookie mistake.
This is retarded. A good admin will test so that everything works, before it will get a chance to actually break. Anyone can fuck up, forget something, whatever. Doesn't matter how experienced you are. Murphys law. The only way to test if it will come up correctly during a non-planned downtime is to actually reboot while you have everything fresh in memory and while you're still around and can fix it. Rebooting in that case is not a bad thing, it's a responsible thing to do.
c++;
I RTFA (shame on me) and it is in my opinion absolutely stupid.
There is actually only one real reason given and that is that if you reboot after some services ceased working, you might end up with a unbootable machine.
In my opinion this outcome is absolutely great. Ok, maybe no great, but it is important and rightful. It forces you to fix the problem properly instead of ignoring the known problems and missing yet unknown problems which might bite you in the .... shortly after.
Also: When services start being flakey on my system, i usually want to run an fsck. In 16 years linux/unix administrations I found quite a time that the FS was corrupted without an apparent reason and with beeing unnoticed before. So a fsck is usually a good thing to run when strange things happen and to be able to run it, i nearly always need to reboot.
I can't grasp what kind of thinking it must be to continue running a server where some services fail or behave strangely. You could end up with more damage than cause by a outage when the reboot does not go through. You just might want to do the reboot at off-peak hours.
This is like *NIX 101.
But then, try changing the locale on a running system...
By and large there is really no need to reboot a UNIX machine unless you are making a change to the kernel, i.e. an upgrade or a recompile with an added feature. Other than that, the author is correct. I have machines with uptimes of two years. It would have been more had I not had to power the machine down for a physical move.
More or less it is "You shouldn't reboot UNIX servers because UNIX admins are tough guys, and we'd rather spend days looking for a solution than ruin our precious uptime!"
That is NOT a reason not to reboot a UNIX server. In fact it sounds like if you've a properly designed environment with redundant servers for things, a reboot might be just the thing. Who cares about uptime? You don't win awards for having big uptime numbers, it is all about your systems working well and providing what they need and not blowing up in a crisis.
Now, there well may be technical reasons why a reboot is a bad idea, but this article doesn't present any. If you want to claim "You shouldn't reboot," then you need to present technical reasons why not. Just having more uptime or being somehow "better" than Windows admins is not a reason, it is silly posturing.
You lie.
Seriously. I don't know what HP is doing, but NFS hangs/stuck processes that you can't kill -9 your way out of is just wrong.
"Not to mention all the idiots who use words like boxen."
Anonymous Coward on Monday August 04, @06:49PM
I run web servers for a few dozen clients, and rebooting a remote machine was always scary. There was the possibility that something might not boot up during startup (e.g. SSHd) and I would be locked out. I would then have to travel to my data center downtown (about 30 minutes away) and troubleshoot the problem. Since I don't have 24/7 access to the DC (I don't have enough business with the DC to warrant an owned security pass...) I have to wait until they open to the general clientèle in the morning.
With ESXi, however, I'm not that scared anymore. If something does go wrong, I have a console to the VM through vCenter client (the application that manages virtual machines on the server). It's happened once where a significant upgrade of FreeBSD 7.2 to 8.1 was problematic. Coincidentally, it was because I didn't upgrade the VMware tools (open-vmware-tools port). Nonetheless, I managed to fix the problem through vCenter.
This is why I love virtualization in general. It's making managing servers easier for me.
What a load of horse shit.
Often system upgrades (eg. security fixes) include new versions of libraries and such. It's impossible for the package manager to know which processes are using those libraries so it can't automatically restart everything. Consider if you have custom processes running, the package manager wouldn't even know about them.
Therefore you have to do it manually, but then you have the same problem. It's damn hard to know which processes are using the libraries that were upgraded. Really, really hard if it's a big server running hundreds or thousands of processes. Often it's easier just to reboot so you make sure everything is running the current version of all the libraries. If you don't then you can't be sure that all the security fixes are actually running on the system since it will be using the old cached versions of the libraries in RAM.
Quotes from stupid people:
You should never reboot a Mac, it's not like Windows.
You should never reboot Unix/Lunux, it's not like Windows.
Well, you shouldn't reboot Windows either. You reboot it when it goes sour. Our Windows servers seldom go sour, so we don't reboot them. Same for Mac or *nix.
Problem is when it starts to cause problems. Like our /var/spool partition deciding it has better things to do than exist... or the ever so important NFS or iSCSI mount that decides to Go West, and gives us the ??? ls we all dread ... with umounting impossible, so remounting impossible, and all these stale files and stuff. You either tweak these things for hours cleaning up all processes, or you reboot.
In fact, being a good sysadmin, all my servers are MEANT to be rebooted if something goes sour. One SVN project goes sour? check if it's not the repository itself that got problems, or if the system needs to save something to safely exist ... and if not, reboot the server. Everything magically restarts itself, does its little sanity check, and a quick look at a remote syslog to make certain everything is all right. 2 minutes lost for everyone, not 3 hours of trying to clean up mess left by some stray process somewhere or trying to kill the rogue 100 compression and rsync jobs that got started eating up all RAM, CPU and network.
Since all our servers are single processes and are either VMs or single machines, it's a breeze to do this. iSCSI will diligently wait before the machine is back up before trying to reconnect. NFS will keep its locked files up, and will reconnect to them. No, seriously, everything simply reconnects!
Of course, the idea is to minimize these occurences, so we learn from it, and we try to repair what could've caused this problem in the first place. And there's a place to do this in a server crash postmortem. But no need to make users wait while we try to figure out wth.
While it's true servers don't need to be restarted as often as Windows counterparts, there are valid reasons for restarting a server:
- new kernel, new features
- new kernel, new security patches (yes, these are distinct reasons)
- ensure all services restart in the event of a real failure
- we have cases where memory fills and the system starts thrashing. It may cure itself eventually, but you can't get in via SSH or console (and no, the OOM killer doesn't kick in).
I think item #3 is important. If you have a crusty system that's been in place for a while and it reboots for some reason, you now have to spend time to make sure everything started, figure out what didn't start, and why. This doesn't mean you need to restart once a week, but every 6-12 months is certainly reasonable.
I've heard a lot of myths. I've never heard a myth stating "You need to reboot a UNIX system to fix problems." If anything I've heard the opposite myth. Who promulgates this shit?
I do remember ONE time a UNIX system needed a reboot. We (developer team) were managing our own cluster of build machines. The head System God was out of town for two weeks. We were having problems with a build host, and tried everything. Day after day. Finally, on the last day before System God was due to return, it occurred to me that the one thing we hadn't tried was to reboot the machine. The reboot fixed the problem, whatever it was.
I felt stupid. One, for not figuring out the problem in a way that could avoid a reboot. Two, for not recording enough information to determine root cause in a post-mortem analysis. Three, for configuring a system in such a way that a reboot might be required in order to fix a problem.
To this day I believe that reboot was unnecessary, although at the time it was the fastest way to resolving the immediate blocking issue.
I just don't get it why was it necessary to make reference to Windows here? Most of the legit reasons not to reboot Unix box he listed apply to Windows and it's analogous subsystems too.
... the crap I read on Slashdot is so unbelievable, I have to reboot my laptop in the hopes that it will go away.
Have gnu, will travel.
The same argument can be applied to Windows servers; sometimes rebooting will only make things worse, or at least no make things any better. Unfortunately, these days the trusty reboot is often the first option instead of last resort; at the very least some basic troubleshooting needs to be done to identify potential causes before you likely erase half the evidence.
I suffer from a desktop variant of this issue at work, whereby re-imaging has become the "troubleshooting" tool of choice, to the point that all thought has now left the support process so that I've witnessed an engineer re-image a PC 3 times (at 30+ minutes each time) before someone else identified that the issue was being caused by a BIOS setting and that re-imaging was a complete waste of time.
Let's face it, if your admin/support staff are lazy and/or stupid, then it doesn't matter which approach they take because they're not going to fix the problem anyway.
Paul Venezia is an uneducated piece of garbage.
Is it just me, or is the logic behind this article broken?
The purpose of existence is to make money.
/. editors: I propose a new rule. Submissions with links to PCWorld, InfoWorld, PCMagazine, Computerworld, CNet, or any other technology periodical you'd see in the check out line of a Walgreens be immediately deleted with prejudice.
They're the Oprah Magazine of the tech world. They exist to sell ads by writing articles with grabby headlines and little substance.
No sig for you!!
Did anyone else notice the reek of the True Scotsman fallacy? If you agree with him, he brags about it. If you don't, he cites the reason to be because you aren't a TRUE pro-unix admin.
Sorta grates on my nerves a bit.
while(1) attack(People.Sandy);
The new crop of sysadmins are sortof funny. I wasn't aware that there was a myth that rebooting a server fixed anything, among the unix ranks. Of course that doesn't fix anything.
Are the people spreading this myth the same folks that log in as root because hey - they're the sysadmin, and access controls are for wimps?
I never reboot unless the system hangs up completely. In recent years I had to reboot once, when the air conditioning failed and a server had a bad memory alarm.
By keeping reboot as an extreme measure, I know when something truly bad happened. If I reboot without reason, I lose that information.
Actually I find I take the same approach with windows as well. Most of the time if you reboot a windows box the same problem is just going to repeat it self time and time again. So rebooting isn't actually a solution here either.
Currently working as a software developer in a company I got so pissed off with the servers rebooting all the time. When the resident it guy was on holiday it fell to me (next person who had any experience of doing this sort of thing). But the time he came back the network was running perfectly fine. Managers asked what i had done. I just said i fixed a few things properly about 4-5 major things instead of just rebooting it ... Long story short. The it guy isn't there any more. I took over his role on top of my own and the network has been running find ever since ...
Oh I got a chunk of the other guys pay for it too :)
...for wasting company time on non-solutions instead of doing a reboot that took 1 minute.
Its a second time in a week when a barely interesting post from this guy's blogs makes it to the main page. What's wrong with you, /. ?
If anything would break on a reboot of a Unix system, your sysadmins aren't doing their fucking jobs and need to be crucified on the shattered remains of a cabinet.
If you can't reboot a specific, single server without production impact, your architects aren't doing their fucking jobs and need to be crucified on a whiteboard easel.
If all you care about is some asinine 'uptime' number, turn in your fucking credentials now - you have no business being anywhere near a command line.
I know this is Slashdot, and fanbois abound, but this is coming it a bit high. Yes, Suzy, even Unix systems need to be rebooted now and then.
- Design system
- Build system (involves inevitable reboots)
- Test system (involves inevitable reboots)
- Move system into production.
Once the services you need start up the way you want, don't play with it. Put it into service and have backups of the original image, any changes you make and a working replacement (Yes, have a working replacement - there is *nothing* better than having another machine sitting next to your server that can take over its job with the flick of a switch while you repair it - it also lets you test changes safely, and whenever you're sure the system is how you want it, you push the same image to your "copy of" server).
If you do it properly, that machine will then stay up until hardware failure. Sometimes that *can* be years away. If you do it properly, you shouldn't ever, ever, ever be rebooting a server that's in production - you're just masking the real problem. Yeah, it'll work most of the time but it's just a way of papering over the cracks. The server hung, the service died, the settings got out of sync, or whatever, for a reason. Just rebooting is ignoring that reason for sake of service continuance - if the service is that vital, you should have high enough availability to cover such incidences or that same problem will come back to bite you later.
Nobody cares about enormous uptimes, but having a server that you haven't NEEDED to touch in months is a good thing. It means that it has a well-defined function and has been performing correctly - that's your "stable" version and should be treated as such. Every time you make a change to a server, it then becomes a "current/experimental" version that you should be wary of.
At worst, when a problem appears, you turn ON a replacement server and fix the one that is showing problems. If its role is well-specified, you don't get "feature creep" where it's running a million things that it never used to and they're not in your startup properly because it's never rebooted enough for you to test them.
On Windows, or Unix, you shouldn't have to reboot. If you do, it's to test something or correctly reinitialise after fixing a problem (a post-solution reboot just to make sure it works as required isn't a bad thing but certainly not "required"). The worry of hardware failure on boot shouldn't stop you rebooting, and similarly you shouldn't reboot just to "spot" problems. Both suggest inattention and lack of suitable backups/replacements/high availability solutions.
Systems can easily go 3-4 years in operation without requiring a reboot. If your hardware is good quality, you're monitoring the server as you should be, you have adequate backups/replacements and the role it performs isn't changed, there's no need to ever reboot it past initial testing. I have internal school servers that only get rebooted in the summer (i.e. once per annum) and that's only because the power goes off to upgrade the electrics each year.
If it wasn't for that, I'd just leave them running. They don't need kernel 2.6.192830921830 and they have been doing that same job reliably for a LONG time. I'm not going to kick them into a reboot "just because". Similarly even the tiniest memory leak in their processes would cause me problems that I would spot immediately.
As it is, 450 happy users all day long for years. The last one I installed actually took a whack from a collapsed networking cabinet coming off the wall (full of fully-populated Gigabit switches) and dropping six feet onto it. Apart from a small dent it carried on just fine, and the disks were idle, and SMART / data integrity show no problems. I rebuilt the entire network cabling around it because switching it off wasn't necessary. If it did reboot and it didn't come up in the expected state? There's a copy of it on another machine on the other side of the room - it's predecessor that also didn't reboot for years but wasn't fast enough to run the amount of PHP / MySQL we needed it to among its other functions. Having the replacement machine
"If you shrug and reboot the box after looking around for a few minutes, you may have missed the fact that a junior admin inadvertently deleted /boot and some portions of /etc and /usr/lib64 due to a runaway script they were writing. That's what was causing the segfaults and the wonky behavior. But since you rebooted the server without digging into the problem, you've made it much worse, and you'll soon boot a rescue image -- with all kinds of ponderous work awaiting you -- while a production server is down."
That argument is somehow pro-Unix?
I mean, yeah, a Windows person can screw with boot files, too. However if a Windows person were to read that paragraph it certainly wouldn't do a thing to encourage them on the solidity of *nix. It basically translates to "if you're having a problem, don't restart, you may not be able to boot again because your other admins may be incapable of writing proper scripts since every *nix system is different on its boot structure ... so ALWAYS do a full check of the existence of your system binaries before rebooting."
Do Windows folks reboot too easily before examining logs, restarting services, etc? Sure. But this article extrapolates this point beyond the deep end.
It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.
Actually it isn't. There's virtually always a reason why something screws up, regardless of if you're in Windows or Unix, and you won't need to reboot. The only exception is for patches, where Windows requires it a bit too often for comfort.
I've worked for a few companies where rebooting a Windows Server for anything except patches/maintenance would require a full root cause analysis, and it pretty much never happened. We virtually always were able to find what was going wrong and fix it without rebooting. This isn't 1998 anymore: Windows Server absolutely can stay up for long periods of time, and there's always ways to prevent reboots.
If you need to increase the number of hugepages on a server, and memory is already seriously fragmented, doing that without a reboot is asking for a world of pain.
Adult Role Playing Forum
courtesy of Appendix A of the Jargon File.
Welcome to the Panopticon. Used to be a prison, now it's your home.
Are the people complaining about windows crashing running consumer hardware without ECC memory and crappy lowest bidder 1U non-redundant PSUs going crowbar? Reliability thresholds of major general purpose OSs is all noise compared to physical hardware attributes nowadays.
It is NEVER a good idea to reboot ANY system of any kind without first understanding what the hell is wrong with it. Various services especially database service if arbitrarily rebooted may take hours (As in OFFLINE) to recover to a consistant state.. I've seen it happen..too many times....it is not a lot of fun telling retarded sysadmins they have no choice but to sit on their hands and wait hours for a system to come back online because they got up and pressed the big red button.
It makes a nice figure. Ten years. HP-UX running a few more or less referential databases. 3650 days. Was it patched properly? Did anyone *really* look after it? The only thing that can be said, is that it apparently was quite a stable machine room in terms of 10 full years of electrical & other provisions, more or less intact.
Then it was shut down for good.
I'd rather see regular maintenance breaks and maintenance windows (pun not entirely intended), than collect numbers in the uptime command's output. But the story is true, after I left that company not a single soul ever rebooted it. Ten years after they send me an email, with an attachment of a putty session. Ten years, :)
I know of one case where a reboot is certainly appropriate. Suppose that you've found and fixed a problem but the fix involved changes to processes that occur at boot time. In that situation, it's a smart idea to reboot the server at some convenient time (this doesn't have to be done immediately) to make sure that the machine will reboot correctly and the problem will stay fixed. If you don't do this, you might be surprised the next time your machine comes back up after (e.g.) a power outage.
So.... It's so easy to mess up a Unix server to get it to the point where it won't boot properly, that they recommend never rebooting it?
I'd rather have a server that I know will consistently boot properly and immediately start working, despite needing to reboot every few months than have a server that always works, but am scared to death to touch it because someone can so easily corrupt something. No matter how much planning and redundancy you have, you will eventually have a failure. When that happens, your users aren't going to be standing around praising the joys of a UNIX/Linux server while you are trying to figure out how to get the thing back online.
On Windows bosex, the Admin are not allowed to fix problems, so problems persist. To temporarily solve the problem until MS fixes the code robot the computer. It doesn't matter because the computer is going to have be rebooted anyway when MS issues an update.
To be fair, the main reason I don't reboot my *nix partitions is that I am never sure they will come back up. Say what you will, the nice thing about Windows is that no matter how it is damaged and how bad the situation is, it will attempt to come back up in something resembling a working state. Probably not a known state. Probably not a secure state. But usually a workable state.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
www.webpages.ge Site catalog
I work in an environment of literally hundreds of linux & Solaris systems, and we reboot pretty much every single one every 6 months on a regular scheduled patching cycle. We have systems broken down into groups of test/dev, staging, and production. When it's time for a new patching run we obtain all the vendor patches up to that point and apply them to the test/dev systems. After giving people a week to test & verify then we apply them to staging, and then a week or two after that during defined maintenance windows we apply them to production. If we encounter any problems along the way we address them. If we had a major problem arise along the way we'd push out the production patching until we had everything resolved on the staging systems. This method has worked for close to 10 years at this organization and we have no intention of changing it. We control access to the production systems, and all configurations are backed up & managed via a combination of homebrew code and cfengine, along with nightly tape backups of all the production systems. If any significant problems occur on production systems then it's not all that difficult to rebuild a machine from bare metal using the saved configurations and restoring any other data from the backups.
It isn't a valid strategy for Windows Servers either. Rebooting is a lazy fix for any platform. It *may* make the problem go away, but you have no clue why. Often you can find the crappy program that's causing the problem and kill it or fix it and no reboot is necessary. It takes longer, but only once. Restarting a service is way faster than rebooting and you don't have this mystery problem hanging over the box.
I have more than 100 Windows Severs under management and other than patches we don't reboot them. We have some poorly written software that we script to shut down every night, but we're restarting that bad app, not the box.
As linux starts to pick up steam, the same crappy admins who make windows look worse than it is are bringing their poor skills into the *nix realm.
Not a myth I've heard uttered by people with real unix(-a-like) experience. If a service is not functioning correctly you might restart that service, and maybe its dependencies, but not the whole machine.
The only time a server should be rebooted is after a kernel update or after configurations changes that you "know" are right but need to verify stay right after a reboot. I do sometimes reboot machines at other times just to make sure all is well so I can be reasonably assured that everything will come back up after, say, a power outage. None of these times happen when there is a known problem to be investigated - the reboot happens at a planned time (well, within a planned window) outside of working/demand hours. Rebooting a machine to fix a problem is no better then "close all your windows and see if it happens again" guesswork.
Some suggest rebooting to force an fsck occasionally, to ensure the filesystems are in consistent order, but this can usually be done without a full reboot (unless you suspect the root filesystem may need checking) - just stop all the relevant services and umount the filesystem.
Even if you don't have power accidents or hung hardware, it is probably a good idea to reboot Unix and other Linux-like boxen every so often. To clear out memory fragmentation and (horrors!) kernel memory leaks and stale kmallocs().
How often is a good question. Perhaps yearly on a lighly loaded box (I like doing it in Dec so `ps` shows me the year). Maybe monthly on a loaded box. I have noticed a speed-up.
He cites as one reason, that someone might have deleted stuff in /boot or /etc or /usr/lib64 so the machine might not come back up quite right. I would sure as hell rather be confronted by such surpises during scheduled downtime, rather than after a power outage when the UPS failed and I already have enough problems. Problems like what he's talking about, while they kind of presume you have already failed as an admin (why are clueless people deleting things in /boot?) are a reason you should reboot. The sooner I know about it, the better, especially if it's at a less panicky time like on a weekend when nobody cares if I have to spend an extra hour with a rescue CD.
Also, sometimes it's just plain convenient. If the reason you're working on the box is that you just moved your /var to a new device (maybe you were changing the kind of filesystem you used, or wanted to expand it but are using a filesystem that isn't easy to resize (though I can't think of one right now)), good fucking luck unmounting it so that you can remount it on its new device. Why go through the hassle when you can just reboot and have things magically work?
And then there's kernel updates. You're not going to upgrade from 2.6.n to 2.6.(n+1) by reloading modules. And yes, I have heard of some hacks for people updating kernels without rebooting, but if you think about how these things work, they're a hell of a lot scarier than rebooting is.
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
I must reboot this unix server now.
My several years experience with numerous Server 2008 (yes that's the same kernel as Vista) and, more recently, 2008 R2 based boxes suggests otherwise. Perhaps your servers are badly configured? Windows Server is a rock solid OS.
I reboot all my systems once a week (Sunday). Better to find out about messed up init scripts and such in a timely fashion that while everyone is screaming about unplanned downtime. And I have found and fixed such errors from time to time, so it's not a bad idea to get the issues resolved before they become a problem.
I do not fail; I succeed at finding out what does not work.
In 20+ years of working in UNIX environments, I've never met anyone who would reboot a UNIX box to 'fix' a problem or 'clean up' anything.
Maybe next time you should post an article about something a monkey typed on a computer. Would probably be worth reading more.
Some setups do not clean up the /tmp directory - mine seems to just fill up forever until I reboot. maybe it does something on its own but I've never checked or ran into a situation where i needed to find out; an update or paranoid reboot deals with it.
Memory fragmentation? people care about that with their 10Gig of RAM still? Well, to be fair performance has been coming back with smart phones and battery life on laptops ...now if we could just get people back to knowing what a pointer is...
Democracy Now! - uncensored, anti-establishment news
Rebooting isn't a crime--and is often necessary after applying security patches.
Think about that the next time someone tells you it'd been 700 days since a reboot; 700 days of exploits you can choose from to assail that machine with.
"The "my servers been up 3 years" e-pene days are gone folks."
Slashdot Stats
uptime: 1021 days, 23:23
In what kind of instances might a test reboot be advisable after a change? Obviously our overlords have not felt the need to follow this particular advice..
MilkMiruku
If a server going down is a big deal then perhaps you aren't as good as you think. Just a thought..
Seriously, if your setup isn't virtual save the preaching because you're stuck in the past.
Tiger Blooded Bi-Winning Machine
Mr. Venezia must know a lot of piss-poor Windows admins. Or be one himself.
What is up with folks tagging this as a "troll?" Strawman? Really? There isn't a debate here, guys. It is not rhetoric. There is no strawman.
This article is sound advice. He's basically saying don't do a reboot without doing an RCA first. If you need to reboot as part of your root cause analysis, fine, but for god sakes don't just shut down everything until you know why, that you need to, and whether it's going to come back up.
This is good advice for Windows sysadmins too. Period. But Windows was not its focus. *nix was. This is because *nix is a bit more compliant to letting root completely hose the file system, on the fly (his "runaway script" example), and still be able to run.
IMO, it was written as a caution to new *nix admins, possibly migrating from a Windows environment, period. RTFA, and take it at face value. You'll learn something if you're a novice to *nix.
--
Toro
You might argue the "myth" of rebooting windows servers but I've never heard anyone with sufficient *nix experience saying that rebooting is a solution to anything except certain cases.
Selah.ca. Pause, and calmly think on that.
Tools that run periodic performance tests and continuous surface crawls will make a fool out of SMART every single time.
(I have a great deal of experience on this front, and three boxes sitting behind me full of about 300 drives in various states of failure. do NOT rely on SMART)
I work for the Department of Redundancy Department.
Windoze admins...
The very first word in your "+5 Informative" diatribe is a derogatory term blanketing all administrators of Windows systems. Anything else you have to say should now be taken as extremely biased, if not plain ignorant. I've been an administrator of Unix systems for over 20 years, and an administrator of Linux and Windows servers since their early days. Being a Windows admin does not mean that one is uniformed or technically inept, any more than being a *nix admin makes one smarter.
- require https over http to devices, yet still have telnet access enabled.
I'm sure I have several devices on my network with telnet enabled. Why should I bother disabling it? I don't use it, so its vulnerability to password sniffing is irrelevant.
And what do any of your gripes have to do with whether or not Unix servers should be rebooted?
Ok, if an article like this is making the front page of slashdot, then the reader base must have shifted a lot more than I had previously realized. Duh, reboots are for upgrades -- this has been established a very long time ago. I mean this isn't even a holy war issue which you expect to have flare up every now and then. If the article were about guidance on upgrades that would be different. The whole "we do a monthly reboot" regardless is silly; I just don't see it, wouldn't you be better served figuring out how to make the delicate parts of the root partition read only?
I mean really, if you are unsure that your system will reboot after you make some change, then why in the world are you making that change on a production server?
Uptime numbers are just another stupid dick-measuring contest. Resources consume ram and some resources leak memory. You can't reclaim that leaked ram without a reboot.
Kernel updates and some system patches take reboots too. If you're not rebooting, you're not current.
You cannot fsck a mounted slash. fsck always finds something wrong on the filesystems when I *do* reboot after a year of uptime.
This business of rebooting a server everytime there is an issue is a sign of a larger problem. There are more options available for fixing problems on *nix systems, that's just the way it is. You don't always *need* a reboot to fix things, but sometimes you do (stuck tape drives, zombie processes, crappy iscsi software, etc).
boycott slashdot February 10th - 17th check out: altSlashdot.org
I mostly agree with the sentiment, analyze a failure in place before rebooting. In fact, for a *problem*, I agree that it is a particularly bad time to blindly reboot.
For updates, in practice, I disagree. In theory, he is right that very very few updates *demand* a kernel replacement and all other updates *should* be possible through restarting the 'right' things. In practice, there are significant problems.
First, some kernel modules are written.. sub optimally and cause unrecoverable issues if unloaded or will not work right on reload. Some drivers are written and only tested with a PCI reset and POST cycle between driver changes. Sometimes reloading a driver if the firmware was updated during runtime will confuse a system. Downing some modules will remove functionality that is required for the system to run well enough to load the module again. Many of these are issues that if you knew in advance would induce you to pick a different vendor, but they are usually only apparent after it is too late.
Secondly, depending on your Unix/Linux choice, updates may not be available of 'just' the modules you need or even clarify whether the updates are in the kernel or modules. A linux distro generally dumps a single update of kernel *and* modules. Even if you *knew* the kernel didn't have critical fixes, modprobe may refuse to load new ones against other kernels. 'Real' Unix *tends* to be potentially better on the latter point, and you *might* be able to comb over the updates with a fine tooth comb, hand-patch the source for the kernel tree that matches your running kernel, build and load. However, this is a ton more work and way riskier than 'just' rebooting.
Finally, *particularly* with shared library vulnerabilities, there is a slim chance in practice you'll understand all the processes currently executing on your system enough to be 100% confident you'll hit all in-memory instances of buggy/vulnerable code. Figuring out the 'right' processes to restart or end is generally a larger logistical challenge.
In general, reboots should be feared in terms of disrupting identifying root cause, but updates and periodic sanity testing to make sure your system on startup actually works and matches how you expect it to be configured. If the service is critical enough to make people worry about reboot induced outage, then the service is not properly configured to run in a manner conducive to mission critical reliability (another server should be able to transparently take up load).
XML is like violence. If it doesn't solve the problem, use more.
I don't suspect the author was aware of Ksplice. You can actually perform kernel upgrades without a reboot.
Ever been asked to reboot after installing a *GAME* on that other platform? This makes me ask the question: "What did you game publishers just do to my system? Are there any secret root-kits or other forms of malware that came bundled along with this game?"
Why so many UNIX admins try keep servers up as long as possible ? There is a valid reason to reboot your UNIX servers on regular basis and it's not about solving problems but rather about avoiding (potential) future problems.
While many Windows folks tend to reboot servers when there is a problem (and it seems to work for them sometimes), Linux/UNIX servers are quite contrary in this regard - these things often seem to work fine until reboot. And after reboot one often realizes that some things manually started or reconfigured while server was running does not work properly or server requires some additional (manual) work to make it work. There may be bugs in startup scripts (or missing links in rc?.d directories, or screwed dependencies), one can make some tweaks with some service without stopping it and then forget putting changes into configuration files etc. Of course, there are habits and tools to deal with it but you cannot be sure that everything works after reboots until you try it.
I advise rebooting servers at convenient moments on regular basis or after major reconfigurations just to make sure that every change is properly synced with configuration files and startup scripts.
Especially in a large company. ... except that someone, somewhere depends on it ... or not, who knows? You just know you got an alarm, or a blinky light, or a dead drive, or a bad network interface, yadda yadda.
One that is perhaps grown via a katamari damacy style brute force accumulation of unrelated crap.
Unrelated crap where more likely than not, there is not a single person left in the company who knows how it works, or specifically what it does
If you reboot, you don't know for sure that it's going to come back on ... and it hasn't been booted in 20 years.
Thankfully with Unix type OS's, if you know what you're doing, you probably don't HAVE to reboot.
So you scrub in, and you do some surgery, and you hope for the best.
nobody wants to be up all night dealing with that shit, and trying to find someone who even knows what the machine does.
nobody.
and that is why ...
So the power failed. I was happy to tell people that a system that hasn't been rebooted in over a year is a system that is BADLY in need of an upgrade. I don't want to run software that old. I'm glad the power failed and gave me an excuse.
The general consensus of disaster recovery best practice is that you do not test a backup strategy, you test a restore strategy. Rebooting a server is testing a system restore process.
Yes exactly. When push comes to shove, you are not a deity and you cannot guarantee that your boxes will never be powered off. As such, you need to be confident that they will also power on. Like a backup/restore, that confidence isn't based in, "this should work," but in, "this does work."
The author's argument that you should need to reboot because faulty boot scripts are the mark of a poor sysadmin is bull. Everyone makes mistakes. EVERYONE. It is the height of arrogance to presume that you are somehow special and don't need to check your work. In the case of a startup script for a server, the only real complete test that it is valid, is to reboot. Not testing your setup is not only poor IT skills, it's unscientific. If there is any remaining shred of computer science in the IT world, it should at least include testing of your hypotheses (assumptions).
Dunno, got one situation that a reboot fixes and I have looked.... and used Google, etc. So I present it as a question to Slashdot. Prove it can be solved without a reboot.
Servers runs a RHEL3 clone. Workstations run a RHEL5 clone. My laptop runs Fedora 12. We have a couple of Tcl/Tk scripts on the servers that we display remotely to the workstations and my laptop via ssh X forwarding. Everything is happy, happy, joy, joy... except when it isn't.
Suddenly some remote X stops working, in particular the Tk ones stop. Basic xterms and even Firefox will start over the remote link perfectly. Even better, they almost always display on my laptop, even when everyone else in the building is sticking their head in the door complaining they can't it to work, which makes it even more fun for me to troubleshoot. I have poked around. I have twiddled server tuning knobs, even ran strace on the apps. They seem to hang on the pipe to X but can't see why. Rebooting will always fix it. One of the servers will usually recover in a few minutes without a reboot if we can afford to wait it out, the other doesn't do it as often but when it happens a reboot is the only way back... and the Tk app that fails on it is our timeclock app so waiting isn't a good option. But the machine with the timeclock also serves out home directories to every staff workstation and hosts the virtual machines that run our library automation system and it bites us hard to have to shut that all down. Thankfully that machine only exhibits the problem a couple of times a year.
Democrat delenda est
Stupid guy said something stupid, and now tries desperately to justify his theology instead of admitting he overstated something.
... otherwise the Windows Updates won't get properly applied. :P
Do you reboot your refrigerator once a month too?
Many essay/article writers feel compelled to conclude with an upbeat, smiley-faced "and here's how things will be better" paragraph at the end, and often it's a little puff of fluffy made-up nonsense that doesn't connect to the rest of the story. I've done it. The New Yorker does this a lot. And Paul Venezia does it here.
Last paragraph: "The next time you're looking at a problem and someone says, 'Hey, let's just reboot the thing,' make sure you've exhausted every other possibility before you send it to init 6. The time and pain you save will definitely be your own."
Really? No, I think that's the whole tragedy of the commons that makes rebooting more common than debugging. It's very possible that by punting the problem downfield (get it running quickly while masking root problems) someone else will in fact wind up taking care of it. Maybe the IT staffer on another shift, maybe a future employee after you've left, maybe a different helpdesk guy after you've got the client off your personal phone line. The economics of having multiple, rotating/replaceable IT personnel make it inevitable that the time/pain will not in fact be your own, and makes actual debugging non-incented.
Send it the magic packet?
Twinstiq, game news
After making configuration changes it makes _a lot_ of sense to reboot if possible. That way you can determine that your changes indeed load properly after a reboot. You don't want that kind of a surprise when you have long since forgotten all the little tweaks in place.
.: Max Romantschuk
Not only does he make a no point (I havent seen any admin willing and happy about reboots)
But also tries to sell the idea of going through hoops to avoid restarting as being a good admin. If the most cautious way is to replace and restart, then you should do it. If you want to avoid the surprise (?) of a deleted /boot before rebooting, then create a checklist and keep a failover server, but stop creating more complicated and in the long run more error prone stuff just to show off.
Availability comes not only knowing how to keep the server online but also with strategies to keep the service online! If all you got is "no reboot" as you strategy then you are screwed anyway.
You are not going to impress girls with that kind of uptime.
I rebooted a Dell office linux fileserver once, just to ensure my new fstab entry mounted ok on startup. It didn't mount okay because I made an error in the fstab startup options. Although that turned out to be a more minor issue with the reboot, the much larger problem was that after rebooting the server it would no longer recognize any USB keyboard (I tried 5 different ones) - on any USB port - or via a plugin PCI card, and I couldn't work out any other way to give keyboard input without first having a normal working keyboard. I tried reseting the CMOS but that just resulted in the machine asking for an F-key to get pressed on startup.
After talking to Dell technical support for some time it was apparent they had no way to work around the issue either (short of replacing the motherboard) and we ended up throwing out the whole server and moving the disks to a new one.
That was a late night, and one server i really regretted rebooting!
PS We had a leaky roof and an umbrella over the server for some time, the server had got wet previously and still had drip marks all over it, so I wouldn't really *blame* dell. ..
If you don't reboot your unix servers you are in for a world of hurt when you need to. Most likely a server that has been up for a few years will not reboot without a disk or multiple disk hardware issue. Some issues might be recoverable some might not. You should have a reboot schedule to minimize server hardware issues on unscheduled reboots.
What horrifies me about this article is that he had to write it. He had to actually say "if you just reboot at trouble, you haven't fixed the problem." This is actually a story worth him writing and worth someone putting on Slashdot.
Mind you, I administer Solaris for a living and rebooting is something we do rather more often than we want to. It works way too often, too. We don't tell the NT admins this because they tend to snicker.
http://rocknerd.co.uk
I think we all know why you shouldn't reboot: It cuts power to the velociraptor cages.
Rebooting is important for finding hardware that is about to fail, bad fans, etc. It's also important in identifying one-time configuration that wasn't set up properly to persist across reboots. Its also good to ensure that a server will come back up, especially if its a server no one typically monitors. And if rebooting renders a critical service unavailable, then the service needs to be redesigned so that it doesn't depend on a single machine.
Obviously, the "no reboot" dude operates in a generously staffed environment, allowing him plenty of time to dick around with stale NFS mountpoints and memory leaks. Most of us don't have the luxury of time.
...we schedule our Linux boxen to reboot weekly. Without fail.
I work in an investment bank. There, they don't have time for this uptime-dick-size-contest shit. The longer a box is up, the greater the likelihood that some hardware failure is going to fuck it up and it won't boot again. A reboot is a great way to tease weird issues out into the open so that they don't screw you over at the critical moment. Of course we have redundant servers, but that's not the point.
Ultimately, the 3 year uptime on that web server you brag about is a disaster waiting to happen. I can give a pretty much cast iron guarantee that when it does go down, whether by choice or unexpectedly, it won't come back up again smoothly. And then you're fucked.
The first thing I learned about unix was 'you never reboot it'.
The second major thing I learned about big UNIX hardware is because the shit is so finicky that rebooting may be its death. I learned that day, after watching a sysadmin argue with a Sun tech for literally an hour or more that it shouldn't be rebooted ... that sometimes the hardware doesn't work on the next boot, as was the case for this multimillion dollar Sun server that now decided it no longer had anything on boot flash so it was unable to boot ... because the Sun tech insisted he reboot for a kernel patch that had no relation to the problem at all.
I don't know any UNIX admin who thinks they should reboot their server, ever. I know unix admins that will patch everything on the system and restart daemons even if they have to apply patches one file at a time (patch cluster failing to install on its own for instance), I've heard stories about admins patching binary code in memory JUST to keep a system up and running. UNIX admins take their work more seriously than anyone I know, including just about every Doctor I met.
I know plenty of people who run Linux who reboot their machines for fun. I know plenty of Windows admins who reboot there servers at the first sign of failure.
UNIX admins on the other hand login and find the problem, fix the problem, and let the machine live.
For those saying 'OMG I DON'T WANT TO RUN SOFTWARE THAT OLD!!!!' ... well then you aren't an admin either, you learn rather quickly that when it ain't broke, don't fuck with it. Kernel flaws that can't be mitigated in application daemons or firewalls are pretty freaking rare so theres almost no weight what so ever behind that, pretty much every exploit that exists is a userland exploit, not kernel so unless you have a new feature that you need in the kernel you rarely HAVE to screw with it now days with KLMs and such, hell, I haven't even rebooted recent FreeBSD builds due to kernel changes, just rebuild and reload the KLM in most cases, I'm certain FreeBSD isn't unique in this ability.
You know what you call someone that thinks you reboot a server to solve problems? You call them a fucking Windows USER. Not an admin, and most certainly not a UNIX admin.
If you think rebooting to fix an unknown problem is a good idea its because you don't know what you're doing. I don't mean that to be malicious, its just the reality of it. An intelligent admin doesn't do shit until he/she knows whats going on.
And with great sadness it brings me to my final duty as a Geek with a conscious ...
CmdrTaco your geek card is hereby revoked. You have exceeded your 'retarded idiots posted to the front page' quota by at least 3 in the last month. This one just sends it over the top. You are no longer considered a geek. All privileges associated with your geek card are also hereby revoked.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
When a system runs all gpl'd open source software that is tested by tens of thousand of users and distro maintainers then why would you need to reboot. It only comes into question when your system becomes littered with binary kernel modules and binary programs like flash and god knows what other binary code gets put on without rigorous peer review. Thats when you run into trouble and your system becomes unstable. Not to mention getting your repositories messed up. As more and more of the Microsofties embed them selves in Linux/Unix it will only get worse.
He kind of makes a roundabout commentary that linux is prone to issues that will cause a reboot to fail vs windows. It's an impractical stance, you don't always have the expertise to figure everything out or you expert it not there. If a linux system goes nutty on you, it can be hard to figure out why. I've had nfs and lvs issues where a reboot was needed, I found out later there were known issues that caused the faults.
You can't spend your day debugging everything, it should be safe to reboot, the reboot may be the fastest way back up and that counts more than anything with some servers. Those long uptime servers scare me, will they come back up? I tend to force reboot after any kind of change to make sure that they will come up.
That's a myth?
Maybe among Fox News commentators, or windows admins, or some other kind of inferior life form. I've never heard a Unix admin say that, and I've been working with them or as one of them for over 15 years.
You don't reboot a Unix system except for:
* kernel updates
* hardware replacement
* you really, really, really have tried absolutely everything else and you need a minute of time to think while you look as if you were doing something else than staring blankly at the wall and you know it fools the windows-only boss-idiot.
I know my Unix systems measure their uptimes in months and sometimes years, and I still remember that one time when I spent an hour looking for traces of what I thought was an unexpected reboot, until I finally found out that the Linux uptime counter rolls over after some 460 or so days.
You don't reboot Unix servers. Whoever thinks the premises of the article with its myth is remotely true - please don't let people like that within 10 feet of your servers.
Assorted stuff I do sometimes: Lemuria.org
We learned from experience that it is easier to patch running binaries and manually restart services than it is to fix UNIX boot issues, especially on a remote server at 3am locked in a cage. The reliability of your booting on UNIX/Linux depends on the quality of your system maintenance.
“Common sense is not so common.” — Voltaire
At some point the physical server will need to be rebooted. It is easier to recall a script that was changed 2 weeks (or 2 months ago) that failed during reboot than it is to recall all the changes performed over the last year.
You don't need to reboot the "clean things up" very often on servers. That is very true.
If you are using virtualization - and you should be - then you can migrate a running VM to another VM server (KVM, OpenVZ, ESX, etc) and not take it down, but the running kernel will eventually need to be rebooted, even in a VM.
Most of my servers have uptimes that reflect when the last kernel update was made. Usually, those updates happen every 2-4 months. I'd post my uptimes now, but there was a kernel update 15 days ago so they aren't "impressive" if you are impressed by that. OTOH, ... here's an ESXi uptime:
# uptime
22:28:01 up 474 days, 5:16, load average: 0.00, 0.00, 0.00
You shouldn't be impressed. It means I'm at least 1 patch level behind, perhaps 2.
Rebooting isn't a crime--and is often necessary after applying security patches. Think about that the next time someone tells you it'd been 700 days since a reboot; 700 days of exploits you can choose from to assail that machine with.
Wow, a kernel exploit in KVM which isn't installed on the system. A DoS kernel exploit in a network module that's not installed. That about sums up the usual 700 days of exploits.
I thought the Linux community was working on upgrading kernels without rebooting. Just store the new kernel file(s) in the filesystem, and run some code to cut over from the old kernel to the new one. What happened to that?
--
make install -not war
ONLY JAVA. I wouldn't be surprised if I heard a graduate say "whats a pointer?" or "is that like a reference?"
Democracy Now! - uncensored, anti-establishment news
Any HW that's so "finicky" that it won't necessarily run properly after a reboot cannot be relied upon, because reboots are sometimes either unpredicted, necessary or both. That machine is not worth "multimillion dollars", unless it's the subject of some kind of major R&D project - which I expect it wasn't. Fire the sysadmin and Sun tech, and replace them with a team that will make a machine that isn't a "high bit" away from failing the entire business.
And I say that as someone whose standards are to reboot Unix/Linux machines only on HW upgrades requiring a power cycle, or a kernel upgrade requiring restarting init.
--
make install -not war
>"It's a persistent myth: reboot your Unix box when something goes wrong or to clean it out."
I have been using and admin'ing Unix (and Linux) systems for over 22 years. I have *never* heard of such a rumor. I hear it all the time for MS-Windows boxes and MS-Windows servers. And many appliance-like boxes based on MS-Windows (like our horrible security camera system) have rebooting itself BUILT-IN. But for Unix/Linux???? I think not.
Lots of 1, 2, 3 ratings in this thread. Well the reasons to reboot a Unix server are many, but it's contextual. Some have mentioned the obvious answers, i.e., to test a redundancy or restore operation, to verify hardware integrity, to verify that software patches will stick after a reboot, etc.
Here's another good reason to reboot often from the HPC world, after every job that runs on compute nodes. HPC code bases are notorious for not exiting cleanly and heaven knows what residual processes or memory clogs are left on a node after a job runs. It's almost always good to run an epilogue script to clean up and reboot nodes after a job terminates. Hell, in some cases we completely re-image a node after a job to make sure things are clean.
So, yeah, this "never reboot a Unix box" attitude comes from people who build boxes that don't change (stupid practice in the modern vulnerability a week environment), or are just plain ignorant all together. If you are building production Unix servers that are single machines with no redundancy and aren't rebooted often (once or twice a year at least) you're not going to be working for much longer. The first major failure will find you on the unemployment line.
The ability to identify system problems is an important skill, learn how to do it, don't just reboot hoping the problem will go away. Learn how to do system traces/tcpdump/etc. You should even learn how to do system traces on user processses so that you can tell them that the reason their application is failing is because it is trying to read a file that doesn't exist. Hand holding isn't in our job description, but it's better than rebooting for nothing. What do you put in your system downtime report? "Rebooted system, problem gone"? Steps taken to prevent problem from happening again, "Schedule Daily Reboots". No joke, there are "users" who think weekly reboots are a good idea. How do you manage over a thousand servers with weekly reboots, especially if there are dependencies? And why would anyone think that the admins are the only only ones who are involved in having applications coming up properly? We don't manage the databases, if the DBAs screw up a script that causes a database to not come up properly, the UNIX admins can't control that. But when some user wants a server rebooted because they don't know what they are doing and the database doesn't come back up, we still have to hang around at 03:00 in the morning while someone pages the DBAs to fix it.
Enterprise Class systems are supposed to have high uptimes. You have redundant and hot pluggable adapters, disk drives, you can even dynamically rearrange processors and memory. You have multiple fabrics leading to your SAN storage, and your SAN systems have hot swapable adapters/disk drives/control units/power supplies and RAID storage. If you feel that it's okay to have frequent reboots, then you might as well be running inexpensive x86 servers without any redundancy. Personally, I prefer more sophisticated systems that rarely go down because of hardware, and if the OS has a bug that forces a reboot, I want the OS vendor to fix it!
If an application team that is lacking in "problem determination" skills ask you to reboot a Linux server and their problem is still happening and the Linux server is actually running under VMWare, don't be surprised if they ask you to reboot the entire VMWare server and now you have just bounce all of the guest servers running on that VMWare server. Meanwhile the original problem was caused by a change made by the application team, and a reboot should not have been needed.
Can you imagine Air Traffic Controllers rebooting systems on a whim? There are plenty of important systems using computers that we would like to never have to reboot. We should be making that a goal instead of thinking reboots will always be needed at random times.
Never been in control of a unix server but my linux desktop needs rebooting all the time to fix issues it gets.
Troll is not a replacement for I disagree.
So rebooting is not good because it might not fix the underlying problem, well of course but that is the same with windows.
But it seems to me a good timesaver to at least reboot the first time you see a issue, and if it come back then you know it is a recurring issue and that it might be worth spending hours fixing it.
And I really do not understand how you can even get a system to stay online for at long as unix veterans do, I have never used a system (windows or linux) that did not get more and more unstable over time (even routers). any device I wanted to stay up all the time I would have automatically reboot occasionaly.
Troll is not a replacement for I disagree.
The answer is it depends on the business. In my experience, it is not IT departments (and thus companies) which reboot machines once a week that have troubles, but companies which have certain machines that never are allowed to go down that have troubles.
In small shops where nobody with root can dodge responsibility you don't need sudo.
It's really just a way to track who does what AFTER THE FACT and not a security measure. In fact at times it's a security hole if you are not really careful.
If you are really careful and have a nice short list of what people can run as root it can work - but often with just a little more time you could tweak the permissions of the files so the right people can use them without going anywhere near root.
It's no security measure, just a "who the fuck did that yesterday?" sort of tracking tool useful if a lot of people need root access to a machine. Nobody with much *nix experience spends much time logged in as root unless they have to anyway.
Instead have you considered that it's about people out of their depth in a different environment to what they are used to? Of course they make mistakes and it pisses off people like the above poster that would prefer people to be competant at the job they are actually doing instead of a different one.
What system are you running that doesn't need a kernel security update?
I'd really like to know, I'd love to run such a system myself. All the systems that I run, need kernel updates a few times a year; and thus needs to be rebooted.
Well, since I'm running my servers in HA environments anyhow, reboot confirms that no non-kernel updates messed with boot configuration w/o any loss for availability. So, it's not so bad after all.
Joachim
People don't write Manifestos any more -- what's going on in this world? [Frank Zappa]
The title says it all: rebooting loses info that may tell you what went wrong. First diagnose the problem and find a solution. Then reboot.
The guys that are trying to run stuff while the system is down.
This silly argument is really just the inexperienced arguing with the experienced. The experienced will say "what happens if I can't get the thing back up after a reboot today" because they will have seen enough hardware problems to have experienced that. The inexperienced will not care or even have a clue they should.
For that reason I tend to wait until the end of the day for some systems. There's no reason for people to be sitting about waiting all afternoon for a server to come up just because I've rebooted it at lunch time and the file systems need four hours to sort themselves out.
Because it might not start up again? Is that a good enough reason for you? My workplace lost electricity for a couple of days a month ago and it took four solid days to properly replace a machine that didn't come up again. If you have enough machines something typically dies each year and it often manifests as a machine not turning back on after it has lost power or after a reboot.
That's why the experienced often only reboot when there's a time window to do so or something else to fall back on or if the machine isn't all that important anyway. The inexperienced do shit like having only a single MS Windows domain controller and rebooting that during peak working hours when the disk was listing a lot of hardware errors.
So yes I reboot stuff but if a dozen people need it within five minutes I'm not going to do so unless it really needs it.
Just about all the stuff I run gets shut down every six months anyway. That's when I'm ready to find problems and have time to deal with them properly.
The cult of gratuitous uptime has always bewildered me.
Eventually, you are going to run into a problem that you do not know how to solve on a live system. Even if you are arrogant enough to believe in your own omnipotence, eventually you are going to run into a problem that *cannot* be solved on a live system. And even if that never happens, a hardware failure is going to bring your system to a screaming halt one day.
Sometimes the problem you are trying to solve will take some time to understand, and you need a solid understanding to plan how to properly fix an ongoing problem. Sometimes you have to hack a quick fix to see you through whilst properly planning your long-term fix.
Sometimes, some clown screws up and takes something out that the live system needed. Sometimes, that clown is you.
If you have been concentrating on keeping your amazing uptime, you have probably been neglecting any verification that your server is even *capable* of correcting booting back up, if you are ever forced into a corner and need to bring the system down. Discovering this, and attempting to fix it mid-crisis, is negligence at its worst. And leaving it to chance based on the belief you are a *nix-God and couldn't possibly be wrong?
Being reasonably confident that you can shut down and bring up a system in a crisis is important. If you're not taking time to do this verification because you think that your excessive uptime demonstrates your hardcore *nix-guru-ness, you're just being negligent.
Yahoo!: the Filing Test
Reboot each server once a year to prove that won't cause a problem.
Would you care to post a link to a good guide to sysadmin for the professional developers yet amateur administrators here?
I'm not a lawyer, but I play one on the Internet. Blog
My machine is unfortunately using Via chips. I've noticed that sometimes the network card will lock up. The only way of regaining network is actually powering down. A warm restart will not kick the chip back to life. Only a complete powerdown will do the trick.
!
That single reason is why some geeks are fond of Google as opposed to Microsoft.
ping microsoft.com
100% packet loss.
Google (and Yahoo), on the other hand, have their servers configured correctly. They "get" it.
I'm not a lawyer, but I play one on the Internet. Blog
I never heard of such a myth... Maybe the opposite!
While I agree that reboot are not a full solution, I also think that you cannot currently do live FS & memory checking, things which are really nice to do periodically..
We reboot servers on our (very small) installation.
The most obvious is if I change the configuration of the server in a non trivial manner, and need to make sure the whole server is consistent.A reboot is the quickest way to find that out if the changes stick and are consistent with the rest of the server software. It is best to test then than to wait for a reboot due to an unrelated problem (hardware failure, power cut, trip on the cable, etc...) and then have to figure out why the server is not working as expected.
Some of our servers running commercial software also have to be rebooted periodically, due to bugs of the software that clog the server to a useless state if left alone. We could shutdown all offending processes, clean the server state, and restart them, but a supervised reboot will reset the state of the server in a reproducible manner.
Other than that, we only reboot if and only if there is a security patch for the kernel that we have to apply or a critical firmware update.
These are probably not best practices, but work in our environment.
Paul Venezia explains why you should almost never reboot a Unix server,
Of course, that is not quite what the actual article really says. It's really just about breaking the habit that Windows "admins" have of using reboots as a means to clear up mysterious trouble. IMHO that habit marks one as a "operator" at heart rather than a sysadmin, but I digress...
In addition, even what Venezia does say goes a bit too far, stating that anything short of a security problem in the kernel proper shouldn't be an excuse for a reboot. In the real world, many cases of rebooting after non-kernel updates may be technically unnecessary but as a practical matter the reboot is the safest approach. It may be theoretically possible to track down each changed shared library, kernel module, or config file and determine all of the running processes thast are dependent on them all and manually shot down and restart each one in proper order, but attempting that is far more likely to cause extended dysfunction than a straight reboot after making major updates. There are also performance tuning issues that can force a reboot just to change a kernel parameter. Finally, on some platforms (notably Solaris) it is possible for pathological software (notable Oracle) to micro-manage memory in ways that over time cause the kernel's memory map to be an unmanageable jumble of tiny free, allocated, and locked blocks around a few big locked islands such that the performance of an exec or a context switch is degraded significantly. The only ways out are a de facto reboot (take everything using memory down and then restart it all manually) or a real reboot. With a real reboot you can be sure to avoid human error and you get a POST cycle as well.
There are two main reasons to reboot a server:
(1) because you don't know what's wrong, or
(2) because you do know what's wrong.
A good admin may not be told enough about a Windows system.
A poor admin may not be able to understand enough about a unix system.
OK then - it's actually even a hell of a lot worse than that. If you've really got something that isn't a complete lie where you have DOZENS of faulty changes to init scripts you are blaming one fairly innocent bit of process with a catastrophic failure in DOZENS of others.
Idiots that do not check their work and (your words) make DOZENS of mistakes really have nothing at all to do whether it's a good idea to reboot or not.
If you really had a fucking clue what you were talking about instead of trying to build an unlikely argument out of bullshit you would know that init scripts can be tested without a reboot and you would consider those that don't do it on important machines idiots.
What sort of idiot changes an init script on anything of any importance without seeing if it works or not? If it does require finding a window to reboot then do it and none of this fucking made up bullshit of DOZENS of unchecked init scripts that contain mistakes. So what's that to get DOZENS then - over one hundred changed init scripts on one machine and a quarter of them with errors in them?
So yes - that scale indicates either bullshit or complaining about what is really a massive QA failure.
Since you've missed the point, lets look at the steaming disaster of vast incompetence (suddenly discover there are dozens of processes which were running which were never put into init/rc files properly) you are using as an example and then examine what you are pretending it proves.
Your example is people messing with scripts that determine what a system does on startup. It is universally agreed that it is a good idea to test such things properly and often with a test restart in all but extreme circumstances.
For some reason you appear to be saying that this example applies in situations where typically nobody thinks it is a good idea to test such things with a test restart.
You've also gone on and described a situation where the systems are so critical that they cannot be restarted yet not important enough for any attempt at getting the startup scripts right - what can you possibly expect other than cries of BULLSHIT!
The waffling about size is just a handwaving distraction by an attempt to look impressive. If you've got large numbers of machines you typically have both good practices in place (instead of the bullshit above about dozens of unexpected problems per outage) AND the leeway to properly get a change window on each affected machine to make sure the things are going to actually come up. You've probably also got a pile of machines identical apart from the hostname and network address and could afford to lose one for a few minutes to see if the init scripts work.
You've got an example where the things actually do need to be rebooted and just about everyone agrees as a bait and switch for some stupid argument to "reboot machines once a week".
With your IT shop from hell example rebooting everything once a week just in case is not going to do anything to save it - it's nothing but a faulty strawman argument that really proves a point about QA instead of what you are pretending it proves.
Regarding "a situation where the systems are so critical that they cannot be restarted yet not important enough for any attempt at getting the startup scripts right" - as I said in my original post, it can be that Monday to Friday are to some extent critical, and Saturday is not.
With regards to test reboots being done - sometimes they can be, but sometimes a new application can be put on a machine which has been around for a year, with many other developers using the machine, many dependencies and so forth. Thus in some situations, people would have to wait until the weekend for the machine to reboot.
Anyhow, this is how it is in some places. People have different opinions about the wisdom of such policies, such as yourself. I've said about as much as there is to say on the topic, anything more and I would be repeating myself.
Too often the problem is that the reboot is seen as a way to get the system "working" like it was before. But in reality the system isn't working right at all. Spending a day with a broken box to diagnose a reoccurring problem is usually the way to go. But too often the reboot hides any true root cause.
I've worked with a few clients that got mad when I fixed a misconfiguration. They didn't say anything but I could tell they had egg on their face since they had been blaming the vendor for the last five years. I was tired of rebooting.
Old but true: windows re-boot, unix be root ...
I just fired a client who insisted on monthly reboots of their linux boxes. (and mangled their udev, and insisted on using a vcd to boot single user because grub was 'dangerous' ...)
I won't stop anyone from being stupid, but I am damned sure not going to do it for them.
"No good deed goes unpunished"
Sometimes it's not hardware.
I just terminated a contract where the client insisted on mounting their SAN volumes in a way that made booting fail about half the time.
They had no idea what they were doing, but insisted on doing it in a bad/broken way and were more interested in posturing than solving problems.
"No good deed goes unpunished"