Why You Shouldn't Reboot Unix Servers

Uptime by cdoggyd · 2011-02-21 05:46 · Score: 5, Funny

Because you won't be able to brag about your uptime numbers.

Re:Uptime by Anrego · 2011-02-21 05:55 · Score: 5, Funny

I once had to move my router (486 running slackware and with a multi-year uptime) across the room it was in. It was connected to a UPS, however the cable going from the UPS to the computer was wrapped through the leg of the table it was sitting on.
I actually _removed the table leg_ so I could hawl the 486 still plugged into the UPS across the room and quickly plug it in before it powered down!
and then we had the first real substantial power failure in years like a few months later.. and the thing had to go down :(
But yeah.. now I reboot frequently to verify that everything still comes up properly.
Re:Uptime by Anrego · 2011-02-21 06:18 · Score: 4, Funny

I meant mains power.. due to a hurricane actually (hurricane Juan).
The machine came out fine (and actually still runs.. though I don't use it as a router any more). Those old drives are surprisingly robust ..
But yeah.. I was actually surprised.. and I did it more for the sake of the doing (the only reason I even left the machine going was because of the uptime). I'd never pull a stunt like that with a real machine :D

Persistent myth? by 6031769 · 2011-02-21 05:48 · Score: 5, Interesting

This is not a myth I had heard before. In fact, none of the *nix sysadmins I know would dream of rebooting the box to clear a problem except as a last resort. Where has this come from?

--
Burns: We're building a casino!
McAllister: Arrr. Give me 5 minutes.

Re:Persistent myth? by SCHecklerX · 2011-02-21 05:52 · Score: 4, Informative

Windoze admins who are now in charge of linux boxen. I'm now cleaning up after a bunch of them at my new job, *sigh*
- root logins everywhere
- passwords stored in the clear in ldap (WTF??)
- require https over http to devices, yet still have telnet access enabled.
- set up sudo ... to allow everyone to do everything
- iptables rulesets that allow all outbound from all systems. Allow ICMP everywhere, etc.
Re:Persistent myth? by afabbro · 2011-02-21 05:54 · Score: 5, Insightful

This is not a myth I had heard before.
+1. This article should be held up as a perfect example of building a strawman.
"It's a persistent myth that some natural phenomena travel faster than the speed of light, but at least one physicist says it's impossible..."
"It's a persistent myth that calling free() after malloc() is unnecessary, but some software engineers disagree..."
"It's a persistent myth that only the beating of tom-toms restores the sun after an eclipse. But is that really true?"

--
Advice: on VPS providers
Re:Persistent myth? by arth1 · 2011-02-21 06:04 · Score: 4, Insightful

Don't forget 777 and 666 permissions all over the place, and SELinux and iptables disabled.
As for "ALL(ALL) ALL" entries in sudoers, Ubuntu, I hate you for ruining an entire generation of linux users by aping Windows privacy escalations by abusing sudo. Learn to use groups, setfattr and setuid/setgid properly, leave admin commands to administrators, and you won't need sudo.
find /home/* -user 0 -print
If this returns ANY files, you've almost certainly abused sudo and run root commands in the context of a user - a serious security blunder in itself.
Re:Persistent myth? by element-o.p. · 2011-02-21 07:30 · Score: 4, Interesting

As for "ALL(ALL) ALL" entries in sudoers, Ubuntu, I hate you for ruining an entire generation of linux users by aping Windows privacy escalations by abusing sudo.
Yeah, I agree with you in principle, although to be fair, there really isn't a way that Ubuntu could know what user account you are going to set up before you actually set it up, and therefore, there isn't really a way for Ubuntu to create an appropriate sudoers entry to give admin privileges to the server admin.

Learn to use groups, setfattr...properly...
Okay, agreed...

Learn to use...setuid/setgid properly...
Ugh...setuid and setgid, IMHO, should be used as little as possible. If there's a security hole in your app, then having it setuid/setgid allows a sufficiently skilled user the ability to gain elevated privileges. I'd much prefer to use sudoers to give access to specific apps to people I trust than give any user access to an app I "trust" through setuid/setgid.

...leave admin commands to administrators, and you won't need sudo.
Maybe I'm just missing something, but that sounds really stupid to me. While I'm a reasonably skilled Linux admin, I don't pretend to know everything, and maybe you can teach me something I've missed in my experience so far. If so, cool. But from my perspective, sudo is an ideal tool for granting appropriate permissions as required to trusted individuals. Sudo logs the user name and command in the log files, so if someone is abusing sudo, you know. Sudo can e-mail failures to admin staff, so if someone is habitually trying to exceed their permissions, you know. Sudo allows pretty fine-grained access to users based upon group or user name, so you can easily allocate permissions as required (well, relatively easily, anyway) -- much more fine-grained than Unix User/Group/Other permissions would allow. For example, with sudo you could allow senior admins (group: admin) and web developers (group: www-dev) read/write permissions to CGI script directories, junior admins (group: jadmin) read-only permissions and all other users (group: users) no access. Uh-oh...we've got four groups here: admins, jadmins, www-dev and users, so doing that with standard Unix permissions is going to be kind of difficult (admins could be members of the www-dev group I suppose, but I can imagine cases where group A might need permissions to a subset of files that group B owns, but shouldn't have access to another subset, which would really complicate things). Sudo is a powerful tool, and just like all the other tools you mentioned, should be used appropriately as a component of overall system security.

find /home/* -user 0 -print
If this returns ANY files, you've almost certainly abused sudo and run root commands in the context of a user - a serious security blunder in itself.
Maybe. I see what you are saying, but as a counter-example, I sometimes run tcpdump from within my home directory when troubleshooting problems. tcpdump has to run as superuser, and I have a lot more faith in giving myself and other admins permission to run "sudo tcpdump" than running tcpdump setuid 0. Again, maybe I'm just missing something, but I really don't have a huge problem with tcpdump (or other admin tools) writing UID 0 data to an admin user's home directory.

--
MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?

Uh.. no by Anrego · 2011-02-21 05:48 · Score: 5, Informative

I for one believe in frequent-ish reboots.

I agree it shouldn't be relied upon as a troubleshooting step (you need to know what broke, why, and why it won't happen again). That said, if you go years without rebooting a machine... there is a good chance that if you ever do (to replace hardware for instance) it won't come back up without issue. Verifying that the system still boots correctly is imo a good idea.

Also, all that fancy high availability failover stuff... it's good to verify that it's still working as well.

The "my servers been up 3 years" e-pene days are gone folks.

Re:Uh.. no by Anrego · 2011-02-21 06:03 · Score: 4, Insightful

Maybe true if the box is set up then never touched. If anything new has been installed on it.. or updated.. I think it's a good idea to verify that it still boots while the change is still fresh in your head. Yes you have changelogs (or should), but all the time spent reading various documentation and experimenting on your proto box (if you have one) is long gone. There's lots of stuff you can install and start using, but could easily not come up properly on boot.
And why are reboots bad. If downtime is that big a deal, you should have a redundant setup. If you have a redundant setup, rebooting should be no issue. I've seen a very common trend where people get some "out of the box" redundancy solution running... then check of "redundancy" on the "list of shit we need" and forget about it. Actually verifying from time to time that your system can handle the loss of a box without issue is important (in my view).

slashdot: *world link farmers by Anonymous Coward · 2011-02-21 05:48 · Score: 5, Insightful

i'm really tired of this semi-technical stuff on slashdot that seems aimed at semi-competent manager-types.

Counter point -- pre-emptive reboot by Syncerus · 2011-02-21 05:49 · Score: 5, Insightful

One minor point of disagreement. I'm a fan of the pre-emptive reboot at specific intervals, whether the interval be 30 days, 60 days, or 90 days is up to you. In the past, I've found the pre-emptive reboot will trigger hidden system problems, but at a time when you're actually ready for them, rather than at a time when they happen spontaneously ( 2:30 in the morning ).

--
"Man is nothing without the works of man" -- Helvetius

Of course you reboot, in controlled settings by pipatron · 2011-02-21 05:51 · Score: 4, Insightful

FTFA:

Some argued that other risks arise if you don't reboot, such as the possibility certain critical services aren't set to start at boot, which can cause problems. This is true, but it shouldn't be an issue if you're a good admin. Forgetting to set service startup parameters is a rookie mistake.

This is retarded. A good admin will test so that everything works, before it will get a chance to actually break. Anyone can fuck up, forget something, whatever. Doesn't matter how experienced you are. Murphys law. The only way to test if it will come up correctly during a non-planned downtime is to actually reboot while you have everything fresh in memory and while you're still around and can fix it. Rebooting in that case is not a bad thing, it's a responsible thing to do.

--
c++; /* this makes c bigger but returns the old value */

What a load of BS by kju · 2011-02-21 05:52 · Score: 4, Insightful

I RTFA (shame on me) and it is in my opinion absolutely stupid.

There is actually only one real reason given and that is that if you reboot after some services ceased working, you might end up with a unbootable machine.

In my opinion this outcome is absolutely great. Ok, maybe no great, but it is important and rightful. It forces you to fix the problem properly instead of ignoring the known problems and missing yet unknown problems which might bite you in the .... shortly after.

Also: When services start being flakey on my system, i usually want to run an fsck. In 16 years linux/unix administrations I found quite a time that the FS was corrupted without an apparent reason and with beeing unnoticed before. So a fsck is usually a good thing to run when strange things happen and to be able to run it, i nearly always need to reboot.

I can't grasp what kind of thinking it must be to continue running a server where some services fail or behave strangely. You could end up with more damage than cause by a outage when the reboot does not go through. You just might want to do the reboot at off-peak hours.

Ummm, that's a crap article by Sycraft-fu · 2011-02-21 05:53 · Score: 4, Insightful

More or less it is "You shouldn't reboot UNIX servers because UNIX admins are tough guys, and we'd rather spend days looking for a solution than ruin our precious uptime!"

That is NOT a reason not to reboot a UNIX server. In fact it sounds like if you've a properly designed environment with redundant servers for things, a reboot might be just the thing. Who cares about uptime? You don't win awards for having big uptime numbers, it is all about your systems working well and providing what they need and not blowing up in a crisis.

Now, there well may be technical reasons why a reboot is a bad idea, but this article doesn't present any. If you want to claim "You shouldn't reboot," then you need to present technical reasons why not. Just having more uptime or being somehow "better" than Windows admins is not a reason, it is silly posturing.

Virtualization to the rescue by Anonymous+Showered · 2011-02-21 05:53 · Score: 4, Interesting

I run web servers for a few dozen clients, and rebooting a remote machine was always scary. There was the possibility that something might not boot up during startup (e.g. SSHd) and I would be locked out. I would then have to travel to my data center downtown (about 30 minutes away) and troubleshoot the problem. Since I don't have 24/7 access to the DC (I don't have enough business with the DC to warrant an owned security pass...) I have to wait until they open to the general clientèle in the morning.

With ESXi, however, I'm not that scared anymore. If something does go wrong, I have a console to the VM through vCenter client (the application that manages virtual machines on the server). It's happened once where a significant upgrade of FreeBSD 7.2 to 8.1 was problematic. Coincidentally, it was because I didn't upgrade the VMware tools (open-vmware-tools port). Nonetheless, I managed to fix the problem through vCenter.

This is why I love virtualization in general. It's making managing servers easier for me.

This is a myth? by pclminion · 2011-02-21 05:57 · Score: 4, Interesting

I've heard a lot of myths. I've never heard a myth stating "You need to reboot a UNIX system to fix problems." If anything I've heard the opposite myth. Who promulgates this shit?

I do remember ONE time a UNIX system needed a reboot. We (developer team) were managing our own cluster of build machines. The head System God was out of town for two weeks. We were having problems with a build host, and tried everything. Day after day. Finally, on the last day before System God was due to return, it occurred to me that the one thing we hadn't tried was to reboot the machine. The reboot fixed the problem, whatever it was.

I felt stupid. One, for not figuring out the problem in a way that could avoid a reboot. Two, for not recording enough information to determine root cause in a post-mortem analysis. Three, for configuring a system in such a way that a reboot might be required in order to fix a problem.

To this day I believe that reboot was unnecessary, although at the time it was the fastest way to resolving the immediate blocking issue.

Sometimes ... by PPH · 2011-02-21 05:57 · Score: 4, Funny

... the crap I read on Slashdot is so unbelievable, I have to reboot my laptop in the hopes that it will go away.

--
Have gnu, will travel.

New rule for Slashdot by aztektum · 2011-02-21 06:04 · Score: 5, Insightful

/. editors: I propose a new rule. Submissions with links to PCWorld, InfoWorld, PCMagazine, Computerworld, CNet, or any other technology periodical you'd see in the check out line of a Walgreens be immediately deleted with prejudice.

They're the Oprah Magazine of the tech world. They exist to sell ads by writing articles with grabby headlines and little substance.

--
:: aztek ::
No sig for you!!

Re:HP-UX says... by sribe · 2011-02-21 06:12 · Score: 4, Informative

Seriously. I don't know what HP is doing, but NFS hangs/stuck processes that you can't kill -9 your way out of is just wrong.

Kind of a well-known, if very old, problem. From Use of NFS Considered Harmful:

k. Unkillable Processes

When an NFS server is unavailable, the client will typically not return an error to the process attempting to use it. Rather the client will retry the operation. At some point, it will eventually give up and return an error to the process.
In Unix there are two kinds of devices, slow and fast. The semantics of I/O operations vary depending on the type of device. For example, a read on a fast device will always fill a buffer, whereas a read on a slow device will return any data ready, even if the buffer is not filled. Disks (even floppy disks or CD-ROM's) are considered fast devices.

The Unix kernel typically does not allow fast I/O operations to be interrupted. The idea is to avoid the overhead of putting a process into a suspended state until data is available, because the data is always either available or not. For disk reads, this is not a problem, because a delay of even hundreds of milliseconds waiting for I/O to be interrupted is not often harmful to system operation.

NFS mounts, since they are intended to mimic disks, are also considered fast devices. However, in the event of a server failure, an NFS disk can take minutes to eventually return success or failure to the application. A program using data on an NFS mount, however, can remain in an uninterruptable state until a final timeout occurs.

Workaround: Don't panic when a process will not terminate from repeated kill -9 commands. If ps reports the process is in state D, there is a good chance that it is waiting on an NFS mount. Wait 10 minutes, and if the process has still not terminated, then panic.

An Appropriate Hacker Koan by idontgno · 2011-02-21 06:25 · Score: 4, Funny

courtesy of Appendix A of the Jargon File.

Tom Knight and the Lisp Machine
A novice was trying to fix a broken Lisp machine by turning the power off and on.
Knight, seeing what the student was doing, spoke sternly: "You cannot fix a machine by just power-cycling it with no understanding of what is going wrong."
Knight turned the machine off and on.
The machine worked.

--
Welcome to the Panopticon. Used to be a prison, now it's your home.

Slashdot Mirror

Why You Shouldn't Reboot Unix Servers

21 of 705 comments (clear)