Solaris Machine Shut Down After 3737 Days of Uptime

So what did it do all that time? by cod3r_ · 2013-03-14 08:36 · Score: 5, Funny

A *nix machine being idle for 3737 days is not all that interesting.

Re:So what did it do all that time? by Anonymous Coward · 2013-03-14 08:45 · Score: 5, Insightful

Somewhere at my last job, there was a Solaris 8 machine with over 4000 days uptime, that everybody hated to do anything with, but one person loved it and refused to migrate the last service that was still on it to something more modern.

Uptime is irrelevant for an individual server, anyway. If there's fail over (and there should be if uptime is important), take it down and update the kernel for security reasons, who cares?
Re:So what did it do all that time? by crutchy · 2013-03-14 08:54 · Score: 5, Funny

If there's fail over (and there should be if uptime is important)
i agree... if you're responsible for a single server performing a mission critical function with no fail over, you may as well just fire yourself
Re:So what did it do all that time? by h4rr4r · 2013-03-14 09:04 · Score: 5, Insightful

Just get it in writing.
Been there done that, when it has to come down for hardware failure or something like that you can show you tried to get a backup machine, you tried to do things right.
Re:So what did it do all that time? by Anonymous Coward · 2013-03-14 09:13 · Score: 4, Informative

No, it was idle "only" since day 3509 (served as a hot backup if we had to restore the service from the new machines).
Re:So what did it do all that time? by Anarchduke · 2013-03-14 10:11 · Score: 4, Insightful

an even more important part of your job then ensuring failover. that is, covering your ass.

--
who prays for Satan? Who in 18 centuries has had the humanity to pray for the 1 sinner that needed it most? ~Mark Twain
Re:So what did it do all that time? by kasperd · 2013-03-14 10:27 · Score: 5, Insightful

mission critical function with no fail over
It is surprisingly hard to guarantee data integrity when doing a fail over.

If you want to guarantee a system keeps operating and maintains data integrity when a single computer fails, you need at least another three computers that are still running with no failures. There is a mathematical proof for this.

If you want to go lower than four computers, you have to make assumptions about how the failures behave. And if just one computer fails in a way that does not match your assumptions, the system will fail.

If you do decide to go with the four computers required to handle a single failure, the protocols to ensure they agree on the current state of your data are quite complicated. The protocols have to be non-deterministic. That's another proven fact. No matter how many machines you throw at the problem, a deterministic protocol cannot handle even a single failure.

You can get around the non-deterministic requirement if you make assumptions about the timing of communication. But you'd slow down the system unnecessarily because you'd have to wait for the maximum time you assumed packet delivery could take on every operation, and if the network was slower than you assumed, the system would fail.

Knowing how difficult fail over can be, it is no surprise that sometimes it is decided to not bother with it and instead hire an operator, who you assume can make everything be ok as long as you have backups plus spare hardware ready to put in production.

--

Do you care about the security of your wireless mouse?
Re:So what did it do all that time? by rickb928 · 2013-03-14 10:54 · Score: 2

Yup. And running a full marathon is pointless and irrelevant - any one could run 26.2 miles on a couple of months, half a mile at a time.

--
deleting the extra space after periods so i can stay relevant, yeah.
Re:So what did it do all that time? by dkf · 2013-03-14 12:02 · Score: 2

[There go the mod points]

Uptime is irrelevant for an individual server, anyway. If there's fail over (and there should be if uptime is important), take it down and update the kernel for security reasons, who cares?
Not all critical services are necessarily internet facing. I know of someone who had an application that ran continually for over 10 years, highly business-critical (master video stream controller for a TV network) and with very fancy hardware attached that it was tricky to replicate. The hardware was gradually updated over that decade, as was the code of the application (dlclose() FTW!)

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:So what did it do all that time? by kasperd · 2013-03-14 12:20 · Score: 2

Excuse me? It's *expensive* yes, hard? Not really.
Most problems gets easier to solve, if you have a lot of money to work with. This one is no exception. But you still need software, which can correctly execute a non-trivial protocol. A single software glitch could still take down the system when the same bug triggers on replicas simultaneously. Redundant systems have blown up due to replicas suffering from the same software glitch.

That means either you need the software to be bug free. Or you let four different teams develop software according to the same spec and hope three of the teams get it right. And even if you have enough money, how do you make sure the people you are hiring are the right people for the task?

--

Do you care about the security of your wireless mouse?
Re:So what did it do all that time? by crontabminusell · 2013-03-14 12:25 · Score: 4, Funny

Somewhere at my last job, there was a Solaris 8 machine with over 4000 days uptime, that everybody hated to do anything with, but one person loved it and refused to migrate the last service that was still on it to something more modern.
Uptime is irrelevant for an individual server, anyway. If there's fail over (and there should be if uptime is important), take it down and update the kernel for security reasons, who cares?
It's like Cory Doctorow said in When Sysadmins Ruled the Earth:

“Greedo will rise again,” Felix said. “I’ve got a 486 downstairs with over five years of uptime. It’s going to break my heart to reboot it.”

“What the everlasting shit do you use a 486 for?”

“Nothing. But who shuts down a machine with five years uptime? That’s like euthanizing your grandmother.”
Re:So what did it do all that time? by AK+Marc · 2013-03-14 14:21 · Score: 2

Doesn't help. When my boss lied to my boss's boss (the owner), I had documented proof I was right. My choices were to ignore it, and take punishment for things I didn't do, or prove my boss wrong, defending my position and probably costing me my job. I stayed quiet, found a new job, and left a place where my boss would lie to sell me out.

Proof you are right doesn't help. It hast to be shared and spread long before there's an issue, or you can still end up in an unwinnable situation.

--
Learn to love Alaska
Re:So what did it do all that time? by TheLink · 2013-03-14 16:02 · Score: 2

The problem is ksplice hasn't been around for more than 3737 days ;).

If you run everything on a "cluster" layer (your apps are not dependent or maybe not even aware of the noncluster layer) then you won't have such problems - you can reboot a node with minimal impact. In the old days the ones famous for uptimes were Tandem and VMS.
--
- Too many replies beneath your current threshold
Re:So what did it do all that time? by Almost-Retired · 2013-03-14 16:11 · Score: 4, Interesting

I'd differ with that. I was fresh on the job, just 2 or 3 months, long enough to get the feeling I would be the scapegoat. The owner came in, and a deal the GM had made in a bar 2 weeks back hadn't worked out, and as the 3 of us were walking to the back of the garage to look at what we had, The GM tried to say it was all my idea.
Wrong, I skipped out in front, spun around and said this stops right here and now, I was just following orders. The owner looked at the GM, looked at me, gave a barely perceptible nod, and started walking again. I didn't get pushed to take the blame again, but I did get pushed in every other way it seemed.
Owners didn't get to be owners without a sense of who's right and who's wrong in boss/employee differences. Tell the truth even if you lose, because if you lose, that job was looking for somebody to do it when you walked in. I'd a hell of a lot prefer to stand my ground if I'm right, and admit it if I'm wrong, and I've done quite a bit of both in my 78 years. Honesty has paid off handsomely several times.
About 2 years later another situation came to a boil, and I was the first one called to the owners office when he arrived. He wanted to know what it would take to fix it. I said 2 things, the gear these people are using is just plain worn out, its been on the road non-stop for at least 5 years, I can't get parts because the parts bills aren't being paid. I need 10 grand in parts, and I can't get a P.O. for more than $200 a month, COD. Hell of a way to run a train. Besides that, the technology has moved on. Its time to upgrade.
His next question floored me, he wanted to know if he needed a new GM. I had to say it looked like he was, at the end of the day, the biggest roadblock to making things run smoothly. Then he had another dept head paged, 3 all told in the next 30 minutes. Years later he said they all agreed with me, so we had a new GM by the next morning. That and $150,000 in new gear put out the fire. That GM didn't work so well either after a couple years, but that's another story I am not directly involved in. The 3rd one is a pussy cat and we sometimes get into very noisy arguments even now, just to entertain the troops. He's a decent man, a motivated manager, but in a war of wits with me on technical stuff, he is unarmed and knows it very very well.
Bottom line to this story is that I had already proved my worth from the 1st day on the job because they had about half the gear packed up to go back to the factory shop, expected 2 to 3 grand each for repairs with a 2 week turnaround time. I canceled that, unpacked them and handed in parts orders at about 10% of that per machine. All were back in service inside of 10 days, half that waiting on FEDEX or UPS.
So it was a question of who was worth more to the person who owns the place. I stayed there 18+ years, have now been retired for 11 years, and the owner and I are still friends.
Cheers, Gene
Re:So what did it do all that time? by crutchy · 2013-03-14 18:48 · Score: 2, Insightful

best way to justify the job you do is to create work for yourself
in IT you can covertly install a virus, which will have half your users begging to get things back up and running and the other half berating you for not doing your job
the last thing you want to do is increase your efficiency to the point where management thinks you are no longer required or that your role can be filled by a machine or some kid fresh out of school
or if you're a department of defense big brass knob, you need to justify spending billions of tax payer money, so you blow up 2 skyscrapers and scare the crap outta the public so they give you more money to go off and fight the world :)
Re:So what did it do all that time? by kasperd · 2013-03-14 20:13 · Score: 3, Informative

I would like to see this proof. I am not doubting you, I am simply curious about the mathematics involved.
I don't remember which paper the result was in, but I do remember the overall idea of the proof.

The general proof says to handle t failures there must be 3t+1 nodes in total.

It is a proof by contradiction, so initially we assume the nodes can be split into three groups with each node being in exactly one of those three groups. And we assume that any two out of those three groups can reach a consensus without involving the third group. Now we'll prove that under those assumptions, the system breaks down.

So we imagine two completely functional groups out of those three, the network within each group is stable, but the network between them is slow. All the nodes in the third group suffer from a byzantine failure, which cause them to send corrupted messages. Imagine that the third group of failing nodes is still communicating with each of the functional groups, but it sends different information to those two groups. Under those circumstances the failing group along with one group of functional nodes can reach consensus, because we assumed two groups can reach consensus without the third. But at the same time the failing group can reach consensus on a different result with the other group of functional nodes.

In the above partitioning into three groups, we could have t nodes in each group, in which case it is proven that with t failures among 3t nodes we cannot reach consensus. Additionally there exist solutions that will reach consensus with t failures among 3t+1 nodes. They are randomized which means runtime is theoretically unbounded, but the probability that the protocol will take forever is zero. On average it completes quickly. For example the Asynchronous Binary Byzantine Agreement protocol operates in round and has 50% probability of finishing in a given round. If it fails to complete it will run another round and have 50% chance of finishing there. The idea in that protocol is that if there are two candidate results to agree on with roughly the same number of nodes supporting each result, they flip a coin, and try to agree on using the result of the coin flip. Trying to agree on the coin flip can only fail if the coin suggested a result that was behind in the number of nodes supporting it. Hence there is at least 50% chance the coin will land on a side, that leads to agreement.

The byzantine failure model is a bit extreme, but that means protocols designed to work in that model are resilient to extreme failures. The stop dead model on the other hand is a bit unrealistic. Which means protocols designed to work in that model are only proven correct under unrealistic assumptions. They may work in practice most of the time. But the proof of correctness isn't valid in the real world. I don't know if anybody have managed to come up with a sensible model, which lies somewhere between those two.

--

Do you care about the security of your wireless mouse?

Oracle sucks. by RocketRabbit · 2013-03-14 08:37 · Score: 5, Insightful

I'd just like to leave this here. Yeah, I know Linux is great and everyfink, but Solaris is excellent and better in some ways. Oracle really ground my gears when they stopped supporting OpenSolaris and OpenIndiana is going nowhere fast.

RIP Sun.

Re:Oracle sucks. by Drinking+Bleach · 2013-03-14 08:42 · Score: 2

Oracle never supported OpenIndiana, it's a distribution of illumos (the OpenSolaris fork).
Re:Oracle sucks. by tnk1 · 2013-03-14 08:52 · Score: 5, Informative

I don't think his comment suggested anything else. You should probably parse it like this:
(Oracle really ground my gears when they stopped supporting OpenSolaris) && (OpenIndiana is going nowhere fast)
Oracle support only applies to the Left Side of the statement. The point of the statement was to suggest that with support gone, and the only alternative to the supported version going nowhere, the Solaris world is completely Shit Out of Luck.
Re:Oracle sucks. by unixisc · 2013-03-14 09:12 · Score: 2

I've never really understood why Oracle had to steal RHEL's distro and rebrand it as its own, when they had a perfectly good OS in Solaris which existed not just on SPARCs, but on x86s as well. As for OpenIndiana, I don't get the point of that project since it doesn't support SPARC, and there is a plethora of OSs for x86
Re:Oracle sucks. by ebno-10db · 2013-03-14 10:05 · Score: 5, Insightful

VMS isn't a Unix
So I've heard, but I believe it is a "computer operating system". Hence I thought it was a more appropriate comparison than to a bicycle.

and I don't believe you can get ahold of VMS any more
Then into the memory hole!

The IBM mainframes are too expensive
For whom? To operations like banks, for whom downtime is incredibly expensive, they're still worth it. For me, an UltraSPARC like the 280R breaks the piggy bank. I get my x86 hardware from other people's castoffs.

and not open source
As you pointed out, OSS Solaris is toast.

What's your point exactly?
Umm, that some other OS's are/were at least as reliable as Solaris. Was I being that obtuse?
Re:Oracle sucks. by jgarry · 2013-03-14 10:13 · Score: 2

VMS isn't a Unix, and I don't believe you can get ahold of VMS any more. The IBM mainframes are too expensive and not open source, so there's no point in comparing them to Solaris.
What's your point exactly? My point is that Solaris is useful, even in its somewhat dodgy state (thanks Oracle for the paid update program you fucks).
You can still get a hobbyist license: http://www.vmshobbyist.org/faq.php?cat_id=3
Back in the '90s, a VMS magazine pointed out that the posix implementation was good enough to say "VMS is better unix than unix."

--
Oracle and unix guy.
Re:Oracle sucks. by rogueippacket · 2013-03-14 10:27 · Score: 2

I believe this originated from two places; first, the constant stream of FUD from Microsoft that only a commercial OS could provide the uptime required for important applications, and second, the time-honoured tradition of rebooting your Windows boxes as the first step in troubleshooting.
Re:Oracle sucks. by Reschekle · 2013-03-14 10:45 · Score: 4, Insightful

OK, first off, it is not stolen. You cannot steal open source software. Oracle is following the GPL.
Second, Oracle was doing OEL before they acquired Sun.
Solaris is a technically good and high quality OS but its hardware support was limited. If you bought the Sun-branded boxes and Sun-branded cards, you were OK. However if you are white-boxing a server, you had to be careful to select chipsets that were on their compatibility list. Then support got murky at that point even then.
I really, really love Solaris, but let's face the facts. Outside of the SPARC platform, there is no reason for Solaris. Linux does everything as well or nearly as well. Linux is weaker in some areas, but not weak enough to justify the cost and lock-in of Solaris.
Solaris exists for Oracle to milk legacy customers on support contracts who aren't ready or willing to migrate to Linux and commodity x86 hardware . There isn't much if any new development going on, and Oracle is only pushing Solaris to new customers as part of their big data warehouse solutions (where customers have $$$$$ and want to spend it with one vendor) where they want to get people locked in to one vendor.
Re:Oracle sucks. by mlts · 2013-03-14 11:41 · Score: 3, Interesting

I will say that AIX is pretty good as well. In general, unless there is a show-stopper patch, or one installs a driver like EMC PowerPath that requires a reboot due to the hooks in the kernel, one can keep AIX up for a long while, only really bothering to update and reboot when the latest tech level is released, and if there are no security specific issues, even that can be ignored, although it is wise to keep up on new firmware stuff just in case.
Re:Oracle sucks. by tompaulco · 2013-03-14 12:56 · Score: 2

time-honoured tradition of rebooting your Windows boxes as the first step in troubleshooting.
Laws!, how I hate this debugging technique. There are some people that I have worked with who would observe an issue with a program, completely skip reading any of the logging information, and jump straight to rebooting the machine. Fortunately, I try to write my applications to recover gracefully, so when the machine comes back up, the services start up and before long, the application is right back to where it was before, working on the same piece of data and complaining in the log about it.

--
If you are not allowed to question your government then the government has answered your question.
Re:Oracle sucks. by aix+tom · 2013-03-14 17:14 · Score: 2

I always thought they were implemented as "dude" and "man" in English. Something Like:
Oracle really ground my gears when they stopped supporting OpenSolaris, dude, OpenIndiana is going nowhere fast, man.
Re:Oracle sucks. by Reschekle · 2013-03-14 19:14 · Score: 2

You sound like you've never had to mediate a 3-way vendor argument between the hardware vendor, the linux support vendor and the HBA vendor, all claiming that it's someone else's fault that the hardware/OS/HBA combo doesn't work, even though they assured you that it did before you bought it. Oh, and you're a top customer, so they want to keep you really really happy. But not "it actually works" kind of happy.
I was talking about Solaris on Intel. Not sure where you got this from. In fact, you kind of reinforce my point: running Solaris on other people's hardware is just asking for trouble like this, which WAS MY POINT!
Anyway, if you really want to play this game, I'd be more than happy to pull up some of the various support cases I've opened with Sun and Oracle about Solaris and get you to try to explain each one to me, since you're such a bigot for Solaris.

Spoken like someone who has never run them side by side in a large estate
I have and do at my current gig, so shut your pipe. I have also done software development for both. So I know the ins and outs of both OSes pretty damned well. I even admitted to you that Solaris is the better OS, yet you're still not happy with my response. Spoken like a true fanboy you have.
There is one advantage to Solaris and it is the vendor lockin, namely the fact that you have limited hardware combinations, whereas with Intel you have literally hundreds of combinations of chipsets and configurations to support and test against. Again, this is one reason why I do *NOT* recommend Solaris on Intel unless you're buying one of Oracle's own Intel servers where they have actually tested the OS and the hardware together.

Or who want to run large databases with more than a few GB of RAM. Small databases seem fine on Linux, but Oracle or DB2 or Sybase don't seem as stable on Linux as on Solaris when you get up to 32+GB RAM.
I'm running rather large databases, much larger than that, with OEL on Dell branded hardware. It just works for us.

Yes, and it's a crying shame. Oracle are terrible. I know of whole enterprises who jumped ship when Oracle took over. But Linux still isn't there, even though most places I'm working at these days have more Linux than Solaris, HP-UX or AIX, and usually more than all other Unix combined. But there's still too much of the amateur around Linux.
Your only indictment against Linux is that you have some specific hardware issue, that's not really convincing and pretending that Solaris doesn't have those issues is just plain laughable and absurd.
Re:Oracle sucks. by Tom · 2013-03-14 23:33 · Score: 2

Actually, last I checked Linux can not show you an uptime of 3737 days.
No, that's not a dig on Linux being unstable. The real reason is both more boring and more interesting at the same time. A Linux system with that kind of uptime would have to be running a kernel from a time where the uptime counter overflows after around 400 days.
And yes, I've seen that happen. :-)

--
Assorted stuff I do sometimes: Lemuria.org

T'ain't nothin... by Anonymous Coward · 2013-03-14 08:37 · Score: 2, Funny

Last place I worked at still used token ring. Packet-Packet-Give baby!

Last message in system log was . . . by StefanJ · 2013-03-14 08:39 · Score: 5, Funny

. . . Mar 12 11:57:03 hedvig kernel:WILL I DREAM?

Re:Last message in system log was . . . by cbiltcliffe · 2013-03-14 10:06 · Score: 3, Funny

I was thinking:
Mar 12 11:57:03 hedvig kernel: So long, and thanks for all the bits.......

--
"City hall" in German is "Rathaus" Kinda explains a few things......

taken down early as a precaution by Anonymous Coward · 2013-03-14 08:42 · Score: 4, Funny

In another 57 years the uptime command might've had rollover issues.

This is news? by Fished · 2013-03-14 08:44 · Score: 5, Interesting

I work at a Very Large Company (who must remain nameless.) We've got Solaris boxes that were last rebooted in the 90's. Yes. Really. Running Solaris 2.6, even.

--
"He who would learn astronomy, and other recondite arts, let him go elsewhere. " -- John Calvin, commenting on Genesis 1

Re:This is news? by bobbied · 2013-03-14 09:21 · Score: 5, Insightful

I work at a Very Large Company (who must remain nameless.) We've got Solaris boxes that were last rebooted in the 90's. Yes. Really. Running Solaris 2.6, even.
I am not surprised. I've seen Sparc/Solaris boxes run for very long times and even when not properly cared for have run times measured in months and years. I've had to shut down boxes to move them that had been running for 5 years. We where scared to death the disk drives would not spin back up after 2 days in the truck, but when we plugged them back in, they powered right back up. Sun built some SOLID hardware and produced a SOLID operating system.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Re:This is news? by amicusNYCL · 2013-03-14 10:20 · Score: 4, Funny

I work at a Very Large Company (who must remain nameless.) We've got Solaris boxes that were last rebooted in the 90's. Yes. Really. Running Solaris 2.6, even.
I'm willing to hazard a guess who you work for. Let's see.. you're running servers that have an OS that was released in 1997, and apparently you haven't rebooted them since. Almost like your company is stuck in the mid- to late-90s. You're the only Slashdotter I've seen with an AOL instant messenger screen name in their profile. That can't be a coincidence. You work for AOL. They have you designing the latest Free CD labels.

--
"Our two-party system is like a bowl of shit looking at itself in a mirror." - Lewis Black
Re:This is news? by shafty · 2013-03-14 10:40 · Score: 3, Funny

Interesting, I left a Very Large Company in the late 90's after having set up a few Solaris 2.x machines for our R&D projects. I had a Quake server running on one of them. There was a lot of incentive to keep that server up.
Re:This is news? by Grog6 · 2013-03-14 10:42 · Score: 4, Interesting

Amazingly enough, in my experience, two days in a truck is not nearly as bad as a few weeks in an extremely temperature-controlled, vibration free room.
The drives will weld to the platter if there's no vibration or movement after "spinning themselves flat" over many years' time.
Apparently, all the micro-projections on the surface of the heads and disks get worn off over time, making the disk and heads Extremely flat; they stick like glue when the air barrier between them escapes over time.
Thermal changes and ambient vibration are apparently enough to keep things 'fluid', and not as likely to stick.
YMMV.

--
Truth isn't Truth - Guliani
Re:This is news? by dbIII · 2013-03-14 12:37 · Score: 2

It's well known with bearings and is actually due to stuff from one polished surface diffusing into the other. The smoother the surface the greater the chance of it happening.

Uptime fetish by DNS-and-BIND · 2013-03-14 08:44 · Score: 2, Insightful

I will never for the life of me understand the "uptime fetish" that uneducated sysadmins have. Who the hell cares? The only people who give a crap about this sort of thing are linux fanbois. The only thing this tells me is that this machine has had an uninterrupted power supply, which is mildly impressive. Otherwise it's a Solaris box which is missing A SHITLOAD OF PATCHES. WTF, sysadmins? What kind of pro sysadmin worships at the altar of individual machine uptime? Much less a Solaris sysadmin?

--
Shutting down free speech with violence isn't fighting fascism. It IS fascism!

Re:Uptime fetish by FileNotFound · 2013-03-14 08:47 · Score: 3, Insightful

Funny because you're right - "Impressive UPS" is all I thought.

--
In Soviet Russia, the television watches YOU!
Re:Uptime fetish by tepples · 2013-03-14 08:47 · Score: 3, Insightful

Otherwise it's a Solaris box which is missing A SHITLOAD OF PATCHES.
Apply a patch to a service and restart the service, not the whole computer. Or what am I missing?
Re:Uptime fetish by Anonymous Coward · 2013-03-14 08:49 · Score: 3, Funny

Boy, you must be fun at parties.
Re:Uptime fetish by Richard_at_work · 2013-03-14 08:54 · Score: 3, Insightful

Impressive if you can do that on the kernel and still be confident of stability.
Re:Uptime fetish by guruevi · 2013-03-14 09:00 · Score: 5, Informative

You can get patches, even kernel patches without having to restart the system. That was one of it's selling points back in the day, some systems even allowed you to hot-swap or hot-upgrade CPU's and memory.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:Uptime fetish by Anonymous Coward · 2013-03-14 09:09 · Score: 5, Insightful

If you don't care, you don't understand history. And sadly, looking at your attitude and phrasing, I got a feeling you're older than I and should know it better.
That you understand it's not worthy of worship is a mark in your favor -- but not as big as you're hoping.
It's not fanboyism. It's from the old cult of service. From taking your limited resources on a system that costs more than your pension, and absolutely positively guaranteeing they were available to your userbase.
We didn't all have roundrobin DNS, sharding, clouds in the early 2000's.
Some of us had Sun's, BSD's, Vaxen, and other systems that might be missing security fixes, but that by and large were secure as long as you made sure nobody that didn't belong on it had an account.
Kernel and driver patches? It might be a performance boost, it might be a security patch. It might be a driver problem that could cause data loss, but only if you were running a certain service. A great admin can choose which are needed. A good admin knows they should apply them all
There's something to be said about rebooting machines -- just to make sure they'll still boot. But the best sysadmins didn't need to check -- they knew.
Uptime diferentiated us from our little brothers running windows, who couldn't even change network settings without a reboot. Who had to restart every 28 days or crash horribly. Who could be brought to a grinding halt with a single large ICMP request.
In short, uptime was an additional proxy variable for admin competence (given the presence of an unrooted box).
Yeah, any idiot could leave a system plugged into a UPS in a closet and have it come out OK. But if you didn't get cracked and filled with porn, you were doing something right.
Given elastic clouds, round robin DNS, volume licensing, SAS... it's very nearly cheaper to spin up a new image and run the install scripts than reboot these days.
I'm not convinced this makes modern sysadmin practices better -- just more resilient to single-host failure.
Just the other week we had a million dollar NAS go down for nearly 12 hours (during the week) while applying a kernel update to the cluster.
If you did that in 99 on a Unix system, you'd have probably been shot after the execs showed you out the door.
Somehow, the cult of service availability has been replaced with the cult of 'good enough'
Re:Uptime fetish by Bacon+Bits · 2013-03-14 09:10 · Score: 4, Insightful

You have no idea if the system can start from a cold boot. And if it fails to start from a cold boot, you have no idea which of the hundreds of patches you've applied in the last 10 years is the one that is causing the boot process to fail, or if it's hardware that's randomly gone sketchy. The last known-good cold state is 10 years ago.
Power systems fail. Backup power is limited. Buildings get damaged and remodeled. For these reasons it is unwise to assume you will never need to power a system off. Even with the super hotswapping of the VAX you would occasionally need to move the system to a different building with new server rooms. If you never demonstrate that a server can safely power back on to a running state, you have no idea what state the system will be in when you do it.
Consider the system in this article for a moment. The last service was removed last year. Why was it left powered on? It was literally doing nothing but counting the seconds until it was shut down today. That's a disgusting waste of power.

--
The road to tyranny has always been paved with claims of necessity.
Re:Uptime fetish by arth1 · 2013-03-14 09:12 · Score: 3, Interesting

The old adage holds true: Iffen ain't broke, don't fix it.
If the machine is in an area where security is important, certain security patches might be needed. But that's no certainty. Other patches - well, with an uptime of 10+ years, adding a stability patch which causes downtime seems rather counter-productive.
Then, experienced sysadmins, which you clearly are not, know that like the most dangerous time for an airplane is during takeoff and landing, the most dangerous time for a server is during shutdown and start. Stiction on old drives, minor internal power surges during boot that doesn't affect a running system, and much else can cause problems.
Oh, and there are also services that you may want to provide 24/7 with no downtime at all, so help you cod. You even mention one such in your nickname. But I have strong doubts whether you truly have kept that service up and running 24/7, even with failovers, if you install patches and reboot just to install patches and reboot.
Re:Uptime fetish by Anonymous Coward · 2013-03-14 09:34 · Score: 5, Informative

The summary is misleading. It was acting as a backup server for it's own replacement.
Re:Uptime fetish by bobbied · 2013-03-14 09:40 · Score: 3, Interesting

It's not like Sun has issued very many Solaris 2.6 patches in the last few years...
Besides... Many Solaris patches simply didn't require a full reboot. In fact, unless you are changing the Kernel, there was no reason to because it just takes longer. Then there is the mission critical system that is on an isolated network that you take a "If it ain't broke, don't fix it" approach. Who cares what patches are on or not? The system just needs to work, day in and out, sans patches.
Windows users amaze me with all the "got to reboot the box" they put up with. Install software? Reboot! Install new drivers? Reboot! Things start to slow down for unknown reasons? Reboot! I simply don't believe that it should be necessary to reboot a box very often. Reboots should not be required unless you are changing hardware and have to actually power it off or need to change parts of the memory resident portions of the operating system (i.e. the booted kernel image). Windows is getting better about this, but you still need to reboot it way too often for all the "recommended" patches to get installed.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Re:Uptime fetish by DougOtto · 2013-03-14 10:38 · Score: 2

Not in the same sense. Solaris was a dynamic kernel. Most "kernel patching" was done at the module level. Modules could be unloaded, patched and reloaded without taking the box down. Most of the time, that worked.

--
Solving Unix problems since 1989...
Re:Uptime fetish by DerekLyons · 2013-03-14 11:00 · Score: 3, Interesting

Then, experienced sysadmins, which you clearly are not, know that like the most dangerous time for an airplane is during takeoff and landing, the most dangerous time for a server is during shutdown and start. Stiction on old drives, minor internal power surges during boot that doesn't affect a running system, and much else can cause problems.
On the other hand, I worked on a system for the US Navy that controlled Trident-I missiles... we rebooted both of our main computers every six hours to ensure that we could reboot them when needed - and the first one after midnight included an extensive hard drive self test to make sure it was working to spec. The gentleman down thread has it right, the answer to 100% uptime is redundancy and failover or switchover, not relying on nothing ever going wrong.

In addition, you seem to be unclear on the difference between a reboot and power cycling... In the latter case, if you're worried about stiction and power surges, that's an indication that you should have been thinking about replacing the machine for quite a while rather than hoping nothing ever goes wrong. Because eventually, something will - and when that happens, now you've potentially got two problems... the one that brought the machine to it's knees, *and* the undiscovered ones because you've never rebooted or cycled power.
Re:Uptime fetish by arth1 · 2013-03-14 11:30 · Score: 2

if you're worried about stiction and power surges, that's an indication that you should have been thinking about replacing the machine for quite a while rather than hoping nothing ever goes wrong
More likely, someone should have thought of that long before the hardware became legacy. When a new sysadmin comes aboard, the best that can be done for legacy systems is often to keep spares and backups, and try not to trigger any faults. The software might not be supported, and the cost of porting can run to millions.
You're lucky if you've never had to support legacy systems. And a company that has them is lucky if they don't get a new sysadmin who first thing causes downtime by well-meaning patching that isn't needed.

Here's the real question... by jayhawk88 · 2013-03-14 08:58 · Score: 5, Interesting

Did they power it back up again after shutting it off? Just to see?

Surprised no one posted this yet by yakatz · 2013-03-14 08:58 · Score: 2, Funny

http://xkcd.com/686/

Re:Surprised no one posted this yet by jupiterssj4 · 2013-03-14 09:03 · Score: 3, Funny

Or this uptime-related one http://xkcd.com/705/

Netware 3.12 by slaker · 2013-03-14 08:59 · Score: 4, Interesting

One of my clients had a Netware 3.12 machine on site that operated continuously about about 16 years. It was retired unceremoniously when they moved to a new location, but that machine did not in all its life have a hardware fault or abend.

--
-- I wanna decide who lives and who dies - Crow T. Robot, MST3K

Re:in other news ... by ebno-10db · 2013-03-14 09:06 · Score: 5, Insightful

a slab of concrete has been found with an uptime of 3737 years

You exaggerate. The oldest concrete structure I know of is the dome of the Pantheon, and that's only been around for 1887 years. Time will tell if it was well built.

a terrible disturbance the /src by Thud457 · 2013-03-14 09:10 · Score: 4, Funny

Netcraft confirms Bill Joy just felt a chill like someone walked on his grave.

hey, that's three jokes there, take your pick.

--

the preceding comment is my own and in no way reflects the opinion of the Joint Chiefs of Staff

Not a good thing!!!! by onyxruby · 2013-03-14 09:11 · Score: 5, Insightful

Last place I was at that had server admins that bragged about /years/ of uptime quickly turned into a discovery that we had thousands of servers that had not been patched in years. Only a few systems can patch the kernel without rebooting and those are the exception, not the rule. It turned into a six month project but in the end we were patching systems that were vulnerable to 5 year old exploits (mix of *nix and Windows).

I had to make the argument that server uptime meant jack, and to make it I put forward the argument that the only thing that mattered was /service/ uptime. Frankly it is the service that needs to be always available, not the server. This is why you have maintenance windows, for the explicit purpose of allowing a given system to patched and rebooted at a predictable time without interrupting services.

If your server is really that important it will have a fail over server for redundancy (SQL cluster, whatever). If your server isn't important enough to have a failover server for service redundancy that it isn't so important that you can't have a maintenance window. Think service, not server!

The only thing that matters is service availability.

Re:in other news ... by bobbied · 2013-03-14 09:13 · Score: 2

>

maybe the sysadmins liked them but as a developer i hated solaris boxen. the libraries were always years old, nothing modern would compile, the cli tools were slightly incompatible with linux scripts, ...

They may be a pain to write and deploy programs on but they will run forever once you do...

Fully characterized platforms, take a LOT of testing effort and testing at this level takes lots of time. The Sparc/Solaris platform was behind the state of the art, but it was stable, stable, stable. Solaris on X86 wasn't bad, if your hardware was supported and you didn't really need the GUI to be local, but it wasn't as stable (mainly due to the hardware).

Sun did their stuff right for the most part, but got seriously hurt by Linux (Red Hat in particular) and in the long run couldn't make reliability pay well enough. Who wanted to buy new when the old stuff was still humming without a reboot 5 years later? Not me.

Got to love that sun blue...

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101

And they shut it down? by prisoner-of-enigma · 2013-03-14 09:48 · Score: 3

Kevin Flynn was trapped in there!

--
In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky

What better relational query language? by tepples · 2013-03-14 09:56 · Score: 2

MySQL can die for all I care. SQL likewise. Horrible language.

What language would you prefer to query a relational database?

Re:Truly Impressed by iggymanz · 2013-03-14 10:30 · Score: 2

HP calls it OpenVMS now, their big Itanium boxes can run it, and Alpha version still supported till 2016:

http://h71000.www7.hp.com/openvms/openvms_supportchart.html

Re: RIP Sun by DougOtto · 2013-03-14 10:34 · Score: 2

I'm not. Our Dice overlords installed a "log in with FB" link for creating accounts. Had I known it was going to stick that stupid icon on everything I'd spent the extra 30 seconds typing stuff in.

--
Solving Unix problems since 1989...

Re:in other news ... by cusco · 2013-03-14 10:51 · Score: 2

I used to work at a place that had AIX, OS2, and NT4 servers, and one frelling Solaris Sparc server. Filthy Sun box froze and needed rebooting (sometimes by pulling the power cord because it was so frozen) more than all the other servers combined. We celebrated its retirement by throwing it off the roof into the swamp.

--
"Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin

Re:in other news ... by Score+Whore · 2013-03-14 11:15 · Score: 2

If you had a problem with POSIX compatibility on Solaris, it's because you don't know Solaris. There are specific paths you should specify for the various POSIX standards, /usr/xpg4, /usr/xpg6, etc. You might try "man -s 5 POSIX" for a start.

Re:in other news ... by KiloByte · 2013-03-14 12:54 · Score: 4, Informative

/usr/xpg4/something is not /bin/sh, the latter being what POSIX requires.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.

My /. user number? I'm honored! by Shag · 2013-03-14 13:25 · Score: 5, Funny

I know, I know, it's just a coincidence...

--
Village idiot in some extremely smart villages.

Re:3737 days in years by hcs_$reboot · 2013-03-14 13:41 · Score: 3, Insightful

Thanks, not so many people know that there are 3650 days in 10 years, especially the geeks here on /.

--
Slashdot, fix the reply notifications... You won't get away with it...

AT&T 3B20D's by stox · 2013-03-14 14:30 · Score: 2

I don't know this for sure, but I suspect there is one out there with 30 years of uptime now, or damn close to that, running Unix-RTR as part of a 5ESS switch.

--
"To those who are overly cautious, everything is impossible. "

Re:"boxen" by bobaferret · 2013-03-14 14:46 · Score: 2

I never could make up my mind on the whole "boxen" thing. Some days it was irritating enough to kill over. Other days it would just slip out, like "pop" instead of "coke" from the lips of a southerner forced to live in chicago for too long. At a minimum it does seem to show ones age though...

Re:in other news ... by AthanasiusKircher · 2013-03-14 16:07 · Score: 2

a slab of concrete has been found with an uptime of 3737 years

You exaggerate. The oldest concrete structure I know of is the dome of the Pantheon, and that's only been around for 1887 years. Time will tell if it was well built.

Umm, who cares about what "you know of"? What matters is historical fact. The Colosseum, for example, contains large amounts of concrete and was finished a half-century before the Pantheon. Lots of concrete was used in rebuilding after the great fire in Rome in the mid first century as well. But, of course, Roman concrete was around for centuries before that.

And yet, all of this is irrelevant, since concrete was used in Egypt, Syria, China, and other places thousands of years earlier. There are in fact concrete columns in Egypt that are still standing and have been dated to roughly 3600 years old. There are examples of floors and other smaller structures that have been discovered elsewhere that are much older. Romans perfected the materials and used them on a huge scale, but the basic idea of concrete is much, much older.

Re:*nix does not need to reboot for more updates u by serialband · 2013-03-14 23:41 · Score: 3, Informative

Kernel updates generally required reboots even in the unix/linux world. In Windows, you could also avoid a reboot if you stopped the services that are being patched and restart them after a patch was applied.

Slashdot Mirror

Solaris Machine Shut Down After 3737 Days of Uptime

74 of 409 comments (clear)