HA Metrics on Non-clustered Systems?

← Back to Stories (view on slashdot.org)

HA Metrics on Non-clustered Systems?

Posted by Cliff on Friday September 21, 2001 @11:28AM from the you-do-the-math dept.

javester asks: "Has anybody given any thought to compiling metrics for high-availability for the different OSs on a high-availability non-clustered system? Is 'High-availability Windows' an oxymoron? Can you even get close the 5 9s (99.999% which is about 5 minutes of downtime a year) on a typical stand-alone Windows 2000 Server running IIS 5 with typical patch-it-up, three-finger salute routine? If you are just serving up a web-site on a plain-vanilla Windows box, how highly-available can it get? By my calculations, with the typical reboot cycle being 3 minutes, and with a security patch requiring a reboot being released on a weekly basis - a stand-alone high-availability Windows box looses about 156 minutes a year just applying patches! So it can never get past the third 9! In *nix environments, reboots are not required as often (except for kernel changes - how APT!), since you can recycle the appropriate daemon without restarting. But really, has anybody made a formal study?" Now wouldn't this make an interesting college project?

7 of 15 comments (clear)

Min score:

Reason:

Sort:

High Availability PC? by Detritus · 2001-09-21 12:29 · Score: 2

Regardless of the operating system, how can you make a high availability system out of a PC? PCs are designed to be cheap, not reliable or maintainable.
When I think of high availability, I think of systems with redundancy and hot backups that can switch to alternate hardware in a few seconds if a failure is detected.

--
Mea navis aericumbens anguillis abundat
Not to mention Windows 95 by danpbrowning · 2001-09-21 12:32 · Score: 2

Not to mention that Windows 95 had a bug that it would blue screen 47 days after installation: after doing aboslutely nothing.

But yes, it would be a neat project if someone put in the resources to do it right.

--
Daniel
promised uptime on windows... by bcarlson · 2001-09-21 12:54 · Score: 2, Funny

I met with [insert name of huge telecom company here] this week. My boss and I discussed uptime with the regional salesperson. He discussed "five 9" uptime on windows like it was nothing. Now, jump forward to yesterday afternoon when their engineers called me asking about our proposed configuration... I told him, and asked him what the approximate guaranteed uptime would be... he said about "89-92%". heh... gotta love engineers being honest.

sorry... just thought it was funny...

--

"...I'll need guns" --Chow Yun-Fat in 'Replacement Killers'
Where are the numbers for HA systems? by embobo · 2001-09-21 19:23 · Score: 3, Interesting

Where are the _real_ (not marketdroid) numbers for HA systems? My impression is that companies never want to give out uptime numbers. If you'd like to do any sort of study I would suspect that you can't just walk up to IBM and say: "Hi! I'm a college student. Can I have you HA stats?" It will take much negotiation and then years to collect the data. I.e.: not a college project unless your advisor is Ghod and can bend companies to his will and you start the study as a freshman.

"...reboots are not required as often, since you can recycle the appropriate daemon without restarting." Umm, you can? What gave you the impression that this was the case in general? Certainly for simple apps like sendmail or apache one can do this but any more complicated app is much more involved. For example, take the recent update of /.: can you update a Slash-based site merely by cylcing (HUPing) a daemon? No, there was a huge database conversion involved.

Finally, it is going to be very hard to get statistically significant results with such a high degree of precision. For example, mon will track downtime. It says that my mailserver has been available 99.95% of the time over the past two years. I know that is wrong (and mon notes this as well). The server hasn't gone down at all but the method of measurement (snmp) has introduced error because it's availablility isn't 100%.
1. Re:Where are the numbers for HA systems? by Tower · 2001-09-22 09:44 · Score: 2
  
  Well, there don't have the true numbers, but they do speak to how the numbers are caluculated...
  IBM eserver x-series (x86) HA whitepaper
  i-series (AS/400)
  
  I haven't found great HA stuff for the z-series (mainframes) that I'm allowed to post :)
  
  --
  "It's tough to be bilingual when you get hit in the head."
Never Reboot? by pete-classic · 2001-09-22 06:14 · Score: 3, Informative

In *nix environments, reboots are not required as often (except for kernel changes[...])

I've never used it, but AFAIK you can even get around this with the HURD.

-Peter
Single System != High Availability by Wanker · 2001-09-22 13:18 · Score: 2

Discounting special environments such as Tandem (which require a LOT of rewriting applications to work) there is no way to get "high availability" out of any single system, for just the reasons you describe. Patches, hardware/software upgrades, OS panics, sysadmin stupidity, leaky roofs, twitchy circuit breakers, fiber-seeking backhoes, and a myriad of other minor and major catastrophes can't simply be put on hold. Hence, any single system is likely to have availability in the 80-95% range in a typical environment.

If you beef up things with advanced monitoring, very reliable hardware, well trained staff, good vendor support, and are generally very careful you can get a single system up to about the 99% range. This is about about 87 hours of downtime per year-- enough for quarterly patches, bi-annual hardware work, and the ocassional problem. Most of this is scheduled, and we're likely to only see about 12 hours of "surprises" a year, even if it's not a good year.

Yes, I'm pulling most of these numbers out of my ass, but my ass has been around the block a couple times. Feel free to fudge the numbers if your experience varies. Changing the 12 hours around won't make that much difference in the later calculations. (below)

Assuming 365.25 days per year, that 12 hours means we have a "no surprises" uptime of roughly 99.8 percent. Not hard if you're running things in a stable environment.

The only way to get those extra two nines is to start running things with spares. If you have a spare system and are careful not to build an environment where both can fail simultaneously (i.e. on the same power grid), the chance of running into a simultaneous surprise becomes 1 - (0.002 * 0.002) or .999996 (99.9996%). Voila! five nines. (You don't multiply the original 99% since I'm assuming that you're smart enough not to schedule patches for the same day on both primary and standby systems.) In order to be able to multiply the probabilities together they do need to be truly independent of one another-- again, not on the same power grid, in the same building, etc.

In short-- there's no way to get five nines from a single system. You need clusters of independent systems.