HA Metrics on Non-clustered Systems?
javester asks: "Has anybody given any thought to compiling metrics for high-availability for the different OSs on a high-availability non-clustered system? Is 'High-availability Windows' an oxymoron? Can you even get close the 5 9s (99.999% which is about 5 minutes of downtime a year) on a typical stand-alone Windows 2000 Server running IIS 5 with typical patch-it-up, three-finger salute routine? If you are just serving up a web-site on a plain-vanilla Windows box, how highly-available can it get? By my calculations, with the typical reboot cycle being 3 minutes, and with a security patch requiring a reboot being released on a weekly basis - a stand-alone high-availability Windows box looses about 156 minutes a year just applying patches! So it can never get past the third 9! In *nix environments, reboots are not required as often (except for kernel changes - how APT!), since you can recycle the appropriate daemon without restarting. But really, has anybody made a formal study?"
Now wouldn't this make an interesting college project?
Where are the _real_ (not marketdroid) numbers for HA systems? My impression is that companies never want to give out uptime numbers. If you'd like to do any sort of study I would suspect that you can't just walk up to IBM and say: "Hi! I'm a college student. Can I have you HA stats?" It will take much negotiation and then years to collect the data. I.e.: not a college project unless your advisor is Ghod and can bend companies to his will and you start the study as a freshman.
"...reboots are not required as often, since you can recycle the appropriate daemon without restarting." Umm, you can? What gave you the impression that this was the case in general? Certainly for simple apps like sendmail or apache one can do this but any more complicated app is much more involved. For example, take the recent update of /.: can you update a Slash-based site merely by cylcing (HUPing) a daemon? No, there was a huge database conversion involved.
Finally, it is going to be very hard to get statistically significant results with such a high degree of precision. For example, mon will track downtime. It says that my mailserver has been available 99.95% of the time over the past two years. I know that is wrong (and mon notes this as well). The server hasn't gone down at all but the method of measurement (snmp) has introduced error because it's availablility isn't 100%.