Uptime Realities in the Internet World
schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.
Uptime Realities in the Slashdot-linked World
There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.
No, thanks.
How many engineers out there have heard the marketing / sales 'it has to be always available' and priced out an infrastructure accordingly.
Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's
Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.
said if i can get this mentioned on slashdot, i'll get the raise after all...
::.. check out some Cell Phone Reviews
"My former boss"
;-)
Nice, and you go after your ex-boss by getting his article slashdotted!
The boss didn't do for, though. :(
Like the Telco... voice grade telco. Better than the power company.
Our web server does about 4 9's, which is a downtime of about 8 hours a year, I think. I really suck at math though. I mean it.. I'm so bad at math I have no idea if thats right. I said "well theres 8544 hours in a year, so 8 divided by that is 0.0009, so thats about 4 9s. I think. 8 hours of downtime isnt that bad. I think the next step up from 8 hours of downtime is essentially those megacorps that have redundant systems, and sirens go off and people die when their server goes down for under a second. In fact, I think if their server actually went down for more than a second, some sort of structual damage to the building hosting it is the only likely scenario. Course, that's closer to 7 9s. I cant figure out how long any of the other 9s are cause I only knew what our average downtime is, and could do the math that way only. Wow, its really hot in here.
Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!
slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
We should just give up on decent service and professionalism. I don't think so.
... It's not unrealistic ... don't expect people to live with downtime just because a good portion of those systems need to be rebooted on a regular basis (Win machines), and general retardness of sysadmins around the world allow things like Nimda and Codered to get out of hand. This is an excuse to let companies too cheap to have decent customer support off the hook. Maybe if they were educating their tech staff instead of finding more ways to rip us off, they'd have decent servive.
My ISP (Ameritech) seems to think so, considering my DSL connection and their promptness to "Get ahold of me within 24 hours..."
Bleh
Everyone with competent sysadmins on rock solid *nix systems raise your hands...
Yeah, I'm glad the local nuclear power plant decided to save money and only go with 3 nines of reliability.
Beta sux! Join the Slashcott! http://hardware.slashdot.org/comments.pl?sid=4760465&cid=46173047
One of my clients (a government agency,) runs a system that attaches to the federal LEIN database, they use it to pull arrest records, whether a person has a concealed weapons permit, etc when someone is pulled over. This system is considered essential and requires 100% uptime (achived through multiple failovers) since officer's lives are on the line when the system is down.
Go away, or I will replace you with a very small shell script.
I think we just knocked his server down to two nines by slashdotting it.
What else would motivate someone to post an ex-boss' e-mail address on the front page of slashdot?
I guess you don't have a pacemaker.
Some things ARE that important, most things aren't.
Let's see...five nines would be just over five minutes of downtime in a year (315 seconds). For business and other non-life-threatening situations, that would be way better than necessary. Lots of folks are probably going to harp on the "If 1 out of 10,000 airplanes crashed, there'd be X crashes" line of argument. There's a problem with that...one mistake doesn't crash an airplane. Every system on an airliner is redundant, and virtually any "pilot error" has time to be fixed before there's a problem. Listen in on the Air Traffic Control to Cockpit transmissions sometime...just about every flight encounters some minor error at some point, whether it is a pilot needing to reask for a clearance or someone needing to climb or descend a bit to clear a potential collision. Errors are unavoidable. The key is to ensure recovery from those errors is possible. So sure, your computer may be down for 5 minutes a year. Make sure you have a backup system that is able to take up the slack instantly, and your downtime is down to 3/10 of a second a year. Redundancy is the key.
It all depends on what is on the server. If it's stuff your own people use constantly on their job, through your own network, you need five nines, otherwise you will take the blame for critical jobs getting done late.
But when people are going to the server through the internet, they get used to interruptions - there are so many links between, some of which periodically become overwhelmed with traffic, that no one could tell the difference between two nines and five nines on your server itself. So sales & product information sites don't need more reliability than you can readily afford. They do need high capacity.
And if it's your blogs concerning your navel lint - no one's looking at your uptime but you...
The results from uptime.netcraft.com seem a bit hokey (sp?) at times since it does not take load balanced web servers into consideration, network outages, etc. In my case, I had a server down for about under a day to be rebuilt and brought it back up... checked the NetCraft results a couple of days later, it didn't show that the site had gone down.
I know there are some projects/sites that will allow people to submit uptimes sent from cron jobs or agents to a server, which then stores the uptime data there. Of course, that doesn't mean that you can just generate junk data (ie: 999 day uptime with 2934 users).
Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.
Obliteracy: Words with explosions
Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.
No, thanks.
Point taken. Somehow, IMO reasonable reliability in the software and hardware industry is rediculously exensive. I guess it wouldn't be too bad if one were willing to trade off performance (speed) for reliability rather than requiring speed and reliability.
I'd be happy to get consistent two nine or better reliability from my ISP!
That's sad.
I hope that LEIN database is on a network that has a better record for delivering the data within a few seconds than the public internet. It doesn't matter how good your server is if your IP traffic got crowded off the net by 3,000 nerds downloading pr0n...
My company (a large-ish, surviving Internet Retailer) has internally announce a Six Sigma Initiative. I'm wondering if we'll need to maintain 5 9s uptime...
i'll settle for seven of nine.
it is the percent that the server is up. i.e. 99.99% is 4 nines of reliability and up 99.99% of the time (I am assuming that x 9 refers to total 9 in the the percentage, not just to the right of the %).
100% uptime is virtually impossible, so the holy grail is as close as possible--99.999%
If you want to learn about uptime, don't bother going to codesta.com. Their servers have already melted from a brutal slashdotting. According to Netcraft, codesta.com runs Linux and has 74 days of uptime... until today!
cpeterso
Heh... Switzerland....
Some factors that precede recent crash between Tu-154 and Boeing 757 DHL were
- Traffic warining system in its scheduled 10-minutes maintenance - dispatcher got no warnings
- Busy phone lines to dispatch - Deutch dispatch was not able to get to Switzerland dispatch to tell them about dangerous situation...
This is an example that cost a lot of lives...
(other tragic circumstance was that pilots of Tu-154 gave priority to dispatch commands instead of commands of collision avoidance system...)
The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.
THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.
Not true. Five 9s in the airlines means that you'd see an airliner late or in some other way unavailable - possibly due to a crash, but not likely - every other day. Reliability is the availability to do what you need, when you need it. If a server is up 100% of the time, but is not able to be accessed because the network is down, the system is not reliable for you.
-- Two men say they're Jesus. One of them must be wrong. - Dire Straits
but their server is down.
Hey freaks: now you're ju
with M$, it is theoretically impossible as well to achieve their advertised up-time; ( i think back when they ran some ad (still running?) about how windows can achieve three or four 9s of uptime).
Total bullshit... let's see -- windows machine *requires* reboot every time you apply a patch; a reboot on a large machine is... i dunno, 10 minutes if you got a lot of crap. security update turns up about twice a week or so... that puts up to be ~99.8% MAXIMUM;
even if you don't buy my numbers, three 9s uptime means every week you only gets ~6 seconds downtime.
yeah... sure... not if you want to patch up than internet explorer / IIS so your system does not die from DoS, hackers, or worms!
My life in the land of the rising sun.
Maybe your phone call to 9-1-1 should be the one that happens during the five minutes of downtime?
Too Bad that a lot of the servers on the top 50 uptime list still have the default page that apache provides.
I'm sure it isn't too difficult to keep them running - just make sure the power is on and the network cable is plugged in.
Yeah, I'm glad the local nuclear power plant decided to save money and only go with 3 nines of reliability.
I wonder if the internet will glow in the dark after the nuclear power plant's webserver melts down? Hey, maybe that is where Trolls come from.
Hate standing in the meat locker (server room)? Hate rushing to work past midnight to cycle a server?
The problem I used to have is I'm not a morning person so being available as an admin before 7am is tough, but now I can admin my network while trapped in rush hour traffic. =] Reboot servers, telent into devices, stop/start services, add users, manage DNS... the list goes on and on.
Uptime can be maintained without even having to leave the comfort of your easy chair. If you're an admin you should check this product out.
SonicAdmin by sonicmobility
(http://www.sonicmobility)
WURD!!
Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.
When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.
The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.
If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?
Alex
Simply put, 4 9's of reliability would mean %99.99 uptime. (only down for .01% of the time).
"Perl 6 gives you the big knob" -- Larry Wall
Actually, even this is silly. True five nines availability on a widely distributed network would mean that an application was available at all times on all segments of the network. Which would mean that your uptime depends not only on your redundancy on one side of a pipe, but on your overall reduncancy as well, so that when a pipe goes down you're still accessible. Since when a pipe goes down in your host you probably lose other resources as well (such as power or alternate pipelines), this means multiple datahouses owned by multiple vendors. Each of these has to have a perfect backup of all data and be running the same versions of all software. Really, the only true redunancy would be so heavily distributed that each local network would basically have to have its own server. This isn't so crazy -- technically, DNS and email do this. However, we all know that for an end user even DNS and email can have perceived outtages.
And this is why 5 9s is foolish. Sure, you're redundant behind the pipe, but if you lose the pipe you can't blame your datacenter when you charged a customer for uninterrupted service. Technically, if their modem disconnects them for a few hours you've broken contract.
Besides, who needs it? If yahoo is unreachible from my desk, I wait and reconnect. It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same. Any services I might have used, or products purchased, I will use or purchase at a later time. After all, I don't refrain from buying shoes just because the mall is closed!
Hey freaks: now you're ju
3 9s = 99.9% uptime = 8.75 hrs/Yr = 525 min/Yr. .875 hrs/Yr = 52.5 min/Yr. .0875 hrs/yr = 5.25 min/Yr. .03 seconds per year downtime.
4 9s = 99.99% uptime =
5 9s = 99.999% uptime =
9 9s = 99.9999999% uptime =
I call bullsh*t on anything that claims to have 9 9s reliability. 3 seconds every HUNDRED years.
Nathan Brazil?
That we live in a society that is more willing to send people into space with only a 99.9% chance of success, yet we freak out when a search engine on the Internet drops below 99.999% reliability? Great. Remind me never to work for NASA.
He who has no
Looks like codesta.com just used up all it's downtime by getting it's servers slashdotted.
Outdoor digital photography, mostly in New Engl
I believe theres more to this than meet the eye.
What other best way to get back on your former boss than slashdotting him or his company server back to medieval ages..
Follow that up with multiple queries on google about boss's info, credit cards, ssn etc..
To cut things short, by the end of the week :
Boss's boss realizes the server crashes were due to Boss, fires his ass on the spot.
Wife realizes that the new unexplained charges on Credit card from "Suzy's Parlor" were not exactly the next door cafe. Gives him the boot as well.
You evil man..you!
Rapid Nirvana
...that this article is hosted on a server which is now being brutally Slashdotted?
To make a pun demonstrates the highest understanding of a language
We did it on a really low budget:
Heartbeat/Mon/Fake/Coda/Linux/IPVS for the High Availability, failover from DS1->DS2, each on different backbone nodes.
Mirrored systems in different geographic locations:
Firewall
IPVS Gateway
Apache->Weblogic bridge (Apache vhosts with ssl)
Apache->Zope bridge (Apache vhosts with ssl)
Zope->Zeo setup for content management.
SAN drive array for Oracle, running on two E4500s
This system isn't really that expensive, just the costs of hardware and my salary for setting them up.
My $0.02 will always be worth more than your â0.02, so
"got crowded off the net by 3,000 nerds downloading pr0n..."
How could it get that crowded? Jpeg files are very small!
"Derp de derp."
Lets say the flight is 2 hours long. 99.99999% uptime on the software that connects the yoke to the control surface means it would be down for 0.08 of a second on each 2 hour flight, hardly anything that would make me thing things were unsafe.
.. whether its a bug in your ASP software, the operating system, the ORB, the tcp/ip stack, or 1 minute adding a cron job to restart the server once every week at 3am when nobody is using the system .. thats when you'd probably want to start earning money from somebody who can manage it a little better.
Now, over a year, bunch it up over every day, and you get 29 seconds. Now thats scary, but if you meet 99.99999% uptime, you're probably not going to bunch all your downtime together in one incident.
Although, I'd say that the article looks like it wasn't written for _really_ critical stuff like this.
But its scary hen you have to argue with your boss about whether you should spend 2 weeks figuring out why your server crashes after 2 weeks up uptime
"Old man yells at systemd"
I work for a small ISP in central NY. A couple years ago, I can't remember which provider it was anymore, but they unplugged us because their paperwork was all screwed up and they didn't think anybody was on the circuit. Then they plugged somebody else into it. It not only took us several hours to find out what the problem was, it took 3 whole days for them to resolve the problem. They wouldn't simply undo what they did, they had to assign us a new circuit and basically refused to escalate the work order. We eventually came back up but lost quite a few customers, understandably.
No, it means that a jetliner has to be operating for a year with just 3 mins in the hangar. But about half its lifetime a jetliner is in maintenance, giving it an uptime of about 50%.
It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues.
You really want to see someone go berserk over downtime, try running a MUD...
See the Netcraft FAQ at http://uptime.netcraft.com/up/accuracy.html#cycle
That site deserves to be slashtdotted. They have this little paper divided into about ten little sections, which multiplies their load by 10x or so. Then, it's a .jsp page (why?), which means more server-side interpreter overhead. If they hadn't crudded up the basic job of serving a readable document, they'd have one or two orders of magnitude more capacity.
I've worked on systems where a failure results in a hearing before an investigative board.
Mea navis aericumbens anguillis abundat
"It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
;)
Please. Let's not talk so badly about eBay. Do you know how many people have been crushed under their CIOs foot?
For instance, 4 nines says your system is up 99.99% of the time. That is, out of 365 days x 24 hours x 60 minutes = 525,600 minutes a year, it can be down for only .01%, or 52 minutes a year. Five nines (99.999%) allows only 5 minuts a year downtime. This may actually be averaged over many servers and several years (that is, if you had 10 servers running for 3 years and just 1 died requiring one day to replace, you could figure your downtime as 100 * 1 day/(10*3*365) = .009%, so you've still got 4 nines).
There are questions about what gets counted when figuring reliability. For one thing, almost no one would count a slashdotting or a DOS attack against their uptime, but nevertheless from the user's viewpoint the server is down. Also, how do you count "scheduled downtime" such as rebooting NT servers after installing security patches, or unplugging the boxes to move them around when it's time to expand the system? A news server with a worldwide audience has no "penalty free" time slots. So either you settle for a lower uptime goal, or you need redundant servers configured so that even major upgrades can be put in by unplugging just one at a time while the others keep running. OTOH the company database server, downtime during working hours is far more serious than downtime for the web server, so if it's a big company you do need redundant servers with automatic switchover. But in most cases there are times late at night or on weekends when no one cares if you shut them _all_ down at once - which certainly makes the upgrades easier.
So anyway, one person's "5 nines" may look like a lot less to someone else. E.g. a server vendor may claim that because only one in a million of their servers is broken at any given time their reliability is 6 nines. Your single server may never break at all - but once a week you take it off-line for ten minutes to load the newest security patches, so to anyone who wanted to keep working for those ten minutes you are only at 3 nines.
Entirely. Having worked extensively on the flight deck systems for the Boeing 767-400ER, I can tell you first hand that the redundancy is rather amazing. There are two major computer systems that drive the displays in the cockpit, the DPCs which do a lot of digital signal manipulation and the DCCs which do a lot of the analog to digital signal manipulation and control. Two DCC boxes drive three DPC boxes and the two DCC boxes are cross-connected to each of the DPC boxes. The three DPC boxes each talk to each other (I'm not sure if the DCC boxes talked to each other - that was further down the chain than I was working on) and actually vote on the data points that are being sent to the displays to determine if one of the DPCs is malfunctioning or processing bad data. The way this all works together is amazingly complicated, especially when you consider that it all runs on embedded boards where the "executable" is typically less than 1-2MBs in size.
... especially the way its actually implemented in the embedded system. Debugging all this, of course, was non-trivial. For that matter, coding it is non-trivial as its all in Ada83.
... those were the days :)
My particular area of development was the actual display software which was provided data from the DPC systems. Each of the six displays (2-pilot, 2-copilot, 2-EICAS in the console) received multi-cast data from each of the DPCs and then fed data back to the DPCs on the display's status. The DPCs would then automagically evaluate if the displays were functioning properly and switch primary functions away from a malfunctioning display to a functioning display if error conditions were detected.
The PFD (primary flight display) is the pilots most important display as it displays airspeed, artificial horizon, TCAS warnings, altitude and a few other things. The ND (navigation display) is the inner screen on both the pilot/co-pilot sides and if the PFD experiences error conditions, the DPCs switch the PFD to the ND and the ND to one of the EICAS (engine indicators, etc.) displays.
All very interesting stuff
Ahh
Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.
> Too Bad that a lot of the servers on the top 50 uptime list still have the default page that apache provides. I'm sure it isn't too difficult to keep them running - just make sure the power is on and the network cable is plugged in.
Historically, some very popular and widely sold operating systems couldn't even do that much.
Sheesh, evil *and* a jerk. -- Jade
I know this is meant as a joke but there is some truth to the way you spend your money for reliability. You have to choose carefully which systems get the money spent to design and validate that they will perform with 99.9999xxxx reliability. If you waste your money, you won't have enough for those systems that really need the quality.
And nothing contributes to reliability more than hiring, training and retaining high quality operators and engineers.
Post hoc ergo propter hoc? You are making the most common mistake in statistics.
Netcraft saying that the boxes with the longest times are BSD only implies there is most likely some kind of relationship between BSD and long uptimes; it does not imply that BSD is responsible for those uptimes.
It could be that the class of administrators who like BSD happen to have administrative practices that preclude rebooting often.
It could be that for some reason BSD is only used in very static configurations where the kinds of activities that would cause you to want to reboot are simply not done.
It could be anything.
I can't read the paper but, for his sake, I hope that he really meant that reliabillity isn't that important to him.
His server is toast!
For those who remember the awesome but now defunct uptimes.net will be pleased to know that a new server is now up and running. It uses the old uptimes protocols and clients.
The URL is http://uptimes.wonko.com/
A GNU/Linux box was number one the last time I looked, with a NetBSD box coming in second.
So much for "two nines". Nothing, I repeat nothing, can withstand the /. hordes ...
Sorry this should read 50% availability.
Yes I know there may be arguments that scheduled maintenance won't count. But even without counting this jetliner availability never reaches 99.999%.
About two years ago I had to fly a longer distance . I sat in the plane, but the plane didn't leave the gate for about an hour. Then the pilot spoke to us asking our patience for another hour, there would be a problem with the oil pressure and the mechanics were looking at it. After this hour he told us that the oilfilter was defective and had to be changed. And after another hour he asked us to leave the plane, there weren't any new oilfilter available at the airport and they had to get another one from another airport. After five hours we finally got clearance and started.
That's five hours unavailability. If this was the only unplanned outage for the plane at all, and it was on average available 99,999% this means a lifetime of 500000 hours or about 60 years for this plane without any further problem with the plane (included outer conditions like weather, grounding due to Sep 11 et. al.)
So planes on average are much more often unavailable than 0.001% of their operation time. The average delay for the Frankfurt Airport (FRA) is currently 15mins, if we assume that every plane lands on FRA about once per day this would be an average outage of 1%, 1000 times that of 5-9.
On W2K - service packs, about 10% of hot fixes, and anything to do with IIS require a reboot. Take your head out of the sand.
Damn! This should have been modded up!!! And it would hardly be "settling".
No one ever had to evacuate a city because the solar panels broke!
To get closer to your analogy, I would treat a server like a jet engine - the plane is designed to fly even if one fails.
That'd be why planes (*gasp*) have redundant systems! One might crash for an hour, but the other two will take care of ya until landing.
And frankly I'd rather not be in a plane that lost control for five minutes once a year.
As long as it's parked on the ground during those five minutes, it's no problem.
You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
"Are ve up?"
"Nien."
"Are ve up yet?"
"Nien."
"How about NOW?"
"Nien."
"Vill ve be comink up soon?"
"Nien."
"Vill ve be up next veek?"
"Nien."
God is real unless declared integer
It's instructive to read about the United Flight 232 incident a few years back. The #2 engine of a DC-10 exploded in flight (at around 30,000 feet) and severed ALL the hydraulic systems and their backups. Without rudder, ailerons, elevator, spoilers, flaps, or one of the three engines, the pilots set the plane up for a forced landing. And about 200 of the 300 passengers on board survived.
Of course, certain bugs can be really bad. I was down at Boeing Field once last year when somebody attempted to take off in a light plane that had just been serviced. Unfortunately the mechanic hooked up the ailerons backwards, so that when the pilot attempted to correct for a crosswind on takeoff, he promptly rolled and landed on top of another plane in the parking area. (Sounds like inadequate preflight action by the pilot on that one, since he appears to have missed the "control surfaces free and correct" item on his pre-takeoff checklist, but no injuries to the best of my knowledge.)
Note that I'm hardly going to argue that flight-control software shouldn't be damn good. But... it's overstating your case to assume that downtime or error necessarily means a plane is going to fall out of the sky.
"Biped! Good cranial development. Evidently considerable human ancestry."
Having the 5 9's of reliability is NOT foolish. It is a reality of life. My particular organization services 40 million web customers, so we can not afford to be down at any time of the day because of the type of service we provide. In fact, last year we made our goal of having the 5-9's, and we did it without needing our disaster recovery (DR) site.
Having a DR plan and being reliable go hand in hand for the most part, however under normal day-to-day business conditions, servers need to be upgraded and things unplugged. You don't switch your entire infrastructure over to a DR site to upgrade your apache web server!! It is for this reason you have redundancy on the network and server level leading out to the Internet (or wherever your customer base resides).
Disasters, on the other hand, do not happen everyday. They happen once a year, maybe.... sometimes once every 2 years. If you live in an area more prone to disasters (like southern California), you may need an alternate site located on the east coast.... but, that is the cost of doing business.
Also, having 5-9's on uptime does NOT mean being accessible to everyone in the world at any time no matter what. Having 5-9's of uptime means that your organization has successfully kept it's applications and services available to the Internet. How is it my company's fault if you don't plug your modem into the wall? It's not, so to say that our "reliability" decreases because of an end user being a moron is a stupid statement.
Gosh...I can't wait for Microsoft to start writing aircraft programs in C#.
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Not to sound like a suit, but it's really about total cost of ownership. For example, software RAID comes with most modern operating systems, but you still need to power down the server to remove and replace a failed drive. However, if you make the upfront investment in a hardware RAID controller with hot-swap capability, you save time and reduce tech support calls, saving money in the long run. If you're offering commercial services (as an ISP or whatever), you start to develop a reputation for reliability that will earn you more customers over time.
Stats on the server are interesting that either it stopped being "up" or stopped bein monitored before june.
Or did I read the graph wrong?
.
Have you read the moderator guidelines? Well, have you, PUNK? (and I want a Karma: Gnarly option)
Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.
At first I didn't believe you.
According to this page, there were 10 fatal accidents in 18 million flights in 1998. That is a little worse tthan six nines. Five nines would be 180 flights, or almost exactly every other day.
I'm really glad I checked before spouting off. :-) Did you know that stat or did you pull it out of the air?
See High Availability for more informaiton.
:-) (And yes, I have Win2000 on my machine and even occasionally am forced to boot into it so I speak from experience).
Coda is the best present option for fs dependant data storage on mostly open-source plaforms. We are using Coda for our MySQL table files, ZODB files and logs.
Coda may still be beta software, but if Open Source software like Coda is considered beta code then Windows 2000 + sp2 must have been alpha code.
My $0.02 will always be worth more than your â0.02, so
... and what is cost of keeping this up?
The maintenance on the 4500s (if they have multi procs and lots of ram) is prolly 20-30k each annually just by itself.
What about the renew costs on the weblogic support?
How much was that oracle?
Even a basic system as qouted above is 400+k.
--- I do not moderate.
This reminds me of when I was working at a .com called rulespace - there was a construction outfit building a parking lot downstairs - one day they decided to move the big uswest/qwest plywood board from one pillar to another. Alarms never went off because they couldn't call the pagers because they had effectively disconnected all the T1's (including the 2 backups), all the dsl circuts/analogue lines and the T1 going to the telephone switch for the entire building. All the redundancy in the world wouldn't save that mess. As I recall they forked out more money for colocation space at Inflow and moved the more critical systems out there.
The Scenario
Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.
Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.
Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.
Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.
Genesis of the 'Five Nines'
We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.
First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.
The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."
'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.
We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.
The Greasy Steel Bar
Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.
What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.
In my experience the most important machines are not accessable from the internet. Our mainframe has high availability, but it sure is not running a webserver. Not to mention it is firewalled off from the outside world.
Finkployd
The same thing happened to me once, a little puddlejumper from Dallas or Houstan to Austin, I think it was.
Anywho, the pilot revs up the engine, then throttles it back done. Fine, brakes and throttle work. Throttles back up, trips the brakes, and off we go screaming down the runway.
Then the plane slows down, and stops.
Pilot comes on the intercom and says 'Um, folks, you may have noticed, we didn't take off. A warning light has come on in the cockpit, and we don't know why. Until we do, we're going to stay right here.
Now, that's not the bad part. The heat and humidity, and a plane full of sweaty smelly passengers isn't the bad part, either.
No, the bad part was the pair of off duty pilots in the seats next to me who started, in loving detail, discussing every thing that could possibly be wrong.
Vintage computer games and RPG books available. Email me if you're interested.
Yeah, and when some systems fail, it doesn't actually matter.
.com boom was pissant little companies demanding 100% uptime, spending a fortune on Oracle and redundant data centres and shit, when they didn't need that reliability. Their business plan didn't call for it, their demographic didn't call for it, nothing called for it. They were engineering their shithouse little business' systems like they were for the A&E department of a hostpital.
One thing I saw again and again during the
And that's the point the guy seems to be making: people are spending millions of dollars where they only need to spend a tenth that, to build systems you could run a trading floor with.
Example: the system I support is mission-critical. If it crashes, it makes the front page of the newspaper the next day. Hundreds of thousands of folks may get delayed by ten or fifteen minutes or a half-hour. The system is finally up to 99.995% availability. And how? By turning off all the backup systems and disentangling all the horrendous software kludges that were put in for the backup system. While the organization was trying to support hot-backup availability, it was crashing every other day. Outside consultants blamed this on flaky hardware and said the system had a life of, at most, a year and a half. Here we are now, four years later, and reliability is better than ever :-).
I like to think that some of the work, and (even
better) some of my attitudes have helped get us
where we are.
Nein!
for 4 quarters running now. IT IS horribly expensive, both on the hardware and support side but given federal requirements and customer demand we have no choice.....
errr....umm...*whooosh* *whoosh* Is this thing on ?
the Feds seem to think that balancing a bank everyday at 15:00 IS, and they aren't JOKING nor do they have a sense of humor...
its called 6 sigma for us
errr....umm...*whooosh* *whoosh* Is this thing on ?
He puts a seemingly valid mailto: link on a heavily trafficed website. If it wasn't his "former boss" before, it damn well will be now.
USDCO has been featured in other /. articles; not only is their colocation facility located underground, with a high degree of redundancy in their connections, but it's not very expensive, either...
'Course, an on-site solution won't be anywhere near as cheap, but if you can colo, this is the place.
One company I worked for once upon a time, ConXioN Corp, has a very real statement on their opening page from a major customer:
"ConXioN has not been down in 5 years." And that was in 2001, they still haven't had a hit.
This is simply a matter of consideration and design. No $19.95/month mom&pop ISP is going to put the effort needed into ensuring such uptimes, things like that take redundancy and forward thinking, and that costs money.
While I was at NASA, the network and servers there also had better than 5-9's availability, because the people who ran those servers and that network took the time to care. For us it wasn't a matter of profit, it was a matter of pride.
So while I agree with those who poo-poo that "nothing is so important" that it needs to be up 100% of the time, and I also agree with the reality that there will be downtime of any system at some point, really impressive uptimes are not just possible, they can and do happen anywhere that uptime is a prioroty.
Long Uptimes are simply a matter of design.
Bob-
The Ludwig von Mises Institute. The reasoning individuals economics
In Myanmar "General Ne Win helped speed his own downfall ... by suddenly declaring much of the Burmese currency worthless and replacing it with bank notes in denominations divisible by his lucky number, nine. Riots followed."
___
"with their freedom lost all virtue lose" - Milton
It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same.
Try convincing the people who call Tech Support of this simple concept.
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
That's the reason why servers go down and planes do not (well.. most of the time): people expect that the server they get for 3000$ will run a corporate mission critical system for years without a crash. Planes costs millions, are tested on hardware every time they're used, servers are not. Do you test your server's hardware every week? (or day?).
Never underestimate the relief of true separation of Religion and State.
You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc
statisticly, for somthing to have an garanteed (95%) uptime of 1 month it must have no downtime in 95 months!
5+ nines are great but you can still go down 1 pico second an hour, that's a hell of a lot of outage.
thank God the internet isn't a human right.
It's a joke. By the time the call got to me, I got the person on the phone, and got a description of the problem...BUZZZZ!!! Times up!!! Of course, I was asked to support a system that I had no formal training on, that I didn't design, install, or ever see in person....support was....difficult. My dot com layoff was, in sooooo many ways, the best thing that could have happened to me!!
Just wait until MS joins the industry associations and starts putting its executives in the committees. It'll change real fast.
Shutting down free speech with violence isn't fighting fascism. It IS fascism!