Uptime Realities in the Internet World
schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
next page
Introduction
The Scenario
Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.
Sound Familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.
Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.
Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.
Genesis of the 'Five Nines'
We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.
First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.
The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, "if everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."
'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.
We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet
The Greasy Steel Bar
Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.
What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.
(next page)
1 nine: 90% availability, or 37 days of downtime per year (Qwest!)
2 nines: 99% availability, or 88 hours of downtime per year
3 nines: 99.9% availability, or 9 hours of downtime per year
4 nines: 99.99% availability, or 53 minutes of downtime per year
5 nines: 99.999% availability, or 315 seconds of downtime per year
6 nines: 99.9999% availability, or 32 seconds of downtime per year
7 nines: 99.99999% availability, or 3 seconds of downtime per year
Beyond that, it doesn't much matter.
TI-89 > all education .3 seconds
9's ---- time
1 876 hours
2 87 hours
3 8 hours
4 52 minutes
5 5 minutes
6 31 seconds
7 3 seconds
8
9 you get the idea
The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.
THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.
Simply put, 4 9's of reliability would mean %99.99 uptime. (only down for .01% of the time).
"Perl 6 gives you the big knob" -- Larry Wall
Good luck applying Six Sigma to processes that aren't directly related to manufacturing something ... ;)
:)
I really mean that - Good Luck.
One word to clients... "Outsource"
Maintaining backend infrastructure with a 5 9's service level agreement really is prohibitively expensive for all but the largest businesses. Especially if they are not a tech company.
The level of engineering that goes into providing true 5 9's service is extraordinary. Also, some military contracts actually require 6 9's!! (Let alone completely seperate networks for classified data).
I'm actually in the design phase of a data center which requires 5 9's (so we can take on those who decide to outsource). Redundant generators, redundant UPS, redundant routers, redundant HVAC, two seperate cable runs from different sides of the building, two connections to the power grid, etc., etc....
And thats just the physical infrastructure! Now you need to develop, or integrate the software to completely cover every aspect of your operations. Anything from cable tagging, to ticketting systems, to emergency procedures. After you build all the infrastructure, take that price and double it... that's how much you will be spending to develop all of those operating procedures. Which, at that point, go get ISO certified - since you've already gone above all the requirements.
If I had to take a guess at a physical cost, $250-300 a square foot seems pretty close (around here anyway). And that only gets cheaper if you are looking at a facility greater than about 10000 sq. ft.
Unless of course, only marketing has those 5 9's!
It appears Ockham lost his razor and grew a beard.
I finally got a tcp connection and page 2 finally loaded, so here it is...
The Uptime Rules
First, as an introduction to the rules, let's review our terms and terminology.
Definitions
Uptime is the amount of time the entire system is available. By entire system we are saying that an entire transaction can be completed. Just having your web servers running when the needed application server isn't running cannot be defined as uptime.
Downtime is everything else.
Scheduled maintenance downtimes or windows are the periods of time (for example, from 1:00am to 3:00am Monday morning) when an IT team has the option, if they need, to bring down various components in a fashion that causes the system to be incapable of complete functionality.
Reliability is defined as uptime but where scheduled maintenance downtime is not counted against it. For example, if in a 24 hour period there was an hour of scheduled downtime, but otherwise full operational for the remaing 23 hours, then the system was 100% reliable.
So how do you translate the 'nines' into acceptable downtime? This chart provides the answer:
'Nines' Uptime % Minutes
Per Year Minutes
Per Month
Two 99% 5256 438.0
Three 99.9% 526 43.8
Four 99.99% 53 4.4
Five 99.999% 5 0.4
Rule #1: A great system run poorly is a poor system.
This is the most crucial rule to understand when managing any system. It doesn't matter how much you spent on the hardware, how well designed your database tables are, or if you installed the latest and greatest operating system on the market. If it cannot be managed well, problems ensue.
Users don't see, or care that problems come from your database servers, or your application servers, or your static data caching. What they perceive is one of two states: working or not working. They want to make their reservation, or pay their bill, or just get the weather in Bali, and they want to do it NOW!
Managing with a given level of reliability in mind is about people, hardware, operating and escalation plans, and ultimately, it is about the money to put it all together and keep it running. The cost of reliability, is very hard to quantify. Even assuming it is a linear relationship (and few things in life are) it's a staggering relationship in financial terms. In my experience each 'nine' is close to an order of magnitude increase in cost!
The bottom-line is this, you need to do an honest assessment of available resources versus intended goals; it is the first step in making sure your great systems runs at least as good as you intended.
Rule #2: Five nines is a goal reachable only through both fully automated system management, and rigorously controlled and tested applications.
Scared by four and five nines? Unless you've worked in a true, hardcore, spare no expense data center, you should be!
Let's think about five nines for a moment. 5 minutes a year. That rules out any form of human involvement in fixing problems. After all, even the best humans are known to be distracted for a minute or two into conversation with a co-worker, or a phone ringing.
As an example, let's time a perfectly common scenario, where you have two people monitoring systems. Time the following emulation in your office space:
1. Assume the system is working happily.
2. Walk over to your kitchen area and grab a soft drink. Then walk back.
3. Wait 15 seconds while you pretend to have the other NOC (Network Operations Center) engineer say "Hey, look at this!"
4. Sprint over to your desk and sit down.
5. Log into your desktop machine.
6. Log into a remote machine.
7. Run one or two basic remote commands ('ps' or 'top' for example)
Now stop the clock. I'm willing to bet your five minutes are up!
Even without a distraction, it's simply not possible for a system of any complexity, to have a problem confirmed, cross checked, and resolved, by a person, within five minutes. Oh, and don't forget about the minute to 90 seconds that you've already lost in monitoring the issue - unless you want alarms going off continuously, you have to set an error threshold that typically consumes 60 seconds or so.
"Okay," you say, "well, five nines is a lot. How about aiming at four nines?" But are four nines really much different than five? Certainly, it gives you more latitude and time to fix a problem, but not much more. You can afford a single downtime that takes a few minutes to debug, but that's all.
The truth is, unless you have an application that doesn't fail, the odds are that your hardware failures will still occur three to four times a year, which pushes the limit of human intervention. A good rule of thumb is that things never happen when you are watching them - figure that any issue takes at least ten minutes to resolve, even if it as simple as a human inadvertently powering both sets of redundant systems down, and now they are powering back up.
prev page
next page
prev page
next page
Rule #3: Even three nines is hard in the Internet World.
The "Internet World" is not a magazine, but rather, a truism of application state, where functionality and features are continuously enhanced. Compare this to a billing or call center, which has a minimum of features, and where great amounts of time are spent in testing before new applications are released to production.
The great thing about developing in the Internet world is that lots of new features can be brought to end users in a very short amount of time. The standard for development is weeks to a few months rather than years. Not only does this provide a level of instant gratification, but it also allows applications and services to be highly responsive to what users actually want and need, and in the end, provides a vastly more desirable system.
The tradeoff, of course, is that the applications themselves aren't nearly as reliable. Thus, the three nines goal. Why three nines? Because it's the highest possible reliability for a system which utilizes human intervention, and there's simply no way that a dynamic, "Internet World" application can be reduced to few enough parameters that it can be managed in an automated fashion. Failure modes grow at an exponential rate to functionality and the task of automating monitoring and management of such dynamic and flexible systems is an entropic one - that is, it quickly becomes a task bigger than the application itself.
But even three nines doesn't come cheaply. It requires a complete staff to be available at all times. There's no time to call and page people - to wait for them to get home from the supermarket where they were grabbing a quart of milk for the baby.
How much staff does one need? Well, that's a good question, and the answers are dependant upon the nature of the particular application. But, my experience in today's world shows that most systems are three-tier applications, with significant networking components. Therefore, at any given time, you need the following people on hand:
* NOC / Monitoring staff
* System administrators
* Network Engineers
* Application Engineers
* Database Administrators
* Crisis Management
* Customer Management
Now, admittedly, there can be some overlap in tasks, and the simpler the application, the easier it is to get overlap, but already, we're talking about quite a few people. Of course, these people need some backup to call in, for fresh ideas, if things aren't going well.
Don't underestimate the value of having a technical person, who understands the system, acting in the Crisis Manager role. This person is actually very critical to making sure that key issues aren't being overlooked, and to providing the detached viewpoint that is key to problem solving.
In addition, having a customer relationship person available to talk to the upset customers, at least when the service is provided to businesses rather than consumers, is vital. This isn't to help solve the problems of a given downtime event, but for the ongoing relationship with the customer.
Rule #4: 99.7% is very cost effective.
That's right, less than three nines. 99.7% gets effectiveness from the fact that it allows for two hours of downtime a month - basically, a total of one day per year.
While it sounds like a lot, it's typical for a failure pattern to consist of several small events of 10-20 minutes duration, and on rare occasions, a failure that takes three to four hours to resolve. That's the core timing that you are get with 99.7% -- the ability to have a four hour failure once a year.
That means that you don't have to build nearly the hardware redundancy - instead of having 1:1 "hot" standby units, you can have a 1:N relationship with a cold standby unit that can be configured and put into place in the span of a couple of hours. The larger N is the greater the costs savings. If they are network components we're referring to, the less complex the routing environment, the fewer people with network-specific skills are needed. Get it simple enough, and you get more overlap of skills, meaning more bang for your salary buck.
Complex systems also require complex understandings. The number of dependencies within systems again grows exponentially, and leaves far more room for human error.
Remember rule #1, a great system run poorly is a poor system.
prev page
next page
Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.
For those who remember the awesome but now defunct uptimes.net will be pleased to know that a new server is now up and running. It uses the old uptimes protocols and clients.
The URL is http://uptimes.wonko.com/
A GNU/Linux box was number one the last time I looked, with a NetBSD box coming in second.
Sorry this should read 50% availability.
Yes I know there may be arguments that scheduled maintenance won't count. But even without counting this jetliner availability never reaches 99.999%.
About two years ago I had to fly a longer distance . I sat in the plane, but the plane didn't leave the gate for about an hour. Then the pilot spoke to us asking our patience for another hour, there would be a problem with the oil pressure and the mechanics were looking at it. After this hour he told us that the oilfilter was defective and had to be changed. And after another hour he asked us to leave the plane, there weren't any new oilfilter available at the airport and they had to get another one from another airport. After five hours we finally got clearance and started.
That's five hours unavailability. If this was the only unplanned outage for the plane at all, and it was on average available 99,999% this means a lifetime of 500000 hours or about 60 years for this plane without any further problem with the plane (included outer conditions like weather, grounding due to Sep 11 et. al.)
So planes on average are much more often unavailable than 0.001% of their operation time. The average delay for the Frankfurt Airport (FRA) is currently 15mins, if we assume that every plane lands on FRA about once per day this would be an average outage of 1%, 1000 times that of 5-9.
Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.
At first I didn't believe you.
According to this page, there were 10 fatal accidents in 18 million flights in 1998. That is a little worse tthan six nines. Five nines would be 180 flights, or almost exactly every other day.
I'm really glad I checked before spouting off. :-) Did you know that stat or did you pull it out of the air?
The Scenario
Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.
Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.
Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.
Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.
Genesis of the 'Five Nines'
We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.
First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.
The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."
'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.
We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.
The Greasy Steel Bar
Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.
What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.
Nein!
I'd also say impractical. 5 nines is 99.999% availability, i.e. can be down for 1 second every 100000 seconds, or 27.77 hours. That gives approximately 6 seconds of downtime per week.
Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc... Chances are even restarting the web server would take up more time than your maximum weekly downtime.
Given that over the course of a month (which is the billing period on most ISP lines), you only have 24 seconds of possible downtime, it's very unlikely that the ISP will be able to meet that target. Pretty much *any* fault would take longer than that to fix, so any company offering a refund if the SLA isn't met is just asking for trouble.
You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc