Slashdot Mirror


Uptime Realities in the Internet World

schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."

11 of 353 comments (clear)

  1. Re:9 9s by Anonymous Coward · · Score: 5, Informative

    1 nine: 90% availability, or 37 days of downtime per year (Qwest!)
    2 nines: 99% availability, or 88 hours of downtime per year
    3 nines: 99.9% availability, or 9 hours of downtime per year
    4 nines: 99.99% availability, or 53 minutes of downtime per year
    5 nines: 99.999% availability, or 315 seconds of downtime per year
    6 nines: 99.9999% availability, or 32 seconds of downtime per year
    7 nines: 99.99999% availability, or 3 seconds of downtime per year

    Beyond that, it doesn't much matter.

  2. Re:9 9s by Wrexen · · Score: 5, Informative

    TI-89 > all education
    9's ---- time
    1 876 hours
    2 87 hours
    3 8 hours
    4 52 minutes
    5 5 minutes
    6 31 seconds
    7 3 seconds
    8 .3 seconds
    9 you get the idea

  3. Oi! You act like a manager! by isa-kuruption · · Score: 3, Informative

    The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.

    THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.

  4. Re:Management want it, but does it understand it by Jobe_br · · Score: 3, Informative

    Good luck applying Six Sigma to processes that aren't directly related to manufacturing something ... ;)

    I really mean that - Good Luck. :)

  5. Re:Customers want it, but don't understand it by ipsuid · · Score: 5, Informative

    One word to clients... "Outsource"

    Maintaining backend infrastructure with a 5 9's service level agreement really is prohibitively expensive for all but the largest businesses. Especially if they are not a tech company.

    The level of engineering that goes into providing true 5 9's service is extraordinary. Also, some military contracts actually require 6 9's!! (Let alone completely seperate networks for classified data).

    I'm actually in the design phase of a data center which requires 5 9's (so we can take on those who decide to outsource). Redundant generators, redundant UPS, redundant routers, redundant HVAC, two seperate cable runs from different sides of the building, two connections to the power grid, etc., etc....

    And thats just the physical infrastructure! Now you need to develop, or integrate the software to completely cover every aspect of your operations. Anything from cable tagging, to ticketting systems, to emergency procedures. After you build all the infrastructure, take that price and double it... that's how much you will be spending to develop all of those operating procedures. Which, at that point, go get ISO certified - since you've already gone above all the requirements.

    If I had to take a guess at a physical cost, $250-300 a square foot seems pretty close (around here anyway). And that only gets cheaper if you are looking at a facility greater than about 10000 sq. ft.

    Unless of course, only marketing has those 5 9's!

    --
    It appears Ockham lost his razor and grew a beard.
  6. Rephrase the question by Todd+Knarr · · Score: 3, Informative

    Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.

  7. Re:Nothing is THAT Important by tzanger · · Score: 3, Informative

    Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.

    At first I didn't believe you.

    According to this page, there were 10 fatal accidents in 18 million flights in 1998. That is a little worse tthan six nines. Five nines would be 180 flights, or almost exactly every other day.

    I'm really glad I checked before spouting off. :-) Did you know that stat or did you pull it out of the air?

  8. Full Text - Page 1 by Kallahar · · Score: 5, Informative

    The Scenario

    Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.

    Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.

    Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.

    Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.

    Genesis of the 'Five Nines'

    We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.

    First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.

    The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."

    'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.

    We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.

    The Greasy Steel Bar

    Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.

    What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.

  9. Nien? by alienmole · · Score: 3, Informative

    Nein!

  10. Re:Uptime by ranulf · · Score: 4, Informative
    the article is about how uptime is too expensive

    I'd also say impractical. 5 nines is 99.999% availability, i.e. can be down for 1 second every 100000 seconds, or 27.77 hours. That gives approximately 6 seconds of downtime per week.

    Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc... Chances are even restarting the web server would take up more time than your maximum weekly downtime.

    Given that over the course of a month (which is the billing period on most ISP lines), you only have 24 seconds of possible downtime, it's very unlikely that the ISP will be able to meet that target. Pretty much *any* fault would take longer than that to fix, so any company offering a refund if the SLA isn't met is just asking for trouble.

  11. Server vs Service by AftanGustur · · Score: 3, Informative
    Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc...Chances are even restarting the web server would take up more time than your maximum weekly downtime.

    You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".

    --
    echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc