Slashdot Mirror


Uptime Realities in the Internet World

schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."

29 of 353 comments (clear)

  1. Uptime by dattaway · · Score: 5, Funny

    Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.

    1. Re:Uptime by ranulf · · Score: 4, Informative
      the article is about how uptime is too expensive

      I'd also say impractical. 5 nines is 99.999% availability, i.e. can be down for 1 second every 100000 seconds, or 27.77 hours. That gives approximately 6 seconds of downtime per week.

      Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc... Chances are even restarting the web server would take up more time than your maximum weekly downtime.

      Given that over the course of a month (which is the billing period on most ISP lines), you only have 24 seconds of possible downtime, it's very unlikely that the ISP will be able to meet that target. Pretty much *any* fault would take longer than that to fix, so any company offering a refund if the SLA isn't met is just asking for trouble.

  2. Customers want it, but don't understand it by derekb · · Score: 5, Insightful


    How many engineers out there have heard the marketing / sales 'it has to be always available' and priced out an infrastructure accordingly.

    Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

    Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.

    1. Re:Customers want it, but don't understand it by Subcarrier · · Score: 5, Funny

      Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

      $999.99

      Problem solved. ;-)

      --
      "I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
    2. Re:Customers want it, but don't understand it by rob_from_ca · · Score: 4, Insightful

      This is the most intelligent thing I've ever heard on slashdot before. If you don't understand this comment, read it again and again until you do. :-)

      If you're a business, your money is far better spent improving the user experience rather than working on buying redundant-everything, building the support infrastructure, and incurring the extra overhead of the tedious and careful processes needed to obtain 5 nines (and 4, and even to a degeree 3 nines).

      If your site sucks and no one visits, it doesn't really matter if it's down...work on building something reasonably reliable that is very compelling to your users; that's money much better spent...

    3. Re:Customers want it, but don't understand it by ipsuid · · Score: 5, Informative

      One word to clients... "Outsource"

      Maintaining backend infrastructure with a 5 9's service level agreement really is prohibitively expensive for all but the largest businesses. Especially if they are not a tech company.

      The level of engineering that goes into providing true 5 9's service is extraordinary. Also, some military contracts actually require 6 9's!! (Let alone completely seperate networks for classified data).

      I'm actually in the design phase of a data center which requires 5 9's (so we can take on those who decide to outsource). Redundant generators, redundant UPS, redundant routers, redundant HVAC, two seperate cable runs from different sides of the building, two connections to the power grid, etc., etc....

      And thats just the physical infrastructure! Now you need to develop, or integrate the software to completely cover every aspect of your operations. Anything from cable tagging, to ticketting systems, to emergency procedures. After you build all the infrastructure, take that price and double it... that's how much you will be spending to develop all of those operating procedures. Which, at that point, go get ISO certified - since you've already gone above all the requirements.

      If I had to take a guess at a physical cost, $250-300 a square foot seems pretty close (around here anyway). And that only gets cheaper if you are looking at a facility greater than about 10000 sq. ft.

      Unless of course, only marketing has those 5 9's!

      --
      It appears Ockham lost his razor and grew a beard.
  3. my boss.... by Patrick13 · · Score: 5, Funny

    said if i can get this mentioned on slashdot, i'll get the raise after all...

    --
    ::.. check out some Cell Phone Reviews
  4. In my dept... by ALecs · · Score: 4, Funny
    After a major firewall downtime last year, I wanted to have some T-shirts printed up advertising

    Tovaris Systems Support:
    Proudly providing nine-fives reliability.

    The boss didn't do for, though. :(

  5. 9 9s by digitalsushi · · Score: 5, Funny

    Like the Telco... voice grade telco. Better than the power company.

    Our web server does about 4 9's, which is a downtime of about 8 hours a year, I think. I really suck at math though. I mean it.. I'm so bad at math I have no idea if thats right. I said "well theres 8544 hours in a year, so 8 divided by that is 0.0009, so thats about 4 9s. I think. 8 hours of downtime isnt that bad. I think the next step up from 8 hours of downtime is essentially those megacorps that have redundant systems, and sirens go off and people die when their server goes down for under a second. In fact, I think if their server actually went down for more than a second, some sort of structual damage to the building hosting it is the only likely scenario. Course, that's closer to 7 9s. I cant figure out how long any of the other 9s are cause I only knew what our average downtime is, and could do the math that way only. Wow, its really hot in here.

    Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!

    --
    slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
    1. Re:9 9s by Anonymous Coward · · Score: 5, Informative

      1 nine: 90% availability, or 37 days of downtime per year (Qwest!)
      2 nines: 99% availability, or 88 hours of downtime per year
      3 nines: 99.9% availability, or 9 hours of downtime per year
      4 nines: 99.99% availability, or 53 minutes of downtime per year
      5 nines: 99.999% availability, or 315 seconds of downtime per year
      6 nines: 99.9999% availability, or 32 seconds of downtime per year
      7 nines: 99.99999% availability, or 3 seconds of downtime per year

      Beyond that, it doesn't much matter.

    2. Re:9 9s by Wrexen · · Score: 5, Informative

      TI-89 > all education
      9's ---- time
      1 876 hours
      2 87 hours
      3 8 hours
      4 52 minutes
      5 5 minutes
      6 31 seconds
      7 3 seconds
      8 .3 seconds
      9 you get the idea

    3. Re:9 9s by Asprin · · Score: 4, Funny

      Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!

      365 days * 24 hours/day * 60 minutes/hour = 525600 minutes/year.

      %uptime %downtime Fuzzy description of downtime
      .9 .1 52560 minutes down/year ~= 36 days down/yr
      .99 .01 5256 minutes down/year ~= 3.5 days down/yr
      .999 .001 525.6 minutes down/year ~= 9 hours down/yr
      .9999 .0001 52.56 minutes down/year ~= 1 hour down/yr
      .99999 .00001 5.256 minutes down/year ~= 5 minutes down/yr
      .999999 .000001 .5256 minutes down/year ~= 32 seconds down/yr
      .9999999 .0000001 .05256 minutes down/year ~= 3.2 seconds down/yr
      .99999999 .00000001 .005256 minutes down/year ~= (HALF A MILLISECOND/YEAR!!!!)
      .999999999 .000000001 .0005256 minutes down/year ~= How long it takes for one of these locally hosted sites to get /.'ed


      --
      "Lawyers are for sucks."
      - Doug McKenzie
    4. Re:9 9s by 4of12 · · Score: 4, Funny

      Hmmm...

      Enough nines of reliability and you can probably easily claim that network latency is responsible for the slow response a client is experiencing:)

      The server can go down, be rebooted before the client thinks something is really wrong!

      --
      "Provided by the management for your protection."
  6. Unfortunatly.... by Lord_Slepnir · · Score: 5, Funny

    I think we just knocked his server down to two nines by slashdotting it.

  7. Must hate his ex-boss by palmech13 · · Score: 5, Funny

    What else would motivate someone to post an ex-boss' e-mail address on the front page of slashdot?

  8. 99.999% perfection by Gorm+the+DBA · · Score: 4, Insightful

    Let's see...five nines would be just over five minutes of downtime in a year (315 seconds). For business and other non-life-threatening situations, that would be way better than necessary. Lots of folks are probably going to harp on the "If 1 out of 10,000 airplanes crashed, there'd be X crashes" line of argument. There's a problem with that...one mistake doesn't crash an airplane. Every system on an airliner is redundant, and virtually any "pilot error" has time to be fixed before there's a problem. Listen in on the Air Traffic Control to Cockpit transmissions sometime...just about every flight encounters some minor error at some point, whether it is a pilot needing to reask for a clearance or someone needing to climb or descend a bit to clear a potential collision. Errors are unavoidable. The key is to ensure recovery from those errors is possible. So sure, your computer may be down for 5 minutes a year. Make sure you have a backup system that is able to take up the slack instantly, and your downtime is down to 3/10 of a second a year. Redundancy is the key.

  9. Simple by American+AC+in+Paris · · Score: 5, Funny


    Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.

    --

    Obliteracy: Words with explosions

  10. Re:Nothing is THAT Important by medcalf · · Score: 4, Insightful

    Not true. Five 9s in the airlines means that you'd see an airliner late or in some other way unavailable - possibly due to a crash, but not likely - every other day. Reliability is the availability to do what you need, when you need it. If a server is up 100% of the time, but is not able to be accessed because the network is down, the system is not reliable for you.

    --
    -- Two men say they're Jesus. One of them must be wrong. - Dire Straits
  11. Close, but it depends by alexhmit01 · · Score: 5, Interesting

    Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.

    When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.

    The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.

    If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?

    Alex

    1. Re:Close, but it depends by Slak · · Score: 5, Funny

      If nobody visits a site that's down, is it really down? ;)

  12. Re:Oi! You act like a manager! by dasmegabyte · · Score: 4, Insightful

    Actually, even this is silly. True five nines availability on a widely distributed network would mean that an application was available at all times on all segments of the network. Which would mean that your uptime depends not only on your redundancy on one side of a pipe, but on your overall reduncancy as well, so that when a pipe goes down you're still accessible. Since when a pipe goes down in your host you probably lose other resources as well (such as power or alternate pipelines), this means multiple datahouses owned by multiple vendors. Each of these has to have a perfect backup of all data and be running the same versions of all software. Really, the only true redunancy would be so heavily distributed that each local network would basically have to have its own server. This isn't so crazy -- technically, DNS and email do this. However, we all know that for an end user even DNS and email can have perceived outtages.

    And this is why 5 9s is foolish. Sure, you're redundant behind the pipe, but if you lose the pipe you can't blame your datacenter when you charged a customer for uninterrupted service. Technically, if their modem disconnects them for a few hours you've broken contract.

    Besides, who needs it? If yahoo is unreachible from my desk, I wait and reconnect. It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same. Any services I might have used, or products purchased, I will use or purchase at a later time. After all, I don't refrain from buying shoes just because the mall is closed!

    --
    Hey freaks: now you're ju
  13. Boss Slashdotted! by cOdEgUru · · Score: 4, Funny

    I believe theres more to this than meet the eye.

    What other best way to get back on your former boss than slashdotting him or his company server back to medieval ages..

    Follow that up with multiple queries on google about boss's info, credit cards, ssn etc..

    To cut things short, by the end of the week :

    Boss's boss realizes the server crashes were due to Boss, fires his ass on the spot.

    Wife realizes that the new unexplained charges on Credit card from "Suzy's Parlor" were not exactly the next door cafe. Gives him the boot as well.

    You evil man..you!

  14. Re:Nothing is THAT Important by Sique · · Score: 4, Insightful

    No, it means that a jetliner has to be operating for a year with just 3 mins in the hangar. But about half its lifetime a jetliner is in maintenance, giving it an uptime of about 50%.

    --
    .sig: Sique *sigh*
  15. Re:If the ailerons are not available by Jobe_br · · Score: 4, Interesting

    Entirely. Having worked extensively on the flight deck systems for the Boeing 767-400ER, I can tell you first hand that the redundancy is rather amazing. There are two major computer systems that drive the displays in the cockpit, the DPCs which do a lot of digital signal manipulation and the DCCs which do a lot of the analog to digital signal manipulation and control. Two DCC boxes drive three DPC boxes and the two DCC boxes are cross-connected to each of the DPC boxes. The three DPC boxes each talk to each other (I'm not sure if the DCC boxes talked to each other - that was further down the chain than I was working on) and actually vote on the data points that are being sent to the displays to determine if one of the DPCs is malfunctioning or processing bad data. The way this all works together is amazingly complicated, especially when you consider that it all runs on embedded boards where the "executable" is typically less than 1-2MBs in size.

    My particular area of development was the actual display software which was provided data from the DPC systems. Each of the six displays (2-pilot, 2-copilot, 2-EICAS in the console) received multi-cast data from each of the DPCs and then fed data back to the DPCs on the display's status. The DPCs would then automagically evaluate if the displays were functioning properly and switch primary functions away from a malfunctioning display to a functioning display if error conditions were detected.

    The PFD (primary flight display) is the pilots most important display as it displays airspeed, artificial horizon, TCAS warnings, altitude and a few other things. The ND (navigation display) is the inner screen on both the pilot/co-pilot sides and if the PFD experiences error conditions, the DPCs switch the PFD to the ND and the ND to one of the EICAS (engine indicators, etc.) displays.

    All very interesting stuff ... especially the way its actually implemented in the embedded system. Debugging all this, of course, was non-trivial. For that matter, coding it is non-trivial as its all in Ada83.

    Ahh ... those were the days :)

  16. Re:I don't really agree here... by Pfhreakaz0id · · Score: 5, Funny

    I'm sorry, I just had a type mismatch error. I saw "oracle" and "isn't really that expensive" in the same post.

  17. Re:Then the *system* is not five nines! by ColaMan · · Score: 4, Funny

    And frankly I'd rather not be in a plane that lost control for five minutes once a year.

    As long as it's parked on the ground during those five minutes, it's no problem.

    --

    You are in a twisty maze of processor lines, all alike.
    There is a lot of hype here.
  18. In Germany, 5 nines is bad. by Carmody · · Score: 5, Funny

    "Are ve up?"
    "Nien."
    "Are ve up yet?"
    "Nien."
    "How about NOW?"
    "Nien."
    "Vill ve be comink up soon?"
    "Nien."
    "Vill ve be up next veek?"
    "Nien."

    --
    God is real unless declared integer
  19. Full Text - Page 1 by Kallahar · · Score: 5, Informative

    The Scenario

    Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.

    Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.

    Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.

    Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.

    Genesis of the 'Five Nines'

    We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.

    First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.

    The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."

    'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.

    We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.

    The Greasy Steel Bar

    Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.

    What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.

  20. Re:Nothing is THAT Important by SuiteSisterMary · · Score: 5, Funny

    The same thing happened to me once, a little puddlejumper from Dallas or Houstan to Austin, I think it was.

    Anywho, the pilot revs up the engine, then throttles it back done. Fine, brakes and throttle work. Throttles back up, trips the brakes, and off we go screaming down the runway.

    Then the plane slows down, and stops.

    Pilot comes on the intercom and says 'Um, folks, you may have noticed, we didn't take off. A warning light has come on in the cockpit, and we don't know why. Until we do, we're going to stay right here.

    Now, that's not the bad part. The heat and humidity, and a plane full of sweaty smelly passengers isn't the bad part, either.

    No, the bad part was the pair of off duty pilots in the seats next to me who started, in loving detail, discussing every thing that could possibly be wrong.

    --
    Vintage computer games and RPG books available. Email me if you're interested.