Why Is Less Than 99.9% Uptime Acceptable?

← Back to Stories (view on slashdot.org)

Why Is Less Than 99.9% Uptime Acceptable?

Posted by Zonk on Sunday March 2, 2008 @08:21AM from the gets-boring-to-stand-there dept.

Ian Lamont writes "Telcos, ISPs, mobile phone companies and other communication service providers are known for their complex pricing plans and creative attempts to give less for more. But Larry Borsato asks why we as customers are willing to put up with anything less than 99.999% uptime? That's the gold standard, and one that we are used to thanks to regulated telephone service. When it comes to mobile phone service, cable TV, Internet access, service interruptions are the norm — and everyone seems willing to grin and bear it: 'We're so used cable and satellite television reception problems that we don't even notice them anymore. We know that many of our emails never reach their destination. Mobile phone companies compare who has the fewest dropped calls (after decades of mobile phones, why do we even still have dropped calls?) And the ubiquitous BlackBerry, which is a mission-critical device for millions, has experienced mass outages several times this month. All of these services are unregulated, which means there are no demands on reliability, other than what the marketplace demands.' So here's the question for you: Why does the marketplace demand so little when it comes to these services?"

10 of 528 comments (clear)

Min score:

Reason:

Sort:

Re:The cost by HairyCanary · 2008-03-02 08:29 · Score: 4, Informative

Exactly what I was thinking. I work for a CLEC, and I have a rough idea how much things cost -- compare what a Lucent 5E costs with what a top of the line Cisco router costs, and you have the answer why voice service achieves five-nines while data service typically does not.
Re:because its ridiculous by X0563511 · 2008-03-02 08:33 · Score: 3, Informative

I just did the math. 99.999 uptime is "less than 5 minutes per year" or "less than half a minute per year" depending if i stuck an extra 0 in there...

Clearly, a ridiculous number.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Costs increase geometrically by (H)elix1 · 2008-03-02 08:33 · Score: 4, Informative

Because every nine will cause a geometric increase in costs.

This

Uptime (%) Downtime 90% 876 hours (36.5 days)
95% 438 hours (18.25 days)
99% 87.6 hours (3.65 days)
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.256 minutes
99.9999% 31.536 seconds

I work for a software shop where we can do high availability, but more often than not, folks chose to lower the uptime expectation rather than pony up for the stupid money it takes to have the hardware / software / infrastructure to get there. Most companies know the customer will not pay the extra cash for the uptime, thus... you get what you pay for.

--
+++ UGUCAUCGUAUUUCU
Because it's not necessary? by Srass · 2008-03-02 08:39 · Score: 3, Informative

Well, my guess would be that many (but not all) people understand that being able to call an ambulance because Aunt Betty has fainted is a necessity, but being able to chat with Aunt Betty for an hour from your car isn't. Missing a rerun of Laverne and Shirley isn't critical, and neither is having to wait to post those vacation pictures to Flickr. Your coworkers will, in all probability, somehow muddle through if you can't send them email from your blackberry.

The telephone as we know it was the first genuinely instantaneous, worldwide communications medium that anyone could use, it was seen as a necessary component for national security during the cold war, and was built out as such. We've had over a century to perfect it, and vast amounts of money were spent doing so. Despite its origins at DARPA, the Internet as we know it today, although more useful, is by and large less of a basic need, is far more complex, and large portions of it are still built on top of the telephone infrastructure, besides.

I can't help but think that most people understand this sort of thing, and understand that bringing such modern conveniences up to five nines of reliability is difficult and expensive, and people have evidently decided that a certain tradeoff to make such things affordable isn't out of line.

The shorter, more pessimistic version of this is probably, "It's cheaper to suck."
Just one point ... by tomhudson · 2008-03-02 09:28 · Score: 3, Informative

In a properly designed cell phone system, if the tower you were going to be handed off to can't take the connection, either the tower you're with will keep the connection, or another (though still sub-optimal) will take the connection.
Of course, when you don't have transmitters with overlapping coverage, this doesn't work.
Introducing the EULA by Mr+Pippin · 2008-03-02 10:57 · Score: 4, Informative

Also, because the EULA came into existence, product warranties effectively vanished, as well as actions the consumer could take via product liability claims, in court..

After all, liability plays a large part in defining QA policies. If software companies were held to the same liability standards most product manufacturers face, I'd bet software development would be more of the engineering practice it should be.

To quote part of Microsoft's EULA for Windows XP.

http://www.microsoft.com/windowsxp/home/eula.mspx
ALSO, THERE IS NO WARRANTY OR CONDITION OF TITLE, QUIET ENJOYMENT, QUIET POSSESSION, CORRESPONDENCE TO DESCRIPTION OR NON-INFRINGEMENT WITH REGARD TO THE SOFTWARE.
Re:because they've been conditioned by NevermindPhreak · 2008-03-02 11:22 · Score: 5, Informative

I believe you are correct. The market isn't "conditioned" into thinking that anything less than five 9s is acceptable. They just don't want to pay the cost associated with it. The price/reliability ratio right now is the one that will satisfy the most customers. 99.999% reliability is harder to sell than 99.9% reliability at half the cost.

I work for a cable company, by the way. I design a lot of the building-out of our system, so i know the actual costs associated with creating that kind of reliability. Whenever someone needs that kind of reliability, I actually recommend getting a second ISP as a low-speed backup solution. It is the only smart way to go to get complete reliability, as pretty much any company advertising 99.999% reliability in this area is outright lying to the customer. (I know this from experience. I have switched customers over to our ISP from a week-long (or longer) outage of every ISP here, and there are quite a few.) Besides, a good router will split bandwidth between the ISPs so you're not paying for something you're not using. (called "bonding")

I still get amazed when people yell at me for being offline for a few hours after maybe 3, 4, 5 years of uptime. They say that they are losing thousands of dollars per day they are offline. Yet, they don't want to pay for a $40 roll-over backup. THESE are the vast majority of customers who complain so much about 99.999% uptime.

On another note, I think anyone claiming 99.999% on POTS is anecdotal. Growing up, I had my power cut out at least twice a year, and the phone system was hardly 99.999%. Trees fall on lines, and people cut buried lines for all sorts of accidental reasons. Just like you insure anything worth enough value, just like you back up data in multiple locations, you need a fallback plan if your ISP goes out if it means that much to you.
Re:because they've been conditioned by Jurily · 2008-03-02 11:34 · Score: 3, Informative

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure." What's the "mean time between catastrophic floods in New Orleans?"

http://www.joelonsoftware.com/items/2008/01/22.html
On redundancy by Animats · 2008-03-02 17:15 · Score: 3, Informative

In the entire history of electromechanical switching in the Bell System no central office was ever out of action for more than thirty minutes for any reason other than a natural disaster. On the other hand, step-by-step (Strowgear) switches failed to connect about 1% of calls correctly, and crossbar reduced that to about 0.1%. With electronic switching, the failure rate is higher but the error rate is much lower.
This reflects the fact that, in the electromechanical era, the hardware reliability was low enough that the system had to be designed to have a higher reliability than any of its individual units. In the computer era, the component reliability is so high that good error rates can be achieved without redundancy. This is why computer-based networks tend to have common mode failures.
If you're involved in designing highly reliable systems, it's worth understanding how Number 5 Crossbar worked. Here's an oversimplified version.
The biggest component of Number 5 crossbar were the crossbar switches themselves. Think of them as 10x10 matrices of contacts which could be X/Y addressed and set or cleared. Failure of one crossbar switch could take down only a few lines, and they usually failed one row or column at a time, taking down at most one line.
The crossbars had no smarts of their own; they were told what to do by "markers", the smart part of the central office. Each marker could set up or tear down a call in about 100ms. Markers were duplicated, with half of the marker checking the other half. If the halves disagreed, the transaction aborted. Each central office had multiple markers (not that many, maybe ten in an office with 10,000 lines), and markers were assigned randomly to process calls.
When a phone went off hook, a marker was notified, and set up a "call" to some free "originating register", the unit that understood dial pulses and provided dial tone. The marker was then released, while the user dialed. The originating register received the input dial info, and when its logic detected a complete number, it requested a random marker, and sent the number. The marker set up the call, set and locked in the correct contacts in the crossbars, and was released to do other work.
If the marker failed to set up the call successfully (there was a timeout around 500ms), the originating register got back a fail, and retried, once. One retry is a huge win; if there's a 1% fail rate on the first try, there's an 0.01% fail rate with two tries. This little trick alone made crossbar systems appear very reliable. There's much to be said for doing one retry on anything which might fail transiently. If the second retry fails, unit level retry as a strategy probably isn't working and the problem needs to be kicked up a level.
The pattern of requesting resources from a pool at random was continued throughout the system. Trunks (to other central offices), senders (for sending call data to the next switch), translators (for converting phone numbers into routes), billing punches (for logging call data), and trouble punches (for logging faults) were all assigned on a random, or in some cases a cyclic rotation basis. Units that were busy, faulted, or physically removed for maintenance were just skipped.
That's how the Bell System achieved such good reliability with devices that had moving parts.
Note that this isn't a "switch to backup" strategy. The distribution of work amongst units is part of normal operation, constantly being exercised. So handling a failure doesn't involve special cases. Failures cost you some system capacity, but don't take the whole system down.
We need more of that in the Internet. Some (not all) load balancers for web sites work like this. Some (but not all) packet switches work like this. Think about how you can use that pattern in your own work. It worked for more than half a century for the Bell System.
Re:because they've been conditioned by tronbradia · 2008-03-02 20:23 · Score: 5, Informative

Actually our health system has completely ballooning costs relative to other countries and is really more of an example of the opposite phenomenon, where insurance must pay for all possible treatment or be sued. Our system without a doubt provides the most care of any system in the world, even though it's pretty obvious that returns diminish dramatically after about 10% of GDP (we are at 15% of GDP, 2nd runner up is Switzerland at 11 or 12%). Returns diminish because, essentially, more care doesn't actually make people healthier past a certain point. 99% if people just need a GP (cheap), immunizations (dirt cheap), antibiotics when they get an bacterial infection (dirt cheap), and surgeons to sew them up when they get in a car crash (expensive-ish but hopefully uncommon and only rarely protracted). The problem is whenever anybody gets anything terminal, there's the potential for basically infinite spending, and the more successful treatment is, the more money goes in because treatment is prolonged. In this case our system is not "barely good enough", it's more way too good, or at least, way too generous.