Why Is Less Than 99.9% Uptime Acceptable?
Ian Lamont writes "Telcos, ISPs, mobile phone companies and other communication service providers are known for their complex pricing plans and creative attempts to give less for more. But Larry Borsato asks why we as customers are willing to put up with anything less than 99.999% uptime? That's the gold standard, and one that we are used to thanks to regulated telephone service. When it comes to mobile phone service, cable TV, Internet access, service interruptions are the norm — and everyone seems willing to grin and bear it: 'We're so used cable and satellite television reception problems that we don't even notice them anymore. We know that many of our emails never reach their destination. Mobile phone companies compare who has the fewest dropped calls (after decades of mobile phones, why do we even still have dropped calls?) And the ubiquitous BlackBerry, which is a mission-critical device for millions, has experienced mass outages several times this month. All of these services are unregulated, which means there are no demands on reliability, other than what the marketplace demands.' So here's the question for you: Why does the marketplace demand so little when it comes to these services?"
The marketplace has been duped into believing that this is the best technology can provide. People don't have time to know, understand, or research history and find that technology really can be reliable.
I'll get modded troll, but I lay much of this at Microsoft's feet. I laughed them off when I first heard of them and their goal of taking over the industry. After all, I'd been working on systems that ran 24x7 with five-9 reliability for years, and DOS/Windows couldn't touch that.
One time I had an opportunity to visit Microsoft and have lunch with a friend there. I figured while there I'd take the opportunity. I asked them in hushed tones, "Just how do you configure Windows so that you don't have to reboot it all of the time?" They looked at me like I was crazy.
Technology can provide reliability. The general public is no longer even aware that it's possible.
Oh Zonk, I'm marking your story as "flamebait". :(
Complacent consumerism. "Hey, it's always been this way so they [service providers] must not be able to have 99.9% uptime. If they had the capability, they sure would provide it to us, their customers."
We figured out a long time ago that it's easier to elect seven judges than to elect 132 legislators.
Probably because of the cost. I do network design for a fairly large telco, and let me tell you the cost goes up exponentially with the number of "9"s that the business asks for.
Basically, we don't rely so much on a single system that a brief outage can be tolerated because there are alternatives to choose from.
This is also the basis of Clayton Christensen's theories on disruptive innovation - that a consumer of something (technology, etc.) is willing to trade off some of these aspects, like reliability, for cost or performance benefits (however you wish to define those benefits...).
The last time I wrote code, it was Morse
... after decades of mobile phones, why do we even still have dropped calls? It's a little thing called physics. When you're traveling while using your phone, you may transit into dead zones. We could solve this by cutting down all the trees and flattening the landscape, but that might make some people angry...512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
You can have one or the other.
We're not talking about software, we're talking about hardware and man-hours. Those will never be free.
Gone!
'five nines' of uptime is a ridiculous and exaggerated expectation for pretty much anything technological for anything that is not life threatening.
Whenever people talk about 99.999 uptime for a service delivered over the internet I laugh in their faces.
In the free world the media isn't government run; the government is media run.
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
As consumers, we're made to feel helpless. The worst we can do (without litigation) to a company is complain or refuse to use their services, but what harm can that do to a giant conglomerate? And in situations in which one company has a monopoly in a certain area of the country, for example, consumers may not have the ability to switch or do without.
As a personal example, Comcast owes me a refund check for Internet services I canceled six months ago. If I, as a consumer, had allowed my debt to go unpaid for that long, my account would have been sent to collections long ago. But the problem is that most of the power--with the economics of the situation, with politicians, and so on--lies on one side of the table, and that power ain't with the consumer.
Are these kind of outages really so common? Mobiles phones I absolutely agree with. ON the other hand, I literally cannot remember the last time I lost cable or my internet. I've literally lost power more frequently than either of them (maybe 4 times in the past year) and lost water once. Emails not making it to their destination--again, does this really happen? In the decade plus I've been using internet email, I can't off the top of my head ever think of any "lost" email unless it was sent to a wrong address or something.
If offered cell plans that cost $50/month with rare outages or $150 a month with extremely rare outages, which would most people take?
99.999% (5 nines) of reliability is achievable, but it's very expensive and hard to do. Everything has to be redundant, with no single point of failure, everything has to support fail-over seamlessly, the software has to be tested with extreme rigor, and upgrade procedures need to function nearly instantly and support rollback without loss of service.
Because every nine will cause a geometric increase in costs.
This
Uptime (%) Downtime 90% 876 hours (36.5 days)
95% 438 hours (18.25 days)
99% 87.6 hours (3.65 days)
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.256 minutes
99.9999% 31.536 seconds
I work for a software shop where we can do high availability, but more often than not, folks chose to lower the uptime expectation rather than pony up for the stupid money it takes to have the hardware / software / infrastructure to get there. Most companies know the customer will not pay the extra cash for the uptime, thus... you get what you pay for.
+++ UGUCAUCGUAUUUCU
To put it simply, it's the money stupid. It requires a lot more equipment and manpower to offer a high availability service. This extra cost results in higher prices. It can cost 1000% more a month for less than 1% more reliability. Think of a $400 a month T1 with a SLA versus a $40/month cable line. Being sheep has nothing to do with it.
You don't make the poor richer by making the rich poorer. - Winston Churchill
Because 90% of stuff labeled 'mission critical' actually isn't. Think about it - for most of us, being able to receive or send cellphone calls or emails at any time seems super important, but the number of hours in any given month where it really *was* super important (the grant application was due in two hours; your mother was sick; your partner was about to go into labor; whatever) is generally pretty low - our real tolerance for occasional downtime is therefore quite high.
Well, my guess would be that many (but not all) people understand that being able to call an ambulance because Aunt Betty has fainted is a necessity, but being able to chat with Aunt Betty for an hour from your car isn't. Missing a rerun of Laverne and Shirley isn't critical, and neither is having to wait to post those vacation pictures to Flickr. Your coworkers will, in all probability, somehow muddle through if you can't send them email from your blackberry.
The telephone as we know it was the first genuinely instantaneous, worldwide communications medium that anyone could use, it was seen as a necessary component for national security during the cold war, and was built out as such. We've had over a century to perfect it, and vast amounts of money were spent doing so. Despite its origins at DARPA, the Internet as we know it today, although more useful, is by and large less of a basic need, is far more complex, and large portions of it are still built on top of the telephone infrastructure, besides.
I can't help but think that most people understand this sort of thing, and understand that bringing such modern conveniences up to five nines of reliability is difficult and expensive, and people have evidently decided that a certain tradeoff to make such things affordable isn't out of line.
The shorter, more pessimistic version of this is probably, "It's cheaper to suck."
"The marketplace has been duped into believing that this is the best technology can provide. People don't have time to know, understand, or research history and find that technology really can be reliable."
No. They believe it is the best the technology can provide at a given price. Why do people "put up" with cars that only give them X amount of protection in a car crash even though there is technology out there that would make them safer? Because they aren't willing to pay the marginal cost for the extra protection. Arguing about what is possible with technology is pointless. What matters is what a piece of technology can do at a given price.
Everything is a trade-off. The sooner Slashdot learns this the less we will have these stupid "Why don't consumers use the latest, greatest, most expensive technology? We need to force them somehow!" articles.
Creative Demolition
When Comtrash Internet dropped my speed from 6 Mbps to 1 Mbps but kept the rate at 6 times DSL, I dropped Comtrash and went with the 1.5 Mbps DSL from my local telco. I got 50% more than Comtrash was delivering at 1/6th the cost. No problem.
When Microsoft decided that I didn't own the rights to my own media and stopped me from being able to copy my own DVDs, I decided to drop them for my media development system and I switched to Linux and Apple. Microsoft doesn't want my business so I went with the people who do. No problem.
When my Long Distance company decided to charge over $1.00 per minute for International calls, I switched to AT&T and their 17 cents a minute program. No problem.
When Frigidaire washers charged extra for the warm water cycle but only give you 5 seconds of hot water and thus, never any, it was no problem to return the unit and buy a different brand. Sure, the salesman wasn't happy but, that is now his problem and not mine. I bought a different brand that did give me what they advertised and promised. No problem.
The list is endless and across all businesses and domains.
The point being is that there are alternatives but, many (or most) people are either too lazy to do anything about it or, like this article, they are too apathetic to do anything about it.
The choice is up to the consumer and, if the consumer would take action, the industry would have to adapt because the market demands it. So far, the market is willing to accept this and thus, the industry sees no reason to change. The less the consumer will accept for their dollar the less they will receive. That, is the problem.
Banjo - The more I know about Windoze, the more I love *nix
It's all about cost vs. the cost of downtime. You'll find in business lines such as the financial sector, customers are willing to pay for extremely high availability because time is indeed money. Business lines that have lower costs for downtime have to weigh availability vs. ROI.
Be careful to pick a provider that advertises "seven nines of reliability" instead of the more common "nine sevens of reliability".
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Engineering has always been about compromise. Any idiot can design a structure that is X feet tall but it would prove more useful it if wasn't a giant block of concrete -- if it had room for offices and the materials used to build it had minimal cost without sacrificing structural integrity.
The same applies to computer engineering. We would easily build a cell phone network that had so many redundancies that it would virtually never go down and would support for thousands of times the expected average load, but we would pay for it in terms of cost. Customers demand reliability. Customers demand affordable cost. What the customer is "willing to accept" is a balance between the two.
I'm still waiting for people to scream about the rising gas prices and the record oil company profits. Seems like this would have a greater impact on the general populous than reliable cell phone service.
I was born in 1964. I have no recollection of POTS telephone service ever being unavailable.
Electricity was expected to drop out a few times every summer, and until someone figures out how to tell lightning where to go, I expect it will continue to happen. In my part of Canada, however, power is continuously available from October to April no matter what. Even if you don't pay your bill. The only winter power outage of note I can think of offhand was the great Ice Storm of 1998, one of the most spectacular cases of force majeure I've witnessed in my life.
In my part of the world, at least, power and telephone were life-and-death services and legislation mandated their reliability.
Crumb's Corollary: Never bring a knife to a bun fight.
You totally misunderstand the 5 9s concept.
It doesn't mean that each and every individual phone will be up 99.999 percent of the time, it means that the system as a whole will be up 99.999% of the time.
Its quite possible for an entire town to be down for an entire year and still meet this criteria.
Yet modern cell operators STILL can not come close.
Sig Battery depleted. Reverting to safe mode.
Of course, when you don't have transmitters with overlapping coverage, this doesn't work.
My concept of 5 9s is much easier: 9.9999%. Or for Vista servers, .99999%.
Sam ty sig.
This has everything to do with cost and nothing to do with Microsoft. Consider VoIP... people are deliberately choosing telephony services that are less reliable and lower quality than POTS, because VoIP is cheaper. If you want 99.999% uptime, that's fine -- but you're going to pay for it. High availability services require better equipment, redundant equipment that doesn't come cheap and more, higher quality staff to operate it. So it costs more.
I've been in the technology services business for a long time, and with few exceptions, 80%+ customers want their services are delivered as cheaply as possible. Most hospital systems don't even have a 99.999% availability requirement. The 20% the want varying levels of higher than normal availability usually have a government regulation, SLA or other mandate requiring that they do so.
Conformity is the jailer of freedom and enemy of growth. -JFK
My electricity isn't 99.999% uptime (that's 30 seconds in a year) which would require me to get an UPS
My consumer grade equipment isn't 99.999% uptime (with luck, maybe I guess but there's no ECC, redundant power etc).
My software isn't 99.999% uptime (ok, so the kernel is stable. When X crashes, so does everything of importance on a desktop)
If there's something urgent, you CALL me anyway.
I'd rather take a line with 99.5% uptime (that's two days without internet per year) that's 10x faster and costs 10x less. Which doesn't include that I have Internet at work, or via my cellphone, or via a webcafe or any number of other easily available sources. The only real killer I can think of is if you only telecommute and can't go to work, but even then I figure the nearest Starbucks will let you occupy a corner with some purchases.
Live today, because you never know what tomorrow brings
Partly correct. What they did was to mass introduce the GUI. 1.0 was a joke as far as usability went. At the same time the 386 was out and the talks of multiprocessing was promising new and exciting computing in the near future.
I don't think they measured squat. Just did their best. Only thing was that there were nobody who could properly design an O/S and complexity, instead of simplicity, ruled the day.
What we are seeing is the very best they as group are able to produce.
They have never been great at marketing either. But they were really the first to push the GUI with success. Don't forget Apple became a very closed platform. They did not attract masses the way the open IBM PC did.
Right there history shows how important open standards are to success. Apple was considered this fantastic success story but in reality they cut it short and did not buy the masses the way the Johnny came lately IBM PC did. But we are slow when it comes to learning from history.
What they have been good at is market lock-in, vender lock-in and many other types of lock-in. (The problem really is that they had never heard about duty and were only interested in money.) We all thought they would get it right sooner or later and deliver a good platform that would allow happy computing. The fact that they specialized in adopting good standards and then corrupt them so that you got locked in was a very calculated development.
At one point Gates himself said that Unix was the way to go. Then he decided to do it better but clearly never understood what made Unix so good (simplicity). Torvald on the other hand was ONLY looking for simplicity. Which is why it fit so well into the general Unix design.
Look at windows, it is filled with arbitrary complexities and is horribly inefficient. Never mind when upper management throws fits and yell at staff, I've never found that conducive to good programming, or business.
Gates cheated his way into O/S design, used people from VAX who's memory management problem were dragged over to windows. Built a kernel in BASIC! Haha! And got away with it for years!
Someone who knew more about systems picked the Unix design and rewrote history based on technology, and was not motivated by money. Interesting to see how much we like to be able to just do what we need. Imagine if IBM had released Linux. With all the corporate support for let's say $100. Then opened it up with a GPL license.
Microsoft would not be sitting pretty at all. The O/S2 collaboration would not have happened and Gates would not have learned his lessons from that. For all their success I've never considered them much of a success where it really matters. Integrity in product and care for customers. I have people send me Brandy, fine wines and other tokens of their appreciation after sales. Because I believe in treating other people the way I like to be treated, and I really care about my clients.
Maybe that's the question the cable company would like to ask, but the one concerned consumers should be asking is, "how do you get someone to expect _more_ for the same price (or less) when they think that what they currently get is good enough?" Reading your piece of the discussion, I think this question could also follow, and it happens to be the original question...
Would I be willing to pay more for cell service that had fewer dead zones, dropped calls and "busy networks" then my current one has? No way. It's not as good as landline, but it's good enough for me. If, ten years from now, it worked the same as it does now, I would expect their competition to have passed them by and I'd switch. In the US we're in a free market system.
If I was tired of my cable internet dying on me occasionally, which competitor would I turn to? DSL, satellite and local wireless all have problems too. I settle for less than 5 nines because I have no choice, if I want service that is anywhere near the cost it is right now.
The N-nines model is a fast and easy way to compare order-of-magnitude differences between existing networks, but it says almost nothing meaningful about actual usage or the perception of uptime from a user's perspective.
.. nobody cares if an incoming piece of email got delayed by 30 seconds at the MTU, but they do get testy if they can't load their webpages. But web surfing only uses 1-2 seconds of bandwidth per minute anyway.
.. who cares about a half hour of downtime from 0300 to 0330 when no one in your company is actually in the building and using the network?
Let's look at the numbers: 99.9% uptime translates to about 9 hours of unscheduled downtime a year. That can be one 9-hour block once a year, 36 minutes per day, 1.5 minutes per hour, 1.5 seconds per minute, or one dropped packet per thousand. Sure, it's easy to spot a 9-hour blackout, but as the slices of downtime get thinner, they get harder to notice at all, or to identify as USD specifically.
99.999% uptime translates to about 5 minutes of USD per year, and is of questionable value. You can't identify a network outage, call in a complaint, and get the issue resolved in the given timeframe. 99.9999%? It is to laugh. You can't even look up the tech support phone number without blowing your downtime budget for the year. Get hit by a rolling blackout for an hour? Kiss your downtime budget goodbye for the next 120 years.
Getting back to 99.9% uptime, let's move on to standard utilization patterns. USD really only becomes an issue if people notice it
If we have 2 seconds of usage and 2 seconds of downtime per minute, the odds of a collision are around 15:1 with an average overlap of 1 second when a collision does happen. Simply interleaving usage and downtime that way increases the perceived uptime by an order of magnitude since 90% of the outages happen when no one is actually using the network. And larger blocks of downtime get lost in larger blocks of non-utilization exactly the same way
Granted, if you have higher utilization you'll have a better chance of hitting a chunk of downtime, but you'll also have higher chances of queuing latency within your own use patterns. If you're already using 99% of your bandwidth, you can't just plunk in one more job and expect it to run immediately. It has to wait for that 1% of space no one else is currently using. And when you get to that point, it's really time to consider buying a bigger pipe anyway.
And that brings us to the main point: People don't buy network connectivity in absolute terms. They buy capacity, and the capacity they buy is scaled to what they think of as acceptable peak usage. "Acceptable peak usage" is a subjective thing, and nobody makes subjective judgements with 99.999% precision.
[citation needed] I call bullshit on that one.
And I call BS on your BS. Clearly you're not familiar with the state-of-the-art as far as email goes. You've certainly not had to set up and run a private email server.
Here's one good reference. It mostly mirrors my experience, except that it's been going on longer than the writer has observed.
The basic problem is that Yahoo, Hotmail, ATT and other large email providers, or ISPs, simply refuse to honor the standards which have been published (DKIM, et. al.). Google is great. But it's gotten so bad with the others that I simply don't bother communicating with anyone who has a Hotmail, Yahoo, or ATT account. If they are someone important, I'll tell them once (via a different band) of the situation. And let them know that unless they change their email provider, I won't be responding to any future email from them.
Usually I just refer them to gmail, because google seems to be the only large email provider with a technical clue.
The other interesting thing is that all of these large companies will treat unsigned email from an Exchange server as more verified than a DKIM email, but I digress.
Supposedly the excuse is that it's due to spam. I'm certain that is part of the problem. But the other part is that there's definite incentive for the big boys to eliminate the small independent websites and drive all of the business into their arms.
So, yes, the OP's statement about many email messages not reaching their destination is quite true. Most? No. But anything that doesn't use the technology offered by the big commercial joints (including Microsoft server technology) is shut off from communicating with a large part of the internet.
Blackberry is not a mission critical service. The people who use it as such are naive.
Heh. Well, many PHBs would disagree, but your point is valid.
For your amusement, the Blackberry email servers are provided by a company called Mirapoint (mirapoint.com), and they are Linux based. From what I've heard, they cut over about 2 years ago from BSD to Linux, for various reasons. I'm also told that the CEO is a complete airhead who has difficulty managing a secretary, let alone a company. But that the mid-level managers and engineers in the U.S. are first rate. I imagine that they could indeed improve the uptime of the email servers, but those servers are quite good already.
The best way to predict the future is to create it. - Peter Drucker.
Also, because the EULA came into existence, product warranties effectively vanished, as well as actions the consumer could take via product liability claims, in court..
After all, liability plays a large part in defining QA policies. If software companies were held to the same liability standards most product manufacturers face, I'd bet software development would be more of the engineering practice it should be.
To quote part of Microsoft's EULA for Windows XP.
http://www.microsoft.com/windowsxp/home/eula.mspx
ALSO, THERE IS NO WARRANTY OR CONDITION OF TITLE, QUIET ENJOYMENT, QUIET POSSESSION, CORRESPONDENCE TO DESCRIPTION OR NON-INFRINGEMENT WITH REGARD TO THE SOFTWARE.
This reminds me of why Bruce Schneier's dream of legislating liability for software defects is misguided. Sure, statutory liability would make software more reliable, but it would mean that the many who don't need the additional reliability (and currently aren't willing to pay for it) would be forced to subsidize the handful who do. It would also likely claim volunteer-developed software as a casualty.
http://outcampaign.org/
If we wanted better uptime we could have it. We would just have to pay more for, and look at, a whole lot of redundant systems. Personally, I'm happier to keep paying less and only have one power line coming into my house, with the nearest plant many miles away. The same goes for cable and telephone service. And my cellular service does work about 99.9% of the time.
when my employers blackberries failed earlier this month they fell back to laptops with a bluetooth tethered phone and outlook/exchange. redundancy is built into the mindset. No messages were lost
I have been a user for about 10 years. This ends Feb 2014. The site's been ruined. I'm off. Dice, FU
I'm sure there's a lot of the attraction of Internet service in being you pay a single flat fee, no matter how "important" the packet is. Who wants to have a 2.99 extra surcharge per call if the caller is a job recruiter(presumably, because he is offering you a job)? How about a 5 dollars surcharge if the call comes from your doctor? vet? The Internet caught on so far with the "a packet is a packet" mantra. Now all the internet suppliers compete on price(because people want cheaper internet) and want to charge extra for things... people haven't considered when they signed up... so they can charge more. This is what this is about, period. I imagine similar efforts are underway, paid for by different cable companies, etc... Anything to not have 5mbps to the internet, unfettered, 24hrs per day, 7 days a week, always-on, for a flat fee.
Unfortunately for them, I'd be willing to downgrade to 1mbps, but not on the always on, nor the unfettered, and if they do downgrade, I will be readjusting my idea of how much it should cost.
I'm at home (and awake) 20% of the time.
My landline is up 99.999% meaning my phone is available to me when I need it 19.998% of the time.
I'm out and about (and coherent) 40% of the time.
My cell phone works 90% of the time meaning it is available to me when I need it 36% of the time.
Clear winner, cell phone.
Sometimes we lose site of reality while studying statistics.
As Joel Spolsky pointed out on his blog JoelOnSoftware, 99.999% is pretty much fictional.
99.999% over a year is 31.526 seconds.
No matter how good your staff, no matter how many people you have on site, no matter how robust your systems, no matter how many failsafes you have standing by, ready to be plugged in...
IF something does go down, even the fastest tech on earth is unlikely to identify, pull out, replace and have fired back up whatever the faulty item is in under 30 seconds.
99.999% uptime is essentially fictional. It's simply an impressive sounding number that says, "We'll do everything realistically possible to keep you up 100% of the time. In a typical year, you won't see anything bring you down. You can now tell your investors/clients this and make them feel warm and fuzzy."
It ignores the second part, "But, honestly, if it does go down, we won't have it back within 30 seconds, 100% of the time. Sorry, but welcome to reality. But, for what it's worth, our board's happy to pay you outage fees because it's a small enough risk and the amounts are capped enough, that we're happy to take the risk and costs in exchange for advertising a service we know no one can deliver."
Let's look at regulated phone service, the example in the original post. Can anyone point to a major carrier that hasn't had a major outage at some point? Be it an idiot in a switch room, a power outage affecting a whole side of the country, an anchor ripping up an undersea cable? And how many of them have actually been back within the mandated 30 seconds?
It doesn't happen. That two hour outage is going to take quarter of a millenium of absolutely no more faults to earn back at 30 seconds/year. With luck, it only hit one in 250 customers so you can pretend you're well within your 99.999% uptime but that 1 in 250 isn't really going to agree they got 99.999% after they were down for 1:59:30 more than their contract said they would be.
So, no, 99.999% doesn't exist. It's just a really cool story we tell ourselves whilst being willing to pay whatever the penalties are for missing it, on rare occasions, in exchange for great advertising.
You can't take the sky from me...
Thanks for actually listing out the figures. It really puts things in perspective, and it made me realize something. My internet service probably gets somewhere between 99.9% and 99.99% uptime. My cell phone is probably in a similar range. My cable is better than 99.999% (maybe even 99.9999%).
But only with redundant systems. What happens is when something goes down, techs aren't getting it back up in 30 seconds, rather it is instantaneously failing over to another system. You have enough redundancy, you can keep operating even in the face of multiple simultaneous failures.
The problem is, of course, going for that can be really expensive. Not only does the system itself have to have a bunch of redundancy, but so does everything supporting it. For example in the case of a web server you'd not only have to have multiple boxes running that, but multiple power connections, generators, network connections, ISPs, etc.
Doing something like that, you can offer essentially 100% uptime, barring a catastrophic event (and face it, and amount of uptime can be ruined by a sufficiently large event). However it is extremely costly, and of course everything has to be well designed because, as you noted, you fuck up anywhere, you got 30 seconds to fix it.
Or you can just do what the voice guys like to do: Change the rules. For them, the system is "up" so long as there is at least one phone line that can place a call to at least one other phone line. By that standard, the voice switch on campus has never been down. Of course that isn't a particularly useful standard, if you asked me.
In the entire history of electromechanical switching in the Bell System no central office was ever out of action for more than thirty minutes for any reason other than a natural disaster. On the other hand, step-by-step (Strowgear) switches failed to connect about 1% of calls correctly, and crossbar reduced that to about 0.1%. With electronic switching, the failure rate is higher but the error rate is much lower.
This reflects the fact that, in the electromechanical era, the hardware reliability was low enough that the system had to be designed to have a higher reliability than any of its individual units. In the computer era, the component reliability is so high that good error rates can be achieved without redundancy. This is why computer-based networks tend to have common mode failures.
If you're involved in designing highly reliable systems, it's worth understanding how Number 5 Crossbar worked. Here's an oversimplified version.
The biggest component of Number 5 crossbar were the crossbar switches themselves. Think of them as 10x10 matrices of contacts which could be X/Y addressed and set or cleared. Failure of one crossbar switch could take down only a few lines, and they usually failed one row or column at a time, taking down at most one line.
The crossbars had no smarts of their own; they were told what to do by "markers", the smart part of the central office. Each marker could set up or tear down a call in about 100ms. Markers were duplicated, with half of the marker checking the other half. If the halves disagreed, the transaction aborted. Each central office had multiple markers (not that many, maybe ten in an office with 10,000 lines), and markers were assigned randomly to process calls.
When a phone went off hook, a marker was notified, and set up a "call" to some free "originating register", the unit that understood dial pulses and provided dial tone. The marker was then released, while the user dialed. The originating register received the input dial info, and when its logic detected a complete number, it requested a random marker, and sent the number. The marker set up the call, set and locked in the correct contacts in the crossbars, and was released to do other work.
If the marker failed to set up the call successfully (there was a timeout around 500ms), the originating register got back a fail, and retried, once. One retry is a huge win; if there's a 1% fail rate on the first try, there's an 0.01% fail rate with two tries. This little trick alone made crossbar systems appear very reliable. There's much to be said for doing one retry on anything which might fail transiently. If the second retry fails, unit level retry as a strategy probably isn't working and the problem needs to be kicked up a level.
The pattern of requesting resources from a pool at random was continued throughout the system. Trunks (to other central offices), senders (for sending call data to the next switch), translators (for converting phone numbers into routes), billing punches (for logging call data), and trouble punches (for logging faults) were all assigned on a random, or in some cases a cyclic rotation basis. Units that were busy, faulted, or physically removed for maintenance were just skipped.
That's how the Bell System achieved such good reliability with devices that had moving parts.
Note that this isn't a "switch to backup" strategy. The distribution of work amongst units is part of normal operation, constantly being exercised. So handling a failure doesn't involve special cases. Failures cost you some system capacity, but don't take the whole system down.
We need more of that in the Internet. Some (not all) load balancers for web sites work like this. Some (but not all) packet switches work like this. Think about how you can use that pattern in your own work. It worked for more than half a century for the Bell System.
The subject who is truly loyal to the Chief Magistrate will neither advise nor submit to arbitrary measures (Junius)
Thank you for bringing some sanity into this argument. Before you showed up it was dominated by idiotic hippies ranting about our mindless consumer-driven existence, the destruction of the environment, Microsoft, and just about everything else that has nothing to do with the issue at hand.
99.999% uptime is orders of magnitude more expensive than 99.99%, which in turn is orders of magnitude more expensive than 99.9% uptime, and so on.
The added cost is simply not worth it, in any sense of the word, to the general public.
I, for one, would prefer to deal with a day's worth of power loss in a major storm, than paying 10x as much for my electricity in order to make it bulletproof.
The savings would be better spent elsewhere.
Note that this is not an argument against proper planning and preventative maintenance to REDUCE downtime as much as possible, just an argument against designing everything in the world to survive a nuclear bomb when that level of reliability is simply not worth the cost.
"after decades of mobile phones, why do we even still have dropped calls?" This is just stupid. A dropped call is not the network, it's your phone losing the network. There is absolutely no way to avoid them, none. RF only travels so far and through so many things. I completely understand the article and its merit, but this is just the author being ignorant of their subject or scoring sensationalist points with uninformed readers. Someone explain to me how a company could possibly cover the entire US, and I mean Wyoming and Montana too (if you want zero dropped calls). Then there the fact that Americans will take a $0 junk heap of a phone with a contract and hope that it will perform well.