Uptime Realities in the Internet World
schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
One of my clients (a government agency,) runs a system that attaches to the federal LEIN database, they use it to pull arrest records, whether a person has a concealed weapons permit, etc when someone is pulled over. This system is considered essential and requires 100% uptime (achived through multiple failovers) since officer's lives are on the line when the system is down.
Go away, or I will replace you with a very small shell script.
The results from uptime.netcraft.com seem a bit hokey (sp?) at times since it does not take load balanced web servers into consideration, network outages, etc. In my case, I had a server down for about under a day to be rebuilt and brought it back up... checked the NetCraft results a couple of days later, it didn't show that the site had gone down.
I know there are some projects/sites that will allow people to submit uptimes sent from cron jobs or agents to a server, which then stores the uptime data there. Of course, that doesn't mean that you can just generate junk data (ie: 999 day uptime with 2934 users).
Our current isp(group telecom) guaruntees 5 9's of reliabillity and it's pretty much a joke. Weve already burned through several years worth of downtime (granted only a coupple hours a month) and who knows what will happen to our "guarunteed service" if/when they finish their slide into bankrupcy.
Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.
When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.
The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.
If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?
Alex
The database server handling the message areas for Voyeurweb, RedClouds, and feedback areas for the same has answered 28,442,099 questions in the last 13 days. That's when we finalized changes to it.. Before that, it had been running for 2 years.
:)
I wish we only had 5mil hits/day.. One web server takes 18mil req/day.. We have bunches of 'em out there.
http://voy37.voyeurweb.com/1.stats.html.
Did I mention we're a Linux shop?
Serious? Seriousness is well above my pay grade.
We did it on a really low budget:
Heartbeat/Mon/Fake/Coda/Linux/IPVS for the High Availability, failover from DS1->DS2, each on different backbone nodes.
Mirrored systems in different geographic locations:
Firewall
IPVS Gateway
Apache->Weblogic bridge (Apache vhosts with ssl)
Apache->Zope bridge (Apache vhosts with ssl)
Zope->Zeo setup for content management.
SAN drive array for Oracle, running on two E4500s
This system isn't really that expensive, just the costs of hardware and my salary for setting them up.
My $0.02 will always be worth more than your â0.02, so
Entirely. Having worked extensively on the flight deck systems for the Boeing 767-400ER, I can tell you first hand that the redundancy is rather amazing. There are two major computer systems that drive the displays in the cockpit, the DPCs which do a lot of digital signal manipulation and the DCCs which do a lot of the analog to digital signal manipulation and control. Two DCC boxes drive three DPC boxes and the two DCC boxes are cross-connected to each of the DPC boxes. The three DPC boxes each talk to each other (I'm not sure if the DCC boxes talked to each other - that was further down the chain than I was working on) and actually vote on the data points that are being sent to the displays to determine if one of the DPCs is malfunctioning or processing bad data. The way this all works together is amazingly complicated, especially when you consider that it all runs on embedded boards where the "executable" is typically less than 1-2MBs in size.
... especially the way its actually implemented in the embedded system. Debugging all this, of course, was non-trivial. For that matter, coding it is non-trivial as its all in Ada83.
... those were the days :)
My particular area of development was the actual display software which was provided data from the DPC systems. Each of the six displays (2-pilot, 2-copilot, 2-EICAS in the console) received multi-cast data from each of the DPCs and then fed data back to the DPCs on the display's status. The DPCs would then automagically evaluate if the displays were functioning properly and switch primary functions away from a malfunctioning display to a functioning display if error conditions were detected.
The PFD (primary flight display) is the pilots most important display as it displays airspeed, artificial horizon, TCAS warnings, altitude and a few other things. The ND (navigation display) is the inner screen on both the pilot/co-pilot sides and if the PFD experiences error conditions, the DPCs switch the PFD to the ND and the ND to one of the EICAS (engine indicators, etc.) displays.
All very interesting stuff
Ahh
IIRC, and it's been a number of years, the overall goal was about 50 minutes of outage per line per year (a little less than three nines). Different failure modes were allocated different parts of that total. Components like the wires, that only took a single line out of service, were allocated the lion's share. Switch components were allocated smaller amounts, depending on how many lines would be out of service. Total system failure on a switch was allocated about 4.5 minutes per year (five nines).
No switching system ever actually made that grade. Probably the ones that came closest were the old electromechanical "steppers". Many small steppers in small towns ran completely unattended, and maintenance consisted of someone driving out once a month to make sure the building was still there and to polish some relay contacts.
All of the computer-controlled switches had dual synchronized processors (ie, each one executing the same op codes at the same time) and duplex memory, with a bunch of extra hardware to detect faults. The single most common cause of total system failure was when a fault had occured, and the system was running "simplex", and a tech pulled a card from the active rather than the failed processor.
Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.
Hmm... let's take it a step further and assign approximate value to infrastructure and process. At a company where I used to work, I smoked more cigarettes than I have ever smoked in my life, and this was directly due to failures of hardware and software. There was never, ever a shortage of queer (strange, not take-it-in-the-bum) little men running around telling us that they were working on expunging the demon/rebooting the server machines/whatever... but boy, was there ever a lack of infrastructure.
So I would say that without sufficiently redundant hardware and code well-enough written to not explode upon severe slashdotting (you know, just as an example), you can have all the process you want, and it will just result in tech staff telling the end-users to go out and have a smoke, 'cause the computers are down and will be back ASAP.
What a horrible thing to do to one's ex-boss. (/redundant)
My company does lots of things, but almost no manufacturing (our local office provides engineering services to the government and military). We also got hit with the Six Sigma marketing buzz, and our stupid (now departed) CEO decided that they needed to initiate the garbage company wide. I've managed to avoid it so far, but I've passed by the conference room occasionally while sessions have been going on, and I would have to say that it would score real close to 10 on the Wank-o-meter. All of the engineers who have been subjected to it have said it's nothing more than good engineering practice that they should have learned in school. But maybe it's good for the administrative/marketing types.
Today's Sesame Street was brought to you by the number e.
It's instructive to read about the United Flight 232 incident a few years back. The #2 engine of a DC-10 exploded in flight (at around 30,000 feet) and severed ALL the hydraulic systems and their backups. Without rudder, ailerons, elevator, spoilers, flaps, or one of the three engines, the pilots set the plane up for a forced landing. And about 200 of the 300 passengers on board survived.
Of course, certain bugs can be really bad. I was down at Boeing Field once last year when somebody attempted to take off in a light plane that had just been serviced. Unfortunately the mechanic hooked up the ailerons backwards, so that when the pilot attempted to correct for a crosswind on takeoff, he promptly rolled and landed on top of another plane in the parking area. (Sounds like inadequate preflight action by the pilot on that one, since he appears to have missed the "control surfaces free and correct" item on his pre-takeoff checklist, but no injuries to the best of my knowledge.)
Note that I'm hardly going to argue that flight-control software shouldn't be damn good. But... it's overstating your case to assume that downtime or error necessarily means a plane is going to fall out of the sky.
"Biped! Good cranial development. Evidently considerable human ancestry."
Having the 5 9's of reliability is NOT foolish. It is a reality of life. My particular organization services 40 million web customers, so we can not afford to be down at any time of the day because of the type of service we provide. In fact, last year we made our goal of having the 5-9's, and we did it without needing our disaster recovery (DR) site.
Having a DR plan and being reliable go hand in hand for the most part, however under normal day-to-day business conditions, servers need to be upgraded and things unplugged. You don't switch your entire infrastructure over to a DR site to upgrade your apache web server!! It is for this reason you have redundancy on the network and server level leading out to the Internet (or wherever your customer base resides).
Disasters, on the other hand, do not happen everyday. They happen once a year, maybe.... sometimes once every 2 years. If you live in an area more prone to disasters (like southern California), you may need an alternate site located on the east coast.... but, that is the cost of doing business.
Also, having 5-9's on uptime does NOT mean being accessible to everyone in the world at any time no matter what. Having 5-9's of uptime means that your organization has successfully kept it's applications and services available to the Internet. How is it my company's fault if you don't plug your modem into the wall? It's not, so to say that our "reliability" decreases because of an end user being a moron is a stupid statement.
Not to sound like a suit, but it's really about total cost of ownership. For example, software RAID comes with most modern operating systems, but you still need to power down the server to remove and replace a failed drive. However, if you make the upfront investment in a hardware RAID controller with hot-swap capability, you save time and reduce tech support calls, saving money in the long run. If you're offering commercial services (as an ISP or whatever), you start to develop a reputation for reliability that will earn you more customers over time.
>Beyond that, it doesn't much matter.
Well, beyond "7 nines" you would start talking about 100% reliability. So you start with contingency plans for a terrorist attack on
one data center at the same moment of a quake under another data center. Now you're in the realm of needing your own redudant power plants, and probably network infrastructure that does not
really exist yet.
So in reality, your guarantee of "9 nines" or, effectively ZERO downtime for the life of the product, really would be specified in terms of compensation and not technology. In other words,
you'd be stating what the client will receieve when (not if) the uptime guarantee is not met.
-fb Everything not expressly forbidden is now mandatory.