Uptime Realities in the Internet World

Uptime by dattaway · 2002-07-09 08:15 · Score: 5, Funny

Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.

Re:Uptime by program21 · 2002-07-09 08:22 · Score: 1

Ironic, isn't it.

--
This has been a test. Had this been a real emergency, we would have fled in terror and you would not have been informed.
Re:Uptime by spencerogden · 2002-07-09 08:33 · Score: 3, Funny

Of course the article is about how uptime is too expensive, I guess this proves the point...

--
Spencer Ogden
Re:Uptime by suss · 2002-07-09 08:34 · Score: 3, Funny

Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.

You could always try the Google Mirror
Re:Uptime by Anonymous Coward · 2002-07-09 08:39 · Score: 0

That's great! :-)
Re:Uptime by Anonymous Coward · 2002-07-09 09:01 · Score: 0

Mod parent up!!

you just cracked up 7 of my co-workers.
Re:Uptime by n9hmg · 2002-07-09 09:06 · Score: 2, Funny
1. Cool site. Nice idea.
2. Unfortunately, neither elgooGoogle nor archive.org had a chance to cache it before we killed it. The "former boss" was probably an update, made after he slashdotted the poor guy.
Re:Uptime by dattaway · 2002-07-09 09:14 · Score: 2

The article has been posted in this thread by an anonymous donor. The parent of its four pages has not been moderated yet.
Re:Uptime by ranulf · 2002-07-09 14:00 · Score: 4, Informative

the article is about how uptime is too expensive
I'd also say impractical. 5 nines is 99.999% availability, i.e. can be down for 1 second every 100000 seconds, or 27.77 hours. That gives approximately 6 seconds of downtime per week.
Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc... Chances are even restarting the web server would take up more time than your maximum weekly downtime.
Given that over the course of a month (which is the billing period on most ISP lines), you only have 24 seconds of possible downtime, it's very unlikely that the ISP will be able to meet that target. Pretty much *any* fault would take longer than that to fix, so any company offering a refund if the SLA isn't met is just asking for trouble.
Re:Uptime by qurob · 2002-07-10 05:51 · Score: 1

That gives approximately 6 seconds of downtime per week. ...

Chances are even restarting the web server would take up more time than your maximum weekly downtime.

Your computer reboots FAST!

Nothing is THAT Important by Master+Bait · 2002-07-09 08:15 · Score: 1

...to be worth 4 and 5 nines of reliability.

--
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman

Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 08:17 · Score: 2, Insightful

Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.

No, thanks.
Re:Nothing is THAT Important by Telecommando · 2002-07-09 08:21 · Score: 2

Yeah, I'm glad the local nuclear power plant decided to save money and only go with 3 nines of reliability.

--
Beta sux! Join the Slashcott! http://hardware.slashdot.org/comments.pl?sid=4760465&cid=46173047
Re:Nothing is THAT Important by SirTwitchALot · 2002-07-09 08:21 · Score: 2, Interesting

One of my clients (a government agency,) runs a system that attaches to the federal LEIN database, they use it to pull arrest records, whether a person has a concealed weapons permit, etc when someone is pulled over. This system is considered essential and requires 100% uptime (achived through multiple failovers) since officer's lives are on the line when the system is down.

--
Go away, or I will replace you with a very small shell script.
Re:Nothing is THAT Important by delta407 · 2002-07-09 08:23 · Score: 1

What about life support equipment? Air traffic control? Nuclear plant monitoring?
Re:Nothing is THAT Important by nuggz · 2002-07-09 08:25 · Score: 3, Insightful

I guess you don't have a pacemaker.

Some things ARE that important, most things aren't.
Re:Nothing is THAT Important by Sims+Youth · 2002-07-09 08:25 · Score: 0

HAHAHAHA!
Thank you for inadvertantly pointing out that you know jackshit about flying planes.
Re:Nothing is THAT Important by Jeff+DeMaagd · 2002-07-09 08:27 · Score: 2

Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.

No, thanks.

Point taken. Somehow, IMO reasonable reliability in the software and hardware industry is rediculously exensive. I guess it wouldn't be too bad if one were willing to trade off performance (speed) for reliability rather than requiring speed and reliability.

I'd be happy to get consistent two nine or better reliability from my ISP!

That's sad.
Re:Nothing is THAT Important by markmoss · 2002-07-09 08:30 · Score: 3, Funny

I hope that LEIN database is on a network that has a better record for delivering the data within a few seconds than the public internet. It doesn't matter how good your server is if your IP traffic got crowded off the net by 3,000 nerds downloading pr0n...
Re:Nothing is THAT Important by limber · 2002-07-09 08:31 · Score: 2, Funny

i'll settle for seven of nine.
Re:Nothing is THAT Important by WetCat · 2002-07-09 08:33 · Score: 3, Insightful

Heh... Switzerland....
Some factors that precede recent crash between Tu-154 and Boeing 757 DHL were
- Traffic warining system in its scheduled 10-minutes maintenance - dispatcher got no warnings
- Busy phone lines to dispatch - Deutch dispatch was not able to get to Switzerland dispatch to tell them about dangerous situation...

This is an example that cost a lot of lives...
(other tragic circumstance was that pilots of Tu-154 gave priority to dispatch commands instead of commands of collision avoidance system...)
Re:Nothing is THAT Important by medcalf · 2002-07-09 08:35 · Score: 4, Insightful

Not true. Five 9s in the airlines means that you'd see an airliner late or in some other way unavailable - possibly due to a crash, but not likely - every other day. Reliability is the availability to do what you need, when you need it. If a server is up 100% of the time, but is not able to be accessed because the network is down, the system is not reliable for you.

--
-- Two men say they're Jesus. One of them must be wrong. - Dire Straits
Re:Nothing is THAT Important by great_flaming_foo · 2002-07-09 08:42 · Score: 3, Funny

Yeah, I'm glad the local nuclear power plant decided to save money and only go with 3 nines of reliability.

I wonder if the internet will glow in the dark after the nuclear power plant's webserver melts down? Hey, maybe that is where Trolls come from.
Re:Nothing is THAT Important by SirTwitchALot · 2002-07-09 08:49 · Score: 1

It's completely private, and EXTREMELY secure... I get calls when I breathe funny around it... ok maybe not... but if I'm on the surveilence camera near the system I can expect a call to check what I was doing there.

--
Go away, or I will replace you with a very small shell script.
Re:Nothing is THAT Important by NanoGator · 2002-07-09 08:51 · Score: 2

"got crowded off the net by 3,000 nerds downloading pr0n..."

How could it get that crowded? Jpeg files are very small!

--
"Derp de derp."
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 08:57 · Score: 0

That assumes the error occurred in the air, and was severe enough to cause a crash. Most errors would occur on the ground, and would result in the plane not being allowed to fly, or in delays while it is being fixed.

If I had a 1/10000 chance of a flight being delayed due to mechanical or other error, I would be a VERY happy flyer.
Re:Nothing is THAT Important by Sique · 2002-07-09 08:57 · Score: 4, Insightful

No, it means that a jetliner has to be operating for a year with just 3 mins in the hangar. But about half its lifetime a jetliner is in maintenance, giving it an uptime of about 50%.

--
.sig: Sique *sigh*
Re:Nothing is THAT Important by Detritus · 2002-07-09 09:05 · Score: 2

If some systems fail, people die.
I've worked on systems where a failure results in a hearing before an investigative board.

--
Mea navis aericumbens anguillis abundat
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 09:09 · Score: 0

I disagree. If my math is correct, five nines of reliability is about 5 minutes and 15 seconds of downtime per year (non-leap). If the device in question is related to life support or radioactive materials I think 5 minutes is an eternity...
Re:Nothing is THAT Important by leucadiadude · 2002-07-09 09:14 · Score: 2

I know this is meant as a joke but there is some truth to the way you spend your money for reliability. You have to choose carefully which systems get the money spent to design and validate that they will perform with 99.9999xxxx reliability. If you waste your money, you won't have enough for those systems that really need the quality.

And nothing contributes to reliability more than hiring, training and retaining high quality operators and engineers.
Re:Nothing is THAT Important by Sique · 2002-07-09 09:17 · Score: 2, Informative

No, it means that a jetliner has to be operating for a year with just 3 mins in the hangar. But about half its lifetime a jetliner is in maintenance, giving it an uptime of about 50%.
Sorry this should read 50% availability.
Yes I know there may be arguments that scheduled maintenance won't count. But even without counting this jetliner availability never reaches 99.999%.
About two years ago I had to fly a longer distance . I sat in the plane, but the plane didn't leave the gate for about an hour. Then the pilot spoke to us asking our patience for another hour, there would be a problem with the oil pressure and the mechanics were looking at it. After this hour he told us that the oilfilter was defective and had to be changed. And after another hour he asked us to leave the plane, there weren't any new oilfilter available at the airport and they had to get another one from another airport. After five hours we finally got clearance and started.
That's five hours unavailability. If this was the only unplanned outage for the plane at all, and it was on average available 99,999% this means a lifetime of 500000 hours or about 60 years for this plane without any further problem with the plane (included outer conditions like weather, grounding due to Sep 11 et. al.)
So planes on average are much more often unavailable than 0.001% of their operation time. The average delay for the Frankfurt Airport (FRA) is currently 15mins, if we assume that every plane lands on FRA about once per day this would be an average outage of 1%, 1000 times that of 5-9.

--
.sig: Sique *sigh*
Re: Nothing is THAT Important by Black+Parrot · 2002-07-09 09:18 · Score: 1

> ...to be worth 4 and 5 nines of reliability.

It's understandable that 100% reliability would look seem engineering overkill to you, Master Baiter, but for the rest of us it's essential.

--
Sheesh, evil *and* a jerk. -- Jade
Re:Nothing is THAT Important by istvandragosani · 2002-07-09 09:22 · Score: 1

Yeah, like phone systems running emergency (911) numbers?

--
Go not to the Elves for counsel, for they will say both no and yes
Re:Nothing is THAT Important by SwedishChef · 2002-07-09 09:25 · Score: 2

Damn! This should have been modded up!!! And it would hardly be "settling".

--
No one ever had to evacuate a city because the solar panels broke!
Re:Nothing is THAT Important by charon.de · 2002-07-09 09:53 · Score: 1

(other tragic circumstance was that pilots of Tu-154 gave priority to dispatch commands instead of commands of collision avoidance system...)

According to this article (German)
http://www.spiegel.de/panorama/0,1518,204356,00.ht ml

The flight data recorder of both machines could be recovered. The TCAS system, warned about a minute before the crash, "Traffic, Traffic" in both machines. About 15 sec. before it was all over, the TCAS in the Boeing 757 said "descent, descent" and in the Tupolew 154 "climb, climb", but the ground control told "decent, descent" to the russian pilot, which he, sadly did. Many international airlines have rules, that tell to follow the automatic system, no matter, what the ground control says! The russian don't have a rule like this, the pilot does, what he thinks is the right to do.
Re:Nothing is THAT Important by undercanopy · 2002-07-09 10:06 · Score: 1

That depends on what kind of five-nines you're talking about. 5-nines of software uptime is much better than 5-nines of flights landing safely. 99.999% of flights landing safely gives 1 crash every 2 days.

--
-- D-23994, Muff#2613
Re:Nothing is THAT Important by PacoTaco · 2002-07-09 10:19 · Score: 3, Interesting

Point taken. Somehow, IMO reasonable reliability in the software and hardware industry is rediculously exensive.
Not to sound like a suit, but it's really about total cost of ownership. For example, software RAID comes with most modern operating systems, but you still need to power down the server to remove and replace a failed drive. However, if you make the upfront investment in a hardware RAID controller with hot-swap capability, you save time and reduce tech support calls, saving money in the long run. If you're offering commercial services (as an ISP or whatever), you start to develop a reputation for reliability that will earn you more customers over time.
Re:Nothing is THAT Important by anjrober · 2002-07-09 10:27 · Score: 1

after years at Fidelity investments, I can tell you during trading hours, trading machines need 4-5 9's, that's for damn sure.
Re:Nothing is THAT Important by tzanger · 2002-07-09 10:33 · Score: 3, Informative

Five-nine reliability in the airline industry would mean that we'd see a major commercial jetliner crash about every other day.

At first I didn't believe you.

According to this page, there were 10 fatal accidents in 18 million flights in 1998. That is a little worse tthan six nines. Five nines would be 180 flights, or almost exactly every other day.

I'm really glad I checked before spouting off. :-) Did you know that stat or did you pull it out of the air?
Re:Nothing is THAT Important by roybadami · 2002-07-09 10:39 · Score: 1

Five nines reliablility is what the telco industry has traditionally aimed for. That means the phone system is out of action on average for 5 minutes each year.

The problem is that the Internet industry has felt it needs to do likewise.

Three nines (ie 9 hours downtime per year) would seem more than enough for most e-commerce sites.
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 10:49 · Score: 0

It was a guesstimate based on several thousand flights per day. I posted it knowing it was an oversimplification that would stimulate debate. I was, however, disappointed when I failed to achieve +5 as an AC. :(

Mods, it's not too late....
Re: Nothing is THAT Important by Anonymous Coward · 2002-07-09 10:51 · Score: 0

No it's not. You just think it is. You've put yourself in a position where you are relying on something just way too much. People only NEED a few things to live. Computers aren't one of them.
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 11:39 · Score: 0

Three numbers: 9-1-1.

If that's not operating at 4 or 5 nines, I'd be very disappointed.
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 11:55 · Score: 0

3000 MPEGs?
Re:Nothing is THAT Important by SuiteSisterMary · 2002-07-09 12:02 · Score: 5, Funny

The same thing happened to me once, a little puddlejumper from Dallas or Houstan to Austin, I think it was.

Anywho, the pilot revs up the engine, then throttles it back done. Fine, brakes and throttle work. Throttles back up, trips the brakes, and off we go screaming down the runway.

Then the plane slows down, and stops.

Pilot comes on the intercom and says 'Um, folks, you may have noticed, we didn't take off. A warning light has come on in the cockpit, and we don't know why. Until we do, we're going to stay right here.

Now, that's not the bad part. The heat and humidity, and a plane full of sweaty smelly passengers isn't the bad part, either.

No, the bad part was the pair of off duty pilots in the seats next to me who started, in loving detail, discussing every thing that could possibly be wrong.

--
Vintage computer games and RPG books available. Email me if you're interested.
Re:Nothing is THAT Important by rodgerd · 2002-07-09 12:27 · Score: 3, Insightful

Yeah, and when some systems fail, it doesn't actually matter.

One thing I saw again and again during the .com boom was pissant little companies demanding 100% uptime, spending a fortune on Oracle and redundant data centres and shit, when they didn't need that reliability. Their business plan didn't call for it, their demographic didn't call for it, nothing called for it. They were engineering their shithouse little business' systems like they were for the A&E department of a hostpital.

And that's the point the guy seems to be making: people are spending millions of dollars where they only need to spend a tenth that, to build systems you could run a trading floor with.
Re:Nothing is THAT Important by Archfeld · 2002-07-09 12:58 · Score: 2

the Feds seem to think that balancing a bank everyday at 15:00 IS, and they aren't JOKING nor do they have a sense of humor...

its called 6 sigma for us

--
errr....umm...*whooosh* *whoosh* Is this thing on ?
Re:Nothing is THAT Important by Anonymous Coward · 2002-07-09 16:56 · Score: 0

if you could choose, would you take a power plant with 99,9% availability and ,1% blackouts or one with 99,9999% avail and ,0001% core meltdown?
Re:Nothing is THAT Important by markmoss · 2002-07-09 23:53 · Score: 1

Secure and reliable are two different things. (Someday I'm going to make and post a picture of how to secure a Windows system - it involves a wire cutters and the power cord.)

The issue with a completely private country-wide network is that it's unlikely to have all the redundant links needed to ensure everyone stays connected. OTOH, the internet has redundancy (except where the phone companies run too many trunk lines through the same building), but there's an unacceptably high risk that critical info (e.g., his routine traffic stop is a mad-dog killer) won't make it through the congestion in time. What is really needed when real-time performance is critical is a way to buy priority through the internet routers. (And I do mean a pay system, priced so that no one can afford to use it to keep the MPEG streaming in on time, but cop shops can afford to prioritize their much shorter messages.)

The horror... by Anonymous Coward · 2002-07-09 08:16 · Score: 0

We'll see how good www.codesta.com's uptime is after the slashdot'ing.

Should be retitled: by tshak · 2002-07-09 08:16 · Score: 2, Redundant

Uptime Realities in the Slashdot-linked World

--

There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips

Netcraft have the final word on this by matthew.thompson · 2002-07-09 08:16 · Score: 1, Redundant

uptime.netcraft.com Is THE best place to see what works for uptime. Last time I checked BSD machines were the best for uptime.

M@t :o)

--
Matt Thompson - Actuality - Insert product here.

Re:Netcraft have the final word on this by questionlp · 2002-07-09 08:27 · Score: 3, Interesting

The results from uptime.netcraft.com seem a bit hokey (sp?) at times since it does not take load balanced web servers into consideration, network outages, etc. In my case, I had a server down for about under a day to be rebuilt and brought it back up... checked the NetCraft results a couple of days later, it didn't show that the site had gone down.

I know there are some projects/sites that will allow people to submit uptimes sent from cron jobs or agents to a server, which then stores the uptime data there. Of course, that doesn't mean that you can just generate junk data (ie: 999 day uptime with 2934 users).
Re:Netcraft have the final word on this by mgpeter · 2002-07-09 08:40 · Score: 3, Funny

Too Bad that a lot of the servers on the top 50 uptime list still have the default page that apache provides.

I'm sure it isn't too difficult to keep them running - just make sure the power is on and the network cable is plugged in.
Re:Netcraft have the final word on this by Anonymous Coward · 2002-07-09 08:46 · Score: 0

You mean Apache Worm food? Uptime dicksize wars are incredibly stupid unless you're using load-balanced servers where you can take one out of service without having an outage. In this case, anyone running Apache that didn't have an outage in the last couple of weeks can get rooted. What's more important, your 300 day uptime for your website or root access to your machine? Yes, I know you don't need to reboot the machine to restart Apache but the service will have an outage while you're upgrading the daemon.
Re: Netcraft have the final word on this by Black+Parrot · 2002-07-09 09:10 · Score: 3, Insightful

> Too Bad that a lot of the servers on the top 50 uptime list still have the default page that apache provides. I'm sure it isn't too difficult to keep them running - just make sure the power is on and the network cable is plugged in.

Historically, some very popular and widely sold operating systems couldn't even do that much.

--
Sheesh, evil *and* a jerk. -- Jade
Re:Netcraft have the final word on this by mindstrm · 2002-07-09 09:14 · Score: 2

Post hoc ergo propter hoc? You are making the most common mistake in statistics.

Netcraft saying that the boxes with the longest times are BSD only implies there is most likely some kind of relationship between BSD and long uptimes; it does not imply that BSD is responsible for those uptimes.

It could be that the class of administrators who like BSD happen to have administrative practices that preclude rebooting often.

It could be that for some reason BSD is only used in very static configurations where the kinds of activities that would cause you to want to reboot are simply not done.

It could be anything.
Re:Netcraft have the final word on this by ethereal · 2002-07-09 09:20 · Score: 1

With an appropriate server farm setup, you can upgrade the machines on a rolling basis and still provide a given level of uptime for the service as a whole. Just use some of your excess capacity for a day as you do the rolling upgrade.
Which is probably why netcraft doesn't try to distinguish between server clusters, etc. - the whole point is to get uptime for the web service, and the best way to do so is usually by using multiple machines to do it.

--
Your right to not believe: Americans United for Separation of Church and
Re:Netcraft have the final word on this by Anonymous Coward · 2002-07-09 09:21 · Score: 0

That or you don't have to have "scheduled reboots" to fix "memory swap file problems" like we do on our citrix boxen..

one or the other, eh?
Re:Netcraft have the final word on this by finkployd · 2002-07-09 11:46 · Score: 2

In my experience the most important machines are not accessable from the internet. Our mainframe has high availability, but it sure is not running a webserver. Not to mention it is firewalled off from the outside world.

Finkployd
Re:Netcraft have the final word on this by Anonymous Coward · 2002-07-09 16:21 · Score: 0

Uptime is complicated. After all, my watch has a more sophisticated computer than the apollo lander and I'd say it's uptime is pretty good, I'd say 100% for the last 10years or so (note, no 9's, actual 100% over 10years) Even if you figure that there are probably some of that model that fail I'd still bet that you could get alot of 9's :)

It's all about balancing complexity/capability with uptime. If all you want a box to do is something simple like serving static web pages, or routing packets, etc then you can expect massive uptime. If it's doing something more complex then you have to expect more downtime.

Follow-up by GothChip · 2002-07-09 08:17 · Score: 1, Redundant

And for the follow up article he discusses how hard it is for a site to remain up after it's been slashdotted.

Lucky that it's your former boss by Sims+Youth · 2002-07-09 08:17 · Score: 0

Looks like he's going to be seeing even fewer nines after this slashdotting.

Customers want it, but don't understand it by derekb · 2002-07-09 08:17 · Score: 5, Insightful

How many engineers out there have heard the marketing / sales 'it has to be always available' and priced out an infrastructure accordingly.

Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.

Re:Customers want it, but don't understand it by Subcarrier · 2002-07-09 08:23 · Score: 5, Funny

Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

$999.99

Problem solved. ;-)

--
"I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
Re:Customers want it, but don't understand it by gmack · 2002-07-09 08:31 · Score: 3, Interesting

Our current isp(group telecom) guaruntees 5 9's of reliabillity and it's pretty much a joke. Weve already burned through several years worth of downtime (granted only a coupple hours a month) and who knows what will happen to our "guarunteed service" if/when they finish their slide into bankrupcy.
Re:Customers want it, but don't understand it by rob_from_ca · 2002-07-09 08:32 · Score: 4, Insightful

This is the most intelligent thing I've ever heard on slashdot before. If you don't understand this comment, read it again and again until you do. :-)

If you're a business, your money is far better spent improving the user experience rather than working on buying redundant-everything, building the support infrastructure, and incurring the extra overhead of the tedious and careful processes needed to obtain 5 nines (and 4, and even to a degeree 3 nines).

If your site sucks and no one visits, it doesn't really matter if it's down...work on building something reasonably reliable that is very compelling to your users; that's money much better spent...
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 08:35 · Score: 1, Funny

No, if pricing is an issue then promise 9.9999% uptime. 5 nines.
Re:Customers want it, but don't understand it by ipsuid · 2002-07-09 08:48 · Score: 5, Informative

One word to clients... "Outsource"

Maintaining backend infrastructure with a 5 9's service level agreement really is prohibitively expensive for all but the largest businesses. Especially if they are not a tech company.

The level of engineering that goes into providing true 5 9's service is extraordinary. Also, some military contracts actually require 6 9's!! (Let alone completely seperate networks for classified data).

I'm actually in the design phase of a data center which requires 5 9's (so we can take on those who decide to outsource). Redundant generators, redundant UPS, redundant routers, redundant HVAC, two seperate cable runs from different sides of the building, two connections to the power grid, etc., etc....

And thats just the physical infrastructure! Now you need to develop, or integrate the software to completely cover every aspect of your operations. Anything from cable tagging, to ticketting systems, to emergency procedures. After you build all the infrastructure, take that price and double it... that's how much you will be spending to develop all of those operating procedures. Which, at that point, go get ISO certified - since you've already gone above all the requirements.

If I had to take a guess at a physical cost, $250-300 a square foot seems pretty close (around here anyway). And that only gets cheaper if you are looking at a facility greater than about 10000 sq. ft.

Unless of course, only marketing has those 5 9's!

--
It appears Ockham lost his razor and grew a beard.
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 08:51 · Score: 0

It's funny. Laugh.
Re:Customers want it, but don't understand it by Homebrewed · 2002-07-09 09:01 · Score: 1

Hey, I've got a couple of Netware servers used by 200 users that come down for a 10-minute reboot once every 3 months. I've also got Veritas backup which allows open file backup. This is on Dell Poweredge servers. Five nines is what, 2 hours downtime every three months?
Re:Customers want it, but don't understand it by Ctrl-Z · 2002-07-09 09:27 · Score: 1

No, 5 nines is about 5 minutes of downtime a year.

60 min/h * 24 h/d * 365.25 d/a = 525960 min/a
525960 * 99.999% = 525954.7404 min/a uptime
525960 - 525954.7404 = 5.2596 min/a downtime

So you're looking at just over a minute every three months.

--
www.timcoleman.com is a total waste of your time. Never go there.
Re:Customers want it, but don't understand it by trapvector · 2002-07-09 09:46 · Score: 2, Interesting

Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.

Hmm... let's take it a step further and assign approximate value to infrastructure and process. At a company where I used to work, I smoked more cigarettes than I have ever smoked in my life, and this was directly due to failures of hardware and software. There was never, ever a shortage of queer (strange, not take-it-in-the-bum) little men running around telling us that they were working on expunging the demon/rebooting the server machines/whatever... but boy, was there ever a lack of infrastructure.

So I would say that without sufficiently redundant hardware and code well-enough written to not explode upon severe slashdotting (you know, just as an example), you can have all the process you want, and it will just result in tech staff telling the end-users to go out and have a smoke, 'cause the computers are down and will be back ASAP.

What a horrible thing to do to one's ex-boss. (/redundant)
Re:Customers want it, but don't understand it by DNS-and-BIND · 2002-07-09 10:19 · Score: 2

Redundant infrastructure...yeah right, been there done that. Redundant infrastructure means two wires in the same conduit...

--
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 10:34 · Score: 0

sorry, neat quip but not even close. I recently talked to a vendor who was asking $2000 per processor for an upgrade for some failover software that we're using. That doesn't include the hardware (dual processor machines), that doesn't include any other software, that doesn't include the admin time, and it definately doesn't include our markup to our customer.
To be truely redundant takes a lot of money -- and there is always some single point of failure somewhere in the system.
Re:Customers want it, but don't understand it by jenns · 2002-07-09 10:38 · Score: 1

Gee, and I was thinking $99,999

--
Whatever women do they must do twice as well as men to be thought half as good. Luckily this is not difficult. -Whitton
Re:Customers want it, but don't understand it by mph_sd · 2002-07-09 10:59 · Score: 1

Offer a compromise. Tell him he can have six eights.
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 11:04 · Score: 0

Please explain to me and others how you can possibly calculate "5 9's" availability? You can make it the best you can, put in as many redundant systems as you want, but how does that fit into some formula that figures out 99.999% availability?

Either something's gonna happen within a set time frame, or it's not. A meteor could strike your NOC and it doesn't matter how many UPSs or redundant network connections you have. So much for a calculated "5 9's".

It's all marketing statistics, nothing more.
Re:Customers want it, but don't understand it by xmedar · 2002-07-09 11:20 · Score: 1

One word to clients... "Outsource"

<real world>
PHB: Our Anderson Consultants have recommended we should outsource to WorldCom...
</real world>

--
Any sufficiently advanced man is indistinguishable from God
Re:Customers want it, but don't understand it by xsbellx · 2002-07-09 11:21 · Score: 1

Actually, isn't six eights better than five nines? 6x8=48 and 5x9=45 ;)

--
If VISTA is the answer, you didn't understand the question
Re:Customers want it, but don't understand it by bellings · 2002-07-09 15:21 · Score: 1

A couple hours of downtime a month? Two hours is a quarter century of downtime at 5 9's.

If you needed five nine's, and you were willing to pay for five nine's, then how can two nine's suddenly become acceptable to your business plan?

--
Slashdot is jumping the shark. I'm just driving the boat.
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 17:30 · Score: 0

Been there done that... and now I'm laid off, because you're on target for the cost, but Exodus and the telcos drove the selling price down to $35/sqft.

I've gone through a number of these fads and they do sometimes actually sell -- for the longest time, no one would buy a VPN system, but everyone wanted it designed and priced. Same thing is going on now with multi-site disaster-tolerant deployments. Try making that work with COTS products and five thousand miles between points A and B.
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 18:41 · Score: 0

People who design data centers for a living will tell you that you cannot achieve 5 9's from your building, which means you cannot achieve it from your equipment.
If you want at shot at 5 9's you better be designing much better than redundant, which is N+1, with N representing need, which means if you need 1 power sub station, you also have a reduntant backup. What you should be designing for is fault-tolerant which will get you 4 9's and change over a 5 year period, fault tolerant is expresssed as either 2(N+1), or, S+S, which translates to System + System. Check out this page if you want to understand what it means to really be redundant, or fault tolerant, it will also outline the cost for true fault tolerant systems.
http://www.upsite.com/TUIpages/whitepapers/tuitier s.html
I'm half way through building a Tier 4 data center, and with 85,000 sq/ft of raised floor space it's going to cost almost $100,000,000 just for the building, and the environmental, electrical, and mechanical systems. There will be additional cost for full commision testing which will be almost 2% of the building cost.
If you are really designing a 5 9's data center you should contact http://www.upsite.com, because you need a lot of help. There is no way that you can expect to pay $250 to $300 sq/ft and get a 5 9's data center, unless you are using a builder who builds office buildings, and thinks you are just "over engineering" your building. Which means you'll be lucky if you get a tier 1 data center.
Good Luck on your career as a data center designer, because you're gonna need it.
The clients who outsourced to you are the ones who are going to need most of the luck.
I can see your ad now:
"We take reliability to the next level, instead of 5 9's we'll give you 6 8's"
Re:Customers want it, but don't understand it by Tony-A · 2002-07-09 20:14 · Score: 1

Think of it as one hour every 12 years. That's *after* dealing with all the one-in-a-million things that *do* happen. That's including any and all "planned maintenance".
Re:Customers want it, but don't understand it by Anonymous Coward · 2002-07-09 23:53 · Score: 0

I apologize for the semantic confusion... when I say "redundant" I actually mean 2N. And considering our N is based on future looking estimates of need based on our current local market, you could say that we are designing 2(N+1) if we take the definition of N to follow current need, in place of total expected need as we are doing. (The reasons we look at need as total expected need rather then just current need has more to do with business model issues, and no "real world" engineering requirements).

As for the cost. I did mention the caveat, "at least around here", which does indeed describe the situation perfectly. If I were building this in say, Seatle, then I would entirely accept the Uptime numbers. In fact, since most tech companies are still located in the western part of the US, if the Uptime numbers are averages across all facilities, the numbers will be skewed in such a way that they are representative of those facilities, rather then real costs in more reasonably priced real estate areas.

BTW, thanks for the personal brow beating, I always enjoy a bit of destructive criticism to let me know that I'm alive ;-)

Good luck on your facility, thanks for the useful links, and the chance for a rebuttal!
Re:Customers want it, but don't understand it by JWSmythe · 2002-07-10 08:45 · Score: 1

We've been knocked out with hurricanes in Florida and snow storms in New York (14 blocks from a couple big building falling too.).. One NOC isn't enough.. Diversify your physical locations, or expect something to be down sometime.

--
Serious? Seriousness is well above my pay grade.
Re:Customers want it, but don't understand it by Beliskner · 2002-07-11 08:10 · Score: 1

Warning: This is offtopic. Ah heck mod me down, excessive karma simply causes Slashdot-wide karma inflation.
Hello, it's been a while, I just got your email, I love holidays. Nowwwww, RDF and XML are both lightweight so that doesn't make your system that different. RDF will only be effective inside big companies, companies rarely cooperate together on their own standards, as you can see from Enron and half a dozen other gigantic companies if they can't even add up (accounting flaws) then what hope is there for homogenous resource-sharing? If it's their interests they'll do it, like Internet Protocol - having more than one Internet is pointless even if you're in China, firewalls can compartmentalise, no company uses a non-IP and non-HTTP compatible system for global Internet traffic. How the heck did you write 1 mil LOC? Hopefully you just the Enter key too many times. How the heck could the RDF people write 50 pages on something so simple? Why don't they just write "it encapsulates part of an XML document for resource-mapping"? Why does this appear 10 pages down. Short abstracts are good but Bajesus, even IETF RFCs are better.
Yeah anyway I can't see how RDF could possibly add further information for natural language parsing than XML can. XML can compartmentalise and categorise by language components if you make your DTD right. A lexically parsed document won't benefit by being declared as a resource to be shared, I think you're using a screwdriver instead of a wrench, like writing a Perl language compiler in Ada. But then if you write this in your conclusion right, you'll get major marks.

--
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?

/.'d? by Anonymous Coward · 2002-07-09 08:18 · Score: 0

What, did everybody click the link but not reply?

I can't access the site...

Sense is not too expensive by infonography · 2002-07-09 08:18 · Score: 1

or maybe they just really like using micro$oft products

--
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23

my boss.... by Patrick13 · 2002-07-09 08:18 · Score: 5, Funny

said if i can get this mentioned on slashdot, i'll get the raise after all...

--
::.. check out some Cell Phone Reviews

Re:my boss.... by schussat · 2002-07-09 08:31 · Score: 2

said if i can get this mentioned on slashdot, i'll get the raise after all...
But now that his email is posted on the front page of slashdot, maybe they'll just split the difference between being fired and getting a raise.
-schussat

--
The hour of noon has passed. Let us go and get some Kentucky Fried Chicken.

yes... by iONiUM · 2002-07-09 08:18 · Score: 1

It poses the idea that four and five nines of reliability are too expensive to be realistic

I know it costs a lot per letter of text... so why not just print maybe one or two nines instead? Or maybe a one with two zeros... I tend to just round off after a certain point

No Grudge? by fiftyLou · 2002-07-09 08:18 · Score: 3, Funny

"My former boss"

Nice, and you go after your ex-boss by getting his article slashdotted! ;-)

Re:No Grudge? by WhiteKnight07 · 2002-07-09 09:12 · Score: 1

Naa, I think he's getting back at him by putting a direct non-obfucated mailto: link to his email on the front page of one of the nets most heavly visited sites. Can we say revenge of the spam bots? I think we can.

--

We're going to make information free Mr. Anderson, whether you like it, or not.
Re:No Grudge? by jsse · 2002-07-09 16:40 · Score: 1

yeah, I'm starting to submit stories from my ex-company's website. Wish me luck. :P

School by Anonymous Coward · 2002-07-09 08:19 · Score: 0

Now, if only School had high uptime... (suffered 2 outages this morning ^^ )

In my dept... by ALecs · 2002-07-09 08:19 · Score: 4, Funny

After a major firewall downtime last year, I wanted to have some T-shirts printed up advertising

Tovaris Systems Support:
Proudly providing nine-fives reliability.

The boss didn't do for, though. :(

Re:In my dept... by Anonymous Coward · 2002-07-09 08:56 · Score: 0

Firewalls really aren't the most difficult thing to manage, chief. Perhaps you'd be better suited for a career in the housekeeping or food services industries.
Re:In my dept... by ALecs · 2002-07-09 09:53 · Score: 1

Firewalls really aren't the most difficult thing to manage, chief.

No...and neither is dead hardware. But equipment vendors....they're difficult. :)
Re:In my dept... by Anonymous Coward · 2002-07-09 11:37 · Score: 0

I've quit better jobs than this.

9 9s by digitalsushi · 2002-07-09 08:20 · Score: 5, Funny

Like the Telco... voice grade telco. Better than the power company.

Our web server does about 4 9's, which is a downtime of about 8 hours a year, I think. I really suck at math though. I mean it.. I'm so bad at math I have no idea if thats right. I said "well theres 8544 hours in a year, so 8 divided by that is 0.0009, so thats about 4 9s. I think. 8 hours of downtime isnt that bad. I think the next step up from 8 hours of downtime is essentially those megacorps that have redundant systems, and sirens go off and people die when their server goes down for under a second. In fact, I think if their server actually went down for more than a second, some sort of structual damage to the building hosting it is the only likely scenario. Course, that's closer to 7 9s. I cant figure out how long any of the other 9s are cause I only knew what our average downtime is, and could do the math that way only. Wow, its really hot in here.

Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!

--
slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue

Re:9 9s by thespacegeek · 2002-07-09 08:28 · Score: 1

.999086757
Only three nines.
Re:9 9s by Anonymous Coward · 2002-07-09 08:29 · Score: 5, Informative

1 nine: 90% availability, or 37 days of downtime per year (Qwest!)
2 nines: 99% availability, or 88 hours of downtime per year
3 nines: 99.9% availability, or 9 hours of downtime per year
4 nines: 99.99% availability, or 53 minutes of downtime per year
5 nines: 99.999% availability, or 315 seconds of downtime per year
6 nines: 99.9999% availability, or 32 seconds of downtime per year
7 nines: 99.99999% availability, or 3 seconds of downtime per year

Beyond that, it doesn't much matter.
Re:9 9s by Wrexen · 2002-07-09 08:29 · Score: 5, Informative

TI-89 > all education
9's ---- time
1 876 hours
2 87 hours
3 8 hours
4 52 minutes
5 5 minutes
6 31 seconds
7 3 seconds
8 .3 seconds
9 you get the idea
Re:9 9s by Anonymous Coward · 2002-07-09 08:33 · Score: 0

2 "9's" = 99% = 1 in 100 = 1 hour downtime in about 4 days.
3 "9's" = 99.9% = 1 in 1000 = 1 hour downtime in about six weeks.
4 "9's" = 99.99% = 1 in 10000 = 1 hour downtime in just under 14 months.
5 "9's" = 99.999% = 1 in 100000 = 1 hour downtime in 11 years. I imagine very few web serves have gotten this far.
Re:9 9s by Anonymous Coward · 2002-07-09 08:35 · Score: 0

I thought that nine 9s of reliability on a 2 GHz machine meant one failure every 0.5 sec?
Re:9 9s by Zathrus · 2002-07-09 08:36 · Score: 1, Redundant
Wow, you do suck at math :)

First off, there's 8766 hrs/yr (assuming 365.25 days/year), not 8544.

As for percentages:
- 90% uptime: Down 876.6 hours, or 36.525 days. 3 days per month.
- 99% uptime: 87.66 hours, or 3.65 days.
- 99.9% uptime: 8.77 hours
- 99.99% uptime: 0.877 hours, or 52.6 minutes
- 99.999% (5 nine's!) uptime: 0.0877 hours, or 5.26 minutes
- 99.9999% uptime: 0.0088 hours, or 0.526 min, or 31.56 seconds
- 99.99999% uptime: 0.0526 min, or ~3 seconds
- 99.999999% uptime: ~1/3 second outage per year
I suspect that most people could figure out the mystical, magical mathematic relations after the 90% to 99% jump....
Re:9 9s by thespacegeek · 2002-07-09 08:36 · Score: 1

0.90 is 876 hours
0.99 is 87.6 hours
0.999 is 8.76 hours
0.9999 is 52 minutes
0.99999 is 5 minutes and 15 seconds
0.999999 is 31.5 seconds
0.9999999 is 3.2 seconds
0.9999999 is 0.32 seconds
0.99999999 is 0.032 seconds
Re:9 9s by Anonymous Coward · 2002-07-09 08:37 · Score: 0

For each percentage, I give the # of hours of downtime allowed, per year.

99% = 87.6hrs
99.9% = 8.76hrs
99.99% = 0.876hrs
99.999% = 0.0876hrs (about 5 minutes per year).

For other amounts of 9's, or for Metric Time, I leave those conversions as an exercise for the reader.
Re:9 9s by Asprin · 2002-07-09 08:40 · Score: 4, Funny

Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!

365 days * 24 hours/day * 60 minutes/hour = 525600 minutes/year.

%uptime %downtime Fuzzy description of downtime
.9 .1 52560 minutes down/year ~= 36 days down/yr
.99 .01 5256 minutes down/year ~= 3.5 days down/yr
.999 .001 525.6 minutes down/year ~= 9 hours down/yr
.9999 .0001 52.56 minutes down/year ~= 1 hour down/yr
.99999 .00001 5.256 minutes down/year ~= 5 minutes down/yr
.999999 .000001 .5256 minutes down/year ~= 32 seconds down/yr
.9999999 .0000001 .05256 minutes down/year ~= 3.2 seconds down/yr
.99999999 .00000001 .005256 minutes down/year ~= (HALF A MILLISECOND/YEAR!!!!)
.999999999 .000000001 .0005256 minutes down/year ~= How long it takes for one of these locally hosted sites to get /.'ed

--
"Lawyers are for sucks."
- Doug McKenzie
Re:9 9s by Anonymous Coward · 2002-07-09 08:41 · Score: 0

According to my napkin, 4 9's is about 53 minutes of downtime in a year, and 5 9's is about 5 minutes.
Re:9 9s by d3jp_ · 2002-07-09 08:42 · Score: 1, Redundant

365 days * 24 hours * 60 minutes * 60 seconds = 31536000 seconds/year

Now, per year, that means that...
90.00000% uptime - Downtime: 36 days, 12 hours
99.00000% uptime - Downtime: 3 days, 15 hours, 36 minutes
99.90000% uptime - Downtime: 8 hours, 45 minutes, 36 seconds
99.99000% uptime - Downtime: 52mins, 33 secs
99.99900% uptime - Downtime: 5 mins, 15 secs
99.99990% uptime - Downtime: 32 secs
99.99999% uptime - Downtime: 3 secs
Re:9 9s by Telecommando · 2002-07-09 08:47 · Score: 1

Well, maybe you were trying to be funny and maybe you were looking for answers.

There are actually 8760 hours in a year so
1 nine (90%) is 876 hours of downtime per year,
2 nines (99%) is 87.6 hours of downtime per year,
3 nines (99.9%) is 8.76 hours of downtime per year,
4 nines (99.99%) is .876 hours of downtime or 52.56 minutes per year,
5 nines (99.999%) is 5.256 minutes of downtime per year, (or .864 seconds per day)
6 nines (99.9999%) is 31.536 seconds of downtime per year,
7 nines (99.99999%) is 3.1536 seconds of downtime per year.

Of course these numbers are for a regular year, not a leap year. ;-)

--
Beta sux! Join the Slashcott! http://hardware.slashdot.org/comments.pl?sid=4760465&cid=46173047
Re:9 9s by OwnedByTwoCats · 2002-07-09 08:50 · Score: 1

90% availability: down for a 5 weeks a year
99% availability: down for three and a half days a year
99.9% availability: down for eight hours a year
99.99% availability: down for 52 minutes a year
99.999% availability: down for 5 minutes a year
99.9999% availability (four nines):
down for 31 seconds per year
One part per million
99.99999% availability: (five nines):
down for 3 seconds a year
99.999999% availability: (six nines):
down for 300 milliseconds a year
"like winning the lottery".
Re:9 9s by Anonymous Coward · 2002-07-09 08:51 · Score: 0

just going by hrs down/year (not really an accurate way to describe it)

0 8766hrs (1year)
1 877.6hrs (~1month)
2 87.66hrs (~1/2 a week)
3 8.766hrs (~1 bus day)
4 .8766hrs (~1 hour)
5 .0877hrs (~5 mins)
6 .0088hrs (~30 secs)
7 .0009hrs (~3 secs)
8 .0001hrs (~316ms)
9 .0000hrs (~31.6ms or about a decent ping time)
Re:9 9s by Anonymous Coward · 2002-07-09 09:03 · Score: 0

> Like the Telco... voice grade telco. Better than
> the power company.

You must be kiddin'.

Just try to compute how long it would take AT&T
to recover to 9 9s from this disaster:

http://www.dmine.com/phworld/history/attcrash.ht m

Kind regards,

Toon Moene
Re:9 9s by ealar+dlanvuli · 2002-07-09 09:08 · Score: 1

We call the Qwest exec board if a page dosen't get through to our techs in under a minuite on server outage, I got to sit in on one call, it was to put it lightly mildly amusing.

--
I live in a giant bucket.
Re:9 9s by Anonymous Coward · 2002-07-09 09:11 · Score: 0

I congradulate you, you have created the single most highly moderated gibberish thread ever

read those replies, 999999999
99999
9
9
99
9
9999
9

9

ahhh my head!
Re:9 9s by michael_cain · 2002-07-09 09:14 · Score: 2, Interesting

Ah, local telco reliability.
IIRC, and it's been a number of years, the overall goal was about 50 minutes of outage per line per year (a little less than three nines). Different failure modes were allocated different parts of that total. Components like the wires, that only took a single line out of service, were allocated the lion's share. Switch components were allocated smaller amounts, depending on how many lines would be out of service. Total system failure on a switch was allocated about 4.5 minutes per year (five nines).
No switching system ever actually made that grade. Probably the ones that came closest were the old electromechanical "steppers". Many small steppers in small towns ran completely unattended, and maintenance consisted of someone driving out once a month to make sure the building was still there and to polish some relay contacts.
All of the computer-controlled switches had dual synchronized processors (ie, each one executing the same op codes at the same time) and duplex memory, with a bunch of extra hardware to detect faults. The single most common cause of total system failure was when a fault had occured, and the system was running "simplex", and a tech pulled a card from the active rather than the failed processor.
Re:9 9s by 4of12 · 2002-07-09 09:23 · Score: 4, Funny

Hmmm...

Enough nines of reliability and you can probably easily claim that network latency is responsible for the slow response a client is experiencing:)

The server can go down, be rebooted before the client thinks something is really wrong!

--
"Provided by the management for your protection."
Re:9 9s by Anonymous Coward · 2002-07-09 09:35 · Score: 0

1 nine: 90% availability, or 37 days of downtime per year (Qwest!)

Actually, if you're talking Qwest DSL, I'd make that:

1 nine: 9% availablility, or 332 days of downtime per year :o)
Re:9 9s by PMM · 2002-07-09 09:54 · Score: 0

I'm going to bookmark this comment

in moments of confusion it'll be handy knowing that theres someone out there more so.
Re:9 9s by Hack+Shoeboy · 2002-07-09 10:45 · Score: 0

Wow. Congrats on the successful troll. Let's see if I can come up with a math question to pose to a bunch of insecure geeks with an eighth grade education and a chip on their shoulder....
I'm having trouble factoring integers. Can someone with an 8th grade education in math explain the Chinese Remainder Theorem please? If not, then please just a list of the first 99,999 prime numbers....

--

IN TEH FUCHAR, LITERSY WLIL EB OPSHANAL!!!!!111
Re:9 9s by Anonymous Coward · 2002-07-09 11:00 · Score: 0

% math Mathematica 4.1 for Linux Copyright 1988-2000 Wolfram Research, Inc. -- Motif graphics initialized -- In[1]:= Table[Prime[n],{n,1,99999}] Out[1]:= Whoa, Nellie.
Re:9 9s by fishbowl · 2002-07-09 11:17 · Score: 3, Interesting

>Beyond that, it doesn't much matter.

Well, beyond "7 nines" you would start talking about 100% reliability. So you start with contingency plans for a terrorist attack on
one data center at the same moment of a quake under another data center. Now you're in the realm of needing your own redudant power plants, and probably network infrastructure that does not
really exist yet.

So in reality, your guarantee of "9 nines" or, effectively ZERO downtime for the life of the product, really would be specified in terms of compensation and not technology. In other words,
you'd be stating what the client will receieve when (not if) the uptime guarantee is not met.

--
-fb Everything not expressly forbidden is now mandatory.
Re:9 9s by chief-dot · 2002-07-10 00:36 · Score: 1

.005256 minutes down/year ~= (HALF A MILLISECOND/YEAR!!!!)

Half a milliminute ...I'm so disgusted in myself, I hate pedantic pricks that correct people on simple and obvious mistakes.
Re:9 9s by Asprin · 2002-07-11 09:05 · Score: 2

.005256 minutes down/year ~= (HALF A MILLISECOND/YEAR!!!!)

Half a milliminute ...I'm so disgusted in myself, I hate pedantic pricks that correct people on simple and obvious mistakes.

I'm so disgusted in myself. I hate when I piss off pedantic pricks with simple and obvious mistakes. Oh well, back to building that Space Shuttle booster rocket! Anyway, thanks for the correction - I completely stepped in 'mea culpa'.

--
"Lawyers are for sucks."
- Doug McKenzie
Re:9 9s by chief-dot · 2002-07-11 10:15 · Score: 1

Why did you have to reply!

Since yesterday I'd forgotten all about how disgusting I am and now you have to just drag up the past!

I think I'm going to turn off email alerts for when people reply to my posts...

I hope you have it turned on though - otherwise you'll never read this...because what sort of lamer replies to a 3 day old post? If you don't turn have it on then this post was a waste of time - in which case... ...I think I'm going to have a lay down.

I wasn't able to read the article because of /.ing by Anonymous Coward · 2002-07-09 08:20 · Score: 0

But basically, I disagree. Four or five nines of reliability may seem unrealistic and expensive now, but if we apply Moore's Law we should be seeing as many as eight to ten nines shortly. In fact, I wouldn't be surprised if we could top fourteen nines by the end of the decade, what with the strides we are currently making with miniaturization, polymers, and circuit-etching processes.

This isn't the first time something that once seemed ludicrous became commonplace; remember Gates and his 640K of RAM statement.

In other words... by reaper20 · 2002-07-09 08:20 · Score: 2, Insightful

We should just give up on decent service and professionalism. I don't think so.

My ISP (Ameritech) seems to think so, considering my DSL connection and their promptness to "Get ahold of me within 24 hours..."

Bleh ... It's not unrealistic ... don't expect people to live with downtime just because a good portion of those systems need to be rebooted on a regular basis (Win machines), and general retardness of sysadmins around the world allow things like Nimda and Codered to get out of hand. This is an excuse to let companies too cheap to have decent customer support off the hook. Maybe if they were educating their tech staff instead of finding more ways to rip us off, they'd have decent servive.

Everyone with competent sysadmins on rock solid *nix systems raise your hands...

Re:In other words... by GeckoX · 2002-07-09 08:34 · Score: 2

My last DB server, which was the back end for a moderately high traffic site (~.5Million hits a day, ~1million db hits a day), running about 80% capacity for the last year straight, was up for 11 mths before we replaced it last week.

Win2k my friend.

And whom supported it that whole time?
Me, the web application developer.

Sure, we _could_ have paid for a 'rock solid *nix system' and a couple of admins to go with, but my raises over the past couple of years sure would have looked dismal.

It's called TCO. Sometimes, in some cases, nix isn't necessarily better, or at least there's nothing wrong with Win IF you rtfm.
Guess you never did! You should try it sometime before slamming WinServer users.

Oh, never got nailed by Nimda or the red or any others either.

--
No Comment.
Re:In other words... by JWSmythe · 2002-07-09 08:50 · Score: 2, Interesting

The database server handling the message areas for Voyeurweb, RedClouds, and feedback areas for the same has answered 28,442,099 questions in the last 13 days. That's when we finalized changes to it.. Before that, it had been running for 2 years.

I wish we only had 5mil hits/day.. One web server takes 18mil req/day.. We have bunches of 'em out there. :)
http://voy37.voyeurweb.com/1.stats.html.

Did I mention we're a Linux shop?

--
Serious? Seriousness is well above my pay grade.
Re:In other words... by Mr+Teddy+Bear · 2002-07-09 11:13 · Score: 2, Funny

You get a cookie!

Feel better now?

Slashdot 20 second rules sucks... so I am typing this to burn time.
Re:In other words... by WNight · 2002-07-10 02:46 · Score: 2

Not bloody likely.

I've seen those uptimes in Win2k, for small print servers, and the like, but in actual use? Doubtful.

Explorer.exe tends to crash around to 3-week mark on average, and you need to reboot everytime you apply critical updates which are needed for a public server.

Thats why by Anonymous Coward · 2002-07-09 08:22 · Score: 0

Goatse.cx runs on Microsoft IIS! So it can reliablly bring you that anus! None of your open sores crap!

Proof that Open Sores is unreliable! Don't click if you run an open sores OS!

Re:Thats why by Anonymous Coward · 2002-07-09 08:28 · Score: 0

Open sores is about the only way to make the site any more disgusting.

4 or 5 nines? by IndependentVik · 2002-07-09 08:22 · Score: 0

Ok, I'll admit my ignorance. Anyone care to explain what X nines of reliability means?

--
I'd suggest you don't use Slashdot as your only news source, or you will suffer permanent brain damage.

Unfortunatly.... by Lord_Slepnir · 2002-07-09 08:24 · Score: 5, Funny

I think we just knocked his server down to two nines by slashdotting it.

Must hate his ex-boss by palmech13 · 2002-07-09 08:24 · Score: 5, Funny

What else would motivate someone to post an ex-boss' e-mail address on the front page of slashdot?

Tell that to the telecom operators by Anonymous Coward · 2002-07-09 08:24 · Score: 0

You dont know what you are tallking about. All our customers have it as a mandatory requirement for *all* platforms we sell. I think the original target was like 3 minutes complete downtime every 10 years. Thats why no Windows platform will ever make it in the telcomm infrastructure world. It used to be that 5-9's was accomplished by proprietary hardware and software (look at Nortel, Lucent, and any other infrastructure providers equipment). Now all the datacomm companies think that you can make 5-9's stuff out of commercial off the shelf 3rd party crap and it's damn near impossible.

-working for the elves...

anonymous karma whore by Anonymous Coward · 2002-07-09 08:25 · Score: 1, Informative

next page

Introduction

The Scenario

Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.

Sound Familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.

Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.

Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.

Genesis of the 'Five Nines'

We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.

First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.

The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, "if everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."

'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.

We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet ... nobody dies from a system failure.

The Greasy Steel Bar

Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.

What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.

(next page)

55.5555555 by Anonymous Coward · 2002-07-09 08:25 · Score: 0

At my place of employment, we don't bother to go for 5 nines, we're quite happy with 9 fives :)

Voyeurweb by JWSmythe · 2002-07-09 08:25 · Score: 1

My bosses allow for 0 downtime with Voyeurweb.. It only takes a bit of magic, a lot of available bandwidth, and redundant servers in multiple physical locations. Hell, we're slashdot proof. :)

--
Serious? Seriousness is well above my pay grade.

Re:Voyeurweb by Anonymous Coward · 2002-07-09 09:25 · Score: 0

Thanks for your comments on the upskirt issue.
Ehm, ...
uptime?
Re:Voyeurweb by Anonymous Coward · 2002-07-09 20:15 · Score: 0

So, you're pitting your ability to administer remote systems against the collective sex drive of not only tens of millions of teenagers, but also every sex-deprived Slashdot geek?

Post the link on Slashdot's main page. $$$ says you get /.'ed! :)

(apologies to Scott Adams for borrwing and modifying the dialogue from a Dilbert comic)

ISPs and Backbones by TibbonZero · 2002-07-09 08:25 · Score: 1

Wouldn't ISP's be that important? What about company VPN's? Hospitals? Google? Slashdot????

Web sites only work if people can view them, and when you have hundreds of thousands of hits per day, you could be loosing alot by being down

--
Tibbon
tibbon.com

uptime by SuperMacNinja · 2002-07-09 08:25 · Score: 1

So is his uptime screwed now that the site has been slashdotted?

99.999% perfection by Gorm+the+DBA · 2002-07-09 08:26 · Score: 4, Insightful

Let's see...five nines would be just over five minutes of downtime in a year (315 seconds). For business and other non-life-threatening situations, that would be way better than necessary. Lots of folks are probably going to harp on the "If 1 out of 10,000 airplanes crashed, there'd be X crashes" line of argument. There's a problem with that...one mistake doesn't crash an airplane. Every system on an airliner is redundant, and virtually any "pilot error" has time to be fixed before there's a problem. Listen in on the Air Traffic Control to Cockpit transmissions sometime...just about every flight encounters some minor error at some point, whether it is a pilot needing to reask for a clearance or someone needing to climb or descend a bit to clear a potential collision. Errors are unavoidable. The key is to ensure recovery from those errors is possible. So sure, your computer may be down for 5 minutes a year. Make sure you have a backup system that is able to take up the slack instantly, and your downtime is down to 3/10 of a second a year. Redundancy is the key.

Re:99.999% perfection by Anonymous Coward · 2002-07-09 08:49 · Score: 0

Well, the big problem is that the airline
industry uses humans as backups for computers,
as the Swiss Air Traffic Control recently
proved.

If the DHL jumbo and the Russian passenger Tupolev
pilots had just followed their collision avoidance
*systems*, over 70 people would have been alive
today that aren't.

Toon Moene.
Re:99.999% perfection by afidel · 2002-07-09 09:11 · Score: 1

Well for my office we are at about 5.5 9's over the last 2 years (UPS freaked during a power outage causing it to shut off rather than correctly tripping to batteries, this caused a 2 minute outage, which was our only outage in 2 years). A quick off the cuff calculation I just did puts the cost of an outage here at $36/minute just in salary costs but not including data loss or lost oportunity costs. We have dual routers, dual core switches and redundant links between all floors, we also have all pc's on UPS's and a backup generator that can run our building for about 48 hours without refueling (any disaster that leads to us not being able to get fuel for 48 hours probably wasn't survivable anyways). For busy ecomerce sites I would imagine downtime costs are probably in the range of thousands to millions of dollars per minute. 5 9's is an atainable goal with a little planning.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:99.999% perfection by MerlynEmrys67 · 2002-07-09 10:05 · Score: 1

So for two years your staff made 36 dollars ???
I am betting that you listed 100K equipment costs over the two years, plus salaries in the 3.2 million dollar range over 2 years (staff of 16, 100K loaded cost per employee). If for 1/2 that cost you could have gotten away with 10 minutes of downtime instead of the 2 minutes you have gotten, your per minute downtime is about 200K/minute. That is a LOT of commerce (about 105 billion a year in revenue) assuming that you completely lost that bussiness rather than having your custommers wait 10 minutes and place their orders again...

--
I have mod points and I am not afraid to use them
Re:99.999% perfection by afidel · 2002-07-09 10:31 · Score: 1

salaries, umm none. We are a satelite office that takes literally about 20 minutes total time from the admin teams per month. I'm already onsite to keep all the desktops running so there is no additional cost for my time (well they had to pay me overtime to wait for the UPS vendor to certify that the UPS wouldn't fall over on the switch back to grid power but otherwise none). The equipment cost is probably close, I didn't see PO for the switches but even with internal discounts the 10 cat6500's we use probably cost a pretty penny. The thing is we would have had more than 10 minutes downtime guarenteed if we hadn't designed for this level of redundancy, because simply getting a replacement management blade will take 4 hours even with the best service plan available. Oh yeah and getting back to salary costs, having the equipment be so reliable is what allows them to spend virtually no time and no dedicated resources to IT in this 200 person office. That alone probably made up for whatever hardware costs there were in designing and implementing the systems.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:99.999% perfection by chthon · 2002-07-09 22:59 · Score: 1

Most businesses not only have to account for daytime interactive processing, but also for processing at night and even in the weekends.
Re:99.999% perfection by virve · 2002-07-10 02:06 · Score: 1

If the DHL jumbo and the Russian passenger Tupolev
pilots had just followed their collision avoidance
*systems*, over 70 people would have been alive
today that aren't.

Can anybody confirm what I heard that had the pilots just let the TCAS system handle the situation by itself the autopilot would have adjusted the altitude correctly? If this is true the TCAS is pretty amazing. I hope it can do better than 99.999%.

--
virve

slashdotted, so I'll blather anyhow... by markmoss · 2002-07-09 08:26 · Score: 2

It all depends on what is on the server. If it's stuff your own people use constantly on their job, through your own network, you need five nines, otherwise you will take the blame for critical jobs getting done late.

But when people are going to the server through the internet, they get used to interruptions - there are so many links between, some of which periodically become overwhelmed with traffic, that no one could tell the difference between two nines and five nines on your server itself. So sales & product information sites don't need more reliability than you can readily afford. They do need high capacity.

And if it's your blogs concerning your navel lint - no one's looking at your uptime but you...

Re:slashdotted, so I'll blather anyhow... by Anonymous Coward · 2002-07-09 09:15 · Score: 0

If it's stuff your own people use constantly on their job, through your own network, you need five nines
Honestly, in most cases, I'd think it being "stuff for your own people" would significantly reduce the need for "five nines". The reason being that your own people are most likely all in the same time zone-almost certainly all on the same continent, and that means there will be a half dozen or so hours out of every day that nobody actually needs the system. You could drop to 90%, maybe even 80% up and no one would ever drop a beat as long as they're the right hours.
Re:slashdotted, so I'll blather anyhow... by Anonymous Coward · 2002-07-09 10:35 · Score: 0

You don't need five nines on most internal systems. That's about 5 minutes of downtime a year or about one reboot. It depends on the service, but most of those can be rebooted without issues after hours or on weekends. That kind of availablility costs money, and a lot of services aren't worth it.
Re:slashdotted, so I'll blather anyhow... by markmoss · 2002-07-09 23:39 · Score: 1

The question is, do those people claiming 5 nines count "scheduled downtime" as downtime or not?

The slashdot test by itsmarcos · 2002-07-09 08:27 · Score: 1

Ironically, the site hosting the article is /.ed! How's that for reliability? ;)

Anyone cached the article?

--
Marcos

Simple by American+AC+in+Paris · 2002-07-09 08:27 · Score: 5, Funny

Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.

--

Obliteracy: Words with explosions

Re:Simple by Indras · 2002-07-09 09:13 · Score: 2

Heh, my old Commodore gets about five nines. If I turn it on for just 52 minutes a year, it is getting .0099999% uptime!

--
The speed of time is one second per second.
Re:Simple by kcbrown · 2002-07-09 10:09 · Score: 2

Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.

Yeah, well I'd like to see you try to get 999.99% uptime!
Though I'm sure there are some players out there (*cough*Qwest*cough*) that are getting 999.99% downtime...
:-)

--
Use 'slashdot stuff' in the subject line in any email you send me if you want to get past the spam filter.
Re:Simple by 4of12 · 2002-07-10 04:52 · Score: 2

Though I'm sure there are some players out there (*cough*Qwest*cough*) that are getting 999.99% downtime...

Wait a minute!

You can't fool this astute reader! Your number is above one hundred percent!

There's no possible way for Qwest to get that kind of downtime .... uhmm .... usually .... unless.. they bring down more than their own machines...hmmm - OK, I'm wrong. You're right.

--
"Provided by the management for your protection."

Nope! by Anonymous Coward · 2002-07-09 08:28 · Score: 0

8 hours a year? You must be an Windows man!

Seriously 8 hours of down time for a cellular operator during peak hours can mean big bucks.

Whats four and five nines? by TibbonZero · 2002-07-09 08:29 · Score: 1

Duh... what's four and five nines?

--
Tibbon
tibbon.com

Re:Whats four and five nines? by Sadiq · 2002-07-09 08:34 · Score: 1

I assume it's 99.99% and 99.999% uptime.

--
SysWear - Geek T-shirts (UK/Europe)
Re:Whats four and five nines? by headjack · 2002-07-09 22:17 · Score: 1

Actually it's 36 and 45. Stay in school.

geographically distributed failover by adam_megacz · 2002-07-09 08:30 · Score: 1

The XWT Cluster has achieved some very high availability on the cheap by using machines at several mom-and-pop data centers across the country. The machines are clustered into a peered (no master) failover configuration with the open source dnsfailover package. If any machine fails, the others will remove it from the DNS records; when it comes back on line, it gets added back in.

By spreading our risk across several data centers in different cities, with no single point of failure in the cluster, we don't have to worry about incompetent network administrators, power failures, a/c failures, backhoes, or nukes. Being able to skip out on all those expensive options saves a ton of money.

Re:geographically distributed failover by Anonymous Coward · 2002-07-09 08:56 · Score: 0

shameless XWT plug? what does this have to do with anything?

By the way - change your webpage background to some other color than black so people can read it.
Re:geographically distributed failover by adam_megacz · 2002-07-09 09:06 · Score: 1

shameless XWT plug? what does this have to do with anything?

The article is about availability. My post lets people know about an open source tool called dnsfailover that can help them improve their availability. It also mentions the XWT Cluster as a real-life example of said software in action.
Re:geographically distributed failover by Anonymous Coward · 2002-07-09 09:17 · Score: 0

A link to your dnsfailover tool would have been sufficient. How hits has XWT.org received in its lifetime, by the way?

Management want it, but does it understand it by Slak · 2002-07-09 08:30 · Score: 2

My company (a large-ish, surviving Internet Retailer) has internally announce a Six Sigma Initiative. I'm wondering if we'll need to maintain 5 9s uptime...

Re:Management want it, but does it understand it by Jobe_br · 2002-07-09 08:46 · Score: 3, Informative

Good luck applying Six Sigma to processes that aren't directly related to manufacturing something ... ;)

I really mean that - Good Luck. :)
Re:Management want it, but does it understand it by arnie_apesacrappin · 2002-07-09 09:00 · Score: 2

Actually, you'll need to add two sixes to it.
Six Sigma is a maximum of 3.4 defects per million. So converting to uptime would be.
Uptime percent = 100*(1 - 3.4*10^-6) = 99.99966
After we take off the literal filter, I'd have to say that was a pretty funny comment. Just hoping to add a little connection to the Six Sigma to Five Nines relationship.

--
Still, with a plan, you only get the best you can imagine. I'd always hoped for something better than that. -CP
Re:Management want it, but does it understand it by afidel · 2002-07-09 09:38 · Score: 1

I know, I work for a division of GE related to IT and we never bother with most of six-sigma, because IT is not a place where you could ever achieve six standard deviations or less of error, for one thing you rely too much on others work, for another computers are more complex than any assemly or manufacturing process in existance. About the only thing we take from six-sigma are the greenbelt projects to drive process improvements.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Management want it, but does it understand it by Zordak · 2002-07-09 09:50 · Score: 3, Interesting

My company does lots of things, but almost no manufacturing (our local office provides engineering services to the government and military). We also got hit with the Six Sigma marketing buzz, and our stupid (now departed) CEO decided that they needed to initiate the garbage company wide. I've managed to avoid it so far, but I've passed by the conference room occasionally while sessions have been going on, and I would have to say that it would score real close to 10 on the Wank-o-meter. All of the engineers who have been subjected to it have said it's nothing more than good engineering practice that they should have learned in school. But maybe it's good for the administrative/marketing types.

--

Today's Sesame Street was brought to you by the number e.
Re:Management want it, but does it understand it by davebooth · 2002-07-09 09:56 · Score: 2

Six Sigma is a maximum of 3.4 defects per million. So converting to uptime would be...
Dont forget you're talking about defects here. Where I work, planned outages for things like preventive maintenance or the deployment of an upgrade to the core apps are not considered defects whereas unplanned outages are. I have several servers here that have only a few dozen days uptime but the last time they system or the primary app they serve crashed resulting in an unplanned outage or a six-sigma-style 'defect' was over a year ago.

--
I had a .sig once. It got boring.
Re:Management want it, but does it understand it by Grax · 2002-07-09 10:15 · Score: 1

Six Sigma as applied to web serving should mean that out of 1 million requests, 999,996.6 would be served successfully.

--
Coding Blog
Re:Management want it, but does it understand it by Anonymous Coward · 2002-07-09 16:24 · Score: 0

That would mean no 404 errors :)

Re:4 or 5 nines? by dcgaber · 2002-07-09 08:32 · Score: 2

it is the percent that the server is up. i.e. 99.99% is 4 nines of reliability and up 99.99% of the time (I am assuming that x 9 refers to total 9 in the the percentage, not just to the right of the %).

100% uptime is virtually impossible, so the holy grail is as close as possible--99.999%

codesta.com uptime by cpeterso · 2002-07-09 08:33 · Score: 3, Funny

If you want to learn about uptime, don't bother going to codesta.com. Their servers have already melted from a brutal slashdotting. According to Netcraft, codesta.com runs Linux and has 74 days of uptime... until today!

--
cpeterso

Oi! You act like a manager! by isa-kuruption · 2002-07-09 08:35 · Score: 3, Informative

The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.

THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.

I'd love to read it... by dasmegabyte · 2002-07-09 08:36 · Score: 2

but their server is down.

--
Hey freaks: now you're ju

not economically possible? by lingqi · 2002-07-09 08:37 · Score: 2, Insightful

with M$, it is theoretically impossible as well to achieve their advertised up-time; ( i think back when they ran some ad (still running?) about how windows can achieve three or four 9s of uptime).

Total bullshit... let's see -- windows machine *requires* reboot every time you apply a patch; a reboot on a large machine is... i dunno, 10 minutes if you got a lot of crap. security update turns up about twice a week or so... that puts up to be ~99.8% MAXIMUM;

even if you don't buy my numbers, three 9s uptime means every week you only gets ~6 seconds downtime.

yeah... sure... not if you want to patch up than internet explorer / IIS so your system does not die from DoS, hackers, or worms!

--

My life in the land of the rising sun.

Re:not economically possible? by Anonymous Coward · 2002-07-09 08:53 · Score: 0

Ever heard of redundant servers?
Re:not economically possible? by lingqi · 2002-07-09 09:01 · Score: 1

ever heard of lower overall cost? (something M$ was advertising side-by-side, btw)

--
My life in the land of the rising sun.
Re:not economically possible? by Pfhreakaz0id · 2002-07-09 09:12 · Score: 2

I apply patches all the time that don't require a reboot (this is 2000, NT and 9x require you to reboot for damn near anything).

--
DO NOT DISTURB THE SE
Re:not economically possible? by amemily · 2002-07-09 09:44 · Score: 1

Well here goes my karma...

Ever heard of securing an IIS server properly in the first place so you don't have to install all those dammned patches in the first place (hint - remove all mappings in IIS EXCEPT the ones you need and you won't get hit by the majority of the exploits out there).

Answer this, would you stick a default install of linux live on the net without securing it first? Or any operating system for that matter? Just as with linux, you have to secure Windows before putting it live. Unfortunally, quite a few winadmins are poorly trained and do not realize this.

That and Microsoft is now making reboot-less patches now for Windows 2000 Server, and for those rare ones that require a reboot, you can qchain them together and reboot once.

About the hackers (I assume you mean crackers), ever heard of a firewall, I hear they are good for defeating most of the script kiddies out there. DoSing, in reality, no server can withstand a major one. Go read up on them, Wired has an article here, and an admin at UW has some more articles here.

Oh yea, what the hell are you doing surfing the net on a server in the first place, that's what workstations are for.

Go work in a IT department with Windows 2000 before you go shooting your mouth off. Adminning your home linux box does not make you an expert in Windows 2000 administration.
Re:not economically possible? by ymgve · 2002-07-09 10:02 · Score: 2

I apply patches all the time that don't require a reboot (this is 2000, NT and 9x require you to reboot for damn near anything).

Well, you still need to reboot for those patches to take effect. ;)
Re:not economically possible? by SuiteSisterMary · 2002-07-09 10:34 · Score: 2

Just as a further comment to the above poster:
Q: What was the last major IIS exploit?
A: Code Red.

Q: When was that, again?
A: Damn, around a year ago?

Q: Was it completely prevented by the most basic of post-install locking down?
A: Yes.

Q: How long did all of these servers go unpatched before Code Red hit the wild?
A: A month. The patch was available for a month before Code Red EVER hit the scene.

Q: What was the worst part of it for those of us who were immune?
A: Ignoring all those damn entries in the logs.

--
Vintage computer games and RPG books available. Email me if you're interested.
Re:not economically possible? by Anonymous Coward · 2002-07-09 10:52 · Score: 0

openBSD is secure by default in its installation. and oh, adminning a point and drool for idiots win2k server is a helluva lot easier than setting up a linux box.
Re:not economically possible? by Verizon+Guy · 2002-07-09 11:40 · Score: 1

Shh! You might be persecuted for telling the truth!

--
Aw, fuck it. Let's go bowling. - The Big Lebowski
Re:not economically possible? by Anonymous Coward · 2002-07-22 08:03 · Score: 0

I dunno... I usually get 6-10 week uptime with my windows 2K.

I never had a Linux machine up that long because I got sick of EVERYthing breaking and not working.

Out for a pleasant evening troll? by r_j_prahad · 2002-07-09 08:39 · Score: 2

Maybe your phone call to 9-1-1 should be the one that happens during the five minutes of downtime?

Re:Out for a pleasant evening troll? by Anonymous Coward · 2002-07-09 09:49 · Score: 0

If you believe that most 911 centers have 99.999% availabilty you've probably never called 911. I don't consider hearing, "Please hold. Your call will be handled by the next available operator. Do not hang up. Calls are answered in the order in which they are recieved" as available. There probably isn't a 911 center in the US that doesn't have HOURS of downtime a year.
Re:Out for a pleasant evening troll? by Anonymous Coward · 2002-07-09 09:57 · Score: 0

You know I saw a 911 system about a month ago that was running DOS 6, and the box was held together (literally) with scotch tape.
Re:Out for a pleasant evening troll? by Anonymous Coward · 2002-07-09 10:47 · Score: 0

I haven't dialed 911 in my entire life. If something doesn't work, guess what? You find a way AROUND IT. People have begun to use technology as such a damn crutch it's unbelievable. I'm sure if calculators stopped working, you'd have a hard time finding people who remember their multiplication tables.

Maybe we need a few more disasters to wake people up.
Re:Out for a pleasant evening troll? by Anonymous Coward · 2002-07-09 13:29 · Score: 0

Somebody please mod the parent up.

hmm... by Zancarius · 2002-07-09 08:40 · Score: 1

So what kind of uptime are YOU going for?

--
He who has no .plan has small finger. ~ Confucius on UNIX

If the ailerons are not available by pommiekiwifruit · 2002-07-09 08:42 · Score: 1

your plane is more than late. You are late. In the Six Feet Under sense.

Remember that the control surfaces on modern jets are not connected mechanically to the yoke, you are completely at the mercy of software. You don't want it to halt.

That is why planes use redundant systems - the requirement for reliability is for the system as a whole, not necessarily for an individual processor. The control services need to be accessible by the pilots (or auto-pilot) at all times.

Re:If the ailerons are not available by Anonymous Coward · 2002-07-09 08:45 · Score: 0

You have a few seconds until you hit the ground. If the system can be restored within that time frame then you would be okay.
Re:If the ailerons are not available by SirSlud · 2002-07-09 08:55 · Score: 2

Lets say the flight is 2 hours long. 99.99999% uptime on the software that connects the yoke to the control surface means it would be down for 0.08 of a second on each 2 hour flight, hardly anything that would make me thing things were unsafe.

Now, over a year, bunch it up over every day, and you get 29 seconds. Now thats scary, but if you meet 99.99999% uptime, you're probably not going to bunch all your downtime together in one incident.

Although, I'd say that the article looks like it wasn't written for _really_ critical stuff like this.

But its scary hen you have to argue with your boss about whether you should spend 2 weeks figuring out why your server crashes after 2 weeks up uptime .. whether its a bug in your ASP software, the operating system, the ORB, the tcp/ip stack, or 1 minute adding a cron job to restart the server once every week at 3am when nobody is using the system .. thats when you'd probably want to start earning money from somebody who can manage it a little better.

--
"Old man yells at systemd"
Re:If the ailerons are not available by Jobe_br · 2002-07-09 09:08 · Score: 4, Interesting

Entirely. Having worked extensively on the flight deck systems for the Boeing 767-400ER, I can tell you first hand that the redundancy is rather amazing. There are two major computer systems that drive the displays in the cockpit, the DPCs which do a lot of digital signal manipulation and the DCCs which do a lot of the analog to digital signal manipulation and control. Two DCC boxes drive three DPC boxes and the two DCC boxes are cross-connected to each of the DPC boxes. The three DPC boxes each talk to each other (I'm not sure if the DCC boxes talked to each other - that was further down the chain than I was working on) and actually vote on the data points that are being sent to the displays to determine if one of the DPCs is malfunctioning or processing bad data. The way this all works together is amazingly complicated, especially when you consider that it all runs on embedded boards where the "executable" is typically less than 1-2MBs in size.

My particular area of development was the actual display software which was provided data from the DPC systems. Each of the six displays (2-pilot, 2-copilot, 2-EICAS in the console) received multi-cast data from each of the DPCs and then fed data back to the DPCs on the display's status. The DPCs would then automagically evaluate if the displays were functioning properly and switch primary functions away from a malfunctioning display to a functioning display if error conditions were detected.

The PFD (primary flight display) is the pilots most important display as it displays airspeed, artificial horizon, TCAS warnings, altitude and a few other things. The ND (navigation display) is the inner screen on both the pilot/co-pilot sides and if the PFD experiences error conditions, the DPCs switch the PFD to the ND and the ND to one of the EICAS (engine indicators, etc.) displays.

All very interesting stuff ... especially the way its actually implemented in the embedded system. Debugging all this, of course, was non-trivial. For that matter, coding it is non-trivial as its all in Ada83.

Ahh ... those were the days :)
Re:If the ailerons are not available by ColdGrits · 2002-07-09 09:12 · Score: 1

"Lets say the flight is 2 hours long. 99.99999% uptime on the software that connects the yoke to the control surface means it would be down for 0.08 of a second on each 2 hour flight, hardly anything that would make me thing things were unsafe."

Uh-hu.

you do realise you've described *SEVEN* nines uptime, don't you?

For five nines, your example means 8s without control. For four nines, that's 80 seconds without control. Are you SURE that wouldn't make you feel unsafe, if the pilot had no control for 80 seconds? Especially if they occurred during final decent and landing!

Of course, those times scale up. On a trans-Atlantic flight, that's at least 48s / 8mins respectively...

--
People should not be afraid of their governments - Governments should be afraid of their people.
Re:If the ailerons are not available by ceejayoz · 2002-07-09 09:30 · Score: 2

That'd be why planes (*gasp*) have redundant systems! One might crash for an hour, but the other two will take care of ya until landing.
Re:If the ailerons are not available by Preposterous+Coward · 2002-07-09 09:50 · Score: 3, Interesting

Not really. In most phases of flight it wouldn't be an issue if ailerons were unavailable momentarily. The plane is instrinsically stable, and in any case you can level wings and induce slow turns with rudder, though it's inefficient. Also, once you are at altitude, you have lots of time to correct if things go wrong. In multi-engine aircraft there's also the option of using differential thrust.
It's instructive to read about the United Flight 232 incident a few years back. The #2 engine of a DC-10 exploded in flight (at around 30,000 feet) and severed ALL the hydraulic systems and their backups. Without rudder, ailerons, elevator, spoilers, flaps, or one of the three engines, the pilots set the plane up for a forced landing. And about 200 of the 300 passengers on board survived.
Of course, certain bugs can be really bad. I was down at Boeing Field once last year when somebody attempted to take off in a light plane that had just been serviced. Unfortunately the mechanic hooked up the ailerons backwards, so that when the pilot attempted to correct for a crosswind on takeoff, he promptly rolled and landed on top of another plane in the parking area. (Sounds like inadequate preflight action by the pilot on that one, since he appears to have missed the "control surfaces free and correct" item on his pre-takeoff checklist, but no injuries to the best of my knowledge.)
Note that I'm hardly going to argue that flight-control software shouldn't be damn good. But... it's overstating your case to assume that downtime or error necessarily means a plane is going to fall out of the sky.

--

"Biped! Good cranial development. Evidently considerable human ancestry."
Re:If the ailerons are not available by DNS-and-BIND · 2002-07-09 10:15 · Score: 2

Gosh...I can't wait for Microsoft to start writing aircraft programs in C#.

--
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Re:If the ailerons are not available by Jobe_br · 2002-07-10 09:26 · Score: 1

Don't count on it. C/C++ isn't flight-level A certified (it might be level B certified, but probably not, likely, its level C which is restricted to non-critical systems and systems that do NOT interact in ANY way with critical systems). C# is certainly not going to be certified for critical systems ... so far, only Ada83 is (far as I know, Ada95 is NOT).
Re:If the ailerons are not available by DNS-and-BIND · 2002-07-10 09:46 · Score: 2

Just wait until MS joins the industry associations and starts putting its executives in the committees. It'll change real fast.

--
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Re:If the ailerons are not available by Jobe_br · 2002-07-11 02:39 · Score: 1

Don't count on it. Until you've been in the industry and lived in it to tell how serious they are about these things, you can't assume that things will change there as they do in other industries. The FAA is very, very, very strict. So is the FDA - yet the various pharmaceutical companies and medical equipment companies STILL haven't made the FDA regulations any less draconian. MS, as a newbie in the industry (if they go into it at all) would certainly not have any more sway than say Lockheed Martin, Rockwell Collins, Honeywell, Boeing, etc. Not to mention that anything that will be approved for INTERNATIONAL air travel would need to be approved by the EU equivalent of the FAA, otherwise its worthless.
Don't you think folks have been trying to get languages other than Ada83 flight-level A certified? Do you think people enjoy only being able to use ONE certified, validated compiler for Ada, that happens to ONLY run on ancient VAX systems such that compiling an embedded application that is about 700K in size (finished binary) takes in excess of 24hrs! The avionics world is very, very different from any other industry ... even other embedded industries. There are very few other technology applications that hold the lives of hundreds of people in the balance. Nuclear energy might be such an industry ... not even military applications have as stringent of guidelines as the commercial avionics industry does.
Btw, the info about the Ada compiler is specific to my work at one of the leaders in the avionics industry (no names, sorry).

Never go to work again! by novakane007 · 2002-07-09 08:42 · Score: 2, Insightful

Hate standing in the meat locker (server room)? Hate rushing to work past midnight to cycle a server?
The problem I used to have is I'm not a morning person so being available as an admin before 7am is tough, but now I can admin my network while trapped in rush hour traffic. =] Reboot servers, telent into devices, stop/start services, add users, manage DNS... the list goes on and on.
Uptime can be maintained without even having to leave the comfort of your easy chair. If you're an admin you should check this product out.
SonicAdmin by sonicmobility
(http://www.sonicmobility)

--

WURD!!

Close, but it depends by alexhmit01 · 2002-07-09 08:43 · Score: 5, Interesting

Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.

When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.

The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.

If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?

Alex

Re:Close, but it depends by Slak · 2002-07-09 08:53 · Score: 5, Funny

If nobody visits a site that's down, is it really down? ;)
Re:Close, but it depends by Anonymous Coward · 2002-07-09 08:56 · Score: 0

The site has had a few 15 minute outages over the past 6 months
Well, that gives you about four-nines uptime. Even with the 1 day holiday, it's still 99.5%. Next thing to consider is whether those outages are administrator initiated, so they can be coordinated with low-traffic periods, and whether a 15 minute outage actually loses the sale (or whether the customer checks back in a half hour). Heck, even at $50k/month, you're only averaging $1.15/minute. How long does it take the $10k redundant system to pay for itself if the "old" system gets you four nines?
Re:Close, but it depends by Anonymous Coward · 2002-07-09 11:46 · Score: 0

$10,000 insurance one time cost, say for two years (life of hardware)....while your pulling in $50,000 a month every month.

that's cheap.
Re:Close, but it depends by Anonymous Coward · 2002-07-09 14:32 · Score: 0

I'm not sure I buy that whole must be up all the time or you lose money junk. If I can't get to amazon to buy something, I'll try again in an hour or so. Not "OMG AMAZON.COM DIDN'T LOAD I'M GOING TO SWITCH TO BUY.COM NOW!!!"
Re:Close, but it depends by BurritoWarrior · 2002-07-10 01:44 · Score: 2

The site isn't making any money. All the profits are going to you guys and the hardware vendors. :-D
Re:Close, but it depends by Biggles_the_pilot · 2002-07-10 04:31 · Score: 0

If noone ever reads this, did I ever really write it? (ahh, me in silly mood, hehee)

--
I have no sig

Mine by Chacham · 2002-07-09 08:43 · Score: 1

I'll show you mine if you show me yours.

# uptime

16:42:54 up 121 days, 2:29, 3 users, load average: 0.23, 0.28, 0.27

--
Have you read my journal today?

Re:Mine by Anonymous Coward · 2002-07-09 11:24 · Score: 0

Server up time: 170:07:45:03
Re:Mine by AWrinkler · 2002-07-09 14:21 · Score: 1

FreeBSD 5.0-CURRENT machine on our network:
12:10PM up 459 days, 50 mins, 2 users, load averages: 1.00, 1.02, 1.00

ho hum...
e-easy
Re:Mine by Anonymous Coward · 2002-07-09 17:24 · Score: 0

$ ud -d
Now : 46 day(s), 07:39:41 running Linux 2.4.18-k6
One : 70 day(s), 19:44:35 running Linux 2.4.14-k6, ended Fri Feb 1 20:38:04 2002
Two : 32 day(s), 03:16:44 running Linux 2.4.18-k6, ended Thu Apr 25 19:20:31 2002
Three: 31 day(s), 16:31:10 running Linux 2.4.17-k6, ended Tue Mar 5 13:08:32 2002
Re:Mine by Anonymous Coward · 2002-07-10 04:19 · Score: 0

Wow! FreeBSD-5.0-CURRENT. You've figured out how to install and run a new kernel without rebooting the machine! Why don't you explain how you did that, you lying sack of shit.
Re:Mine by Anonymous Coward · 2002-07-10 06:54 · Score: 0

Geesh, nobody seems to be posting any REAL OS numbers running on REAL hardware:

$ show sys
VAX/VMS V5.5-2H4 on node BOXEN 10-JUL-2002 12:53:54.23 Uptime 344 10:22:19

Re:4 or 5 nines? by 920 · 2002-07-09 08:44 · Score: 2, Informative

Simply put, 4 9's of reliability would mean %99.99 uptime. (only down for .01% of the time).

--
"Perl 6 gives you the big knob" -- Larry Wall

five nines? by Anonymous Coward · 2002-07-09 08:44 · Score: 0

like in 9.9999% ? ;-)

You think that FAA systems have 5 9s? by glrotate · 2002-07-09 08:46 · Score: 1

Ha ha.

Re:You think that FAA systems have 5 9s? by Anonymous Coward · 2002-07-09 11:11 · Score: 0

YOu'd better believe the FAA requires five 9s or better. That's why airplanes are always flying ten-year-old computer technology. And the s/w is wrung out like you wouldn't believe if you didn't work in the industry.

The downside of all this is that the hotshots don't want to work on flight-control s/w because its ancient technology. Or maybe that's the upside!

Re:Oi! You act like a manager! by dasmegabyte · 2002-07-09 08:46 · Score: 4, Insightful

Actually, even this is silly. True five nines availability on a widely distributed network would mean that an application was available at all times on all segments of the network. Which would mean that your uptime depends not only on your redundancy on one side of a pipe, but on your overall reduncancy as well, so that when a pipe goes down you're still accessible. Since when a pipe goes down in your host you probably lose other resources as well (such as power or alternate pipelines), this means multiple datahouses owned by multiple vendors. Each of these has to have a perfect backup of all data and be running the same versions of all software. Really, the only true redunancy would be so heavily distributed that each local network would basically have to have its own server. This isn't so crazy -- technically, DNS and email do this. However, we all know that for an end user even DNS and email can have perceived outtages.

And this is why 5 9s is foolish. Sure, you're redundant behind the pipe, but if you lose the pipe you can't blame your datacenter when you charged a customer for uninterrupted service. Technically, if their modem disconnects them for a few hours you've broken contract.

Besides, who needs it? If yahoo is unreachible from my desk, I wait and reconnect. It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same. Any services I might have used, or products purchased, I will use or purchase at a later time. After all, I don't refrain from buying shoes just because the mall is closed!

--
Hey freaks: now you're ju

For those bad at math: by Faldgan · 2002-07-09 08:46 · Score: 2

3 9s = 99.9% uptime = 8.75 hrs/Yr = 525 min/Yr.
4 9s = 99.99% uptime = .875 hrs/Yr = 52.5 min/Yr.
5 9s = 99.999% uptime = .0875 hrs/yr = 5.25 min/Yr.
9 9s = 99.9999999% uptime = .03 seconds per year downtime.

I call bullsh*t on anything that claims to have 9 9s reliability. 3 seconds every HUNDRED years.

--
Nathan Brazil?

Re: For those bad at math: by Black+Parrot · 2002-07-09 09:08 · Score: 1

> I call bullsh*t on anything that claims to have 9 9s reliability. 3 seconds every HUNDRED years.

Heh. Last night when Jay Leno did his newspaper clippings, one was an ad for a "500 Year Clock" - that came with a one year warranty.

Maybe we should start a shop selling 9 9s at top dollar, but only giving a one year warranty?

--
Sheesh, evil *and* a jerk. -- Jade

So the upshot is ... by Zancarius · 2002-07-09 08:46 · Score: 2, Funny

That we live in a society that is more willing to send people into space with only a 99.9% chance of success, yet we freak out when a search engine on the Internet drops below 99.999% reliability? Great. Remind me never to work for NASA.

--
He who has no .plan has small finger. ~ Confucius on UNIX

Speaking of uptimes... by MongooseCN · 2002-07-09 08:46 · Score: 2

Looks like codesta.com just used up all it's downtime by getting it's servers slashdotted.

--

Outdoor digital photography, mostly in New Engl

Boss Slashdotted! by cOdEgUru · 2002-07-09 08:48 · Score: 4, Funny

I believe theres more to this than meet the eye.

What other best way to get back on your former boss than slashdotting him or his company server back to medieval ages..

Follow that up with multiple queries on google about boss's info, credit cards, ssn etc..

To cut things short, by the end of the week :

Boss's boss realizes the server crashes were due to Boss, fires his ass on the spot.

Wife realizes that the new unexplained charges on Credit card from "Suzy's Parlor" were not exactly the next door cafe. Gives him the boot as well.

You evil man..you!

--
Rapid Nirvana

Does anyone else find it ironic... by Grip3n · 2002-07-09 08:49 · Score: 2, Funny

...that this article is hosted on a server which is now being brutally Slashdotted?

--
To make a pun demonstrates the highest understanding of a language

Re:Does anyone else find it ironic... by Anonymous Coward · 2002-07-09 08:55 · Score: 0

You need to be modded up my friend! I love irony like this

+1 Redundant by Subcarrier · 2002-07-09 08:49 · Score: 1

The above is a high availability response if I ever saw one!

--
"I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush

I don't really agree here... by SkyLeach · 2002-07-09 08:50 · Score: 3, Interesting

We did it on a really low budget:

Heartbeat/Mon/Fake/Coda/Linux/IPVS for the High Availability, failover from DS1->DS2, each on different backbone nodes.
Mirrored systems in different geographic locations:
Firewall
IPVS Gateway
Apache->Weblogic bridge (Apache vhosts with ssl)
Apache->Zope bridge (Apache vhosts with ssl)
Zope->Zeo setup for content management.
SAN drive array for Oracle, running on two E4500s

This system isn't really that expensive, just the costs of hardware and my salary for setting them up.

--
My $0.02 will always be worth more than your â0.02, so :-p

Re:I don't really agree here... by Pfhreakaz0id · 2002-07-09 09:09 · Score: 5, Funny

I'm sorry, I just had a type mismatch error. I saw "oracle" and "isn't really that expensive" in the same post.

--
DO NOT DISTURB THE SE
Re:I don't really agree here... by Doodhwala · 2002-07-09 09:55 · Score: 1

So just how does Coda support High Availability? While yes, that are its features and it does support server replication, disconnected operation, low bandwith connections, etc, it is technically STILL in developement and can thus have crashes and buggy behavior in many instances. I know.... I have worked with Coda, developed software (CodaVis) for it and am at Carnegie Mellon right now.

Now that said, Coda is GREAT! IT supports a number of features that no other Open FS does and it works pretty well for the research purposes I need it for (look up Internet Suspend/Resume here).
Re:I don't really agree here... by Anonymous Coward · 2002-07-09 18:03 · Score: 0

Not to mention the two E4500s... Unless they got them for like 1.25 on eBay from some shitter-bound dot-bomb.

Then the *system* is not five nines! by pommiekiwifruit · 2002-07-09 08:51 · Score: 1

And frankly I'd rather not be in a plane that lost control for five minutes once a year.

Re:Then the *system* is not five nines! by ColaMan · 2002-07-09 09:34 · Score: 4, Funny

And frankly I'd rather not be in a plane that lost control for five minutes once a year.

As long as it's parked on the ground during those five minutes, it's no problem.

--

You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.

Its all about MY convenience by Just+Jeff · 2002-07-09 08:51 · Score: 1

I'm a one-man-band at a small organization. I have a lot of machines set up over a three city block area. Some of these machines are important, some are not. They are all running for a reason. If they stop running, I have to interrupt whatever other useful thing I'm doing, and fix the problem. Sometimes I'm doing something important, sometimes not, but I'm always doing something. My to-do list never seems to get any shorter.

So reliability, in my case, is not a commercial transactions lost per minute scenario. None of my machines are in a life support position where failure would endanger anyone's life. Reliability, in my case, means that my phone doesn't ring and other projects are interrupted less often. Some of my machines have been running for years with no unexpected down time. Others, uhhhh, less.

Its all about my convenience. I like high-reliability systems.

Re:Its all about MY convenience by Anonymous Coward · 2002-07-09 10:02 · Score: 0

> Some of my machines have been running for years with no unexpected down time. Others, uhhhh, less.

Care to describe which machines are which?

Page 2 by Anonymous Coward · 2002-07-09 08:53 · Score: 1, Informative

I finally got a tcp connection and page 2 finally loaded, so here it is...
The Uptime Rules

First, as an introduction to the rules, let's review our terms and terminology.

Definitions

Uptime is the amount of time the entire system is available. By entire system we are saying that an entire transaction can be completed. Just having your web servers running when the needed application server isn't running cannot be defined as uptime.

Downtime is everything else.

Scheduled maintenance downtimes or windows are the periods of time (for example, from 1:00am to 3:00am Monday morning) when an IT team has the option, if they need, to bring down various components in a fashion that causes the system to be incapable of complete functionality.

Reliability is defined as uptime but where scheduled maintenance downtime is not counted against it. For example, if in a 24 hour period there was an hour of scheduled downtime, but otherwise full operational for the remaing 23 hours, then the system was 100% reliable.

So how do you translate the 'nines' into acceptable downtime? This chart provides the answer:
'Nines' Uptime % Minutes
Per Year Minutes
Per Month
Two 99% 5256 438.0
Three 99.9% 526 43.8
Four 99.99% 53 4.4
Five 99.999% 5 0.4

Rule #1: A great system run poorly is a poor system.

This is the most crucial rule to understand when managing any system. It doesn't matter how much you spent on the hardware, how well designed your database tables are, or if you installed the latest and greatest operating system on the market. If it cannot be managed well, problems ensue.

Users don't see, or care that problems come from your database servers, or your application servers, or your static data caching. What they perceive is one of two states: working or not working. They want to make their reservation, or pay their bill, or just get the weather in Bali, and they want to do it NOW!

Managing with a given level of reliability in mind is about people, hardware, operating and escalation plans, and ultimately, it is about the money to put it all together and keep it running. The cost of reliability, is very hard to quantify. Even assuming it is a linear relationship (and few things in life are) it's a staggering relationship in financial terms. In my experience each 'nine' is close to an order of magnitude increase in cost!

The bottom-line is this, you need to do an honest assessment of available resources versus intended goals; it is the first step in making sure your great systems runs at least as good as you intended.

Rule #2: Five nines is a goal reachable only through both fully automated system management, and rigorously controlled and tested applications.

Scared by four and five nines? Unless you've worked in a true, hardcore, spare no expense data center, you should be!

Let's think about five nines for a moment. 5 minutes a year. That rules out any form of human involvement in fixing problems. After all, even the best humans are known to be distracted for a minute or two into conversation with a co-worker, or a phone ringing.

As an example, let's time a perfectly common scenario, where you have two people monitoring systems. Time the following emulation in your office space:

1. Assume the system is working happily.
2. Walk over to your kitchen area and grab a soft drink. Then walk back.
3. Wait 15 seconds while you pretend to have the other NOC (Network Operations Center) engineer say "Hey, look at this!"
4. Sprint over to your desk and sit down.
5. Log into your desktop machine.
6. Log into a remote machine.
7. Run one or two basic remote commands ('ps' or 'top' for example)

Now stop the clock. I'm willing to bet your five minutes are up!

Even without a distraction, it's simply not possible for a system of any complexity, to have a problem confirmed, cross checked, and resolved, by a person, within five minutes. Oh, and don't forget about the minute to 90 seconds that you've already lost in monitoring the issue - unless you want alarms going off continuously, you have to set an error threshold that typically consumes 60 seconds or so.

"Okay," you say, "well, five nines is a lot. How about aiming at four nines?" But are four nines really much different than five? Certainly, it gives you more latitude and time to fix a problem, but not much more. You can afford a single downtime that takes a few minutes to debug, but that's all.

The truth is, unless you have an application that doesn't fail, the odds are that your hardware failures will still occur three to four times a year, which pushes the limit of human intervention. A good rule of thumb is that things never happen when you are watching them - figure that any issue takes at least ten minutes to resolve, even if it as simple as a human inadvertently powering both sets of redundant systems down, and now they are powering back up.
prev page
next page

May favorite quote about "Reliable systems" by Bug-Y2K · 2002-07-09 08:55 · Score: 1

It is in my random .sig file forever:

Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks. -- Sean Donelan on NANOG, Mon, 26 Nov 2001 06:28:22 -0500 (EST)

I use variations of it verbally in meetings when the marketing/sales pinheads are demanding absurd uptimes for brochureware websites. It makes a great starting point for those "be careful what you wish for, because you will have to pay the bill" talks that I use like Jedi mind tricks on pinhead marketing and sales weasels.

Personal Experience by scott1853 · 2002-07-09 08:56 · Score: 2

I work for a small ISP in central NY. A couple years ago, I can't remember which provider it was anymore, but they unplugged us because their paperwork was all screwed up and they didn't think anybody was on the circuit. Then they plugged somebody else into it. It not only took us several hours to find out what the problem was, it took 3 whole days for them to resolve the problem. They wouldn't simply undo what they did, they had to assign us a new circuit and basically refused to escalate the work order. We eventually came back up but lost quite a few customers, understandably.

How about Stratus Technologies??? by Mysticalfruit · 2002-07-09 08:56 · Score: 1

They've been building machines that provide 99.999% uptime for something like 20 years.

I've got a lab full of those bastards. Everything is redudant. CPU/memory/powersupplies/ups/disk/network/backplan e you name it

I've had a chance to open the thing up and look inside and its amazing!

My only gripes have ever been that their a bit esoteric at times and their generally behind the technologiy curve a bit, but I think they do it purposely so they know that their putting out a tested product. Nobody wants a machine running the stock market to be on anything but throughly tested hardware. Sort like how all the computer systems on the ISS are only 386 level...

--
Yes Francis, the world has gone crazy.

Re:How about Stratus Technologies??? by Anonymous Coward · 2002-07-10 01:14 · Score: 0

We used to have one of those beasts. I swear you could take a tommy gun to it and fill it full of holes and it would probably keep working.

Re:4 or 5 nines? by superkri · 2002-07-09 08:57 · Score: 1

well, that depends on what time frame you are counting with. For example, the last second my uptime was exactly 100%. ;-)

hmm by nomadic · 2002-07-09 08:59 · Score: 2

It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues.

You really want to see someone go berserk over downtime, try running a MUD...

Re:hmm by schnurble · 2002-07-09 09:20 · Score: 2

Actually, the specific 800lb gorilla I was thinking of was a client when Steve and I were working together.

eBay.

--
"To err is human, to forgive is simply not my policy." --root

Visions of the Future by jimberini · 2002-07-09 08:59 · Score: 1

As we continue to depend more and more on networks for day-to-day operations, reliability becomes a must. Is it going to be expensive? YES. That doesn't mean that it won't happen or that it doesn't need to happen. How will these networks transform from being a pr0n conduit to carrying traffic such as VoIP (as a business, not the lame implementations we have seen so far), without having some sort of (good) reliability?

This reliability WILL come about one way or another.

Linux uptime stats wraparound at 497 days by dananderson · 2002-07-09 09:01 · Score: 2

You can't say BSD systems have the best uptime based on Netcraft surveys. Linux systems have a limit of 497 days for reported uptime. After that, the uptime wraps around. There are several Linux systems with such uptimes. That being said, BSD systems are highly reliable and efficient and may have a higher uptime than Linux.

See the Netcraft FAQ at http://uptime.netcraft.com/up/accuracy.html#cycle

The slashdotted site by Animats · 2002-07-09 09:02 · Score: 2

That site deserves to be slashtdotted. They have this little paper divided into about ten little sections, which multiplies their load by 10x or so. Then, it's a .jsp page (why?), which means more server-side interpreter overhead. If they hadn't crudded up the basic job of serving a readable document, they'd have one or two orders of magnitude more capacity.

Page 3 by Anonymous Coward · 2002-07-09 09:02 · Score: 1, Informative

prev page
next page

Rule #3: Even three nines is hard in the Internet World.

The "Internet World" is not a magazine, but rather, a truism of application state, where functionality and features are continuously enhanced. Compare this to a billing or call center, which has a minimum of features, and where great amounts of time are spent in testing before new applications are released to production.

The great thing about developing in the Internet world is that lots of new features can be brought to end users in a very short amount of time. The standard for development is weeks to a few months rather than years. Not only does this provide a level of instant gratification, but it also allows applications and services to be highly responsive to what users actually want and need, and in the end, provides a vastly more desirable system.

The tradeoff, of course, is that the applications themselves aren't nearly as reliable. Thus, the three nines goal. Why three nines? Because it's the highest possible reliability for a system which utilizes human intervention, and there's simply no way that a dynamic, "Internet World" application can be reduced to few enough parameters that it can be managed in an automated fashion. Failure modes grow at an exponential rate to functionality and the task of automating monitoring and management of such dynamic and flexible systems is an entropic one - that is, it quickly becomes a task bigger than the application itself.

But even three nines doesn't come cheaply. It requires a complete staff to be available at all times. There's no time to call and page people - to wait for them to get home from the supermarket where they were grabbing a quart of milk for the baby.

How much staff does one need? Well, that's a good question, and the answers are dependant upon the nature of the particular application. But, my experience in today's world shows that most systems are three-tier applications, with significant networking components. Therefore, at any given time, you need the following people on hand:

* NOC / Monitoring staff
* System administrators
* Network Engineers
* Application Engineers
* Database Administrators
* Crisis Management
* Customer Management

Now, admittedly, there can be some overlap in tasks, and the simpler the application, the easier it is to get overlap, but already, we're talking about quite a few people. Of course, these people need some backup to call in, for fresh ideas, if things aren't going well.

Don't underestimate the value of having a technical person, who understands the system, acting in the Crisis Manager role. This person is actually very critical to making sure that key issues aren't being overlooked, and to providing the detached viewpoint that is key to problem solving.

In addition, having a customer relationship person available to talk to the upset customers, at least when the service is provided to businesses rather than consumers, is vital. This isn't to help solve the problems of a given downtime event, but for the ongoing relationship with the customer.

Rule #4: 99.7% is very cost effective.

That's right, less than three nines. 99.7% gets effectiveness from the fact that it allows for two hours of downtime a month - basically, a total of one day per year.

While it sounds like a lot, it's typical for a failure pattern to consist of several small events of 10-20 minutes duration, and on rare occasions, a failure that takes three to four hours to resolve. That's the core timing that you are get with 99.7% -- the ability to have a four hour failure once a year.

That means that you don't have to build nearly the hardware redundancy - instead of having 1:1 "hot" standby units, you can have a 1:N relationship with a cold standby unit that can be configured and put into place in the span of a couple of hours. The larger N is the greater the costs savings. If they are network components we're referring to, the less complex the routing environment, the fewer people with network-specific skills are needed. Get it simple enough, and you get more overlap of skills, meaning more bang for your salary buck.

Complex systems also require complex understandings. The number of dependencies within systems again grows exponentially, and leaves far more room for human error.

Remember rule #1, a great system run poorly is a poor system.
prev page
next page

Uptimes by JWSmythe · 2002-07-09 09:04 · Score: 1

Here's a few machines. It's this low because of hardware upgrades last September.. We took one or two down at a time, which left 10 or so serving the site, therefore creating no downtime. hehe
Most of these are web servers that frequently do between 20Mb/s and 80Mb/s, depending on their task. voy03 handles voting, which gives it a slightly higher load.. It only counts a few million votes daily (read: a few million CGI hits)..

voy01 # uptime
4:55pm up 292 days, 24 min, 1 user, load average: 0.71, 0.44, 0.36

voy02 # uptime
4:56pm up 307 days, 3:04, 1 user, load average: 0.15, 0.17, 0.17

voy03 # uptime
4:56pm up 306 days, 8:25, 1 user, load average: 13.70, 12.11, 10.17

voy04 # uptime
4:56pm up 306 days, 19:40, 1 user, load average: 0.45, 0.38, 0.32

voy05 # uptime
4:56pm up 307 days, 3:16, 1 user, load average: 0.25, 0.35, 0.39

voy60 # uptime
4:57pm up 262 days, 23:57, 1 user, load average: 0.33, 0.37, 0.35

--
Serious? Seriousness is well above my pay grade.

Re:Uptimes by VB · 2002-07-09 10:14 · Score: 1

voy03 # uptime
4:56pm up 306 days, 8:25, 1 user, load average: 13.70, 12.11, 10.17

That load average is a killer!

My primary just went down after 417... Bad memory stick / or mobo... not sure yet, but this little 486 is still pluggin' along:
vanboers@yuma:~$ uptime
3:10pm up 448 days, 18:55, 1 user, load average: 2.14, 1.45, 1.26

(and, for the curious):
vanboers@yuma:~$ top -b -n 1 | head -n 10 | tail -n 3

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
11967 vanboers 16 6 15076 10M 1364 R N 62.8 61.8 2832h setiathome

(what else you gonna do with a 486?)

--
www.dedserius.com
VB != VisualBasic
Re:Uptimes by Anonymous Coward · 2002-07-09 11:05 · Score: 0

FreeBSD 5-CURRENT box on our network:
8:46AM up 458 days, 21:22, 0 users, load averages: 1.08, 1.06, 1.02

ho hum..
e-easy.com.au
Re:Uptimes by Chacham · 2002-07-09 11:58 · Score: 1

Ugh, I am unworthy.

Please call me after your next "hardware upgrade" so I can gloat. :-)

--
Have you read my journal today?

800lb gorilla of eBay by AtariDatacenter · 2002-07-09 09:06 · Score: 2

"It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."

Please. Let's not talk so badly about eBay. Do you know how many people have been crushed under their CIOs foot? ;)

Re:4 or 5 nines? by markmoss · 2002-07-09 09:06 · Score: 2

For instance, 4 nines says your system is up 99.99% of the time. That is, out of 365 days x 24 hours x 60 minutes = 525,600 minutes a year, it can be down for only .01%, or 52 minutes a year. Five nines (99.999%) allows only 5 minuts a year downtime. This may actually be averaged over many servers and several years (that is, if you had 10 servers running for 3 years and just 1 died requiring one day to replace, you could figure your downtime as 100 * 1 day/(10*3*365) = .009%, so you've still got 4 nines).

There are questions about what gets counted when figuring reliability. For one thing, almost no one would count a slashdotting or a DOS attack against their uptime, but nevertheless from the user's viewpoint the server is down. Also, how do you count "scheduled downtime" such as rebooting NT servers after installing security patches, or unplugging the boxes to move them around when it's time to expand the system? A news server with a worldwide audience has no "penalty free" time slots. So either you settle for a lower uptime goal, or you need redundant servers configured so that even major upgrades can be put in by unplugging just one at a time while the others keep running. OTOH the company database server, downtime during working hours is far more serious than downtime for the web server, so if it's a big company you do need redundant servers with automatic switchover. But in most cases there are times late at night or on weekends when no one cares if you shut them _all_ down at once - which certainly makes the upgrades easier.

So anyway, one person's "5 nines" may look like a lot less to someone else. E.g. a server vendor may claim that because only one in a million of their servers is broken at any given time their reliability is 6 nines. Your single server may never break at all - but once a week you take it off-line for ten minutes to load the newest security patches, so to anyone who wanted to keep working for those ten minutes you are only at 3 nines.

Re:4 or 5 nines? by mevets · 2002-07-09 09:08 · Score: 1

It isn't reliability; it is availability.

The availability of a system is the fraction of its intended duty cycle which it is functional for. It is frequently expressed as a percentage, as in 99.999%.

Reliability is the rate of failure, thus is expressed in units of time or usage. Mean Time To Failure or Mean Time Between Failure are expressions of reliability.

Rephrase the question by Todd+Knarr · 2002-07-09 09:09 · Score: 3, Informative

Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.

Re:Rephrase the question by Anonymous Coward · 2002-07-09 09:28 · Score: 1, Informative

If your 500 server farm has redundancy and know how to do load balancing, you still get 99.99% (give or take) of the capacity for doing its job.

Looks like you only need to look after the farm every few days to still have good capacity. Not too bad.

The downtime included the time the machine is taken down for either upgrade/repair and time to restart.
Re:Rephrase the question by Todd+Knarr · 2002-07-09 17:40 · Score: 1

That depends. Your whole service doesn't go down, but every single day for an hour or so someone's in the server room fixing something. If you let it go, Murphy will take over and the backup for whatever was down will go down too, taking down the whole service.

And even if it doesn't, there's still a cost in maintaining the systems that break. At my last job NCR figured their machines were good for a year before any hardware broke, even if the boxes were completely ignored and not maintained. They couldn't understand why we were shipping 4-5 boxes a week back with hardware failures and had one full-time person who did nothing but spin up and ship replacement boxes. Well, with 250+ boxes that's about 1.5 days per hardware failure, what did they expect?

Most customers won't complain.... by hoya · 2002-07-09 09:11 · Score: 1

It's been my experience working on the ISP side of the world that just about every ISP / colo provider claims at least 99.9% uptime with every SLA (service level agreement).

Its pretty safe to say that almost none actually back that up with their performance. However, from my experience, very few customers will try to get the company to honor its SLA because you need to provide pretty good documentation. We had a few situations where the company was down for a few hours, dropping its uptime below its guarantee, and still wouldn't credit us because they claimed the downtime to be much shorter. No matter how many traceroutes from major network nodes we showed, they kept arguing against it until we gave up.

Also, unless there is a catastrophic downtime that pulls service out for more than a few minutes at a time, and uptime falls below 95% for the month or lower, most users don't want to be bothered fighting for a credit. If you've ever dealt with any big telco / network provider you know what I'm talking about.

So, the bottom line is that in a lot of service industries (internet especially) it is very easy to claim 5 9's reliability, come close, and not really pay a huge price for failing. In fact, most people won't even notice. Now, for the 911 network, air traffic control, etc., it's a different story.

How Much spam do you get? by TheDick · 2002-07-09 09:12 · Score: 0, Offtopic

When someone posts a non-spammed proof email address in a mailto link on the frontpage of /. ??? I guess I know who to ask now....

I get quite a bit just from the comments I post.

--

Hillarious by FreeLinux · 2002-07-09 09:14 · Score: 2

I can't read the paper but, for his sake, I hope that he really meant that reliabillity isn't that important to him.

His server is toast!

New Uptime Server by Aknaton · 2002-07-09 09:16 · Score: 2, Informative

For those who remember the awesome but now defunct uptimes.net will be pleased to know that a new server is now up and running. It uses the old uptimes protocols and clients.

The URL is http://uptimes.wonko.com/

A GNU/Linux box was number one the last time I looked, with a NetBSD box coming in second.

Re:New Uptime Server by mbrix · 2002-07-09 21:03 · Score: 1

TuxTime has been up for a long, long time, it even has more participating computers than the "New Uptimes Project".

They only have 1 "nine" now ... by gruntvald · 2002-07-09 09:17 · Score: 2

So much for "two nines". Nothing, I repeat nothing, can withstand the /. hordes ...

Downtime != impacted users by linuxwrangler · 2002-07-09 09:20 · Score: 1

The time of day (week/month) a system is down is as important as how much of the time.

Our hosting company blew out a (supposedly) fully redundant Cisco and took us down for 1.5 hours in the middle of what was at the time our peak day in history. This impacted hundreds of thousands of visitors.

The same downtime at 1:00 am would have had about 1% of the impact on users even though the availability statistics would show both to be identical.

Naturally systems tend to fail more often under high load when the impact on your user base will be the greatest.

Impact on your user base is a better measure of the impact of downtime.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis

No you don't by gruntvald · 2002-07-09 09:23 · Score: 2

On W2K - service packs, about 10% of hot fixes, and anything to do with IIS require a reboot. Take your head out of the sand.

But most websites have redundant servers by Ars-Fartsica · 2002-07-09 09:27 · Score: 2

Now if an entire site went down, that would be one thing, but you will often see large scale sites (Google, Yahoo) operating with 5% of their servers down or out of rotation from some reason.

To get closer to your analogy, I would treat a server like a jet engine - the plane is designed to fly even if one fails.

RE sig..... by isotope23 · 2002-07-09 09:29 · Score: 1

"The world is run by idiots because they're more efficient than hamsters. "

That's just what the HAMSTERS want you to think!

--
Service guarantees Citizenship! Questions Guarantee GITMO.... Amerika Uber Alles!

SQL Transform by knco · 2002-07-09 09:37 · Score: 1

So, my wife wants to know why I think advertising on the Internet would be a good idea. But first, I should probably explain the family dynamic: she's an Ivy League grad and attorney while I have a UC degree and CPA certificate. Usually, she's the one who is right, so I figure I should probably listen (as if I have a choice).

I figure it makes pretty good economic sense, since many different sites with low CPM rates still get over a million page views per day. Problem is, she replies, there's probably only around 150-200k unique visitors at any of these respective locations, each of whom is triggering around 5-7 page views per person per day.

And besides, she continued, using the Jungean Archetype model to illustrate her point, the target audience is devoted to reason, not emotion. This, I concede, defeats one of my central tenets: applying a test to determine whether a person is Apollonian or Dionysian, left-brain or right-brain, etc. in order to assess the likelihood of downloading Centiare, my cool little cash management/forecasting program for individuals and small businesses.

Centiare quickly and automatically calendarizes projected deposits, payments and running cash balances over any time period selected - the output looks like a spreadsheet. But since transactions are stored in a database, the way it works is through a series of SQL pivot/transformation functions. The results are stored within multiple counter arrays to keep track of time periods, monthly totals, and grand totals. Once the recordset is complete, viola', the whole thing is formatted and printed - the flash report looks really good.

And besides, it's free to try, and only $20 to buy!

Centiare

uptime for windows? by Anonymous Coward · 2002-07-09 09:37 · Score: 0

Although the client installation instructions would probably be: run setup.exe and reboot.

Re:uptime for windows? by Aknaton · 2002-07-09 10:04 · Score: 1

There are clients for Windows, Unix (BSD, MacOS X, and GNU/Linux) and BeOS.

Cost Of An Outage by stan_freedom · 2002-07-09 09:39 · Score: 1

When designing new systems, I ask the customer to break down the costs of an outage based on scope and time. For example, if the entire system goes down for a minute, what is the cost? An hour? A day? What if only part of the system goes down? I ask the customer to consider all impacts of the outage, beyond simple lack of access. I ask them to put all of this information into a spreadsheet so they can easily play with the numbers and do what-ifs. Most of the time, the customer doesn't have a clue what an outage costs them until they perform this exercise.

Once the customer truly understands the actual costs of an outage, they are generally much more realistic when it comes to designing a system. I also encourage the customer to consider the odds of an outage happening. Yes, total and extended system outages occur. I have more than enough first-hand experiences. But what is the cost and what are the odds over time? Is it worth paying an extra $100k to avoid an outage that may only happen once a year and result in a $10k worst-case scenario?

As for the several of the comments I observed about planes requiring five-nines uptime, I don't think that is realistic. Planes frequently have system failures resulting in partial outages. That's why they have two engines, multiple wheels, back-up control systems, etc. Also, most of us have experienced flight delays due to mechanincal repairs. That's an outage as far as I'm concerned. When a plane can remain in service constantly for all but a few seconds a year, then it will have achieved five-nines. I don't know of any planes that perform to that level.

In Germany, 5 nines is bad. by Carmody · 2002-07-09 09:40 · Score: 5, Funny

"Are ve up?"
"Nien."
"Are ve up yet?"
"Nien."
"How about NOW?"
"Nien."
"Vill ve be comink up soon?"
"Nien."
"Vill ve be up next veek?"
"Nien."

--
God is real unless declared integer

Re:In Germany, 5 nines is bad. by Anonymous Coward · 2002-07-09 12:54 · Score: 0

Nein ...get it right. /ScumBag

Environments hosted by multiple companies. by t0ph3rus · 2002-07-09 09:44 · Score: 1

A lot of sites are actually hosted by more than one company. Some big consulting companies sell solutions that are hosted in their hosting environment but maintained by another for company. I am currently on a project in which our company was outsourced to not only deliver the software but to maintain the servers. The big consulting company that has a multimillion dollar monitoring system always let us down. We ended up writing several scripts to monitor our processes and servers( as we could not install any other software for security reasons ) Obviously, We would rather have the montoring software that we use on our internally hosted sites. However, Our scripts do a superb job and haven't let us down. I don't think it is hard to achieve 5 9's if your site is set up properly. All of our down-time on the different environments that we host were actually due to the Large Consuting company doing things like shutting down firewall ports, A security team kicking the wire to a DataBase, turning off the power to the hosting facility ( Yes they have UPS. However, it is configured wrong. So when they have a scheduled power down.... The UPS's do not work. ) Unfortunately , I can't say that this is unique in the Industry. As another Major Hardware/Consulting/software company that we are partnered with are is also just as bad.

How many 9s? by Matheus · 2002-07-09 09:46 · Score: 1

So how many 9s do we suppose Google has? (given the recent Interview)

99.99999999999999999???

Re:Oi! You act like a manager! by isa-kuruption · 2002-07-09 09:53 · Score: 3, Interesting

Having the 5 9's of reliability is NOT foolish. It is a reality of life. My particular organization services 40 million web customers, so we can not afford to be down at any time of the day because of the type of service we provide. In fact, last year we made our goal of having the 5-9's, and we did it without needing our disaster recovery (DR) site.

Having a DR plan and being reliable go hand in hand for the most part, however under normal day-to-day business conditions, servers need to be upgraded and things unplugged. You don't switch your entire infrastructure over to a DR site to upgrade your apache web server!! It is for this reason you have redundancy on the network and server level leading out to the Internet (or wherever your customer base resides).

Disasters, on the other hand, do not happen everyday. They happen once a year, maybe.... sometimes once every 2 years. If you live in an area more prone to disasters (like southern California), you may need an alternate site located on the east coast.... but, that is the cost of doing business.

Also, having 5-9's on uptime does NOT mean being accessible to everyone in the world at any time no matter what. Having 5-9's of uptime means that your organization has successfully kept it's applications and services available to the Internet. How is it my company's fault if you don't plug your modem into the wall? It's not, so to say that our "reliability" decreases because of an end user being a moron is a stupid statement.

Netcraft survey by A_Non_Moose · 2002-07-09 10:21 · Score: 2

Stats on the server are interesting that either it stopped being "up" or stopped bein monitored before june.

Or did I read the graph wrong?

.

--
Have you read the moderator guidelines? Well, have you, PUNK? (and I want a Karma: Gnarly option)

I Would have responded sooner but... by pukeAndCry · 2002-07-09 10:29 · Score: 1

our firewall was down and we lost our Internet connection.

rats...

Why does everbody think... by MoreDruid · 2002-07-09 10:35 · Score: 1

that if the site is not available, the server is down as well? If they built their network with a little sense, the webserver is dedicated... so even if it went down, the business processes are not affected. Duh...

--
The best weapon of a dictatorship is secrecy, but the best weapon of a democracy should be the weapon of openness.

Good article on assessing the uptime of a building by aaarrrgggh · 2002-07-09 10:45 · Score: 1

Check out this article for how you can evaluate the reliability of a building. Simple little calculator for looking at all the different systems involved.

Makes you think about how all the parts relate...

Re:Coda? by SkyLeach · 2002-07-09 11:07 · Score: 2

See High Availability for more informaiton.

Coda is the best present option for fs dependant data storage on mostly open-source plaforms. We are using Coda for our MySQL table files, ZODB files and logs.

Coda may still be beta software, but if Open Source software like Coda is considered beta code then Windows 2000 + sp2 must have been alpha code. :-) (And yes, I have Win2000 on my machine and even occasionally am forced to boot into it so I speak from experience).

--
My $0.02 will always be worth more than your â0.02, so :-p

Re:old Commodore by Anonymous Coward · 2002-07-09 11:08 · Score: 0

Last time I saw one of those was in the main HVAC
room in the State Office Building in Shreveport, LA. It ran all the big compressors and other HVAC systems in the building. This little computer was on all the time...(ca. 1980's)

Uptime Realities by inode_buddha · 2002-07-09 11:13 · Score: 1

er, your former boss's mail @ interceptor.com has a hell of a typo in the first paragraph. The (presumably) actual site at codesta.com is still not availabel at 19:12 EST. Anyplace else I should look? The Google mirror was funny, but once was enough. Thx

--
C|N>K

4500s, datalines, etc... by juuri · 2002-07-09 11:20 · Score: 2

... and what is cost of keeping this up?

The maintenance on the 4500s (if they have multi procs and lots of ram) is prolly 20-30k each annually just by itself.

What about the renew costs on the weblogic support?

How much was that oracle?

Even a basic system as qouted above is 400+k.

--
--- I do not moderate.

Re:4500s, datalines, etc... by SkyLeach · 2002-07-09 12:10 · Score: 2

Consider it correctly. We are part of a large(ish) corporation. Corporate IS has a budget for the ERP systems, and they purchased the Sun machines and most of the Oracle licenses. We only had to purchase a couple licenses ourselves. If we weren't part of the corporation I probably would have gone with a MySQL solution, which would have been fine with me.

We still use MySQL for some applicaitons in the ebiz group (like logging and mailing lists). Our budget is way below 400k.

If it weren't for the corporate IS and their heavy ERP system needs we'd drop all that and just use one of the many possible multi-server MySQL combinations.

--
My $0.02 will always be worth more than your â0.02, so :-p

Re:Oi! You act like a manager! by Skuld-Chan · 2002-07-09 11:31 · Score: 2

This reminds me of when I was working at a .com called rulespace - there was a construction outfit building a parking lot downstairs - one day they decided to move the big uswest/qwest plywood board from one pillar to another. Alarms never went off because they couldn't call the pagers because they had effectively disconnected all the T1's (including the 2 backups), all the dsl circuts/analogue lines and the T1 going to the telephone switch for the entire building. All the redundancy in the world wouldn't save that mess. As I recall they forked out more money for colocation space at Inflow and moved the more critical systems out there.

Full Text - Page 1 by Kallahar · 2002-07-09 11:32 · Score: 5, Informative

The Scenario

Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.

Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.

Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.

Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.

Genesis of the 'Five Nines'

We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.

First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.

The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."

'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.

We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.

The Greasy Steel Bar

Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.

What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.

99999 and Microsoft by aoeu · 2002-07-09 11:52 · Score: 1

Let me see. 99.999% uptime means 5 minutes downtime per year. Computers boot in about two minutes. Can you imagine a Microsoft product that only needs three security patches in a year. HAH!

--
All your database are belong to U.S.

Re:99999 and Microsoft by jpmorgan · 2002-07-10 06:03 · Score: 1

Actually, you can buy Microsoft servers (based on Windows 2000 Datacenter) that are gauranteed to be 99.999% reliable. HPaq, Dell, Motorolla, Unisys and Stratus sell 'em.
Take a look

Tasteless: Uptime in a terrorist-infested world by PsiPsiStar · 2002-07-09 12:33 · Score: 0, Flamebait

Even buildings don't have 100% 'uptime' anymore. Considering how it crashed, the WTC must have been running windows. It had them all over the place.

--

___
It's the end of my comment as I know it and I feel fine.

Simplicity, not over-engineering, gets you 5 9's. by shoppa · 2002-07-09 12:43 · Score: 2

Too many systems try to be reliable by adding complexity. Putting on a backup system, or two backup systems, or adding a lot of layers of machines, or a lot of layers of software, often gets you in trouble. Especially when you don't know what you're doing.

Example: the system I support is mission-critical. If it crashes, it makes the front page of the newspaper the next day. Hundreds of thousands of folks may get delayed by ten or fifteen minutes or a half-hour. The system is finally up to 99.995% availability. And how? By turning off all the backup systems and disentangling all the horrendous software kludges that were put in for the backup system. While the organization was trying to support hot-backup availability, it was crashing every other day. Outside consultants blamed this on flaky hardware and said the system had a life of, at most, a year and a half. Here we are now, four years later, and reliability is better than ever :-). I like to think that some of the work, and (even better) some of my attitudes have helped get us where we are.

Nien? by alienmole · 2002-07-09 12:45 · Score: 3, Informative

Nein!

We have hit 6 sigma by Archfeld · 2002-07-09 12:55 · Score: 2

for 4 quarters running now. IT IS horribly expensive, both on the hardware and support side but given federal requirements and customer demand we have no choice.....

--
errr....umm...*whooosh* *whoosh* Is this thing on ?

Re:We have hit 6 sigma by mOdQuArK! · 2002-07-10 05:12 · Score: 1

Gee, if you want to put it that way - I've been running 200 sigma for the last ten minutes on my laptop. Of course, once I shut it off, then my stats won't be so good.
Re:We have hit 6 sigma by Archfeld · 2002-07-10 12:14 · Score: 2

LOL, you have a future in corporate management I think :)

--
errr....umm...*whooosh* *whoosh* Is this thing on ?

Former(?) boss by cant_get_a_good_nick · 2002-07-09 13:22 · Score: 2

He puts a seemingly valid mailto: link on a heavily trafficed website. If it wasn't his "former boss" before, it damn well will be now.

Re:Former(?) boss by OzJimbob · 2002-07-09 16:10 · Score: 1

I was thinking the exact same thing...well more along the lines that is was revenge directed at the former boss. As if you post SOMEONE ELSE'S email address to the front page of /. for 'Bob's' sake!

--
-"I still believe in revolution; I just don't capitalize it anymore." - srini!

Don't forget Tandem: 100% uptime (no nines). by PongStroid · 2002-07-09 13:31 · Score: 1

Subject is more of a marketing line than anything, but Tandem systems come much closer to 100% availability than anything else that I'm familiar with.

Check this for more info, and this.

Never mind that they were bought by Compaq, and now HP - the architecture still stands. It is one of the great - relatively unknown Silicon Valley companies.

There's some really interesting stuff architecture-wise. Linux-heads would do well to check it out.

Here ya go. by TheSHAD0W · 2002-07-09 13:39 · Score: 2

USDCO has been featured in other /. articles; not only is their colocation facility located underground, with a high degree of redundancy in their connections, but it's not very expensive, either...

'Course, an on-site solution won't be anywhere near as cheap, but if you can colo, this is the place.

Anything's Possible... by Anonymous Coward · 2002-07-09 14:16 · Score: 0

This is not made up...
The uptime counter wrapped at 480ish days, but the ethernet counter is correct.. This box is going on 2 years at 100% uptime.

-- interface e0 (704 days, 9 hours, 15 minutes, 21 seconds) --

8:15pm up 206 days, 14:21 2726120747 NFS ops, 0 CIFS ops, 0 HTTP ops

Go NetAPP!

"Long Uptimes" are simply a matter of design by Bob_Robertson · 2002-07-09 14:28 · Score: 2

I have a Linux server that has been running without reboot for 679 days. Yes, I do update content, and I admit that there has been network maintenance that has made it unreachable twice during that time. However, I spend almost no money on the network, so that's what I get.

One company I worked for once upon a time, ConXioN Corp, has a very real statement on their opening page from a major customer:

"ConXioN has not been down in 5 years." And that was in 2001, they still haven't had a hit.

This is simply a matter of consideration and design. No $19.95/month mom&pop ISP is going to put the effort needed into ensuring such uptimes, things like that take redundancy and forward thinking, and that costs money.

While I was at NASA, the network and servers there also had better than 5-9's availability, because the people who ran those servers and that network took the time to care. For us it wasn't a matter of profit, it was a matter of pride.

So while I agree with those who poo-poo that "nothing is so important" that it needs to be up 100% of the time, and I also agree with the reality that there will be downtime of any system at some point, really impressive uptimes are not just possible, they can and do happen anywhere that uptime is a prioroty.

Long Uptimes are simply a matter of design.

Bob-

--
The Ludwig von Mises Institute. The reasoning individuals economics

#9 Myanmar style by wytcld · 2002-07-09 15:03 · Score: 2

Beware the 9s.

In Myanmar "General Ne Win helped speed his own downfall ... by suddenly declaring much of the Burmese currency worthless and replacing it with bank notes in denominations divisible by his lucky number, nine. Riots followed."
___

--
"with their freedom lost all virtue lose" - Milton

Tell that to the Swiss Re:99.999% perfection by Anonymous Coward · 2002-07-09 15:06 · Score: 0

> ..one mistake doesn't crash an airplane.

You apparently trained the Swiss air traffic controllers in Zurich.

Re:Tell that to the Swiss Re:99.999% perfection by 21mhz · 2002-07-10 03:18 · Score: 1

There apparently was a series of grave mistakes which unsurprisingly have added up, i.e. an operator working with certain systems switched off, the single emergency phone line that got blocked. The ill-fated Russian pilots probably believed that negligence like this is possible at home, but not in the land of
famous banks and watches...

--
My exception safety is -fno-exceptions.

Full Text - Page 2 by jazzbotley · 2002-07-09 15:26 · Score: 1

The Uptime Rules

First, as an introduction to the rules, let's review our terms and terminology.

Definitions

Uptime is the amount of time the entire system is available. By entire system we are saying that an entire transaction can be completed. Just having your web servers running when the needed application server isn't running cannot be defined as uptime.

Downtime is everything else.

Scheduled maintenance downtimes or windows are the periods of time (for example, from 1:00am to 3:00am Monday morning) when an IT team has the option, if they need, to bring down various components in a fashion that causes the system to be incapable of complete functionality.

Reliability is defined as uptime but where scheduled maintenance downtime is not counted against it. For example, if in a 24 hour period there was an hour of scheduled downtime, but otherwise full operational for the remaing 23 hours, then the system was 100% reliable.

So how do you translate the 'nines' into acceptable downtime? This chart provides the answer: 'Nines' Uptime % Minutes
Per Year Minutes
Per Month Two 99% 5256 438.0 Three 99.9% 526 43.8 Four 99.99% 53 4.4 Five 99.999% 5 0.4

Rule #1: A great system run poorly is a poor system.

This is the most crucial rule to understand when managing any system. It doesn't matter how much you spent on the hardware, how well designed your database tables are, or if you installed the latest and greatest operating system on the market. If it cannot be managed well, problems ensue.

Users don't see, or care that problems come from your database servers, or your application servers, or your static data caching. What they perceive is one of two states: working or not working. They want to make their reservation, or pay their bill, or just get the weather in Bali, and they want to do it NOW!

Managing with a given level of reliability in mind is about people, hardware, operating and escalation plans, and ultimately, it is about the money to put it all together and keep it running. The cost of reliability, is very hard to quantify. Even assuming it is a linear relationship (and few things in life are) it's a staggering relationship in financial terms. In my experience each 'nine' is close to an order of magnitude increase in cost!

The bottom-line is this, you need to do an honest assessment of available resources versus intended goals; it is the first step in making sure your great systems runs at least as good as you intended.

Rule #2: Five nines is a goal reachable only through both fully automated system management, and rigorously controlled and tested applications.

Scared by four and five nines? Unless you've worked in a true, hardcore, spare no expense data center, you should be!

Let's think about five nines for a moment. 5 minutes a year. That rules out any form of human involvement in fixing problems. After all, even the best humans are known to be distracted for a minute or two into conversation with a co-worker, or a phone ringing.

As an example, let's time a perfectly common scenario, where you have two people monitoring systems. Time the following emulation in your office space:

Assume the system is working happily.
Walk over to your kitchen area and grab a soft drink. Then walk back.
Wait 15 seconds while you pretend to have the other NOC (Network Operations Center) engineer say "Hey, look at this!"
Sprint over to your desk and sit down.
Log into your desktop machine.
Log into a remote machine.
Run one or two basic remote commands ('ps' or 'top' for example)

Now stop the clock. I'm willing to bet your five minutes are up!

Even without a distraction, it's simply not possible for a system of any complexity, to have a problem confirmed, cross checked, and resolved, by a person, within five minutes. Oh, and don't forget about the minute to 90 seconds that you've already lost in monitoring the issue - unless you want alarms going off continuously, you have to set an error threshold that typically consumes 60 seconds or so.

"Okay," you say, "well, five nines is a lot. How about aiming at four nines?" But are four nines really much different than five? Certainly, it gives you more latitude and time to fix a problem, but not much more. You can afford a single downtime that takes a few minutes to debug, but that's all.

The truth is, unless you have an application that doesn't fail, the odds are that your hardware failures will still occur three to four times a year, which pushes the limit of human intervention. A good rule of thumb is that things never happen when you are watching them - figure that any issue takes at least ten minutes to resolve, even if it as simple as a human inadvertently powering both sets of redundant systems down, and now they are powering back up.

150% uptime by Anonymous Coward · 2002-07-09 16:40 · Score: 0

I have seen several routers that have been up for over 460 days.

On a side note, an extreme amount of uptime can be achieved even without redundant machines, simply install linux, you never have to reboot to "finish the install". Then plug it into a good network and a UPS, cheap, and effective. ( 1gb ethernet prefered :P )

I have yet to see an NT system with 99.0% uptime :)

I am not trolling, or trying to start a fire, this is just what I see.

It gets weirder... by Anonymous Coward · 2002-07-09 16:59 · Score: 0

When you read Netcraft's full report, it says that the site is running IIS 5 on LINUX! Haw haw haw! Funny!

Criteria is "unavailable", not "catastrophic fail" by Anonymous Coward · 2002-07-09 17:07 · Score: 0

Right. The author defines "downtime" as any time that the system cannot complete user transactions.

I've taken about 100 airline flights in my life (very roughly). One of them sat on the ground for three hours before take-off due to mechanical trouble, for a 50-minute flight. That was an unavailable system! I could have driven that far in three hours!

So my experience with airlines is just TWO nines of reliability.

The company I work for needs it...... by Anonymous Coward · 2002-07-09 17:18 · Score: 1, Insightful

Police system needs to be able to access the criminal databank 24/7/365. Unless you want a shotgun in the face when you pull over a driver.

Crime doesn't take holidays.

Re:The company I work for needs it...... by myz24 · 2002-07-10 01:21 · Score: 1

Even this schedule allows for maintenance every 4 years ;-)

Not SBC by Anonymous Coward · 2002-07-09 17:30 · Score: 0

First off pick a reliable service provider. Somebody that is not SBC. I might have 2 9s on my dsl and 1 9 on my email.

Re:Oi! You act like a manager! by Phroggy · 2002-07-09 18:17 · Score: 2

It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same.

Try convincing the people who call Tech Support of this simple concept.

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;

The server is way cheaper by Otis_INF · 2002-07-09 19:50 · Score: 2

That's the reason why servers go down and planes do not (well.. most of the time): people expect that the server they get for 3000$ will run a corporate mission critical system for years without a crash. Planes costs millions, are tested on hardware every time they're used, servers are not. Do you test your server's hardware every week? (or day?).

--
Never underestimate the relief of true separation of Religion and State.

This all depends on context. by ben_ · 2002-07-09 21:10 · Score: 1

Much as I dislike MS, we have an NT4 Exchange server that has been running continuously, no reboots needed, since January. Its uptimes match that of the Linux server that is the firewall. Since the NT server is completely firewalled from the Internet (SMTP mail is routed in and out through the Linux box) and runs no Net-addressable services, it needs no security patches for IIS.

The point is that MS can validly claim those uptimes in certain circumstances. Don't ever let your dislike of something blind you to the facts about it. That is bad engineering.

--
ben_ the technologist and platform agnostic

Unscheduled??? by winchester · 2002-07-09 22:51 · Score: 1

Okay, is there anyone here who understands that 99.999% reliability means that you have 0.001% total UNSCHEDULED downtime, so downtime due to crashes and what not? 99.999% never talks about the downtime due to maintainance, which is, after all, scheduled downtime.

Server vs Service by AftanGustur · 2002-07-09 22:52 · Score: 3, Informative

Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc...Chances are even restarting the web server would take up more time than your maximum weekly downtime.

You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".

--
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc

Re:Server vs Service by ranulf · 2002-07-09 23:52 · Score: 2

the distiction between "server uptime" and "service uptime"
With proper hardware (redundancy etc ..)
The article doesn't even touch on these issues. In fact, from the situations described in the article, it's clear that they are considering a single box and discussing how unsuitable 99.999% availability is. If you did have redundancy, then yes, I agree you are far less likely to have two catastrophic failures at once (especially if you have machines at different locations and DNS assisted failover, etc).
However, the article paints the picture of a mad scramble to bring the machine up within a matter of minutes of failure. If you do have proper redundancy, let's say 1 of your 3 machines in different locations fails, then you don't have a mad scramble to get things working. Sure, it helps if you have fixed things quickly in case another machine fails and it will ease the load on the remaining servers, but it's not amazingly pressing.
Now, I've worked in places where everything's done on one server, and it is a big deal when the machine goes down. It is a big deal when it needs a reboot. It's even a big deal just restarting the web server, as in many cases clients loose their session context.
Ah, yes. I'd almost forgotten session context. This is not the kind of thing that is trivially easy to share between redundant servers, unless it is considered from the start. Typically, this is possible for sites with static data, but anything that attempts to keep track of the user's data and/or preferences will start having a big issues trying to keep things up-to-date.

Not quite the 95% rule by oliverthered · 2002-07-09 23:08 · Score: 2

statisticly, for somthing to have an garanteed (95%) uptime of 1 month it must have no downtime in 95 months!

5+ nines are great but you can still go down 1 pico second an hour, that's a hell of a lot of outage.

--
thank God the internet isn't a human right.

I've had to support a 5 niner... by billmaly · 2002-07-09 23:54 · Score: 2

It's a joke. By the time the call got to me, I got the person on the phone, and got a description of the problem...BUZZZZ!!! Times up!!! Of course, I was asked to support a system that I had no formal training on, that I didn't design, install, or ever see in person....support was....difficult. My dot com layoff was, in sooooo many ways, the best thing that could have happened to me!!

Re:Simplicity, not over-engineering, gets you 5 9' by chthon · 2002-07-10 00:06 · Score: 1

I have seen this in reality. On my previous job the systems were migrated to HP systems, which claimed 99.999%. One of the first things which broke was the redundant Fibre Channel controller. It took two days to fix it.

The cost of 9's by mustprotectdata · 2002-07-10 01:22 · Score: 1

As a rule of thumb, for each extra nine, add an extra zero...

NT Reliability (wasRe:This all depends on context) by DSL+Pimp · 2002-07-10 02:28 · Score: 1

Yes but NT is probably the only mature server OS Microsoft has, and they only took several years to get there...

I think 5.

--
"If I were important, I would have a sig file..."

A matter of perspective... by belphegor · 2002-07-10 03:08 · Score: 1

The biggest thing I have run into when discussing system reliability that people don't seem to get the distinction between a particular system being up and the service being available.

In many cases, the end goal should be for the users to experience however-many-9's you want -- but that doesn't mean that your administrators are only going to have to deal with that much down time on individual systems. In fact, they'll have quite a lot of downtime to cope with -- but you have to be sufficiently redunant that you can still provide the service in the presence of individual system/link failures.

Besides that, a good system will be designed to degrade gracefully in the event of component failures. Your users will be a lot happier if you can tell them 'this service is temporarily unavailable' than if the service just disappears.

The only thing that is exceptionally difficult is having a redundant database synchronized in near-real time across multiple sites. That's where you need to spend the big money on clustering. Front-end servers can be as redundant as you want, and mid-level app servers can be clustered or independent, but the db has got to be there for the rest of it to work.

But if you can't convince management that the goal is that the users' experience of downtime be minimized (and you can measure it), then you're going to have a hard time asking for more money when the apparent amount of time spent fixing broken systems goes up rather than down with each upgrade. Sure, you want to minimize the individual system downtime, but look at services like Google where they have tens of thousands of individual systems with dozens or more down at any given time -- they've just designed their stuff well enough that the service can keep chugging along in the presence of failures.

Going way OT by xmedar · 2002-07-11 15:31 · Score: 1

RDF will only be effective inside big companies

Unless a "God" starts defining ontologies, and of course borrowing from others that already exist, plus it's a much smaller step from RDF to DAML than from XML to RDF...think of the possibilities... and there's some compression and encryption involved to get to the wireline data, I guess it's good my ontologies start with the most important forms of data... MP3 tags :) I don't think perhaps you realise the significance of RDF, XML is about tags RDF is about semantic content and one up from there, DAML enables logic deductions, as TBL named it "The Semantic Web", far more useful than typographical web we have at the moment. As for >1MLOC, well you need other services like authentication, gateways to translate the content of other systems into RDF etc, you get there very quickly. As for China, and other counties, we can just batter them with the WTO rules as they will be hurting our business, they might even fall foul other laws regarding network and computer security if they interfer, anyhow nice to know you got my email and that it inspired a response.

--
Any sufficiently advanced man is indistinguishable from God

Re:Going way OT by Beliskner · 2002-07-11 22:26 · Score: 1

Warning: This is offtopic. Ah heck mod me down, excessive karma simply causes Slashdot-wide karma inflation.
A standardised way for describing metadata just looks nice. Unless you're planning on operating corporate-wide or intercorporate-wide, XML, RDF and DAML are just buzzwords. What's most important is a working system, and then document your application API and communications. In other words for a developer, documented communications = XML+DTD. Remember many apps can read Microsoft Word .doc format despite it being bespoke. Supporting XML is just icing on the cake, if you tell a developer a document's gonna be in XML he'll say "cool" whereas a secret proprietary format would be a PITA but can still be done. You want to take the pain out of the ass, I'm not going to complain about that.
With standards compliance, you would benefit from being able to plug in "out of the box" software like Microsoft Access (for whatever reason). But then if you're on the cutting edge, your DTD can be so weird that any "out of the box" software will just be able to put your data into DOM format and then uhhhhh it won't know what to do.
However it looks good if you have RDF standards compliance, same as Microsoft Windows looks good if it's compatible with a HP Deskjet printer. Stick it on the front of your product, heck even if you stick a dog turd on the front of your product people will think it's special and buy it anyway (like Eminem).
Now with DAML and RDF you... One sec, must eat breakfast and check email.

--
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?
Re:Going way OT by xmedar · 2002-07-12 07:11 · Score: 1

A standardised way for describing metadata just looks nice. Unless you're planning on operating corporate-wide or intercorporate-wide, XML, RDF and DAML are just buzzwords.

True, major benefits accrue from being a cross-boundary (team/dept/organisation/nation) system, then you can use the same technology implementation both inside and outside, I think that's what marketroids refer to as "a big win". The point is that with straight XML the apps have to make sense of the tags, RDF is a semantic description and so relationships are explicitly defined allowing for a different level of program logic to be applied without the app having to understand a particular schema other than RDF. There are other ways the same thing can be done, everything could be described semantically in Lisp and then just give everyone Lisp interpreters, RDF was chosen because it is easy to understand, you don't have all the old associations of expert systems and other such failures, and being serialised as XML means interpreters are easy to build, and so are query stores. XML is useful as a tool and so is RDF, but there are differences in where the line between syntax and semantics lie, that was my point, I hope I've explained it better now.

--
Any sufficiently advanced man is indistinguishable from God

Slashdot Mirror

Uptime Realities in the Internet World

353 comments