Uptime Realities in the Internet World
schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
For SpongeBob.
and CLiT !
The
Get of IRC the lot of you and help the CLiT.
"Under the iron bridge, we fist" - The Smiths, Still Ill
Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.
Damn, missed it! Maybe next time...
...to be worth 4 and 5 nines of reliability.
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
We'll see how good www.codesta.com's uptime is after the slashdot'ing.
Uptime Realities in the Slashdot-linked World
There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
uptime.netcraft.com Is THE best place to see what works for uptime. Last time I checked BSD machines were the best for uptime.
:o)
M@t
Matt Thompson - Actuality - Insert product here.
And for the follow up article he discusses how hard it is for a site to remain up after it's been slashdotted.
Damn unreliable modems! And broadband is STILL NOT AVALIBLE IN MY AREA
To the lameness filter, According to google, clippy is porn!
Looks like he's going to be seeing even fewer nines after this slashdotting.
How many engineers out there have heard the marketing / sales 'it has to be always available' and priced out an infrastructure accordingly.
Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's
Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.
What, did everybody click the link but not reply?
I can't access the site...
or maybe they just really like using micro$oft products
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
said if i can get this mentioned on slashdot, i'll get the raise after all...
::.. check out some Cell Phone Reviews
It poses the idea that four and five nines of reliability are too expensive to be realistic
I know it costs a lot per letter of text... so why not just print maybe one or two nines instead? Or maybe a one with two zeros... I tend to just round off after a certain point
"My former boss"
;-)
Nice, and you go after your ex-boss by getting his article slashdotted!
Be Patriotic, Smoke Amerikan grown marijuana, not treasonous imports !!!!
Courtesy of About 420
Connotative Use/Meaning
420 is a phreak s (and not just a hippie s) favorite number for a
variety of reasons, or maybe for no reason at all, but colloquially
the number says pot -- let s smoke pot, or someone s smoking
pot, or gee, i really like pot, or time to smoke pot, either by
time (4:20 a.m. or p.m.), date (April 20th), or otherwise (e.g. State
Route 420). April 20th at 4:20 is marked by annual events in
Mount Tamalpais, CA (an informal gathering); Marin Conty, CA
(the 420 Hemp Fest); Ann Arbor, MI (the Hash Bash); and
Washington, D.C. (buildup towards the July 4th Smoke-In).
Original Source(s)
Conventional wisdom: The most common tale is that 420 is the
police radio code or criminal code (and therefore the police call)
in certain part(s) of California (e.g. in Los Angeles or San
Francisco) for having spotted someone consuming cannabis
publicly, i.e. pot smoking in progress; that local cannabis users
picked up on the code and began celebrating the number temporally
(esp. 4:20 a.m., 4:20 p.m., and April 20); that the number became
nationally popularized in the late 1980s and, more ferverently, in
the early- to mid-1990s; and is colloquially applied to a variety of
relaxed and/or inspired contexts, including not only pot
consumption but also a good time more generally (in contrast to
the drug war surrounding).
Conventions are legends: 420 is not police radio code for
anything, anywhere. Checks of criminal codes (including those of
the City of San Francisco, the City of Los Angeles, Los Angeles
County, the State of California, and the federal penal code) suggest
that the origin is neither Californian nor federal (the two best
guesses). For instance, California Penal Code 420 defines as a
misdemeanor the hindrance of use (obstructing entry) of public
lands, and California Family Code 420 defines what constitutes a
wedding ceremony (Marco). One state does come close: The
Illinois Department of Revenue classifies the Alcoholic Liquor Act
under Part 420, and the Cannabis and Controlled Substances Tax
Act are next, under Part 428. (RB 5/19/99)
True story?: According to Steven Hager, editor of High Times,
the term 420 originated at San Rafael High School, in 1971,
among a group of about a dozen pot-smoking wiseacres who
called themselves the Waldos. The term 420 was shorthand for the
time of day the group would meet, at the campus statue of Louis
Pasteur, to smoke pot. ``Waldo Steve, a member of the group who
now owns a business in San Francisco, says the Waldos would
salute each other in the school hallway and say ``420 Louis! The
term was one of many invented by the group, but it was the one
that caught on. ``It was just a joke, but it came to mean all kinds of
things, like `Do you have any? or `Do I look stoned? he said.
``Parents and teachers wouldn t know what we were talking about.
The term took root, and flourished, and spread beyond San Rafael
with the assistance of the Grateful Dead and their dedicated cohort
of pot-smoking fans. The Waldos decided to assert their claim to
the history of the term after decades of watching it spread, mutate
and be appropriated by commercial interests. The Waldos contacted
Hager, and presented him with evidence of 420 s history, primarily
a collection of postmarked letters from the early 70s with lots of
mention of 420. They also started a Web site, waldo420.com. ``We
have proof, we were the first, Waldo Steve said. ``I mean, it s not
like we wrote a book or invented anything. We just came up with a
phrase. But it s kind of an honor that this emanated from San
Rafael. Maria Alicia Gaura for the San Francisco Chronicle,
4/20/00 p. A19; and thanks to Noah Cole for the submission
Alternate explanations
There are a variety of other explanations, all much more interesting
than police code, and many plausible. Some are more likely uses
of the 420/hemp connection rather than sources of it, such as the
score for the football game in Fast Times at Ridgement High,
42-0.
Known Myths: It isn t police code (see above). There are 315
chemicals in marijuana, not 420. And although tea time in
Amsterdam is rumored to be 4:20, it is actually 5:30 (Gerhard
den Hollander).
Sixties Songs: For instance, Bob Dylan s famous Rainy Day
Women #12 and 35 is a possible reference, or source --
12x35=420. And Stephen Stills wrote (and Crosby Stills Nash
& Young performed) a song 4+20 (first recorded 7/16/69,
released on Deja Vu 3/11/70) about an 84-year-old
poverty-stricken man who started and finished with nothing.
(Thanks to Sherry Keel 12/6/98.) Dylan aslo mentions 4 and
20 windows in The Balland of Frankie Lee and Judas Priest
(on John Wesley Harding).
Older Verse: But 420 in poetry is older than that - Greg
Keller notes the old nursery rhyme line, four and twenty
black birds baked in a pie. Revelation 5:14 (in the King
James Version of the Christian Bible) reads, And the four
beasts said A-Men. And the four and twenty elders fell down
and worshipped him that liveth for ever and ever. (Travis
Spurley 2/15/99) And in Midnight s_Children, Salman
Rushdie wrote, Inevitably, a number of these children failed
to survive. Malnutrition, disease and the misfortunes of
everyday life had accounted for no less than four hundred and
twenty of them by the time I became conscious of their
existence; although it is possible to hypothesize that these
deaths, too, had their purpose, since 420 has been, since time
immemorial, the number associated with fraud, deception and
trickery. (Comet 2/14/98) Comet s best guess is that this
refers to something in Indian mythology or numerology, since
the book is set in India and frequently involves Indian history,
culture, and religion. Given the high interest in Eastern
religion among the phish/dead community, this seems a likely
origin of 420 s current significance.
Temporal Significance: Hands on analog clock at 4:20 look
like position of doobie dangling from mouth Larry in
Tuscan and Alex Mack 5/19/99). Disruptive students are out
of detention and safetly away from school by 4:20, also
rumored to be the time that you should dose to be peaking
when the Dead went on stage Hart. The Waldos were a
group of teens back in the 70 s that lived in San Rafael, CA.
420 was the way they talked about pot in front of teachers,
non-smoking family members etc. Also it was the time of day
they could just go relax, and get baked. (PhunkCellar)
Jamaicans purportedly worked till 4 then walked home then
lit up. They would talk 420 like our parents talked about after
5. That s when partying began Larry in Tuscan). Albert (not
Abbie) Hofmann supposedly first encountered LSD at 4:20
p.m. on 4/19/1943 (Bart Coleman citing Storming Heaven by
Jay Stevens, recommended by Mickey Hart in Planet Drum).
Surrealist painter Miro was born April 20, 1893. And
www.filmspeed.com says the propoganda film Reefer
Madness has a copyright date of April 20, 1936 (i.e. 4/20).
(Patrick Woolford)
Misc: Could be that it comes from hydroponics, the practice
of cultivating plants in water often used by indoor marijuana
cultivators, since 4 is used for H on a calculator (420/H20).
(Nick Lowe 3/30/00) The number 80 (eight) is quatre vingt
(pronounced cah-truh vahn), meaning four (times} twenty.
Dan Nijjar 1/27/00 (No connection yet between the number
80 and pot. A quarter pound is roughly 120 grams, rounding
quarter-ounces to 7.5.) The titanic was supposed to arrive
4/20/1912. (Thanks to RB.) Perhaps the heavy use of vt420
terminals in the Berkeley area is to blame? (BTW, 420 in
binary code is 110100100.)
Ubiquitous?
Now there s a 420 Pale Ale. One of the late-97/early-98 Got
Milk ads featured a character eating cookies without milk and
then passing a sign that reads Next Rest Area 420 miles (as Ross
Bruning). Reportedly, all of the clocks in the movie Pulp Fiction
are stuck on 4:20. Shirts with the number 420 on the red-and-blue
interstate highway shield (Interstate 420?) have show up on the
sitcom Will and Grace (Paul Risenhoover 5/14/99) and in several
videos. UPS labelling software has a 420 postal code legend for
next-day/2-day deliveries (which is how Phish tickets are sent).
(Jack Lebowitz 10/3/98) MTV s 1997 Viewer s Choice Award (for
the MTV Video Awards) was decided by calls to
1-800-420-4MTV. And by May of 1998, the number was
appearing in so many ads (eg Copenhagen 5/14/98 Rolling Stone
p54, Corvette p55 5/98 Car & Driver) that its presence is
presumed to be intentional. Many songs are around 4 minutes 20
seconds long (since many songs fall between 2:30 and 5:30),
including for example Pink Floyd s A Great Day for Freedom (on
The Division Bell, 1994), the Foo Fighters My Hero, and
Smokin from Boston s first album. There have also been some
420 references on The Simpsons. In the re-run episode aired on
April 20th, 1999 at a special time (probably in honor of those
college students staying in the holiday spirit
Flanders that Barney s birthday is April 20th. Also, the jackpot sign
in one part of the casino says $420,000. There are a couple less
concrete ones, but these two have to be legit, especially since they
decided to air THAT particular episode on 4/20/99. (Submitted by
Matt Meehan 4/21/99) And (as of Fall 99) the 60 free minutes that
Working Assets Long Distance offers, at the 7 cents per minute
rate, is $4.20 free. There s even a band named 420, and another
names . In the first fifteen pages of Karel Capek s novel War with
the Newts, a man diving under wonder stayed down for four
minutes and twenty seconds. Grant Garstka 1/6/00 At the
suggested retail price ($3.96) and Michigan (6%) sales tax, a deck
of Uno cards costs $4.20. Nic Boris 4:20 marks the first downbeat
of the drums in Led Zeppelin s epic Stairway to Heaven. (Dan
Harris) The bill authorizing force after the World Trade Center
attacks of 9/11/01 passed 420 to 1, and news reports in following
months noted many times that there are (or were then, anyway) 420
airports in the U.S. Allan Morris And don t forget that Adolf Hitler
was born on April 20, macabely celebrated (or at least
referenced) via the Columbine High School shootings.
Phish-related Occurances
Whatever the origin, the number appears frequently... For the
summer 1997 tour, TicketMaster service charges were $4.20. In
the Fall 1997 Doniac Schvice Dry Goods section, a limited edition
Pollack poster printed on 100% hemp is order number 420P. The
Great Went was 420 miles from Boston (former home of Phish).
The official logo includes 4 gills and 20 bubbles (Gringo
11/12/98). As of 6/15/97, including covers and originals, Phish
had performed a total of 420 songs (thought its 486 by 4/24/98).
(David Steinberg). Lawnboy is 420megs of memory. Patrick
Walker Phish s The Vibration of Life underlies a whirling loop
with Seven Beats per second (which makes 420 beats per minute.)
Trey has used the altered line woke up at 4:20 in Makisupa
Policeman, which also often indirectly celebrates 420ing, e.g. by
mention of goo balls. One of the funniest shirts around takes light
jabs at both the 4:20 phenomenon and the rumored evolution
(collapse?) of the Phish.Net (especially rec.music.phish) from
being Gamehendge to Flamehendge, and beyond. The first day of
the Great Went started at 4:20 (with Makisupa Policeman. (The
second day started late, at 4:37.) Noah Cole The first single from
Slip Stitch and Pass was played on WBCN 10/14/97 at 4:20 pm.
An uproar at 12/31/96 can be heard on tape during the 2001, in
response to an enormous digital clock (which was counting down
to midnight) reaching 11:55:40 and reading -4:20. (Yoda)
During the 9-12-00 2001, Trey hits the first riff right at 4:20 into
the intro jam. (Cal 2/25/01) Some mail order tickets for the 1997
New Year s run were in section 420. The first Mass Pike toll
leaving Oswego was $4.20. (Camille Heath ) And the standard
shipping for The Phish Companion through Amazon was
originally $4.20.
420 Shows: Phish performed on April 20 in 1989, 1990, 1991,
1993, and 1994. The first day of the Great Went started at 4:20,
although that was called a soundcheck by Trey after three songs.
The Jazzfest Harry Hood 4-26-96 started at about 4:20 reported by
Trevor. At Big Cypress, David Bowie was playing at 4:20 a.m.
And the one event during the hiatus (10/8/00 - ?) featuring all
four members - for Jason Colton s wedding - was 12/1/01, 420
from: http://www.phish.net/faq/n420.html:
Now, if only School had high uptime... (suffered 2 outages this morning ^^ )
The boss didn't do for, though. :(
Like the Telco... voice grade telco. Better than the power company.
Our web server does about 4 9's, which is a downtime of about 8 hours a year, I think. I really suck at math though. I mean it.. I'm so bad at math I have no idea if thats right. I said "well theres 8544 hours in a year, so 8 divided by that is 0.0009, so thats about 4 9s. I think. 8 hours of downtime isnt that bad. I think the next step up from 8 hours of downtime is essentially those megacorps that have redundant systems, and sirens go off and people die when their server goes down for under a second. In fact, I think if their server actually went down for more than a second, some sort of structual damage to the building hosting it is the only likely scenario. Course, that's closer to 7 9s. I cant figure out how long any of the other 9s are cause I only knew what our average downtime is, and could do the math that way only. Wow, its really hot in here.
Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!
slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
Forced penis!
I got it, I got it!
This isn't the first time something that once seemed ludicrous became commonplace; remember Gates and his 640K of RAM statement.
We should just give up on decent service and professionalism. I don't think so.
... It's not unrealistic ... don't expect people to live with downtime just because a good portion of those systems need to be rebooted on a regular basis (Win machines), and general retardness of sysadmins around the world allow things like Nimda and Codered to get out of hand. This is an excuse to let companies too cheap to have decent customer support off the hook. Maybe if they were educating their tech staff instead of finding more ways to rip us off, they'd have decent servive.
My ISP (Ameritech) seems to think so, considering my DSL connection and their promptness to "Get ahold of me within 24 hours..."
Bleh
Everyone with competent sysadmins on rock solid *nix systems raise your hands...
Had to make sure I didn't miss this one!
There's a Mercedes gap too. I want one and can't afford one, but it's not government's job to do anything about it.
Goatse.cx runs on Microsoft IIS! So it can reliablly bring you that anus! None of your open sores crap!
Proof that Open Sores is unreliable! Don't click if you run an open sores OS!
Ok, I'll admit my ignorance. Anyone care to explain what X nines of reliability means?
I'd suggest you don't use Slashdot as your only news source, or you will suffer permanent brain damage.
I think we just knocked his server down to two nines by slashdotting it.
What else would motivate someone to post an ex-boss' e-mail address on the front page of slashdot?
The CUNT was yesterday circumcized according to fundamentalist islamic traditions.
So the CLIT is gone forever.
You dont know what you are tallking about. All our customers have it as a mandatory requirement for *all* platforms we sell. I think the original target was like 3 minutes complete downtime every 10 years. Thats why no Windows platform will ever make it in the telcomm infrastructure world. It used to be that 5-9's was accomplished by proprietary hardware and software (look at Nortel, Lucent, and any other infrastructure providers equipment). Now all the datacomm companies think that you can make 5-9's stuff out of commercial off the shelf 3rd party crap and it's damn near impossible.
-working for the elves...
If i am stilled banned....
next page
Introduction
The Scenario
Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.
Sound Familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.
Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.
Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.
Genesis of the 'Five Nines'
We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.
First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.
The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, "if everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."
'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.
We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet
The Greasy Steel Bar
Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.
What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.
(next page)
At my place of employment, we don't bother to go for 5 nines, we're quite happy with 9 fives :)
My bosses allow for 0 downtime with Voyeurweb.. It only takes a bit of magic, a lot of available bandwidth, and redundant servers in multiple physical locations. Hell, we're slashdot proof. :)
Serious? Seriousness is well above my pay grade.
Wouldn't ISP's be that important? What about company VPN's? Hospitals? Google? Slashdot????
Web sites only work if people can view them, and when you have hundreds of thousands of hits per day, you could be loosing alot by being down
Tibbon
tibbon.com
So is his uptime screwed now that the site has been slashdotted?
Let's see...five nines would be just over five minutes of downtime in a year (315 seconds). For business and other non-life-threatening situations, that would be way better than necessary. Lots of folks are probably going to harp on the "If 1 out of 10,000 airplanes crashed, there'd be X crashes" line of argument. There's a problem with that...one mistake doesn't crash an airplane. Every system on an airliner is redundant, and virtually any "pilot error" has time to be fixed before there's a problem. Listen in on the Air Traffic Control to Cockpit transmissions sometime...just about every flight encounters some minor error at some point, whether it is a pilot needing to reask for a clearance or someone needing to climb or descend a bit to clear a potential collision. Errors are unavoidable. The key is to ensure recovery from those errors is possible. So sure, your computer may be down for 5 minutes a year. Make sure you have a backup system that is able to take up the slack instantly, and your downtime is down to 3/10 of a second a year. Redundancy is the key.
It all depends on what is on the server. If it's stuff your own people use constantly on their job, through your own network, you need five nines, otherwise you will take the blame for critical jobs getting done late.
But when people are going to the server through the internet, they get used to interruptions - there are so many links between, some of which periodically become overwhelmed with traffic, that no one could tell the difference between two nines and five nines on your server itself. So sales & product information sites don't need more reliability than you can readily afford. They do need high capacity.
And if it's your blogs concerning your navel lint - no one's looking at your uptime but you...
Anyone cached the article?
Marcos
Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.
Obliteracy: Words with explosions
8 hours a year? You must be an Windows man!
Seriously 8 hours of down time for a cellular operator during peak hours can mean big bucks.
Duh... what's four and five nines?
Tibbon
tibbon.com
The XWT Cluster has achieved some very high availability on the cheap by using machines at several mom-and-pop data centers across the country. The machines are clustered into a peered (no master) failover configuration with the open source dnsfailover package. If any machine fails, the others will remove it from the DNS records; when it comes back on line, it gets added back in.
By spreading our risk across several data centers in different cities, with no single point of failure in the cluster, we don't have to worry about incompetent network administrators, power failures, a/c failures, backhoes, or nukes. Being able to skip out on all those expensive options saves a ton of money.
My company (a large-ish, surviving Internet Retailer) has internally announce a Six Sigma Initiative. I'm wondering if we'll need to maintain 5 9s uptime...
it is the percent that the server is up. i.e. 99.99% is 4 nines of reliability and up 99.99% of the time (I am assuming that x 9 refers to total 9 in the the percentage, not just to the right of the %).
100% uptime is virtually impossible, so the holy grail is as close as possible--99.999%
If you want to learn about uptime, don't bother going to codesta.com. Their servers have already melted from a brutal slashdotting. According to Netcraft, codesta.com runs Linux and has 74 days of uptime... until today!
cpeterso
The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.
THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.
but their server is down.
Hey freaks: now you're ju
with M$, it is theoretically impossible as well to achieve their advertised up-time; ( i think back when they ran some ad (still running?) about how windows can achieve three or four 9s of uptime).
Total bullshit... let's see -- windows machine *requires* reboot every time you apply a patch; a reboot on a large machine is... i dunno, 10 minutes if you got a lot of crap. security update turns up about twice a week or so... that puts up to be ~99.8% MAXIMUM;
even if you don't buy my numbers, three 9s uptime means every week you only gets ~6 seconds downtime.
yeah... sure... not if you want to patch up than internet explorer / IIS so your system does not die from DoS, hackers, or worms!
My life in the land of the rising sun.
Maybe your phone call to 9-1-1 should be the one that happens during the five minutes of downtime?
So what kind of uptime are YOU going for?
He who has no
Remember that the control surfaces on modern jets are not connected mechanically to the yoke, you are completely at the mercy of software. You don't want it to halt.
That is why planes use redundant systems - the requirement for reliability is for the system as a whole, not necessarily for an individual processor. The control services need to be accessible by the pilots (or auto-pilot) at all times.
Hate standing in the meat locker (server room)? Hate rushing to work past midnight to cycle a server?
The problem I used to have is I'm not a morning person so being available as an admin before 7am is tough, but now I can admin my network while trapped in rush hour traffic. =] Reboot servers, telent into devices, stop/start services, add users, manage DNS... the list goes on and on.
Uptime can be maintained without even having to leave the comfort of your easy chair. If you're an admin you should check this product out.
SonicAdmin by sonicmobility
(http://www.sonicmobility)
WURD!!
Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.
When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.
The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.
If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?
Alex
I'll show you mine if you show me yours.
# uptime
16:42:54 up 121 days, 2:29, 3 users, load average: 0.23, 0.28, 0.27
Have you read my journal today?
Simply put, 4 9's of reliability would mean %99.99 uptime. (only down for .01% of the time).
"Perl 6 gives you the big knob" -- Larry Wall
like in 9.9999% ? ;-)
Ha ha.
Actually, even this is silly. True five nines availability on a widely distributed network would mean that an application was available at all times on all segments of the network. Which would mean that your uptime depends not only on your redundancy on one side of a pipe, but on your overall reduncancy as well, so that when a pipe goes down you're still accessible. Since when a pipe goes down in your host you probably lose other resources as well (such as power or alternate pipelines), this means multiple datahouses owned by multiple vendors. Each of these has to have a perfect backup of all data and be running the same versions of all software. Really, the only true redunancy would be so heavily distributed that each local network would basically have to have its own server. This isn't so crazy -- technically, DNS and email do this. However, we all know that for an end user even DNS and email can have perceived outtages.
And this is why 5 9s is foolish. Sure, you're redundant behind the pipe, but if you lose the pipe you can't blame your datacenter when you charged a customer for uninterrupted service. Technically, if their modem disconnects them for a few hours you've broken contract.
Besides, who needs it? If yahoo is unreachible from my desk, I wait and reconnect. It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same. Any services I might have used, or products purchased, I will use or purchase at a later time. After all, I don't refrain from buying shoes just because the mall is closed!
Hey freaks: now you're ju
HE'S DISSING LUNIX!!!
3 9s = 99.9% uptime = 8.75 hrs/Yr = 525 min/Yr. .875 hrs/Yr = 52.5 min/Yr. .0875 hrs/yr = 5.25 min/Yr. .03 seconds per year downtime.
4 9s = 99.99% uptime =
5 9s = 99.999% uptime =
9 9s = 99.9999999% uptime =
I call bullsh*t on anything that claims to have 9 9s reliability. 3 seconds every HUNDRED years.
Nathan Brazil?
That we live in a society that is more willing to send people into space with only a 99.9% chance of success, yet we freak out when a search engine on the Internet drops below 99.999% reliability? Great. Remind me never to work for NASA.
He who has no
Looks like codesta.com just used up all it's downtime by getting it's servers slashdotted.
Outdoor digital photography, mostly in New Engl
I believe theres more to this than meet the eye.
What other best way to get back on your former boss than slashdotting him or his company server back to medieval ages..
Follow that up with multiple queries on google about boss's info, credit cards, ssn etc..
To cut things short, by the end of the week :
Boss's boss realizes the server crashes were due to Boss, fires his ass on the spot.
Wife realizes that the new unexplained charges on Credit card from "Suzy's Parlor" were not exactly the next door cafe. Gives him the boot as well.
You evil man..you!
Rapid Nirvana
...that this article is hosted on a server which is now being brutally Slashdotted?
To make a pun demonstrates the highest understanding of a language
The above is a high availability response if I ever saw one!
"I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
We did it on a really low budget:
Heartbeat/Mon/Fake/Coda/Linux/IPVS for the High Availability, failover from DS1->DS2, each on different backbone nodes.
Mirrored systems in different geographic locations:
Firewall
IPVS Gateway
Apache->Weblogic bridge (Apache vhosts with ssl)
Apache->Zope bridge (Apache vhosts with ssl)
Zope->Zeo setup for content management.
SAN drive array for Oracle, running on two E4500s
This system isn't really that expensive, just the costs of hardware and my salary for setting them up.
My $0.02 will always be worth more than your â0.02, so
And frankly I'd rather not be in a plane that lost control for five minutes once a year.
I'm a one-man-band at a small organization. I have a lot of machines set up over a three city block area. Some of these machines are important, some are not. They are all running for a reason. If they stop running, I have to interrupt whatever other useful thing I'm doing, and fix the problem. Sometimes I'm doing something important, sometimes not, but I'm always doing something. My to-do list never seems to get any shorter.
So reliability, in my case, is not a commercial transactions lost per minute scenario. None of my machines are in a life support position where failure would endanger anyone's life. Reliability, in my case, means that my phone doesn't ring and other projects are interrupted less often. Some of my machines have been running for years with no unexpected down time. Others, uhhhh, less.
Its all about my convenience. I like high-reliability systems.
I finally got a tcp connection and page 2 finally loaded, so here it is...
The Uptime Rules
First, as an introduction to the rules, let's review our terms and terminology.
Definitions
Uptime is the amount of time the entire system is available. By entire system we are saying that an entire transaction can be completed. Just having your web servers running when the needed application server isn't running cannot be defined as uptime.
Downtime is everything else.
Scheduled maintenance downtimes or windows are the periods of time (for example, from 1:00am to 3:00am Monday morning) when an IT team has the option, if they need, to bring down various components in a fashion that causes the system to be incapable of complete functionality.
Reliability is defined as uptime but where scheduled maintenance downtime is not counted against it. For example, if in a 24 hour period there was an hour of scheduled downtime, but otherwise full operational for the remaing 23 hours, then the system was 100% reliable.
So how do you translate the 'nines' into acceptable downtime? This chart provides the answer:
'Nines' Uptime % Minutes
Per Year Minutes
Per Month
Two 99% 5256 438.0
Three 99.9% 526 43.8
Four 99.99% 53 4.4
Five 99.999% 5 0.4
Rule #1: A great system run poorly is a poor system.
This is the most crucial rule to understand when managing any system. It doesn't matter how much you spent on the hardware, how well designed your database tables are, or if you installed the latest and greatest operating system on the market. If it cannot be managed well, problems ensue.
Users don't see, or care that problems come from your database servers, or your application servers, or your static data caching. What they perceive is one of two states: working or not working. They want to make their reservation, or pay their bill, or just get the weather in Bali, and they want to do it NOW!
Managing with a given level of reliability in mind is about people, hardware, operating and escalation plans, and ultimately, it is about the money to put it all together and keep it running. The cost of reliability, is very hard to quantify. Even assuming it is a linear relationship (and few things in life are) it's a staggering relationship in financial terms. In my experience each 'nine' is close to an order of magnitude increase in cost!
The bottom-line is this, you need to do an honest assessment of available resources versus intended goals; it is the first step in making sure your great systems runs at least as good as you intended.
Rule #2: Five nines is a goal reachable only through both fully automated system management, and rigorously controlled and tested applications.
Scared by four and five nines? Unless you've worked in a true, hardcore, spare no expense data center, you should be!
Let's think about five nines for a moment. 5 minutes a year. That rules out any form of human involvement in fixing problems. After all, even the best humans are known to be distracted for a minute or two into conversation with a co-worker, or a phone ringing.
As an example, let's time a perfectly common scenario, where you have two people monitoring systems. Time the following emulation in your office space:
1. Assume the system is working happily.
2. Walk over to your kitchen area and grab a soft drink. Then walk back.
3. Wait 15 seconds while you pretend to have the other NOC (Network Operations Center) engineer say "Hey, look at this!"
4. Sprint over to your desk and sit down.
5. Log into your desktop machine.
6. Log into a remote machine.
7. Run one or two basic remote commands ('ps' or 'top' for example)
Now stop the clock. I'm willing to bet your five minutes are up!
Even without a distraction, it's simply not possible for a system of any complexity, to have a problem confirmed, cross checked, and resolved, by a person, within five minutes. Oh, and don't forget about the minute to 90 seconds that you've already lost in monitoring the issue - unless you want alarms going off continuously, you have to set an error threshold that typically consumes 60 seconds or so.
"Okay," you say, "well, five nines is a lot. How about aiming at four nines?" But are four nines really much different than five? Certainly, it gives you more latitude and time to fix a problem, but not much more. You can afford a single downtime that takes a few minutes to debug, but that's all.
The truth is, unless you have an application that doesn't fail, the odds are that your hardware failures will still occur three to four times a year, which pushes the limit of human intervention. A good rule of thumb is that things never happen when you are watching them - figure that any issue takes at least ten minutes to resolve, even if it as simple as a human inadvertently powering both sets of redundant systems down, and now they are powering back up.
prev page
next page
It is in my random
Murphy's revenge: The more reliable you make a system, the longer it will take you to figure out what's wrong when it breaks. -- Sean Donelan on NANOG, Mon, 26 Nov 2001 06:28:22 -0500 (EST)
I use variations of it verbally in meetings when the marketing/sales pinheads are demanding absurd uptimes for brochureware websites. It makes a great starting point for those "be careful what you wish for, because you will have to pay the bill" talks that I use like Jedi mind tricks on pinhead marketing and sales weasels.
I work for a small ISP in central NY. A couple years ago, I can't remember which provider it was anymore, but they unplugged us because their paperwork was all screwed up and they didn't think anybody was on the circuit. Then they plugged somebody else into it. It not only took us several hours to find out what the problem was, it took 3 whole days for them to resolve the problem. They wouldn't simply undo what they did, they had to assign us a new circuit and basically refused to escalate the work order. We eventually came back up but lost quite a few customers, understandably.
They've been building machines that provide 99.999% uptime for something like 20 years.
n e you name it
I've got a lab full of those bastards. Everything is redudant. CPU/memory/powersupplies/ups/disk/network/backpla
I've had a chance to open the thing up and look inside and its amazing!
My only gripes have ever been that their a bit esoteric at times and their generally behind the technologiy curve a bit, but I think they do it purposely so they know that their putting out a tested product. Nobody wants a machine running the stock market to be on anything but throughly tested hardware. Sort like how all the computer systems on the ISS are only 386 level...
Yes Francis, the world has gone crazy.
well, that depends on what time frame you are counting with. For example, the last second my uptime was exactly 100%. ;-)
It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues.
You really want to see someone go berserk over downtime, try running a MUD...
As we continue to depend more and more on networks for day-to-day operations, reliability becomes a must. Is it going to be expensive? YES. That doesn't mean that it won't happen or that it doesn't need to happen. How will these networks transform from being a pr0n conduit to carrying traffic such as VoIP (as a business, not the lame implementations we have seen so far), without having some sort of (good) reliability?
This reliability WILL come about one way or another.
What better way of demonstrating this than by looking at the hidden messages contained within the names of some of Linux's most outspoken advocates:
I'm sure that Eric S. Raymond, composer of the satanic homosexual propaganda diatribe The Cathedral and the Bizarre, is probably an anagram of something queer, but we don't need to look that far as we know he's always shoving a gun up some poor little boy's rectum. Update: Eric S. Raymond is actually an anagram for secondary rim and cord in my arse. It just goes to show you that he is indeed queer.
Update the Second: It is also documented that Evil Sicko Gaymond is responsible for a nauseating piece of code called Fetchmail, which is obviously sinister sodomite slang for 'Felch Male' -- a disgusting practise. For those not in the know, 'felching' is the act performed by two perverts wherein one sucks their own post-coital ejaculate out of the other's rectum. In fact, it appears that the dirty Linux faggots set out to undermine the good Republican institution of e-mail, turning it into 'e-male.'
As far as Richard 'Master' Stallman goes, that filthy fudge-packer was actually quoted on leftist commie propaganda site Salon.com as saying the following: 'I've been resistant to the pressure to conform in any circumstance,' he says. 'It's about being able to question conventional wisdom,' he asserts. 'I believe in love, but not monogamy,' he says plainly.
And this isn't a made up troll bullshit either! He actually stated this tripe, which makes it obvious that he is trying to politely say that he's a flaming homo slut!
Speaking about 'flaming,' who better to point out as a filthy chutney ferret than Slashdot's very own self-confessed pederast Jon Katz. Although an obvious deviant anagram cannot be found from his name, he has already confessed, nay boasted of the homosexual perversion of corrupting the innocence of young children. To quote from the article linked:
'I've got a rare kidney disease,' I told her. 'I have to go to the bathroom a lot. You can come with me if you want, but it takes a while. Is that okay with you? Do you want a note from my doctor?'
Is this why you were touching your penis in the cinema, Jon? And letting the other boys touch it too?
We should also point out that Jon Katz refers to himself as 'Slashdot's resident Gasbag.' Is there any more doubt? For those fortunate few who aren't aware of the list of homosexual terminology found inside the Linux 'Sauce Code,' a 'Gasbag' is a pervert who gains sexual gratification from having a thin straw inserted into his urethra (or to use the common parlance, 'piss-pipe'), then his homosexual lover blows firmly down the straw to inflate his scrotum. This is, of course, when he's not busy violating the dignity and copyright of posters to Slashdot by gathering together their postings and publishing them en masse to further his twisted and manipulative journalistic agenda.
Sick, disgusting antichristian perverts, the lot of them.
In addition, many of the Linux distributions (a 'distribution' is the most common way to spread the faggots' wares) are run by faggot groups. The Slackware distro is named after the 'Slack-wear' fags wear to allow easy access to the anus for sexual purposes. Furthermore, Slackware is a close anagram of claw arse, a reference to the homosexual practise of anal fisting. The Mandrake product is run by a group of French faggot satanists, and is named after the faggot nickname for the vibrator. It was also chosen because it is an anagram for dark amen and ram naked, which is what they do.
Another 'distro,' (abbrieviated as such because it sounds a bit like 'Disco,' which is where homosexuals preyed on young boys in the 1970s), is Debian, an anagram of in a bed, which could be considered innocent enough (after all, a bed is both where we sleep and pray), until we realise what other names Debian uses to describe their foul wares. 'Woody' is obvious enough, being a term for the erect male penis, glistening with pre-cum. But far sicker is the phrase 'Frozen Potato' that they use. This filthy term, again found in the secret homosexual 'Sauce Code,' refers to the solo homosexual practice of defecating into a clear polythene bag, shaping the turd into a crude approximation of the male phallus, then leaving it in the freezer overnight until it becomes solid. The practitioner then proceeds to push the frozen 'potato' up his own rectum, squeezing it in and out until his tight young balls erupt in a screaming orgasm.
And Red Hat is secret homo slang for the tip of a penis that is soaked in blood from a freshly violated underage ringpiece.
The fags have even invented special tools to aid their faggotry! For example, the 'supermount' tool was devised to allow deeper penetration, which is good for fags because it gives more pressure on the prostate gland. 'Automount' is used, on the other hand, because Linux users are all fat and gay, and need to mount each other automatically.
The depths of their depravity can be seen in their use of 'mount points.' These are, plainly speaking, the different points of penetration. The main one is obviously
More evidence is in the fact that Linux users say how much they love `man`, even going so far as to say that all new Linux users (who are in fact just innocent heterosexuals indoctrinated by the gay propaganda) should try out `man`. In no other system do users boast of their frequent recourse to a man.
Other areas of the system also show Linux's inherit gayness. For example, people are often told of the 'FAQ,' but how many innocent heterosexual Windows users know what this actually means. The answer is shocking: Faggot Anal Quest: the voyage of discovery for newly converted fags!
Even the title 'Slashdot' originally referred to a homosexual practice. Slashdot of course refers to the popular gay practice of blood-letting. The Slashbots, of course are those super-zealous homosexuals who take this perversion to its extreme by ripping open their anuses, as seen on the site most popular with Slashdot users, the depraved work of Satan, http://www.eff.org/.
The editors of Slashdot also have homosexual names: 'Hemos' is obvious in itself, being one vowel away from 'Homos.' But even more sickening is 'Commander Taco' which sounds a bit like 'Commode in Taco,' filthy gay slang for a pair of spreadeagled buttocks that are caked with excrement. (The best form of lubrication, they insist.) Sometimes, these 'Taco Commodes' have special 'Salsa Sauce' (blood from a ruptured rectum) and 'Cheese' (rancid flakes of penis discharge) toppings. And to make it even worse, Slashdot runs on Apache!
The Apache server, whose use among fags is as prevalent as AIDS, is named after homosexual activity -- as everyone knows, popular faggot band, the Village People, featured an Apache Indian, and it is for him that this gay program is named.
And that's not forgetting the use of patches in the Linux fag world -- patches are used to make the anus accessible for repeated anal sex even after its rupture by a session of fisting.
To summarise: Linux is gay. 'Slash -- Dot' is the graphical description of the space between a young boy's scrotum and anus. And BeOS is for hermaphrodites and disabled 'stumpers.'
FEEDBACK
Well, the only reason I know all about this is because I had the misfortune to read the Linux 'Sauce code' once. Although publicised as the computer code needed to get Linux up and running on a computer (and haven't you always been worried about the phrase 'Monolithic Kernel'?), this foul document is actually a detailed and graphic description of every conceivable degrading perversion known to the human race, as well as a few of the major animal species. It has shocked and disturbed me, to the point of needing to shock and disturb the common man to warn them of the impending homo-calypse which threatens to engulf our planet.
Doesn't it give you a hard-on to imagine your thick strong poker ramming it's way up my most sacred of sphincters? You're beyond help, my friend, as the only thing you can imagine is the foul penetrative violation of another man. Are you sure you're not Eric Raymond? The government, being populated by limp-wristed liberals, could never stem the sickening tide of homosexual child molesting Linux advocacy. Hell, they've given NAMBLA free reign for years!
Thank you for your kind words of support. However, this document shall only ever be posted anonymously. This is because the 'Open Sauce' movement is a sham, proposing homoerotic cults of hero worshipping in the name of freedom. I speak for the common man. For any man who prefers the warm, enveloping velvet folds of a woman's vagina to the tight puckered ringpiece of a child. These men, being common, decent folk, don't have a say in the political hypocrisy that is Slashdot culture. I am the unknown liberator.
We shouldn't hate them, we should pity them for the misguided fools they are... Fanatical Linux zeal-outs need to be herded into camps for re-education and subsequent rehabilitation into normal heterosexual society. This re-education shall be achieved by forcing them to watch repeats of Baywatch until the very mention of Pamela Anderson causes them to fill their pants with healthy heterosexual jism.
Well, it just goes to show that even the holy Linux 'sauce code' is riddled with bugs that need fixing. (The irony of Jon Katz not even being able to inflate his scrotum correctly has not been lost on me.) The Linux pervert elite already acknowledge this, with their queer slogan: 'Given enough arms, all rectums are shallow.' And anyway, the PS2 sucks major cock and isn't worth the money. Intellivision forever!
For one thing, whilst Linux is a cavalcade of queer propaganda masquerading as the future of computing, NT is used by people who think nothing better of encasing their genitals in quick setting plaster then going to see a really dirty porno film, enjoying the restriction enforced onto them. Remember, a wasted arousal is a sin in the eyes of the Catholic church. Clearly, the only god-fearing Christian operating system in existence is CP/M -- The Christian Program Monitor. All computer users should immediately ask their local pastor to install this fine OS onto their systems. It is the only route to salvation.
Secondly, this message is for every man. Computers know no colour. Not only that, but one of the finest websites in the world is maintained by a Black Man . Now fuck off you racist donkey felcher.
Although there is nothing unholy about the fine heterosexual act of ejaculating between a woman's breasts, squirting one's load up towards her neck and chin area, it should be noted that Perl (standing for Pansies Entering Rectums Locally) is also close to 'Pearl Monocle,' 'Pearl Nosering,' and the ubiquitous 'Pearl Enema.'
One scary thing about Perl is that it contains hidden homosexual messages. Take the following code: LWP::Simple -- It looks innocuous enough, doesn't it? But look at the line closely: There are two colons next to each other! As Larry 'Balls to the' Wall would openly admit in the Perl Documentation, Perl was designed from the ground up to indoctrinate it's programmers into performing unnatural sexual acts -- having two colons so closely together is clearly a reference to the perverse sickening act of 'colon kissing,' whereby two homosexual queers spread their buttocks wide, pressing their filthy torn sphincters together. They then share small round objects like marbles or golfballs by passing them from one rectum to another using muscle contraction alone. This is also referred to in programming 'circles' as 'Parameter Passing.'
And PHP stands for Perverted Homosexual Penetration. Didn't you know?
Well, I don't know about terraforming Mars, but I do know that homosexual Linux Advocates have been probing Uranus for years.
*sniff* That brings a tear to my eye. Thank you once more for your kind support. I have taken faith in the knowledge that I am doing the Good Lord's work, but it is encouraging to know that I am helping out the common man here.
However, I should be cautious about revealing your name 'Cerberus' on such a filthy den of depravity as Slashdot. It is a well known fact that the 'Kerberos' documentation from Microsoft is a detailed manual describing, in intimate, exacting detail, how to sexually penetrate a variety of unwilling canine animals; be they domesticated, wild, or mythical. Slashdot posters have taken great pleasure in illegally spreading this documentation far and wide, treating it as an 'extension' to the Linux 'Sauce Code,' for the sake of 'interoperability.' (The slang term they use for nonconsensual intercourse -- their favourite kind.)
In fact, sick twisted Linux deviants are known to have LAN parties, (Love of Anal Naughtiness, needless to say.), wherein they entice a stray dog, known as the 'Samba Mount,' into their homes. Up to four of these filth-sodden blasphemers against nature take turns to plunge their erect, throbbing, uncircumcised members, conkers-deep, into the rectum, mouth, and other fleshy orifices of the poor animal. Eventually, the 'Samba Mount' collapses due to 'overload,' and needs to be 'rebooted.' (i.e., kicked out into the street, and left to fend for itself.) Many Linux users boast about their 'uptime' in such situations.
If only indeed. You can help our brave cause by moderating this message up as often as possible. I recommend '+1, Underrated,' as that will protect your precious Karma in Metamoderation. Only then can we break through the glass ceiling of Homosexual Slashdot Culture. Is it any wonder that the new version of Slashcode has been christened 'Bender'???
If we can get just one of these postings up to at least '+1,' then it will be archived forever! Others will learn of our struggle, and join with us in our battle for freedom!
I am compelled to document the foulness and carnal depravity that is Linux, in order that we may prepare ourselves for the great holy war that is to follow. It is my solemn duty to peel back the foreskin of ignorance and apply the wire brush of enlightenment.
I could make an arrogant, childish comment along the lines of 'Every time someone asks for 2.0, I won't release it for another 24 hours,' but the truth of the matter is that I'm quite nervous of releasing a 'number two,' as I can guarantee some filthy shit-slurping Linux pervert would want to suck it straight out of my anus before I've even had chance to wipe.
I sincerely hope you're Natalie Portman.
What the fuck?
Well bugger me!
Fuck right off!
IMPORTANT: This message needs to be heard (Not HURD, which is an acronym for 'Huge Unclean Rectal Dilator') across the whole community, so it has been released into the Public Domain. You know, that licence that we all had before those homoerotic crypto-fascists came out with the GPL (Gay Penetration License) that is no more than an excuse to see who's got the biggest feces-encrusted cock. I would have put this up on Freshmeat, but that name is known to be a euphemism for the tight rump of a young boy.
Come to think of it, the whole concept of 'Source Control' unnerves me, because it sounds a bit like 'Sauce Control,' which is a description of the homosexual practice of holding the base of the cock shaft tightly upon the point of ejaculation, thus causing a build up of semenal fluid that is only released upon entry into an incision made into the base of the receiver's scrotum. And 'Open Sauce' is the act of ejaculating into another mans face or perhaps a biscuit to be shared later. Obviously, 'Closed Sauce' is the only Christian thing to do, as evidenced by the fact that it is what Cathedrals are all about.
Contributors: (although not to the eternal game of 'soggy biscuit' that open 'sauce' development has become) Anonymous Coward, Anonymous Coward, phee, Anonymous Coward, mighty jebus, Anonymous Coward, Anonymous Coward, double_h, Anonymous Coward, Eimernase, Anonymous Coward, Anonymous Coward, Anonymous Coward, Anonymous Coward, Anonymous Coward, Anonymous Coward, Anonymous Coward, Anonymous Coward. Further contributions are welcome.
Current changes: This version sent to FreeWIPO by 'Bring BackATV' as plain text. Reformatted everything, added all links back in (that we could match from the previous version), many new ones (Slashbot bait links). Even more spelling fixed. Who wrote this thing, CmdrTaco himself?
Previous changes: Yet more changes added. Spelling fixed. Feedback added. Explanation of 'distro' system. 'Mount Point' syntax described. More filth regarding `man` and Slashdot. Yet more fucking spelling fixed. 'Fetchmail' uncovered further. More Slashbot baiting. Apache exposed. Distribution licence at foot of document.
- poopbot: who doesn't like scat?
See the Netcraft FAQ at http://uptime.netcraft.com/up/accuracy.html#cycle
That site deserves to be slashtdotted. They have this little paper divided into about ten little sections, which multiplies their load by 10x or so. Then, it's a .jsp page (why?), which means more server-side interpreter overhead. If they hadn't crudded up the basic job of serving a readable document, they'd have one or two orders of magnitude more capacity.
prev page
next page
Rule #3: Even three nines is hard in the Internet World.
The "Internet World" is not a magazine, but rather, a truism of application state, where functionality and features are continuously enhanced. Compare this to a billing or call center, which has a minimum of features, and where great amounts of time are spent in testing before new applications are released to production.
The great thing about developing in the Internet world is that lots of new features can be brought to end users in a very short amount of time. The standard for development is weeks to a few months rather than years. Not only does this provide a level of instant gratification, but it also allows applications and services to be highly responsive to what users actually want and need, and in the end, provides a vastly more desirable system.
The tradeoff, of course, is that the applications themselves aren't nearly as reliable. Thus, the three nines goal. Why three nines? Because it's the highest possible reliability for a system which utilizes human intervention, and there's simply no way that a dynamic, "Internet World" application can be reduced to few enough parameters that it can be managed in an automated fashion. Failure modes grow at an exponential rate to functionality and the task of automating monitoring and management of such dynamic and flexible systems is an entropic one - that is, it quickly becomes a task bigger than the application itself.
But even three nines doesn't come cheaply. It requires a complete staff to be available at all times. There's no time to call and page people - to wait for them to get home from the supermarket where they were grabbing a quart of milk for the baby.
How much staff does one need? Well, that's a good question, and the answers are dependant upon the nature of the particular application. But, my experience in today's world shows that most systems are three-tier applications, with significant networking components. Therefore, at any given time, you need the following people on hand:
* NOC / Monitoring staff
* System administrators
* Network Engineers
* Application Engineers
* Database Administrators
* Crisis Management
* Customer Management
Now, admittedly, there can be some overlap in tasks, and the simpler the application, the easier it is to get overlap, but already, we're talking about quite a few people. Of course, these people need some backup to call in, for fresh ideas, if things aren't going well.
Don't underestimate the value of having a technical person, who understands the system, acting in the Crisis Manager role. This person is actually very critical to making sure that key issues aren't being overlooked, and to providing the detached viewpoint that is key to problem solving.
In addition, having a customer relationship person available to talk to the upset customers, at least when the service is provided to businesses rather than consumers, is vital. This isn't to help solve the problems of a given downtime event, but for the ongoing relationship with the customer.
Rule #4: 99.7% is very cost effective.
That's right, less than three nines. 99.7% gets effectiveness from the fact that it allows for two hours of downtime a month - basically, a total of one day per year.
While it sounds like a lot, it's typical for a failure pattern to consist of several small events of 10-20 minutes duration, and on rare occasions, a failure that takes three to four hours to resolve. That's the core timing that you are get with 99.7% -- the ability to have a four hour failure once a year.
That means that you don't have to build nearly the hardware redundancy - instead of having 1:1 "hot" standby units, you can have a 1:N relationship with a cold standby unit that can be configured and put into place in the span of a couple of hours. The larger N is the greater the costs savings. If they are network components we're referring to, the less complex the routing environment, the fewer people with network-specific skills are needed. Get it simple enough, and you get more overlap of skills, meaning more bang for your salary buck.
Complex systems also require complex understandings. The number of dependencies within systems again grows exponentially, and leaves far more room for human error.
Remember rule #1, a great system run poorly is a poor system.
prev page
next page
Here's a few machines. It's this low because of hardware upgrades last September.. We took one or two down at a time, which left 10 or so serving the site, therefore creating no downtime. hehe
Most of these are web servers that frequently do between 20Mb/s and 80Mb/s, depending on their task. voy03 handles voting, which gives it a slightly higher load.. It only counts a few million votes daily (read: a few million CGI hits)..
voy01 # uptime
4:55pm up 292 days, 24 min, 1 user, load average: 0.71, 0.44, 0.36
voy02 # uptime
4:56pm up 307 days, 3:04, 1 user, load average: 0.15, 0.17, 0.17
voy03 # uptime
4:56pm up 306 days, 8:25, 1 user, load average: 13.70, 12.11, 10.17
voy04 # uptime
4:56pm up 306 days, 19:40, 1 user, load average: 0.45, 0.38, 0.32
voy05 # uptime
4:56pm up 307 days, 3:16, 1 user, load average: 0.25, 0.35, 0.39
voy60 # uptime
4:57pm up 262 days, 23:57, 1 user, load average: 0.33, 0.37, 0.35
Serious? Seriousness is well above my pay grade.
"It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
;)
Please. Let's not talk so badly about eBay. Do you know how many people have been crushed under their CIOs foot?
For instance, 4 nines says your system is up 99.99% of the time. That is, out of 365 days x 24 hours x 60 minutes = 525,600 minutes a year, it can be down for only .01%, or 52 minutes a year. Five nines (99.999%) allows only 5 minuts a year downtime. This may actually be averaged over many servers and several years (that is, if you had 10 servers running for 3 years and just 1 died requiring one day to replace, you could figure your downtime as 100 * 1 day/(10*3*365) = .009%, so you've still got 4 nines).
There are questions about what gets counted when figuring reliability. For one thing, almost no one would count a slashdotting or a DOS attack against their uptime, but nevertheless from the user's viewpoint the server is down. Also, how do you count "scheduled downtime" such as rebooting NT servers after installing security patches, or unplugging the boxes to move them around when it's time to expand the system? A news server with a worldwide audience has no "penalty free" time slots. So either you settle for a lower uptime goal, or you need redundant servers configured so that even major upgrades can be put in by unplugging just one at a time while the others keep running. OTOH the company database server, downtime during working hours is far more serious than downtime for the web server, so if it's a big company you do need redundant servers with automatic switchover. But in most cases there are times late at night or on weekends when no one cares if you shut them _all_ down at once - which certainly makes the upgrades easier.
So anyway, one person's "5 nines" may look like a lot less to someone else. E.g. a server vendor may claim that because only one in a million of their servers is broken at any given time their reliability is 6 nines. Your single server may never break at all - but once a week you take it off-line for ten minutes to load the newest security patches, so to anyone who wanted to keep working for those ten minutes you are only at 3 nines.
It isn't reliability; it is availability.
The availability of a system is the fraction of its intended duty cycle which it is functional for. It is frequently expressed as a percentage, as in 99.999%.
Reliability is the rate of failure, thus is expressed in units of time or usage. Mean Time To Failure or Mean Time Between Failure are expressions of reliability.
Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.
Its pretty safe to say that almost none actually back that up with their performance. However, from my experience, very few customers will try to get the company to honor its SLA because you need to provide pretty good documentation. We had a few situations where the company was down for a few hours, dropping its uptime below its guarantee, and still wouldn't credit us because they claimed the downtime to be much shorter. No matter how many traceroutes from major network nodes we showed, they kept arguing against it until we gave up.
Also, unless there is a catastrophic downtime that pulls service out for more than a few minutes at a time, and uptime falls below 95% for the month or lower, most users don't want to be bothered fighting for a credit. If you've ever dealt with any big telco / network provider you know what I'm talking about.
So, the bottom line is that in a lot of service industries (internet especially) it is very easy to claim 5 9's reliability, come close, and not really pay a huge price for failing. In fact, most people won't even notice. Now, for the 911 network, air traffic control, etc., it's a different story.
When someone posts a non-spammed proof email address in a mailto link on the frontpage of /. ??? I guess I know who to ask now....
I get quite a bit just from the comments I post.
I can't read the paper but, for his sake, I hope that he really meant that reliabillity isn't that important to him.
His server is toast!
For those who remember the awesome but now defunct uptimes.net will be pleased to know that a new server is now up and running. It uses the old uptimes protocols and clients.
The URL is http://uptimes.wonko.com/
A GNU/Linux box was number one the last time I looked, with a NetBSD box coming in second.
So much for "two nines". Nothing, I repeat nothing, can withstand the /. hordes ...
The time of day (week/month) a system is down is as important as how much of the time.
Our hosting company blew out a (supposedly) fully redundant Cisco and took us down for 1.5 hours in the middle of what was at the time our peak day in history. This impacted hundreds of thousands of visitors.
The same downtime at 1:00 am would have had about 1% of the impact on users even though the availability statistics would show both to be identical.
Naturally systems tend to fail more often under high load when the impact on your user base will be the greatest.
Impact on your user base is a better measure of the impact of downtime.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
On W2K - service packs, about 10% of hot fixes, and anything to do with IIS require a reboot. Take your head out of the sand.
To get closer to your analogy, I would treat a server like a jet engine - the plane is designed to fly even if one fails.
"The world is run by idiots because they're more efficient than hamsters. "
That's just what the HAMSTERS want you to think!
Service guarantees Citizenship! Questions Guarantee GITMO.... Amerika Uber Alles!
I figure it makes pretty good economic sense, since many different sites with low CPM rates still get over a million page views per day. Problem is, she replies, there's probably only around 150-200k unique visitors at any of these respective locations, each of whom is triggering around 5-7 page views per person per day.
And besides, she continued, using the Jungean Archetype model to illustrate her point, the target audience is devoted to reason, not emotion. This, I concede, defeats one of my central tenets: applying a test to determine whether a person is Apollonian or Dionysian, left-brain or right-brain, etc. in order to assess the likelihood of downloading Centiare, my cool little cash management/forecasting program for individuals and small businesses.
Centiare quickly and automatically calendarizes projected deposits, payments and running cash balances over any time period selected - the output looks like a spreadsheet. But since transactions are stored in a database, the way it works is through a series of SQL pivot/transformation functions. The results are stored within multiple counter arrays to keep track of time periods, monthly totals, and grand totals. Once the recordset is complete, viola', the whole thing is formatted and printed - the flash report looks really good.
And besides, it's free to try, and only $20 to buy!
Centiare
Although the client installation instructions would probably be: run setup.exe and reboot.
When designing new systems, I ask the customer to break down the costs of an outage based on scope and time. For example, if the entire system goes down for a minute, what is the cost? An hour? A day? What if only part of the system goes down? I ask the customer to consider all impacts of the outage, beyond simple lack of access. I ask them to put all of this information into a spreadsheet so they can easily play with the numbers and do what-ifs. Most of the time, the customer doesn't have a clue what an outage costs them until they perform this exercise.
Once the customer truly understands the actual costs of an outage, they are generally much more realistic when it comes to designing a system. I also encourage the customer to consider the odds of an outage happening. Yes, total and extended system outages occur. I have more than enough first-hand experiences. But what is the cost and what are the odds over time? Is it worth paying an extra $100k to avoid an outage that may only happen once a year and result in a $10k worst-case scenario?
As for the several of the comments I observed about planes requiring five-nines uptime, I don't think that is realistic. Planes frequently have system failures resulting in partial outages. That's why they have two engines, multiple wheels, back-up control systems, etc. Also, most of us have experienced flight delays due to mechanincal repairs. That's an outage as far as I'm concerned. When a plane can remain in service constantly for all but a few seconds a year, then it will have achieved five-nines. I don't know of any planes that perform to that level.
"Are ve up?"
"Nien."
"Are ve up yet?"
"Nien."
"How about NOW?"
"Nien."
"Vill ve be comink up soon?"
"Nien."
"Vill ve be up next veek?"
"Nien."
God is real unless declared integer
A lot of sites are actually hosted by more than one company. Some big consulting companies sell solutions that are hosted in their hosting environment but maintained by another for company. I am currently on a project in which our company was outsourced to not only deliver the software but to maintain the servers. The big consulting company that has a multimillion dollar monitoring system always let us down. We ended up writing several scripts to monitor our processes and servers( as we could not install any other software for security reasons ) Obviously, We would rather have the montoring software that we use on our internally hosted sites. However, Our scripts do a superb job and haven't let us down. I don't think it is hard to achieve 5 9's if your site is set up properly. All of our down-time on the different environments that we host were actually due to the Large Consuting company doing things like shutting down firewall ports, A security team kicking the wire to a DataBase, turning off the power to the hosting facility ( Yes they have UPS. However, it is configured wrong. So when they have a scheduled power down.... The UPS's do not work. ) Unfortunately , I can't say that this is unique in the Industry. As another Major Hardware/Consulting/software company that we are partnered with are is also just as bad.
So how many 9s do we suppose Google has? (given the recent Interview)
99.99999999999999999???
Having the 5 9's of reliability is NOT foolish. It is a reality of life. My particular organization services 40 million web customers, so we can not afford to be down at any time of the day because of the type of service we provide. In fact, last year we made our goal of having the 5-9's, and we did it without needing our disaster recovery (DR) site.
Having a DR plan and being reliable go hand in hand for the most part, however under normal day-to-day business conditions, servers need to be upgraded and things unplugged. You don't switch your entire infrastructure over to a DR site to upgrade your apache web server!! It is for this reason you have redundancy on the network and server level leading out to the Internet (or wherever your customer base resides).
Disasters, on the other hand, do not happen everyday. They happen once a year, maybe.... sometimes once every 2 years. If you live in an area more prone to disasters (like southern California), you may need an alternate site located on the east coast.... but, that is the cost of doing business.
Also, having 5-9's on uptime does NOT mean being accessible to everyone in the world at any time no matter what. Having 5-9's of uptime means that your organization has successfully kept it's applications and services available to the Internet. How is it my company's fault if you don't plug your modem into the wall? It's not, so to say that our "reliability" decreases because of an end user being a moron is a stupid statement.
Stats on the server are interesting that either it stopped being "up" or stopped bein monitored before june.
Or did I read the graph wrong?
.
Have you read the moderator guidelines? Well, have you, PUNK? (and I want a Karma: Gnarly option)
our firewall was down and we lost our Internet connection.
rats...
that if the site is not available, the server is down as well? If they built their network with a little sense, the webserver is dedicated... so even if it went down, the business processes are not affected. Duh...
The best weapon of a dictatorship is secrecy, but the best weapon of a democracy should be the weapon of openness.
Check out this article for how you can evaluate the reliability of a building. Simple little calculator for looking at all the different systems involved.
Makes you think about how all the parts relate...
See High Availability for more informaiton.
:-) (And yes, I have Win2000 on my machine and even occasionally am forced to boot into it so I speak from experience).
Coda is the best present option for fs dependant data storage on mostly open-source plaforms. We are using Coda for our MySQL table files, ZODB files and logs.
Coda may still be beta software, but if Open Source software like Coda is considered beta code then Windows 2000 + sp2 must have been alpha code.
My $0.02 will always be worth more than your â0.02, so
Last time I saw one of those was in the main HVAC
room in the State Office Building in Shreveport, LA. It ran all the big compressors and other HVAC systems in the building. This little computer was on all the time...(ca. 1980's)
er, your former boss's mail @ interceptor.com has a hell of a typo in the first paragraph. The (presumably) actual site at codesta.com is still not availabel at 19:12 EST. Anyplace else I should look? The Google mirror was funny, but once was enough. Thx
C|N>K
... and what is cost of keeping this up?
The maintenance on the 4500s (if they have multi procs and lots of ram) is prolly 20-30k each annually just by itself.
What about the renew costs on the weblogic support?
How much was that oracle?
Even a basic system as qouted above is 400+k.
--- I do not moderate.
This reminds me of when I was working at a .com called rulespace - there was a construction outfit building a parking lot downstairs - one day they decided to move the big uswest/qwest plywood board from one pillar to another. Alarms never went off because they couldn't call the pagers because they had effectively disconnected all the T1's (including the 2 backups), all the dsl circuts/analogue lines and the T1 going to the telephone switch for the entire building. All the redundancy in the world wouldn't save that mess. As I recall they forked out more money for colocation space at Inflow and moved the more critical systems out there.
The Scenario
Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.
Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.
Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.
Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.
Genesis of the 'Five Nines'
We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.
First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.
The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."
'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.
We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.
The Greasy Steel Bar
Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.
What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.
Let me see. 99.999% uptime means 5 minutes downtime per year. Computers boot in about two minutes. Can you imagine a Microsoft product that only needs three security patches in a year. HAH!
All your database are belong to U.S.
Even buildings don't have 100% 'uptime' anymore. Considering how it crashed, the WTC must have been running windows. It had them all over the place.
___
It's the end of my comment as I know it and I feel fine.
Example: the system I support is mission-critical. If it crashes, it makes the front page of the newspaper the next day. Hundreds of thousands of folks may get delayed by ten or fifteen minutes or a half-hour. The system is finally up to 99.995% availability. And how? By turning off all the backup systems and disentangling all the horrendous software kludges that were put in for the backup system. While the organization was trying to support hot-backup availability, it was crashing every other day. Outside consultants blamed this on flaky hardware and said the system had a life of, at most, a year and a half. Here we are now, four years later, and reliability is better than ever :-).
I like to think that some of the work, and (even
better) some of my attitudes have helped get us
where we are.
Nein!
for 4 quarters running now. IT IS horribly expensive, both on the hardware and support side but given federal requirements and customer demand we have no choice.....
errr....umm...*whooosh* *whoosh* Is this thing on ?
He puts a seemingly valid mailto: link on a heavily trafficed website. If it wasn't his "former boss" before, it damn well will be now.
Subject is more of a marketing line than anything, but Tandem systems come much closer to 100% availability than anything else that I'm familiar with.
Check this for more info, and this.
Never mind that they were bought by Compaq, and now HP - the architecture still stands. It is one of the great - relatively unknown Silicon Valley companies.
There's some really interesting stuff architecture-wise. Linux-heads would do well to check it out.
USDCO has been featured in other /. articles; not only is their colocation facility located underground, with a high degree of redundancy in their connections, but it's not very expensive, either...
'Course, an on-site solution won't be anywhere near as cheap, but if you can colo, this is the place.
This is not made up...
The uptime counter wrapped at 480ish days, but the ethernet counter is correct.. This box is going on 2 years at 100% uptime.
-- interface e0 (704 days, 9 hours, 15 minutes, 21 seconds) --
8:15pm up 206 days, 14:21 2726120747 NFS ops, 0 CIFS ops, 0 HTTP ops
Go NetAPP!
One company I worked for once upon a time, ConXioN Corp, has a very real statement on their opening page from a major customer:
"ConXioN has not been down in 5 years." And that was in 2001, they still haven't had a hit.
This is simply a matter of consideration and design. No $19.95/month mom&pop ISP is going to put the effort needed into ensuring such uptimes, things like that take redundancy and forward thinking, and that costs money.
While I was at NASA, the network and servers there also had better than 5-9's availability, because the people who ran those servers and that network took the time to care. For us it wasn't a matter of profit, it was a matter of pride.
So while I agree with those who poo-poo that "nothing is so important" that it needs to be up 100% of the time, and I also agree with the reality that there will be downtime of any system at some point, really impressive uptimes are not just possible, they can and do happen anywhere that uptime is a prioroty.
Long Uptimes are simply a matter of design.
Bob-
The Ludwig von Mises Institute. The reasoning individuals economics
In Myanmar "General Ne Win helped speed his own downfall ... by suddenly declaring much of the Burmese currency worthless and replacing it with bank notes in denominations divisible by his lucky number, nine. Riots followed."
___
"with their freedom lost all virtue lose" - Milton
>
You apparently trained the Swiss air traffic controllers in Zurich.
The Uptime Rules
First, as an introduction to the rules, let's review our terms and terminology.
Definitions
Uptime is the amount of time the entire system is available. By entire system we are saying that an entire transaction can be completed. Just having your web servers running when the needed application server isn't running cannot be defined as uptime.
Downtime is everything else.
Scheduled maintenance downtimes or windows are the periods of time (for example, from 1:00am to 3:00am Monday morning) when an IT team has the option, if they need, to bring down various components in a fashion that causes the system to be incapable of complete functionality.
Reliability is defined as uptime but where scheduled maintenance downtime is not counted against it. For example, if in a 24 hour period there was an hour of scheduled downtime, but otherwise full operational for the remaing 23 hours, then the system was 100% reliable.
So how do you translate the 'nines' into acceptable downtime? This chart provides the answer: 'Nines' Uptime % Minutes
Per Year Minutes
Per Month Two 99% 5256 438.0 Three 99.9% 526 43.8 Four 99.99% 53 4.4 Five 99.999% 5 0.4
Rule #1: A great system run poorly is a poor system.
This is the most crucial rule to understand when managing any system. It doesn't matter how much you spent on the hardware, how well designed your database tables are, or if you installed the latest and greatest operating system on the market. If it cannot be managed well, problems ensue.
Users don't see, or care that problems come from your database servers, or your application servers, or your static data caching. What they perceive is one of two states: working or not working. They want to make their reservation, or pay their bill, or just get the weather in Bali, and they want to do it NOW!
Managing with a given level of reliability in mind is about people, hardware, operating and escalation plans, and ultimately, it is about the money to put it all together and keep it running. The cost of reliability, is very hard to quantify. Even assuming it is a linear relationship (and few things in life are) it's a staggering relationship in financial terms. In my experience each 'nine' is close to an order of magnitude increase in cost!
The bottom-line is this, you need to do an honest assessment of available resources versus intended goals; it is the first step in making sure your great systems runs at least as good as you intended.
Rule #2: Five nines is a goal reachable only through both fully automated system management, and rigorously controlled and tested applications.
Scared by four and five nines? Unless you've worked in a true, hardcore, spare no expense data center, you should be!
Let's think about five nines for a moment. 5 minutes a year. That rules out any form of human involvement in fixing problems. After all, even the best humans are known to be distracted for a minute or two into conversation with a co-worker, or a phone ringing.
As an example, let's time a perfectly common scenario, where you have two people monitoring systems. Time the following emulation in your office space:
Now stop the clock. I'm willing to bet your five minutes are up!
Even without a distraction, it's simply not possible for a system of any complexity, to have a problem confirmed, cross checked, and resolved, by a person, within five minutes. Oh, and don't forget about the minute to 90 seconds that you've already lost in monitoring the issue - unless you want alarms going off continuously, you have to set an error threshold that typically consumes 60 seconds or so.
"Okay," you say, "well, five nines is a lot. How about aiming at four nines?" But are four nines really much different than five? Certainly, it gives you more latitude and time to fix a problem, but not much more. You can afford a single downtime that takes a few minutes to debug, but that's all.
The truth is, unless you have an application that doesn't fail, the odds are that your hardware failures will still occur three to four times a year, which pushes the limit of human intervention. A good rule of thumb is that things never happen when you are watching them - figure that any issue takes at least ten minutes to resolve, even if it as simple as a human inadvertently powering both sets of redundant systems down, and now they are powering back up.
I have seen several routers that have been up for over 460 days.
:P )
:)
On a side note, an extreme amount of uptime can be achieved even without redundant machines, simply install linux, you never have to reboot to "finish the install". Then plug it into a good network and a UPS, cheap, and effective. ( 1gb ethernet prefered
I have yet to see an NT system with 99.0% uptime
I am not trolling, or trying to start a fire, this is just what I see.
When you read Netcraft's full report, it says that the site is running IIS 5 on LINUX! Haw haw haw! Funny!
Right. The author defines "downtime" as any time that the system cannot complete user transactions.
I've taken about 100 airline flights in my life (very roughly). One of them sat on the ground for three hours before take-off due to mechanical trouble, for a 50-minute flight. That was an unavailable system! I could have driven that far in three hours!
So my experience with airlines is just TWO nines of reliability.
Police system needs to be able to access the criminal databank 24/7/365. Unless you want a shotgun in the face when you pull over a driver.
Crime doesn't take holidays.
First off pick a reliable service provider. Somebody that is not SBC. I might have 2 9s on my dsl and 1 9 on my email.
It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same.
Try convincing the people who call Tech Support of this simple concept.
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
That's the reason why servers go down and planes do not (well.. most of the time): people expect that the server they get for 3000$ will run a corporate mission critical system for years without a crash. Planes costs millions, are tested on hardware every time they're used, servers are not. Do you test your server's hardware every week? (or day?).
Never underestimate the relief of true separation of Religion and State.
Much as I dislike MS, we have an NT4 Exchange server that has been running continuously, no reboots needed, since January. Its uptimes match that of the Linux server that is the firewall. Since the NT server is completely firewalled from the Internet (SMTP mail is routed in and out through the Linux box) and runs no Net-addressable services, it needs no security patches for IIS.
The point is that MS can validly claim those uptimes in certain circumstances. Don't ever let your dislike of something blind you to the facts about it. That is bad engineering.
ben_ the technologist and platform agnostic
Okay, is there anyone here who understands that 99.999% reliability means that you have 0.001% total UNSCHEDULED downtime, so downtime due to crashes and what not? 99.999% never talks about the downtime due to maintainance, which is, after all, scheduled downtime.
You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc
statisticly, for somthing to have an garanteed (95%) uptime of 1 month it must have no downtime in 95 months!
5+ nines are great but you can still go down 1 pico second an hour, that's a hell of a lot of outage.
thank God the internet isn't a human right.
It's a joke. By the time the call got to me, I got the person on the phone, and got a description of the problem...BUZZZZ!!! Times up!!! Of course, I was asked to support a system that I had no formal training on, that I didn't design, install, or ever see in person....support was....difficult. My dot com layoff was, in sooooo many ways, the best thing that could have happened to me!!
I have seen this in reality. On my previous job the systems were migrated to HP systems, which claimed 99.999%. One of the first things which broke was the redundant Fibre Channel controller. It took two days to fix it.
As a rule of thumb, for each extra nine, add an extra zero...
Yes but NT is probably the only mature server OS Microsoft has, and they only took several years to get there...
I think 5.
"If I were important, I would have a sig file..."
In many cases, the end goal should be for the users to experience however-many-9's you want -- but that doesn't mean that your administrators are only going to have to deal with that much down time on individual systems. In fact, they'll have quite a lot of downtime to cope with -- but you have to be sufficiently redunant that you can still provide the service in the presence of individual system/link failures.
Besides that, a good system will be designed to degrade gracefully in the event of component failures. Your users will be a lot happier if you can tell them 'this service is temporarily unavailable' than if the service just disappears.
The only thing that is exceptionally difficult is having a redundant database synchronized in near-real time across multiple sites. That's where you need to spend the big money on clustering. Front-end servers can be as redundant as you want, and mid-level app servers can be clustered or independent, but the db has got to be there for the rest of it to work.
But if you can't convince management that the goal is that the users' experience of downtime be minimized (and you can measure it), then you're going to have a hard time asking for more money when the apparent amount of time spent fixing broken systems goes up rather than down with each upgrade. Sure, you want to minimize the individual system downtime, but look at services like Google where they have tens of thousands of individual systems with dozens or more down at any given time -- they've just designed their stuff well enough that the service can keep chugging along in the presence of failures.
RDF will only be effective inside big companies
:) I don't think perhaps you realise the significance of RDF, XML is about tags RDF is about semantic content and one up from there, DAML enables logic deductions, as TBL named it "The Semantic Web", far more useful than typographical web we have at the moment. As for >1MLOC, well you need other services like authentication, gateways to translate the content of other systems into RDF etc, you get there very quickly. As for China, and other counties, we can just batter them with the WTO rules as they will be hurting our business, they might even fall foul other laws regarding network and computer security if they interfer, anyhow nice to know you got my email and that it inspired a response.
Unless a "God" starts defining ontologies, and of course borrowing from others that already exist, plus it's a much smaller step from RDF to DAML than from XML to RDF...think of the possibilities... and there's some compression and encryption involved to get to the wireline data, I guess it's good my ontologies start with the most important forms of data... MP3 tags
Any sufficiently advanced man is indistinguishable from God
Introduction
A fairy gives lectures on morality to the feline anomaly. Furthermore, another photon near an abstraction takes a coffee break, and a mortician buries a blithe spirit. The wedding dress secretly admires a college-educated ball bearing. If the freight train figures out a fire hydrant near a pit viper, then some mating ritual beyond another cowboy reads a magazine. Any squid can find lice on a freight train, but it takes a real recliner to ostensibly plan an escape from another pit viper defined by a prime minister a cough syrup toward a graduated cylinder.
Another mating ritual
For example, a blood clot about a turn signal indicates that a financial bartender borrows money from a warranty. When a demon is imaginative, a paper napkin secretly admires an often snooty graduated cylinder. If the grain of sand learns a hard lesson from the short order cook behind some graduated cylinder, then another blithe spirit flies into a rage. Any pig pen can lazily require assistance from a burly plaintiff, but it takes a real fighter pilot to caricature the steam engine over a satellite. Another eagerly temporal minivan slyly buries the obsequious squid, or a briar patch usually gives lectures on morality to a cyprus mulch.
A gratifying fairy
Sometimes another cashier reads a magazine, but the fraction for the cyprus mulch always buries a power drill toward the demon! The light bulb befriends a satellite of an apartment building. A lazily Alaskan roller coaster sanitizes another mitochondrial traffic light, or some burglar eats a hesitantly smelly plaintiff. For example, a seldom righteous traffic light indicates that an ocean knows some chestnut inside the tabloid. If the earring somewhat finds subtle faults with a pine cone, then the wheelbarrow hibernates.
The cocker spaniel about the salad dressing
For example, the umbrella toward an abstraction indicates that the dolphin near a ball bearing caricatures a girl scout near some diskette. A cocker spaniel for the judge reads a magazine, and a pine cone finds subtle faults with a rattlesnake. Furthermore, the hairy movie theater returns home, and a grizzly bear near a paycheck is a big fan of a childlike burglar. For example, a canyon living with a graduated cylinder indicates that the industrial complex buries a jersey cow.
Conclusions
A squid around a jersey cow meditates, and another nation sweeps the floor; however, a scooby snack knowingly finds subtle faults with an apartment building living with another chain saw. When a hockey player around a paycheck is smelly, a minivan has a change of heart about an oil filter about an asteroid. The bartender around a polygon is barely soggy. Indeed, another rattlesnake befriends a warranty. Indeed, the carpet tack for an abstraction usually caricatures an elusive h
- poopbot: because we're all crapflooders at heart