Data Center Power Failures Mount
1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."
I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!].
-B. Mays
Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.
Can I sign up for broken water main notices here, too, or do I need to go to another website?
Yes, that's clearly Twitter territory.
Quiet troll. Slashdot has broken several stories over the years and most of them started as these little coincidences. Go back to reading CNN if you want your news filtered.
6th Street Radio @ddombrowsky
Installing a generator or UPS causes an accident sooner than you'd experience without having a generator or UPS. Safety measures cause accidents. See, not(correlation does not imply causation). We slashdotters have always known this to be true. Heretics beware!
"A blown transformer appears to be the culprit"
I'd heard the new movie was crude, but I didn't realize how crude it actually was!
We had an outage today. Our servers are hosted with Genesis Hosting, which suffered an outage from their ISP; XO Communications in Chicago. Anyone know what happened?
Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.
30% off web hosting. Coupon code "SLASHDOT".
Anyone seriously oncerned about their web applications, will have redundant sites, and a way to share the load. Few people pay attention to the fact that DNS requires geographically disparate DNS servers *, such that even in the event of a datacenter fire (or nuclear attack), there will still be an answer for your zone. Couple this with a few smaller server farms in separate places, and there won't be any problems. I went to look it up on wikipedia, but didn't find out where it is required for authoritative DNS servers to be in separate geographic regions. Where did I read this, DNS and BIND?
Zhrodague.net - I do projects and stuff too.
Indeed, 18465. And we shall get off your lawn as well.
Not if the water company runs Linux!
My wild guess is they are deferring preventative maintenance on these data centers so we are seeing these major outages now. Fire suppression, UPS, transfer switches, generators, distribution panels, transformers, network gear, server, storage devices and other gear will fail if you don't maintain them properly. As loads increase, the equipment will fail earlier and my guess the people have pushed the limit of this equipment beyond they the lifespan of load rating.
Are they over drawing the power out the units? poor battery that blow up? Not having the right wire gauge? Not cooling the power buses and switches?
Surprise surprise...there's a downside to consolidation. Hey morons, the internet was invented as a means to ensure redundant communications paths given nuclear warfare. The old central switch (physical switching) was seen as too cumbersome and vulnerable. Now that we have wonderfully redundant communications, and have done away with most of the downsides of physically distributed systems, morons are building logically centralized systems.
NEWSFLASH - Redundant communications and physical virtualization do very little for you if you build a logical mainframe.
Truly distributed systems must be physically AND logically DISTRIBUTED with redundant comms paths in order to gain the full benefits of decentralization. (e.g. Distributed isn't distributed if all your authentication is done at one site or all your traffic must pass through .)
Sure, I heard that Genesis Hosting suffered an outage from their ISP; XO Communications in Chicago.
... saying that it's time to reconsider cost cutting measures. In 15 years in the field I never saw a well designed and well maintained critical power system drop its load. I saw many poorly designed and/or poorly maintained systems drop loads, even catching fire in the process. One such fire in a poorly designed and poorly maintained system took the entire building with it, data center and all. The fire suppression system in that one was never upgraded to meet the needs of the "repurposed space" which was originally a light industrial/office space.
Warning: This signature may offend some viewers.
See story of Qld Health datacentre disaster on ZDnet recently:
http://www.zdnet.com.au/news/hardware/soa/Horror-story-Qld-Health-datacentre-disaster/0,130061702,339297206,00.htm
I'm one of the guys that services the security system in Fisher Plaza. The damn sprinklers killed half my panels near the scene. Turns out they use gas suppression methods in the data centers, not so much in the utility closets. And the city of Seattle REQUIRES sprinklers throughout the building, even right over the precious, precious servers. In defense of the staff there however, they do not keep them all charged 24/7. Other then that, I have no more info, as they're pretty locked down.
Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.
Can I sign up for broken water main notices here, too, or do I need to go to another website?
100+ million people daily are "serviced" by these 5 data centers.
Company's such as authorize.net where COMPLETELY unavailable for payments to hundred of thousands of webmasters sites (ya know the people who make money)
If you don't think this is serious news then you are still living at home.
Ya that's what I thought.
I work for a company that makes high-end datacenter power systems, this should be good for business once the trade rags the CxOs read report on the millions and millions of lost business.
Or at least it will keep the sales staff busy writing up quotes that will be rejected for being too expensive (although much less than the cost of a prolonged outage...)
Any insufficiently advanced magic is indistinguishable from technology.
...what is the normal (historical) rate of data center power failures, and how does the recent spate compare? Five in a week sounds severe, but what's the normal worldwide average? I can imagine that with thousands of data centers around the globe, there's likely a serious failure occurring somewhere in the world once every couple of days.
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
This is why you should look into company's with geographical diversity such as Ubiquity (http://www.ubiquityservers.com) or various other companies in the data center market.
"Major" data center or not, the one your company employing you at the time is using is the important one.
In my experiences, data center backups fail about a third the time power is interupted somewhere.
Servers in an Oakland California center were the victim of the loss of one of three power phases, while the monitoring that would have switched over to the diesel generators was looking at the power level of other phases. UPS systems ran out of power. An extra level of redundancy in the form of rack mount UPSes allowed servers to shut down properly despite the data center's loss of routing.
Data center #2 was the victim of a simple power outage and immediate failure of the main data center UPS system. According to a security guard I talked to, "it exploded". The diesel backup never had a chance to start.
Then the doubly-sourced Power Distribution Unit supplying a rack at a third ISP failed in a way that turned off both sources supplying the servers.
Hint: Add an extra level of UPS redundancy and safe shutdown software daemons, at least. Multiple data centers if you need more nines.
authorize.net are apparently complete idiots, if they are that large and all their equipment is in one datacenter then that's bordering on insane. Heck, my little company of under 1k employees has two facilities. Anyone who's should be running a site with 100k+ customers knows better.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum. A data center that goes down twice in a month would get ahead of any issue pretty fast. Lessons learned from the staff and the management are very valuable.
The marketing messaging:
"We power fail our data center every two weeks to ensure our backups work..."
Sound scary? Just think about the data center that has never been through this process. at that point, the wet paper bag you tried to market your way out of dried rather quickly and you are now faced with the prospect of slapping around inside of a zip-lock.
Lindsay Blanton
RadioReference.com
We're a Rackspace customer in their DFW datacenter. This is the third power-related outage they've had in the last two years at that supposedly world-class facility.
The first wasn't really their fault: truck driver with health condition runs into their transformers. Generators kick in, but chillers don't re-start quickly enough. Temps skyrocket in minutes, emergency shutdowns. Maybe the transformes should have had some $50 concrete pylons surrounding them?
The second outage was the result of a botched generator upgrade.
This latest outage was the result of a botched UPS maintenance.
None of the outages was long enough to trigger our failover policy to our DR site, but our customers definitely noticed.
While their messaging has been very open and honest about the problems, and the SLA credits have been immediate, we pay them nearly $20K per month. Nedless to say, we are shopping, and looking into a "multiple cheap colos" architecture instead of "Tier-1 managed hosting". Nothing beats geographic redundancy.
my pc. it is 3400mhz and all the data center host sites of my interest. for the first time in 21020 hours of perfect runtime the ups saved it in a series of northeast thunderstorms coniciding with the outages around the globe. the permeating physics were too much for the world, reverberating to an unintentional master of perfect float, hardly busy, waiting to send a message. Believe it or not.
The Fisher Plaza story is big. I happened to be walking by right after it happened, noticed the generators running and went, 'Hm-m-m". We've toured their facility in the past, and wanted to use them, but they didn't have capacity at the time. They seemed first rate. If a first tier provider can have this happen...
I was taught to respect my elders. The trouble is, it's getting harder and harder to find some.
cheap, fast, reliable
pick one
It was a terrible movie, not the kind I like to watch anyway, but for some reason I felt compelled to view the damned thing twice in two days.
The Big Bad Threat in the film was all about something called a Fire Sale, ("It All Has To Go"), where the population's fear level is spiked up into a panic by a group of bad guys deliberately crashing the national infrastructure by way of hacking all the most important computer systems. --All to create a giant distraction so that the stock market could be plundered by thievesssssss! The story is weirdly in keeping with the theme of this Slashdot article.
Consider: Everything is a metaphor in this big old world of ours where matter and energy are based on nothing more than space and the vague notion that there is something which exists. With no matter to speak of, the whole of reality is little more than a hologram, and that being the case, the power of thought and awareness holds about the same amount of substance, if not more. --The subconscious is connected quite well to the whole affair, and events of some magnitude like today's server outages will tend to send ripples through reality so that poor shmucks like me find themselves watching in fascination stupid movies they hate without knowing why.
All I know for sure is that Bruce Willis was a lot more fun to watch when he was playing opposite Cybill Shepherd.
-FL
by right after it happened, noticed the generators running and went, 'Hm-m-m
You're some kind of witch aren't you? You broke my internet!
BURN THE WITCH!!!
it seems you had some bad experiences lately.
under normal circumstances you can always have 2 out of this 3 - regardless of which topic we are speaking (datacenters, code quality, cars*, ... - just name it)
*) and not even a bad analogy :)
where is badanalogyguy when you need him?
All these data centers failed at roughly the same time as the sunspots returned, but that's just a coincidence, right?
Craig Milo Rogers
pay down the road. Transformers going out? Guess where it was built? Until ppl quit buying inferior made products, we will see more and more of these issues come up.
I'm almost thinking of taking UPS out of the loop here. They cause nearly all the downtime we have. It would be better to just let the machines power off rather than allowing the UPSs to CAUSE the machines to be taken offline. At least if the UPS isn't in circuit, the machines power back up again when the power comes back, but if there's a fault with the UPS or it's batteries, then the machines stay offline until the batteries have been replaced.
Why the hell the idiots that design UPSs seem to think it's a good idea to prevent them turning on if they sense a problem with the batteries is beyond me. Why not let the machines power back up but just make a loud beeping noise until the batteries are fixed. Don't they realise that most of the time the UPS will only properly test the batteries when there's an actual power cut? On APC units (and most others) the periodic self test function uses your SERVERS as the test load! So if the batteries can't deliver the current, your servers get turned off just due to a routine TEST! Why can't they fit an internal dummy load like a small ceramic heater or something - it's only on about 5 seconds so it won't even get hot.
Yes, APC, I'm talking to you. I've even switched suppliers thinking it must only affect APC units, but it seems all others I've tried have the same issues.
Safety measures drastically reduce the chance of accidents, while being unprepared, especially if it's just a brief period of unpreparedness, greatly increases the chance of an accident. This makes you wonder if the safety measures were really worth it, but at least you won't have any accidents as long as you remain prepared.
"When information is power, privacy is freedom" - Jah-Wren Ryel
If by "broken several stories" you mean "posted links to stories that were broken elsewhere" than you may have a very good point.
Browse at -1 to keep an eye out for abuses.
My sweet, sweet inter-tubes. You are a cruel and fickle Mistress