Data Center Power Failures Mount
1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."
I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!].
-B. Mays
Yes, that's clearly Twitter territory.
"A blown transformer appears to be the culprit"
I'd heard the new movie was crude, but I didn't realize how crude it actually was!
Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.
30% off web hosting. Coupon code "SLASHDOT".
Anyone seriously oncerned about their web applications, will have redundant sites, and a way to share the load. Few people pay attention to the fact that DNS requires geographically disparate DNS servers *, such that even in the event of a datacenter fire (or nuclear attack), there will still be an answer for your zone. Couple this with a few smaller server farms in separate places, and there won't be any problems. I went to look it up on wikipedia, but didn't find out where it is required for authoritative DNS servers to be in separate geographic regions. Where did I read this, DNS and BIND?
Zhrodague.net - I do projects and stuff too.
Indeed, 18465. And we shall get off your lawn as well.
My wild guess is they are deferring preventative maintenance on these data centers so we are seeing these major outages now. Fire suppression, UPS, transfer switches, generators, distribution panels, transformers, network gear, server, storage devices and other gear will fail if you don't maintain them properly. As loads increase, the equipment will fail earlier and my guess the people have pushed the limit of this equipment beyond they the lifespan of load rating.
Surprise surprise...there's a downside to consolidation. Hey morons, the internet was invented as a means to ensure redundant communications paths given nuclear warfare. The old central switch (physical switching) was seen as too cumbersome and vulnerable. Now that we have wonderfully redundant communications, and have done away with most of the downsides of physically distributed systems, morons are building logically centralized systems.
NEWSFLASH - Redundant communications and physical virtualization do very little for you if you build a logical mainframe.
Truly distributed systems must be physically AND logically DISTRIBUTED with redundant comms paths in order to gain the full benefits of decentralization. (e.g. Distributed isn't distributed if all your authentication is done at one site or all your traffic must pass through .)
Sure, I heard that Genesis Hosting suffered an outage from their ISP; XO Communications in Chicago.
... saying that it's time to reconsider cost cutting measures. In 15 years in the field I never saw a well designed and well maintained critical power system drop its load. I saw many poorly designed and/or poorly maintained systems drop loads, even catching fire in the process. One such fire in a poorly designed and poorly maintained system took the entire building with it, data center and all. The fire suppression system in that one was never upgraded to meet the needs of the "repurposed space" which was originally a light industrial/office space.
Warning: This signature may offend some viewers.
I'm one of the guys that services the security system in Fisher Plaza. The damn sprinklers killed half my panels near the scene. Turns out they use gas suppression methods in the data centers, not so much in the utility closets. And the city of Seattle REQUIRES sprinklers throughout the building, even right over the precious, precious servers. In defense of the staff there however, they do not keep them all charged 24/7. Other then that, I have no more info, as they're pretty locked down.
Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.
Can I sign up for broken water main notices here, too, or do I need to go to another website?
100+ million people daily are "serviced" by these 5 data centers.
Company's such as authorize.net where COMPLETELY unavailable for payments to hundred of thousands of webmasters sites (ya know the people who make money)
If you don't think this is serious news then you are still living at home.
Ya that's what I thought.
...what is the normal (historical) rate of data center power failures, and how does the recent spate compare? Five in a week sounds severe, but what's the normal worldwide average? I can imagine that with thousands of data centers around the globe, there's likely a serious failure occurring somewhere in the world once every couple of days.
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
"Major" data center or not, the one your company employing you at the time is using is the important one.
In my experiences, data center backups fail about a third the time power is interupted somewhere.
Servers in an Oakland California center were the victim of the loss of one of three power phases, while the monitoring that would have switched over to the diesel generators was looking at the power level of other phases. UPS systems ran out of power. An extra level of redundancy in the form of rack mount UPSes allowed servers to shut down properly despite the data center's loss of routing.
Data center #2 was the victim of a simple power outage and immediate failure of the main data center UPS system. According to a security guard I talked to, "it exploded". The diesel backup never had a chance to start.
Then the doubly-sourced Power Distribution Unit supplying a rack at a third ISP failed in a way that turned off both sources supplying the servers.
Hint: Add an extra level of UPS redundancy and safe shutdown software daemons, at least. Multiple data centers if you need more nines.
authorize.net are apparently complete idiots, if they are that large and all their equipment is in one datacenter then that's bordering on insane. Heck, my little company of under 1k employees has two facilities. Anyone who's should be running a site with 100k+ customers knows better.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum. A data center that goes down twice in a month would get ahead of any issue pretty fast. Lessons learned from the staff and the management are very valuable.
The marketing messaging:
"We power fail our data center every two weeks to ensure our backups work..."
Sound scary? Just think about the data center that has never been through this process. at that point, the wet paper bag you tried to market your way out of dried rather quickly and you are now faced with the prospect of slapping around inside of a zip-lock.
Lindsay Blanton
RadioReference.com
We're a Rackspace customer in their DFW datacenter. This is the third power-related outage they've had in the last two years at that supposedly world-class facility.
The first wasn't really their fault: truck driver with health condition runs into their transformers. Generators kick in, but chillers don't re-start quickly enough. Temps skyrocket in minutes, emergency shutdowns. Maybe the transformes should have had some $50 concrete pylons surrounding them?
The second outage was the result of a botched generator upgrade.
This latest outage was the result of a botched UPS maintenance.
None of the outages was long enough to trigger our failover policy to our DR site, but our customers definitely noticed.
While their messaging has been very open and honest about the problems, and the SLA credits have been immediate, we pay them nearly $20K per month. Nedless to say, we are shopping, and looking into a "multiple cheap colos" architecture instead of "Tier-1 managed hosting". Nothing beats geographic redundancy.
All these data centers failed at roughly the same time as the sunspots returned, but that's just a coincidence, right?
Craig Milo Rogers