When the Power Goes Out At Google
1sockchuck writes "What happens when the power goes out in one of Google's mighty data centers? The company has issued an incident report on a Feb. 24 outage for Google App Engine, which went offline when an entire data center lost power. The post-mortem outlines what went wrong and why, lessons learned and steps taken, which include additional training and documentation for staff and new datastore configurations for App Engine. Google is earning strong reviews for its openness, which is being hailed as an excellent model for industry outage reports. At the other end of the spectrum is Australian host Datacom, where executives are denying that a Melbourne data center experienced water damage during weekend flooding, forcing tech media to document the outage via photos, user stories and emails from the NOC."
Obviously if the power goes out, and the service goes offline, then it WASN'T a cloud. If it's a cloud, it can't go down. If it goes down, it wasn't a cloud.
What's there to get?
Glen Beck, is that you!?
...but it was stored on Google Docs.
Whoosh.
You are thinking too small-scale. Of course there are people on-site. Google has data centers all over the world -- how are they going to drive there?
http://en.wikipedia.org/wiki/DUKW
'nuff said.
Yeah, and when the guys at the Jesus Christ of Datcenters that you describe have to do something like, say, switch from generator to utility power manually, and the document that details that process is 18 months old and refers to electrical panels that don't exist anymore, you get what you had here. A failure of fail-over procedures. If the lowliest help desk / operator can't at least understand the documentation you've written, then you've failed.
The only equipment failure listed is a "power failure." Granted, that can be as simple as "car hits a telephone pole and knocks out a chunk of the grid, leaving your office in the dark", which should be an easily survivable event. But how do you handle a failure like "50kva inline UPS shits the bed leaving nothing but a smoking chassis that no one wants to go anywhere near?" or "HVAC unit fails on christmas eve when only a skeleton staff is on duty and fills the raised floor with 8 inches of water, shorting everything within an inch of its life and making it impossible to bring any hosted services back online?"
There's nothing like a little bit of "we had no idea these three or four unrelated circumstances could happen simultaneously" disaster porn to make you realize that A. Outage / DR / fail-over planning is more than just throwing money at stuff (UPS's, generators, redundant lines, etc) and B. No matter how good your plan is, it will never be 100% effective.
There are some people that if they don't know, you can't tell 'em.
Don't have all your shit in one data center, maybe? I'd have thought that one would be pretty fundamental. Of course, knowing Google they're going to decide that what they really need is power generation right on site, then they'll just pop off and invent nuclear fusion before lunch.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?