Multiple Sites Down In SF Power Outage
corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
Don't these large sites have failover capable, redundant servers in multiple physical locations? Why should a failure in one rack, one room, or heck, even one state for the giant sites, effect them?
I don't respond to AC's.
I've been told there was no fuel left at the time.
Now, the only remaining question is: How did the drunk guy get in there?
Any data center that advertises high availability should be testing that sort of thing on a regular basis. It's possible that they could fail switchover even if they are being regularly tested, but it is unlikely.
If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer I'd be pissed that the data center I pay tons of money to can't keep my site up in the event of a power outage, which is one of the main perks of hosting at a data center in the first place.
Where's the +1 "100% fucking right" mod option?
Whaddya bet some poor mid-level admin gets blamed and tossed for this? And the upper-management guy who ignored the recommendations for testing or redundancy still gets his bonus for good fiscal performance.
If I knew the wedgies I gave you back in 6th grade would have resulted in this . . . I might have taken a moments pause.
For me it would be other way around. A technology failure I could understand. Letting a drunk employee near my server rack, I could not.
If you want news from today, you have to come back tomorrow.
Wait, you think its OK to advertise five nines reliability, UPS backup, and generator backup, only to find out that the systems were not being properly tested to meet the advertised capability?
What is "high availability". 99% uptime is 3.5 days down. 99.9% is 9 hours down. 88.88% is nearly an hour down. Certainly these sites can still be considered 3 nines high availability.
Much of Europe uses 220V/50Hz.
The drunk thing is way outside the control of the administrators. Testing the failover is something they can do, and if something doesn't work, they can fix it.
"...I would think these large sites are going to pitch a bitch..."
I would think these large sites would understand the concept of not putting all your eggs (servers) in one basket. There is a reason why smart companies use replication and clustering, and datacenters spread across the country.
Now, now... LiveJournal is back up.
Have you ever been in a data center? Cabinets that are all locked. To get the key, you have to sign it out from security. Ditto for the cages. It wouldn't just require a drunken/disgruntled employee, it would require a conspiracy of them: security staff to hand over the keys and the disgruntled employees to do the misdeeds.
Well, there is one way around that: you walk over to the EPO button and give it a whack. It'll take down the whole floor. Rinse, lather, repeat on other floors. How many do you think you can do before someone stops you?
Anyway, my employer has a lot of stuff in 365 Main. We're not one of the companies mentioned in TFA, but we're certainly one of the ones affected. Within a couple minutes of the outage, we knew we'd lost everything we had there and several of our sysadmins grabbed their gear and headed for the city to go join that line outside of 365. By the time they left the building we had confirmation that it was a power outage.
Power was already back on when they got inside and they immediately brought up anything that wasn't already up and tested it all to make sure it was OK. To say the least, this is inconsistent with (tall) tales of somebody going apeshit on 40 racks.
If the heater is really that important, it should be reporting back at regular intervals that it's on, and when the signal isn't being received anymore there should be a process so that somebody calls and asks what's up. If somebody wanted to turn it off and couldn't, they'd just unplug it.