Multiple Sites Down In SF Power Outage
corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
I try this and I get "nothing for you to see here"... guess it's affecting slashdot too? ;-)
I can verify that it affected much of the Financial District here in SF. We had the power go out 3 times. Seems to be back now. Haven't heard any explanation yet.
im in ur datacentr
trashin ur racks
Don't these large sites have failover capable, redundant servers in multiple physical locations? Why should a failure in one rack, one room, or heck, even one state for the giant sites, effect them?
I don't respond to AC's.
Gamefaqs/Gamespot is also down. I wonder if it's related.
"There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy."
They probably just didn't kick in. Had the same problem at Internap in Seattle a few years ago. Power was cut to the building and the UPSs failed to switch over.
Does this mean backup generators have failed or is the fault somewhere outside the datacenter? Time to start shopping.
Quack, quack.
It's interesting that so many major sites would go down in a local power outage? Are they all sharing one data center in SF? If so, why don't they have co-locations in other cities?
~ I am logged on, therefore I am.
um, like i said.
They say the first thing to go is your penis. Well, it's either that or your brain. I forget which...
(That's the fantasy sports site that works like a stock market, if you didn't know.)
What sound do people on rollercoasters make? Hint: it's not Xbox 360.
I can hear it now, the sound of a million emos all finally committing suicide.
At least 20,000 without power in downtown S.F. Marisa Lagos and Demian Bulwa, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 15:12 PDT SAN FRANCISCO -- At least 20,000 customers of Pacific Gas and Electric Co. in downtown San Francisco lost power this afternoon, the utility said. Brian Swanson, a spokesman for the utility, said outages have been reported throughout downtown and along the Embarcadero, including at PG&E's office on Beale Street near the Ferry Building. It was unclear initially how many customers who lost power remained without it for a sustained period. Power outages were also reported in the South of Market neighborhood, the Outer Mission and down the 3rd Street corridor south of Mission Bay. PG&E officials said they did not know why power had gone out, but most customers appeared to be back online by 3 p.m. The outage has prompted Muni to run shuttles in the place of cable cars, a spokeswoman said. The T-Third Metro line was unable to cross the 4th Street Bridge for a short time, but power was restored to the drawbridge by 3 p.m. Muni bus lines 14, 49, 30, 41 and 45 were without power for about 30 minutes following the outage, but are now working, spokeswoman Maggie Lynch said. Parking Control officers were deployed to the Outer Mission, 3rd Street and Monterey Avenue for traffic control, she added. Power first went offline around 1:50 p.m. and came back at least three times in the downtown area before shutting off again. The same problems were reported in South of Market all the way to AT&T Park and the Caltrain station at Fourth and King streets, and traffic lights were out as far south as Monterey Boulevard. At the Westfield Center at Market and Fifth streets, only one of six Nordstrom elevators was working while the shopping mall ran on a backup generator. Shoppers milled around as the lights flickered on and off. BART is still running trains but the lights at its downtown stations have flickered on and off several times, said spokesman Linton Johnson. The transit agency also has concerns about the ventilation system, which is on the same grid as the lights, he said, but will keep its downtown stations open so long as the lights and ventilation continue to work. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power.
bloody terrorists!!!
Yep, it took down most of CNET, which GameFAQs is under. Main sight is back up as of now, though forums are still down.
If forums teach us anything, it is that logic and critical thinking should be required courses in the public schools.
Well, that explains why I haven't been able to access LiveJournal for the past hours. Good thing I read Slashdot...
Yes, you're right. Thousands and thousands of people are making it up. Craigslist is down now, and has been down for the past hour or so. So was Gamespot. It's not "FUD".
I don't respond to AC's.
Well, you test and test and test, and when something finally happens, nothing. Stuff happens.
Brownouts sometimes fail to trigger generators, even though they should. If only one phase goes down, depending on the design, it may not trip (and would cause a somewhat random outage, like some drunk shutting down racks).
If the generator runs on diesel, they usually only plan for a few hours of backup. If they didn't recalculate the generator runtime as they added equipment, the load may have caused the fuel consumption to go up higher than anticipated. Is it hot in SF today? Air handlers may be straining to keep the place cool, or maybe the generator got running too hot.
Often times, as equipment is added, the load gets out of balance between phases. It is usually a good idea to keep the load as even as possible, but in a high traffic data center, I would imagine there would be a lot of stuff moving in and out, expanding and contracting, and it may become hard to keep track of the loads across phases. A good facilities manager should be able to tell you the current load off the top of his head, but too often these details get left out.
This is just stuff I've seen in cable TV headends over the years. Granted, this facility should have a power manager/engineer on staff, but so often the power is one of the first things to get cut from the budget.
"Well, good luck finding a judge that doesn't run a bestiality site."
How many people will lose their jobs for failing to plan for this or failing to keep the generators fueled up?
How many will "merely" see their careers stalled or be "encouraged" to look elsewhere for employment?
When will this be used as an example of how to plan - and how not to plan - for disaster in an academic paper?
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Any data center that advertises high availability should be testing that sort of thing on a regular basis. It's possible that they could fail switchover even if they are being regularly tested, but it is unlikely.
If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer I'd be pissed that the data center I pay tons of money to can't keep my site up in the event of a power outage, which is one of the main perks of hosting at a data center in the first place.
There's a report here that "Flesh-eating zombies are prowling the streets"
That name resolves to an IP address in San Jose. Maybe they have redundant servers for their webpage, you know, wouldn't want to make potential customers think their sites would go down during a power outage..
Most people don't have that level of backup power...It's too expensive unless you're making obscene money. A hefty ups to get you through an outage of a few hours or less, and that's about all you've got.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
was listening to SomaFM via Treo, got a call, and when I came back, no music :(
12:50 - press return.
We are working with our co-location facility managers to assess why it is back-up power generators failed to provide the necessary back-up power to prevent our site going down. We apologize for any inconvenience caused by our site being unavailable this afternoon.
I think that's admin speak for:
I warned these idiots eight months ago during my review that the datacenter had outgrown its generator capacity. But did they listen? Fuck no, they just kept counting money and worrying about the bottom line. The beancounters looked at me like I'd asked them for a blowjob from their grandmothers when I submitted the workup for additional generator capacity. And now that the shit's hit the fan, whose ass are they screaming for? Screw this, I'm applying at Taco Bell.
There are some people that if they don't know, you can't tell 'em.
http://www.sun.com/
:))
Front page says (in the ad) POWER UP AND GO.
This is a DATA CENTER, its whole purpose in life is to be available when things like this happen. It had better have generators and plenty of fuel on hand at all times. The data center I work at has the capability to run at full power with nothing coming in from the outside world for 36 hours. I don't know what the standard is for other data centers, but it seems like they should be capable of getting at least 12 hours of operation without incoming power from the grid.
Someone came in shitfaced drunk, got angry, went berserk, and fucked up a lot of stuff. There's an outage on 40 or so racks at minimum.
Libel lawsuit in 3...2...
Please help metamoderate.
Hosts NOT to use: 365main
On a side note: 365main.com is up. Good to know where their priorities lie.
I don't respond to AC's.
I is not in ur datacenter, 2 power ur servers.
Just called a friend at One Market, the big office tower downtown at the end of Market Street, and she says the power has been going on and off there for hours. Building alarms were sounding, but nothing serious was happening other than power loss.
Press Release on Red Envelope having 2 years of uptime at 365 Main - San Francisco from today: http://365main.com/press_releases/pr_7_24_07_red_e nvelope.html
So I edited my hosts.conf so technorati points at my localhost.
Can't say that's degraded my blog-reading experience in the least.
You could then get all the geeks to crank the handles and keep the web running!
Engineering is the art of compromise.
As someone who lives and works in San Francisco, I can attest that "a crazy homeless dude did it" is a fairly sensible first guess for most problems.
A widespread power outage and a gay wino vandalizing a datacenter on the same day?
Write your own Choose Your Own Adventure. http://www.freegameengines.org/gamebook-engine/
They're not the speediest at letting you in at 365... this was taken about an hour ago from across the street: http://tastic.brillig.org/~jwb/dorks.jpg
This has got to be some type of joke: RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco's Datacenter.
It was released today....
ÕÕ
They do have backup power systems in place. Ten 2.1 MW "Continuous Power Systems" according to this document. I wonder how close they were to guaranteeing 99.99 percent uptime this year...
There are two types of people in the world: those who divide people into two types and those who don't.
It's been a long time since I went on a tour of several data centers to locate a new facility for our dot-com. I believe that 365 Main was a facility that does not use a battery UPS. Instead, they have engine-backed flywheel UPS system (see http://www.enterprisenetworksandservers.com/monthl y/art.php?2813 for a description). At the time, they have 10 2-megawatt generators on the roof in a N+2 configuration. The engines are kept heated and are spec'd to go from stop to engage-clutch/deliver-power in 3 seconds. The flywheel can deliver 11 seconds of power so they can fail through a couple of bad engines before running out of flywheel power. They periodidally do a 20-hour load test into a pair of 500,000 watt heat-sinks. Time will tell if this outage was a failure of design, failure of maintenance, or outright malfeasance. But it wasn't supposed to happen. They've got some 'splainin' to do.
As to diesel storage, use of diesel is widespread for emergency use everywhere from hospitals to emergency-services to hospitals. Those systems are run regularly - typically weekly. The use of biocides, stabilizers, and mobile fuel-scrubbing services, and extra filtration systems can maintain the fuel quality. Our colo currently maintains a 1-week fuel-supply and has multiple quick-refuel contracts in place. I can't imagine any colo having less than 24-48 hours in-the-tank with quick-refill on-call.
But one thing that is missing is cooling. Our colo has a typical contract that says something like blah-blah won't exceed 80F for more than 4 hours blah blah. OK, but a rack full of blade servers can crank out 15-20kW of heat load and a data center can heat up real quick without AC. By contract, 150F for 3.5 hours would be in-spec.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
For me it would be other way around. A technology failure I could understand. Letting a drunk employee near my server rack, I could not.
If you want news from today, you have to come back tomorrow.
No kidding. years ago in my former job on traffic systems we had a great UPS with a generator on site and the ability keep it fueled up indefinitely. A security contractor came in on the weekend to install something and tried to wire up a new circuit hot. He slipped with a screwdriver and shorted the white phase to the chasis of the breaker panel. I don't think the tip of the driver actually touched ground, but the burn mark is still there to show how close he got.
The resuting current spike blew the 100A fuses (heavy metal strips) both going in to and out of the UPS. With the UPS effectively broken the generator set failed to start and the system gracefully shut down 40 minutes after the incident. Thats not bad. The batteries were only specified to work long enough for the genny to settle at 50Hz.
In the process of blowing the fuses a spike got back into the power supply of one of our DEC Alphas and took out the power supply. The system was redundant at the software level so I didn't notice immediately.
The UPS guy came out and didn't have enough fuses to replace the blown one, but we found that with a bit of brute force and filing attacks some others could be made to fit.
Please type the word in this image: problems
http://michaelsmith.id.au
Data sheet for 365 Main:
The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.
They use a Hytec Continuous Power System, which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.
With this type of equipment, if for some reason you lose power and the generator doesn't start before the flywheel runs down, you're dead. There's no way to start the thing without external power. Unless you buy the optional Black Start feature, which has an extra battery pack for starting the Diesel. "Usually the black start facility will not be often needed but it won't hurt to consider installing one. Just imagine if you were unable to start up your UPS system because the mains supply is not available.". Did 365 Main buy that option?
Pinging openbsd.org [199.185.137.3] with 32 bytes of data:
Reply from 199.185.137.3: bytes=32 time=239ms TTL=236
Pinging freebsd.org [69.147.83.40] with 32 bytes of data:
Reply from 69.147.83.40: bytes=32 time=191ms TTL=47
Pinging netbsd.org [204.152.190.12] with 32 bytes of data:
Reply from 204.152.190.12: bytes=32 time=213ms TTL=241
Lost irony.
Wanted to look at 365 main in google maps' street view but the button isn't available.
Doesn't seem to be showing airborne/satellite images either.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
I just tried to look at my blog on livejournal, and got a 403 error, not 404. Intermittent errors are quite common on lj, so I thought I'd try again later.
So then I checked my Netflix queue, and couldn't get to it (got a 404 error there, though, not a "nice \"we're dead\" message" - two sites in a row indicate the problem might be local.
Good thing slashdot was my next stop, not one of the many others. I had no idea all those sites were run out of the same location in SF.
San Francisco has always seemed to me to be a strange place to run a server farm. Aside from the crazy drunk homeless people, you also have occasional earthquakes, and some of the most expensive real estate on earth. An acre in Arizona can cost the same as a square foot in SF, so how come all these places are in SF and not the middle of the desert? Or Alaska, if you want to save on air conditioning...
If the masses can keep you down, you're not the Ubermensch.
How coincidental that I was actually trying to reach a Sun page before and couldn't get to it. I don't even remember what it was anymore, I really need to make my Firefox closed tabs list longer than 5.
Reviewing just the first hour of video games.
Exactly! I worked in a small Telecom in Kansas and we had UPS and Generator backup and tested running full load 4 times a year.... It's fun doing that, throwing the switch to turn off utility power then hearing the KA-THUNK as the switchgear switched from utility to generator. I would think these large sites are going to pitch a bitch.....
Wait, you think its OK to advertise five nines reliability, UPS backup, and generator backup, only to find out that the systems were not being properly tested to meet the advertised capability?
Also google has datacenters in several cities. They could probably deal with an outage in San Francisco by just dropping it from the roundrobins.
What is "high availability". 99% uptime is 3.5 days down. 99.9% is 9 hours down. 88.88% is nearly an hour down. Certainly these sites can still be considered 3 nines high availability.
Funny thing though, the same sort of story on Yahoo! News reports
that Netflix's downtime is NOT related to this incident:
http://news.yahoo.com/s/ap/20070724/ap_on_hi_te/n
"The online hub of Netflix's rental system went down Monday evening and remained unavailable until Tuesday afternoon, locking out subscribers for more than 18 hours. Spokesman Steve Swasey attributed the outage to an unanticipated problem that he declined to describe.
The breakdown didn't appear to be related to San Francisco power outages that were blamed for temporarily knocking out several popular Web sites, including Craigslist, Technorati, Typepad and Livejournal.
Service to Netflix's site was finally restored around 3 p.m. PDT after Netflix's engineers had missed several earlier estimated times for fixing the trouble."
So, is it just the Business Writer trying to put a biased spin on this story, or is there more to it then that?
You are absolutely right. The co-lo we use is just down the street from there and the last few times they "tested" their generators we had outages. Five times in the last 1.5 years. those were the "unsuccessful" tests anyways. Needless to say we are moving to another co-lo.
All points of time and space are connected.
According to sfgate.com: "The source of the power failure appears to be an explosion in a transformer vault under a manhole in a plaza at 560 Mission St. in San Francisco... Witnesses said they heard an explosion at about 1:50 p.m., then saw flames coming from the manhole."
I biked past the place twice a day for years- they rehabbed and prepped the building up as a datacenter just in time for the dot.com to crater. It was left cold for a few years, but then there were a spat of articles in the local press, talking about the cheap hosting deals being offered, and of the incredible redundancy built into the the place in case of disaster. They've promised a lot, over the years, and whatever they cause may be, it really looks like they failed to deliver.
I really feel for all the folks who have to deal with this outage; it's no fun at all!
A client of mine had a number of servers in a Sterling, Virginia data center managed by Verio/NTT. It's a good data center and seems to be well-run.
Last September, the data center experienced two complete power failures in the span of three days. To their immense credit, data center management was straight with customers about what had happened. For those who might be interested, their statements about the problem appear here.
My point? Make sure you know how to bring your systems back up from a completely cold start, and that you find a way to test this periodically. While we work to ensure that this sort of situation occurs rarely, the fact remains that these sorts of failures DO occur, and they're not as uncommon as the sales and marketing folks would like you to believe.
Phil
I tried to mod the article "-1 Not Redundant" but it wasn't an option. And I didn't have mod points. At least my inability to function only warrants a comment, rather than a slashdot article.
Velociraptor = Distiraptor / Timeraptor
...until the commercial power fails and doesn't come back for days.
:o(
The only places I've actually seen the insane levels of backup that some would like is in some telco central offices. The one I was associated with the longest had eight-hour-plus battery backup and 8 days of fuel for the diesels. Some of our really remote microwave sites had 24 hour battery and 30 day diesel.
Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze. When one of the techs finally got there (after burying his Sno-Cat and walking the last couple miles), he had to chip ice off the steel door to get inside, where he was able to get the diesel started with a little "rewire" of one of the backup battery sets. Oh, his two-way radio also failed during his hike, since it was outside his snowsuit, and the lack of communication caused the company to start two more Sno-Cats and a helicopter in that direction.
The site was out for nearly six hours, IIRC.
Even the BEST designs are subject to failure.
--
Tomas
ahhh, that explains why steam is kicking everyone repeatedly.
...don't host your data at the same location as livejournal! They say lightning doesn't strike twice but...
For a minute there, I thought this story was a dupe....
Somebody forgot to knock on wood. I bet they'll think twice about releasing a press release like that again!
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Well, that's not the only scenario — but all the other ones I can think of call for even higher levels of stupidity. Maybe they had enough backup, but somebody forgot to buy diesel. Maybe the widget that's supposed to make backup come on automatically hadn't been properly maintained.
I myself had the misfortune to be working the help desk at a colo provider when some clueless tech working in the battery room disconnected the wrong cable and powered down the whole building. The really unpleasant part was answering the question every caller asked: "DON'T YOU IDIOTS HAVE BACKUP POWER?"
When you buy rack space, you naturally expect to get backup power. All providers claim to have it, but over and over your hear reports of outages where backup didn't kick in. What's needed is some independent authority to certify that the provider not only has adequate backup, but also has all the maintenance and testing procedures in place that guarantee that the bloody thing works.
On the frontpage of 365 Main, the top item in "In the news" is:
RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco Data Center. Online Retailer Also Cuts Energy Costs by 33 Percent.
It is from his terrible spelling we can tell he is a GameFAQs forum poster.
Even Jesus hates listening to Creed.
Funny enough, there was a press release put out today talking about how the 365 Main facility had given 100% uptime over the past 2 years. Yes, 100% uptime for a facility is very possible. All it needs is to stay online and providing power and cooling.
He didn't say he'd think it was OK, only that it was understandable...I think I agree. I would roll heads in either case, but probably be more outraged by the drunk access.
Max.
88.88% uptime causes less outage than 99.9%? I don't follow your math. Did you do it with an Intel chip, by any chance?
(This space intentionally blank.)
It would have been nice if someone had linked to a reliable source, like SF Gate instead of a gossip rag's wet dream.
You missed the second part of the headline: "Online Retailer Also Cuts Energy Costs by 33 Percent" And today, they've cut their energy costs by 100%!
I called comcast earlier as my friend can access the site but I cannot, and he lives a mile north of me.
I believe there is a lot more going on here than what is mentioned, comcast said that it was an AT&T link to the backbone that was refusing connections. I don't know a ton about networking but it seems to be back and functional now, and earlier when I would tracert the IP down was 12.116.17.7, if that helps you guys to peek at what it is.
Heh, that was July 30, 2006, I remember it well. Seattle City Power was taken out by nearby contruction. The UPSes came online, but one of the generators failed to switch on, so the batterys drained in ~15 minutes. The entire DC didn't lose power, but a good portion of it.
Looks like their site is up. This is probably FUD to generate blog hits.
Hanlon's Razor leads to an easier explanation -- the Slashdot editors took several hours to promote this from the inbox to the main site.
... Wasn't supported on that version of Firefox. But worked on an older Netscape.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Well, according to their self-congratulatory press release, issued earlier today, they were allegedly at 100% uptime for the past two years.
;-)
The irony of issuing a press release like that, and then to be hit with a power outage and apparent simultaneous failure of all backup systems later that day, is beyond measure.
I don't know about God, but it's enough to make me believe in karma.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Forgive my ignorance, but how would using a SAN have helped in this situation? Are you proposing that a single SAN storage net span multiple (remote) physical locations? And with SAN, can't a disk only be used by one computer at a time anyway?
Sure, you could use RAID 1(+0) and put the mirrored halves at different locations, but I can't imagine that being acceptable from either a performance or a reliability point of view.
Wouldn't master-slave database replication be more appropriate for this kind of work?
Yeah I don't really get it. I'm sure that SF Bay is a nice place to work and all, probably a nice view, good selection of late-night delivery food ... but why the heck would you site a datacenter there? I get that it's a big Internet peering point, but still.
It's not like you need to walk down there and eyeball your server every day. Does it give the suits the warm fuzzies to be able to see their DC from their office window or something?
It's not *that* hard to get multiple backhauls from different backbone providers in other parts of the country, ones which aren't close to oceans, tectonic fault lines, and have cheap power. As far back as the mid 90s I remember that there were some fairly serious datacenters in Texas -- I think EDS set up the first really big ones.
Even the big East-Coast peering point (Reston, VA?) seems like it would be a better choice. Still uncomfortably close to an ocean and a major metro area, though.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
And Ironport!
I get to rebuild some slave databases. Thanks 365! Your generators are top notch.
Oh, wait, SF = San Francisco.
Whew.
Much of Europe uses 220V/50Hz.
The drunk thing is way outside the control of the administrators. Testing the failover is something they can do, and if something doesn't work, they can fix it.
It was the Terrists (I wish I could type it the way Bush pronounces it) ;)
It's only paranoia if your wrong...
> The drunk thing is way outside the control of the administrators.
Eh? You're kidding, right?
Max.
Nope. They're not talking some random drunk off of the street, they're talking about a disgruntled employee.
oh, ok. Even so, there's still things that can be done. I'd still be more pissed about that, I think. Like the other guy said, I can understand power failures - they're unacceptable, but still understandable. Allowing a drunk in there, whether an employee or not, I cannot understand. Is there only one person there or something? Security should have just sent the person home.
On the other hand, there's drunk, and there's drunk....
Max.
Is there anyone reading this thread that actually knows something about power outages? The information about colos from those that use them and have visited this one is great, but am I alone in thinking the power outage itself was kinda weird (and perhaps even suspicious)?
I can't remember ever being in a power outage where the power went off for a few minutes, came back on, and continued going on and off for three hours. A typical power outage is a component failure that leads to a single outage or to an overload that then leads to an outage. Sometimes successive areas fall like dominoes as the overload travels around the grid, but isn't it unusual for power to go on and off like this?
I for one would appreciate it if anyone with actual knowledge of these things would post. I know this is more of a blue collar specialty than a tech one, but someone must know the answer. Seems to me that if the power was indeed going "on and off" for three hours that this is a very good reason why the colo might have failed. It's simply too unusual an occurrence to plan for.
An interesting possible reason for 365's outage debacle was posted by someone on an O'Reilly Radar blog (emphasis added by me):
ajblardone [07.24.07 06:22 PM] I was there when the power went out. The generators kicked in right away. Some colos were fine others weren't. Mine went black for a while after the outage. 365 main had been working on electrical upgrades all week and this outage might have been bad timing for them... At 4pm 365 main sent out a notice saying the building was 100% operational and still running on the generators until PG&E confirms that utility power is stable.
"...I would think these large sites are going to pitch a bitch..."
I would think these large sites would understand the concept of not putting all your eggs (servers) in one basket. There is a reason why smart companies use replication and clustering, and datacenters spread across the country.
The guy who rebooted that VAX is probably under a floor tile in that datacenter.
Veteran, Bermuda Triangle Expeditionary Force, 1992-1951
The press release "RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco Data Center", which was on the 365 Main web site earlier today, has disappeared from there.
But they sent the press release to PR Newswire, and you can still read it there.
Yeah here in Australia it is 240V 50Hz, but often closer to 260 in Western Australia for historical reasons. Most people design for 250 which is what a volt meter will read in most places.
http://michaelsmith.id.au
They probably just didn't kick in. Had the same problem at Internap in Seattle a few years ago.
Many datacenters didn't expect the growth they experianced. As a result, many UPS and generator sets are undersize or the entire load is not onboard. In some cases, the critical serviers are up to post the we are down page, but the HVAC system and main floor are down. What good is having a datacenter up if the building AC is down? Sometimes you are forced to shut down simply because the support AC is down and not on critical power. You can ride out a 20 minute outage without AC, but after an hour, it's at critical tempratures.
The truth shall set you free!
365 Main gets to royally fuck up one day every 4 years. Maybe the companies should have hired 366 Main.
How the fuck is it understandable? If the equipment is not being tested and maintained properly, how it is any more understandable than letting drunks through the main doors?
Properly maintained and tested backup equipment does not fail.
I swear, people are going through some major mental gymnastics here to excuse away sheer incompetence.
Time to upgrade the cardswipe system to also require a brethalyzer..
*swipe*
*bip* *beep* *beep* *boop* *bleep*
[deep breath]
*whoosh*
Alcohol Level: 0.15
*beeeeeeeep*
Damnit!!
There's no place like
Not a bad idea, I think...
Max.
I'd prefer to think he was just trying to balance out all the people who have started using "site" when they mean "sight" since this whole intarweb thing came about.
PG&E is a good example of why "deregulation" does not work for utilities. We got about one thousandth of an inch of rain (barely measurable). This was just enough to knock out the power to a sizable chunk of the East Bay. Why? Because in their quest for profits, PG&E is too cheap to properly wash down their equipment, and dust builds up. A drizzle turns the dust into mud and causes stuff to short out. That's not to say that PG&E is good in the dry weather. Where I live in the Bay Area (decidedly not the sticks), power goes out for 3+ hours at least twice a year.
The revolution will be mocked
Not so. The phone company's commitment to dial-tone reliability predates the existance of 911 service (which was first mandated in 1967 but not universally deployed until much later) by decades.
-Tom Duff
Essentially, that's not what I said.
If you want news from today, you have to come back tomorrow.
You know, kind of like "P2P" and "bitorrent" and all that.
But of course, usenet lacks choke-points to insert advertising -- oh wait, I mean it lacks spam resistance, that's it -- so it is of course doomed to obscurity.
Drunk employee?
Sounds like someone has been installing several instances of windows.
Like anyone would do that sober.. maybe their cable management was really bad and he just staggered into wrong corridor?
There are no atheists when recovering from tape backup.
Under the wide and heavy VAX
Dig my grave and let me relax
Long have I lived, and many my hacks
And I lay me down with a will.
These be the words that tell the way:
"Here he lies who piped 64K,
Brought down the machine for nearly a day,
And Rogue playing to an awful standstill.
240v/50hz in NZ too.
Heh a couple of years back at home one night the power went out, then came on a few seconds later, but the lights seemed really dim... and the fridge was making a really weird noise, and other appliances were either not working or doing weird stuff.
I poked a multimeter into a wall socket, and we were only getting 90-110v. Surprisingly my computer and the tv were working fine.
Digging back a few months, I found another gem...
365 Main Recognized by PG&E...for taking proactive steps to reduce power usage.
This is what technical folk refer to as an "understatement".
The global (yes, global) LB took over but we lost an estimated 100,000 credit card transactions in the 10 seconds it took for it to kick in. They literally had to hold restrain the manager.
Another one I had was at a telemaketing company. We had just run new wires for some T1s and my boss had these periods of "neatfreakness" he proceeded to cut the wires he THOUGHT were the old ones...
So as the call center manager is bragging to a potential client about our uptime all the TSRs are yelling into their headsets "Hello?". My boss comes out of the network closet with wirecutters in one hand and a bundle of cable in the other right as the manager and COULD HAVE BEEN client were walking up the aisle. Priceless....
"Chinese Amazons, power armor, laser swords.... things just meant to be." - Shampoo, A Very Scary Bet
Supposedly the power has been back on for some time but craigslist is still down./ 07/24/BAG9NR67253.DTL
http://sfgate.com/cgi-bin/article.cgi?f=/c/a/2007
Some of the sites, including Craigslist, remained down even after power was restored, as administrators ensured that data in the server hadn't been damaged, among other checks.
from the article... "Some of the sites, including Craigslist, remained down even after power was restored, as administrators ensured that data in the server hadn't been damaged, among other checks."
It's well after 10pm and craigslist is still only intermittently working. I wonder why they're having such trouble?
-- QED
Now, now... LiveJournal is back up.
CFO: "Four nines are 80% as good, right?"
Dewey, what part of this looks like authorities should be involved?
Have you ever been in a data center? Cabinets that are all locked. To get the key, you have to sign it out from security. Ditto for the cages. It wouldn't just require a drunken/disgruntled employee, it would require a conspiracy of them: security staff to hand over the keys and the disgruntled employees to do the misdeeds.
Well, there is one way around that: you walk over to the EPO button and give it a whack. It'll take down the whole floor. Rinse, lather, repeat on other floors. How many do you think you can do before someone stops you?
Anyway, my employer has a lot of stuff in 365 Main. We're not one of the companies mentioned in TFA, but we're certainly one of the ones affected. Within a couple minutes of the outage, we knew we'd lost everything we had there and several of our sysadmins grabbed their gear and headed for the city to go join that line outside of 365. By the time they left the building we had confirmation that it was a power outage.
Power was already back on when they got inside and they immediately brought up anything that wasn't already up and tested it all to make sure it was OK. To say the least, this is inconsistent with (tall) tales of somebody going apeshit on 40 racks.
Who 'drunk tests' a data center?
I just read Slashdot for the articles.
Even better is the page is now 404'd =)
Oh, telecom for telemarketers. I don't even put telecom We must have been Very Bad in a past life to have done that. I was escalating Nortel service issues to VPs in Texas (from the Bay Area) at mine; they eventually reassigned the guy who slept in his truck, never bathed or brushed (teeth or hair) and was usually drunk. Musta been one hell of a union. That doesn't even touch the people I worked for. Ugh.
Oooh, I bet in about 3 more years that manager torches the shoulder surfer's car in the middle of the night. Just for that the LART is moving from the serve room to my office. It's only a 3 wood, but the very first time I met our new CFO I told him he would eventually let me expense a taser. We're getting closer.
Veteran, Bermuda Triangle Expeditionary Force, 1992-1951
How does Internap keep doing this? The major Seattle problem, yeah, but I can recall several outages (of LJ mainly) where they say "our provider lost power due to whatever and their generators didn't work/were overloaded/worked, but then stopped". I've been in their Boston facility, and it was packed to the gills, and there were large generators outside. I'd have to assume they work.
I like music
Where I used to work, we had a commodity K62-300 box running Solaris x86 go for over three years on an unfirewalled global IP, acting as a DNS server for an ISP. In the end, it was brought down by the power supply fan seizing. It was so type I couldn't even turn it by sticking a screwdriver in the blades.
:)
Clearly, I'm hung like an eHorse
You should already be steamed up for that!
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..
Shouldn't the drunk be taken into account in a failover plan? Granted, it is unlikely, but the whole point of a failover is making sure that if one server is not available (IE motherboard failed, died in a fire, or pissed on by a drunken soon to be ex-employee) that you FAIL OVER to the other server or server farm.
P.S.,
This is what part of the alphabet would look like if Q and R were eliminated.
Sounds like it's time for a new technical term...
Max.
Yes, they had generators, but run on alcohol and someone drunk it.
365 Main Customer,
At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was effected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St.
An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building. On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.
As a result of the incident, continuous power was interrupted for up to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4 were directly affected, though other colocation rooms are still being investigated. We are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause.
All generators will continue to operate on diesel until the root cause of the event has been identified and corrected. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.
We understand the seriousness of this issue and will provide full details once they come available. We sincerely apologize for the impact this has had on your operations.
Regards,
Vice President, Security
365 Main
"The World's Finest Data Centers"
Just send me a big fat check and all is forgiven.
You do know most multimeters are not rated for the amps and peek voltages of mains power. The same thing happened to me, also in New Zealand, late at night about one and a half years ago. The power cut out for a second, the computer died then started back up, but never started booting. The lights were real dim, and my alarm clock started showing weird digits. The washing machine (which was off) managed to get its self halfway into a wash cycle and started beeping. I never checked the TV, but the computer screen was. You wouldn't happen to live near Springston?
I poked a multimeter into a wall socket, and we were only getting 90-110v. Surprisingly my computer and the tv were working fine.
TVs and computers use switched mode power supplies these days, which are quite happy running on a fairly wide range of voltages (although the current draw will of course be much higher at lower voltages). This is the reason why PSUs nolonger have 110/220v switches on them.
http://blog.nexusuk.org
"See this pub?" asks John, "I built it, but they don't call me Pubbuilder John? I'm the local doctor, I saved Barman Jim's life once when he choked on a peanut, but they don't call me Lifesaver John. Every year, I supply a huge Christmas tree for the village green, but the don't call me Christmas Tree John.
"But you shag one lousy sheep..." (Note; since that Austin Powers film came out, I assume that you Yanks know what "shagging" is now).
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
Taser. Hell, where I am we've got shipyard justice. Plasma cutters do wonders....
"Chinese Amazons, power armor, laser swords.... things just meant to be." - Shampoo, A Very Scary Bet
Trust me, it's not that bad. The guy made one spelling mistake in a post that was otherwise correctly spelled, punctuated, capitalised, and generally better-written than a lot of the crap that's out there on the Net.
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
> Main sight is back up as of now, though forums are still down.
I see.
Looks like they couldn't stop the story before it hit the wires... I wonder if they'll issue a retraction? :P
...unfortunately no one can be told what The Mat^H^H^HGoatse is...they must experience it for themselves...
In the event of a significant regional disaster, how good are those quick-refill contracts, anyway? I just keep thinking of fuel diversion by emergency services or simply the inability to deliver fuel due to transit woes, employee shortages, etc. In the event of a significant situation, fuel is a top priority for lots of people, many of them with guns and/or legal authority to seize it.
It always struck me as somewhat more resilient to have N+1 generators capable of being run on natural gas and LP. Sure, some regions might lose natural gas delivery (earthquake, etc), but it seems more likely to me that natural gas would keep running in spite of problems that might prevent or badly slow diesel delivery. And being capable of switching to LP means that even if you lose natural gas, you can keep running on on-site fuel.
The downside is that you probably are more limited in LP storage facilities in dense urban areas (diesel seems more fire marshal friendly) and diesel is more fuel efficient, but overall it seems that the odds favor longer term survivability of natural gas + LP vs. diesel.
Working for a telephone company in Florida, I have a hard time believing anyone running a data center could be so ill-prepared. We have our own issues with DR - there's going to be some issues when a bomb goes off under a switch site; BUT we have had multiple switch sites keep running simultaneously on generators and inverters during and after hurricanes. Our NOC and switch techs go above and beyond to keep power and connectivity up. They may get a bug out notice prior to a major hurricane, but if so, everything is cut over to generator power with at least 48 hours of fuel and they're back on site just as soon as the roads are drivable. The last time South Florida got smashed, all of my data systems stayed online even though it was close to a week before commercial power was back on.
A couple of 30-somethings embark on the ultimate roadtrip
Friend that was at the colo says the diesel did not start and the wheel spun down.
If a CFO with at least an MBA cannot make proper use of grammar in an apology letter sent to his paying clients what hope is there. Seriously that is inexcusable next time maybe try proofreading or perhaps have a secretary review it...
Didn't 365 ever plan for their own disaster? I'm sure other major companies have enough redundancy in their infrastructure to support their business in case of a power outage. Found this article on disaster planning: http://www.smartbrief.com/news/aaaa/industryBW-det ail.jsp?id=B3A11DDD-AD9B-4399-9682-6E54C82E6757
More companies need to prepare for when it's their turn instead of relying on someone else. What about data recovery? What if the drives got damaged? They'd be spending a whole lotta money on data recovery.
I work for Hyperic, which is a systems management company here at 2nd & Mission in SF. Our website is run out of that colo on 365 Main... and it was up all day yesterday, even despite the manhole cover blasting off which resulted in the mad power outage... which was witnessed by a Fruit of the Loom commercial and an apparent dead guy that had been on a gurney in the street with a security guard for a couple hours. (for more on this with pictures, check out Javier's blog from yesterday: http://www.hyperic.com/blog/hyperic/2007/07/24/hyp eric-is-where-the-action-is/
Knowing a thing or two about running data centers here, that data center definitely has serious backup and disaster recovery - they are professionals otherwise we wouldn't have picked them and neither would other serious businesses like Yelp and Technorati. I don't know for sure - but the drunken idiot theory makes a whole lot more sense given how many other sites that I know are run out of there that were unaffected, ours included.
-Stacey Schneider
Hyperic
http://www.hyperic.com/
Waiting in line for checkin at 365 Main:
http://tastic.brillig.org/~jwb/dorks.jpg
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Netflix is hosted elsewhere. Their outage was not related to the power problem.
3 640.html
http://cbs5.com/topstories/topstories_story_20606
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Sounds like it's time to update the Hazard Vulnerability Analysis . . .
Well, they DO test this regularly, at least generator fail over in the event of power loss. Unfortunately, it appears that a significant power SURGE occurred from a transformer back feed. This resulted in the flywheels in their generators spinning down before power could be switched over and likely some system that detects power loss probably got fried in the surge and never notified the generator controller of the loss. 1) they're lucky they have a REALLY good ground fault interrupter as this likely would have cooked every server in every rack otherwise, or at least every surge stopgap between the line feed and the racks, which still could have caused days of downtime to replace, 2) how does one test for a several megawatt power surge? 3) Only some of their racks went down so at least some battery or generator power came online, just not all of them, or not ones that powered certain rooms.
That said, the fact that they're running exclusively on generator until they identify and fix this fault, and that the power company and the generator operators are jumping in means they're more than willing to blow several thousand in fuel costs to make sure this does not happen again, and I would expect they'll bill the generator manufacturer for this failure and all related costs (which that company will likely bill to an insurance provider) and possibly find another generator company or add a few more redundant systems.
The fact that the clients are not insisting on installing UPS systems with at least 30 minute run times IN the racks with their servers means either the clients are cheap, or no one considered that a fuse, breaker, or PDU in a rack could blow and take out half a rack or more if it wasn't on internal UPS power, regardless of whether power was on or not... This is flawed redundancy thinking.
Business Continuity should be 25% of total IT spending (labor, hardware & software, backup, everything combined). This does not include redundant co-lo for users, only servers. If you want redundancy for everyone, users included, take you IT budget now (without that redundancy) and add 125% to it (it costs MORE than double).
There is no contest in life for which the unprepared have the advantage.
I like "FALL-over" planning myself. Ya know, for when an admin shows up for work stumbling drunk.
He's making fun of you. Your melodramatic whining is well-suited to the types of posts made on LiveJournal every day.
It's not whining, it's a demand for accountability. If I were as bad at my job as these people are at theirs, I would be fired, and rightly so. Those of you who like to bend over and take it are welcome to keep doing so. I won't.
"I swear, people are going through some major mental gymnastics here to excuse away sheer incompetence."
They have sympathy because they know they're incompetent, too. About 95% of the population would agree.
That said, the cause of the outage could be another case of "nothing works and nobody cares." In other words, they did test everything, but it still failed when push came to shove.
I'm trying to install an eSATA for a client. We've gotten three eSATA controllers, two sets of port multipliers, and different cables and it still doesn't work right. The last set was actually tested by the company before sending it to us and it still doesn't work right, although it came a lot closer than the first two. Now the company admits the port multipliers are a new design by their supplier and they can't be certain it isn't a quality control problem with their supplier.
Nothing works in IT. I just have to engrave that on my hand like the punishment in the latest Harry Potter movie.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
UPS's themselves may be cheap, but they eat rack space. So when you sign a contract with a provider, you take into account all of these costs and perceived benefits.
Am I the only one to think of the possible consequences of a major earthquake in the bay area?
Hosting a site at only one location could, in the worst case, lead to the site being lost, if it was not backed up at an other physical location.
There are no surprises here. San Francisco's Mission St. Substation feeds half a dozen significant datacenters (365main, Level3, Coloserve, 400 Mission, and 650 Townend) and has suffered 3 serious outages in the past 7 years. California itself had 2 straight summers of rolling blackouts, which only subsided thanks to the dot-com crash. California is running out of duct-tape. 365main, usually runs a good operation, and is one of the best datacenters in California.. However, it's also the most expensive datacenter in California, and should have a better track record than it's lower-cost competitors like 200 Paul and Coloserv. In May, 2007 we moved our infrastructure out of 365, off of California's cancerous power grid, and into a more reliable, greener, and cheaper grid.. Yeah, we moved to Seattle. This was the best decision we ever made. Most of our experience with 365 was extremely positive, however pricing, and power density problems forced us to move. I can't list all of the good things 365main did, but here's a list of 365's power problems as we experienced them: In April, 2005 365main had an outage that affected all customers for 50 minutes due to a failed EPO valve. 365 handled that outage spectacularly, claling all of their customers within 15 minutes of the outage. In February, 2006 365main experienced a partial outage for 3 seconds that only affected some customers, but caused problems in their Telco spine, affecting connectivity. In October, 2006 365main had a backup generator fail, but supposedly no customers were directly affected, but customers were not allowed to enter the building between 3:29 PM and 4:40 PM.
You don't need to be a large site to spread yourself out. Even if you're just big enough to be able to afford $100-200 USD per month in hosting costs, you can do at least reasonably effective redundancy...
Roundrobin between the two servers means that in the event of an outage, only 50% of requests are denied, and you can change the DNS records (with a low TTL, I'd hope) and be switched over entirely to the surviving server within minutes. And that's just with two cheap budget dedicated servers in commodity datacenters...
I mean, throw one box up at ThePlanet in Dallas, another one up at iWeb in Montreal (Hah! Multi-country redundancy) and you've got yourself a pretty darned good chance of surviving ANY disaster. I mean, Quebec (Montreal) and Texas (Dallas) both have their own interconnects too, so one of those giant power outages that took out the eastern US/Canada a few years back (Except Quebec) wouldn't even affect both locations.
But I know very little about this sort of thing. So maybe my idea of zero-budget failover with DNS is stupid, somebody fill me in.
Yeah I figured that. I was surprised about the TV tho...
Nope this was in Wellington.
My multimeter, while its a cheap one, can measure up to 500V AC. And a voltmeter has basically infinite resistance - as close to zero amps as they can make it will flow through it. This is why you wire them in parallel in a circuit. An ampmeter has as little resistance as possible, and is wired in series. If you plug a multimeter into a wall socket when it is switched to amps then you are going to need a new ampmeter. Its basically the same as pushing a bent-up paperclip into the phase/earth sockets.
UPS are typically a 2U or 4U solution and can power up to half a rack. Data centers almost always charge less for U's associated with devices that don't add to power or heating costs (KVM, terminal display, tape jukeboxes, etc). Rack space itself is cheap if the costs of cooling and power can be eliminated. A company we partner with (we back up their racks, not use their space) charge a base price for the 1st U per server, plus a lower price per additional U (for the same server). Buying a 1/4 rack, half rack, or full rack is always at a discount compared to U-by-U purchase. Any reasonable configuration is going to be at least 3 individual servers anyway, so an extra 2U for the UPS divided out to a few servers should not be worth considering when looking at the cost of power loss. Some data centers even include this additional UPS space (and the UPS itself) as part of the regular fees.
There is no contest in life for which the unprepared have the advantage.
365 Main has placed a statement about their Hitec UPS failures on their web site. Highlights:
So they apparently had startup failures in four out of ten Hitec units. Their basic architecture is that they have eight main UPS systems (each is a motor/generator/flywheel/Diesel combo), each driving a separate section of the colo, and two spares, Backup 1 and Backup 2, which can be switched to drive any section. No big battery banks; it's all flywheels and Diesels. At least eight systems must be running to keep the full data center up.
365 Main has had Hitec experts flown in from Holland, where the UPS was made. Today, Hitec top management arrived: "A longstanding member of the Hitec Board of Directors is arriving later tonight and will be onsite tomorrow (Sunday) to participate in all investigation activities."