Multiple Sites Down In SF Power Outage
corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
I can verify that it affected much of the Financial District here in SF. We had the power go out 3 times. Seems to be back now. Haven't heard any explanation yet.
im in ur datacentr
trashin ur racks
Don't these large sites have failover capable, redundant servers in multiple physical locations? Why should a failure in one rack, one room, or heck, even one state for the giant sites, effect them?
I don't respond to AC's.
It's interesting that so many major sites would go down in a local power outage? Are they all sharing one data center in SF? If so, why don't they have co-locations in other cities?
~ I am logged on, therefore I am.
At least 20,000 without power in downtown S.F. Marisa Lagos and Demian Bulwa, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 15:12 PDT SAN FRANCISCO -- At least 20,000 customers of Pacific Gas and Electric Co. in downtown San Francisco lost power this afternoon, the utility said. Brian Swanson, a spokesman for the utility, said outages have been reported throughout downtown and along the Embarcadero, including at PG&E's office on Beale Street near the Ferry Building. It was unclear initially how many customers who lost power remained without it for a sustained period. Power outages were also reported in the South of Market neighborhood, the Outer Mission and down the 3rd Street corridor south of Mission Bay. PG&E officials said they did not know why power had gone out, but most customers appeared to be back online by 3 p.m. The outage has prompted Muni to run shuttles in the place of cable cars, a spokeswoman said. The T-Third Metro line was unable to cross the 4th Street Bridge for a short time, but power was restored to the drawbridge by 3 p.m. Muni bus lines 14, 49, 30, 41 and 45 were without power for about 30 minutes following the outage, but are now working, spokeswoman Maggie Lynch said. Parking Control officers were deployed to the Outer Mission, 3rd Street and Monterey Avenue for traffic control, she added. Power first went offline around 1:50 p.m. and came back at least three times in the downtown area before shutting off again. The same problems were reported in South of Market all the way to AT&T Park and the Caltrain station at Fourth and King streets, and traffic lights were out as far south as Monterey Boulevard. At the Westfield Center at Market and Fifth streets, only one of six Nordstrom elevators was working while the shopping mall ran on a backup generator. Shoppers milled around as the lights flickered on and off. BART is still running trains but the lights at its downtown stations have flickered on and off several times, said spokesman Linton Johnson. The transit agency also has concerns about the ventilation system, which is on the same grid as the lights, he said, but will keep its downtown stations open so long as the lights and ventilation continue to work. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power.
Yep, it took down most of CNET, which GameFAQs is under. Main sight is back up as of now, though forums are still down.
If forums teach us anything, it is that logic and critical thinking should be required courses in the public schools.
Well, you test and test and test, and when something finally happens, nothing. Stuff happens.
Brownouts sometimes fail to trigger generators, even though they should. If only one phase goes down, depending on the design, it may not trip (and would cause a somewhat random outage, like some drunk shutting down racks).
If the generator runs on diesel, they usually only plan for a few hours of backup. If they didn't recalculate the generator runtime as they added equipment, the load may have caused the fuel consumption to go up higher than anticipated. Is it hot in SF today? Air handlers may be straining to keep the place cool, or maybe the generator got running too hot.
Often times, as equipment is added, the load gets out of balance between phases. It is usually a good idea to keep the load as even as possible, but in a high traffic data center, I would imagine there would be a lot of stuff moving in and out, expanding and contracting, and it may become hard to keep track of the loads across phases. A good facilities manager should be able to tell you the current load off the top of his head, but too often these details get left out.
This is just stuff I've seen in cable TV headends over the years. Granted, this facility should have a power manager/engineer on staff, but so often the power is one of the first things to get cut from the budget.
"Well, good luck finding a judge that doesn't run a bestiality site."
Any data center that advertises high availability should be testing that sort of thing on a regular basis. It's possible that they could fail switchover even if they are being regularly tested, but it is unlikely.
If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer I'd be pissed that the data center I pay tons of money to can't keep my site up in the event of a power outage, which is one of the main perks of hosting at a data center in the first place.
There's a report here that "Flesh-eating zombies are prowling the streets"
You mean that all 3 x 20,000 gallon tanks were empty? I find that hard to believe.
There are two types of people in the world: those who divide people into two types and those who don't.
We are working with our co-location facility managers to assess why it is back-up power generators failed to provide the necessary back-up power to prevent our site going down. We apologize for any inconvenience caused by our site being unavailable this afternoon.
I think that's admin speak for:
I warned these idiots eight months ago during my review that the datacenter had outgrown its generator capacity. But did they listen? Fuck no, they just kept counting money and worrying about the bottom line. The beancounters looked at me like I'd asked them for a blowjob from their grandmothers when I submitted the workup for additional generator capacity. And now that the shit's hit the fan, whose ass are they screaming for? Screw this, I'm applying at Taco Bell.
There are some people that if they don't know, you can't tell 'em.
Impossible, they would never commit suicide without posting a note in the form of bad angst-filled poetry to their blog first. There is no chance any of them will actually kill themselves until the site is back online.
This is a DATA CENTER, its whole purpose in life is to be available when things like this happen. It had better have generators and plenty of fuel on hand at all times. The data center I work at has the capability to run at full power with nothing coming in from the outside world for 36 hours. I don't know what the standard is for other data centers, but it seems like they should be capable of getting at least 12 hours of operation without incoming power from the grid.
Press Release on Red Envelope having 2 years of uptime at 365 Main - San Francisco from today: http://365main.com/press_releases/pr_7_24_07_red_e nvelope.html
It takes Diesel a few years to go bad. That site has fuel polishing systems to prevent that. Because of earthquake risk, they contractually are obliged to have 24-48 hours of backup fuel with many of their clients.
They have the HiTec rotary UPSs in all their facilities, which link a generator to a flywheel UPS. It's stupid to not have backup fuel for that type of system; you can only run for 13 seconds before the load crashes.
It is possible that they got a number of small hits and the generators failed to re-start after a few. Good procedures are to stay on generator until utility stabilizes if you have more than one "hit."
Be interesting to find out what happened.
This has got to be some type of joke: RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco's Datacenter.
It was released today....
ÕÕ
It's been a long time since I went on a tour of several data centers to locate a new facility for our dot-com. I believe that 365 Main was a facility that does not use a battery UPS. Instead, they have engine-backed flywheel UPS system (see http://www.enterprisenetworksandservers.com/monthl y/art.php?2813 for a description). At the time, they have 10 2-megawatt generators on the roof in a N+2 configuration. The engines are kept heated and are spec'd to go from stop to engage-clutch/deliver-power in 3 seconds. The flywheel can deliver 11 seconds of power so they can fail through a couple of bad engines before running out of flywheel power. They periodidally do a 20-hour load test into a pair of 500,000 watt heat-sinks. Time will tell if this outage was a failure of design, failure of maintenance, or outright malfeasance. But it wasn't supposed to happen. They've got some 'splainin' to do.
As to diesel storage, use of diesel is widespread for emergency use everywhere from hospitals to emergency-services to hospitals. Those systems are run regularly - typically weekly. The use of biocides, stabilizers, and mobile fuel-scrubbing services, and extra filtration systems can maintain the fuel quality. Our colo currently maintains a 1-week fuel-supply and has multiple quick-refuel contracts in place. I can't imagine any colo having less than 24-48 hours in-the-tank with quick-refill on-call.
But one thing that is missing is cooling. Our colo has a typical contract that says something like blah-blah won't exceed 80F for more than 4 hours blah blah. OK, but a rack full of blade servers can crank out 15-20kW of heat load and a data center can heat up real quick without AC. By contract, 150F for 3.5 hours would be in-spec.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
For me it would be other way around. A technology failure I could understand. Letting a drunk employee near my server rack, I could not.
If you want news from today, you have to come back tomorrow.
No kidding. years ago in my former job on traffic systems we had a great UPS with a generator on site and the ability keep it fueled up indefinitely. A security contractor came in on the weekend to install something and tried to wire up a new circuit hot. He slipped with a screwdriver and shorted the white phase to the chasis of the breaker panel. I don't think the tip of the driver actually touched ground, but the burn mark is still there to show how close he got.
The resuting current spike blew the 100A fuses (heavy metal strips) both going in to and out of the UPS. With the UPS effectively broken the generator set failed to start and the system gracefully shut down 40 minutes after the incident. Thats not bad. The batteries were only specified to work long enough for the genny to settle at 50Hz.
In the process of blowing the fuses a spike got back into the power supply of one of our DEC Alphas and took out the power supply. The system was redundant at the software level so I didn't notice immediately.
The UPS guy came out and didn't have enough fuses to replace the blown one, but we found that with a bit of brute force and filing attacks some others could be made to fit.
Please type the word in this image: problems
http://michaelsmith.id.au
Data sheet for 365 Main:
The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.
They use a Hytec Continuous Power System, which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.
With this type of equipment, if for some reason you lose power and the generator doesn't start before the flywheel runs down, you're dead. There's no way to start the thing without external power. Unless you buy the optional Black Start feature, which has an extra battery pack for starting the Diesel. "Usually the black start facility will not be often needed but it won't hurt to consider installing one. Just imagine if you were unable to start up your UPS system because the mains supply is not available.". Did 365 Main buy that option?
I tried to mod the article "-1 Not Redundant" but it wasn't an option. And I didn't have mod points. At least my inability to function only warrants a comment, rather than a slashdot article.
Velociraptor = Distiraptor / Timeraptor
...until the commercial power fails and doesn't come back for days.
:o(
The only places I've actually seen the insane levels of backup that some would like is in some telco central offices. The one I was associated with the longest had eight-hour-plus battery backup and 8 days of fuel for the diesels. Some of our really remote microwave sites had 24 hour battery and 30 day diesel.
Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze. When one of the techs finally got there (after burying his Sno-Cat and walking the last couple miles), he had to chip ice off the steel door to get inside, where he was able to get the diesel started with a little "rewire" of one of the backup battery sets. Oh, his two-way radio also failed during his hike, since it was outside his snowsuit, and the lack of communication caused the company to start two more Sno-Cats and a helicopter in that direction.
The site was out for nearly six hours, IIRC.
Even the BEST designs are subject to failure.
--
Tomas
88.88% uptime causes less outage than 99.9%? I don't follow your math. Did you do it with an Intel chip, by any chance?
I work 3 blocks from 365 Main.
There were 5 individual power failures, each no longer than 5 minutes, over a roughly 30 minute period. A couple of them were in quick succession.
Well, according to their self-congratulatory press release, issued earlier today, they were allegedly at 100% uptime for the past two years.
;-)
The irony of issuing a press release like that, and then to be hit with a power outage and apparent simultaneous failure of all backup systems later that day, is beyond measure.
I don't know about God, but it's enough to make me believe in karma.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
HP has a nice overview of building systems which can failover between widely distributed nodes called Designing Disaster Tolerant High Availability Clusters. It's a bit old, and is focused on ServiceGuard, but is still interesting.
"...I would think these large sites are going to pitch a bitch..."
I would think these large sites would understand the concept of not putting all your eggs (servers) in one basket. There is a reason why smart companies use replication and clustering, and datacenters spread across the country.
You forgot one very important component, the car battery used to START THE GENERATOR. I've been to many sites whereby the battery that would start the generator is a) dead or b) missing.
Time to upgrade the cardswipe system to also require a brethalyzer..
*swipe*
*bip* *beep* *beep* *boop* *bleep*
[deep breath]
*whoosh*
Alcohol Level: 0.15
*beeeeeeeep*
Damnit!!
There's no place like
365 Main Customer,
At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was effected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St.
An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building. On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.
As a result of the incident, continuous power was interrupted for up to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4 were directly affected, though other colocation rooms are still being investigated. We are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause.
All generators will continue to operate on diesel until the root cause of the event has been identified and corrected. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.
We understand the seriousness of this issue and will provide full details once they come available. We sincerely apologize for the impact this has had on your operations.
Regards,
Vice President, Security
365 Main
"The World's Finest Data Centers"
Just send me a big fat check and all is forgiven.