Multiple Sites Down In SF Power Outage
corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
At least 20,000 without power in downtown S.F. Marisa Lagos and Demian Bulwa, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 15:12 PDT SAN FRANCISCO -- At least 20,000 customers of Pacific Gas and Electric Co. in downtown San Francisco lost power this afternoon, the utility said. Brian Swanson, a spokesman for the utility, said outages have been reported throughout downtown and along the Embarcadero, including at PG&E's office on Beale Street near the Ferry Building. It was unclear initially how many customers who lost power remained without it for a sustained period. Power outages were also reported in the South of Market neighborhood, the Outer Mission and down the 3rd Street corridor south of Mission Bay. PG&E officials said they did not know why power had gone out, but most customers appeared to be back online by 3 p.m. The outage has prompted Muni to run shuttles in the place of cable cars, a spokeswoman said. The T-Third Metro line was unable to cross the 4th Street Bridge for a short time, but power was restored to the drawbridge by 3 p.m. Muni bus lines 14, 49, 30, 41 and 45 were without power for about 30 minutes following the outage, but are now working, spokeswoman Maggie Lynch said. Parking Control officers were deployed to the Outer Mission, 3rd Street and Monterey Avenue for traffic control, she added. Power first went offline around 1:50 p.m. and came back at least three times in the downtown area before shutting off again. The same problems were reported in South of Market all the way to AT&T Park and the Caltrain station at Fourth and King streets, and traffic lights were out as far south as Monterey Boulevard. At the Westfield Center at Market and Fifth streets, only one of six Nordstrom elevators was working while the shopping mall ran on a backup generator. Shoppers milled around as the lights flickered on and off. BART is still running trains but the lights at its downtown stations have flickered on and off several times, said spokesman Linton Johnson. The transit agency also has concerns about the ventilation system, which is on the same grid as the lights, he said, but will keep its downtown stations open so long as the lights and ventilation continue to work. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power.
Yep, it took down most of CNET, which GameFAQs is under. Main sight is back up as of now, though forums are still down.
If forums teach us anything, it is that logic and critical thinking should be required courses in the public schools.
Well, you test and test and test, and when something finally happens, nothing. Stuff happens.
Brownouts sometimes fail to trigger generators, even though they should. If only one phase goes down, depending on the design, it may not trip (and would cause a somewhat random outage, like some drunk shutting down racks).
If the generator runs on diesel, they usually only plan for a few hours of backup. If they didn't recalculate the generator runtime as they added equipment, the load may have caused the fuel consumption to go up higher than anticipated. Is it hot in SF today? Air handlers may be straining to keep the place cool, or maybe the generator got running too hot.
Often times, as equipment is added, the load gets out of balance between phases. It is usually a good idea to keep the load as even as possible, but in a high traffic data center, I would imagine there would be a lot of stuff moving in and out, expanding and contracting, and it may become hard to keep track of the loads across phases. A good facilities manager should be able to tell you the current load off the top of his head, but too often these details get left out.
This is just stuff I've seen in cable TV headends over the years. Granted, this facility should have a power manager/engineer on staff, but so often the power is one of the first things to get cut from the budget.
"Well, good luck finding a judge that doesn't run a bestiality site."
You mean that all 3 x 20,000 gallon tanks were empty? I find that hard to believe.
There are two types of people in the world: those who divide people into two types and those who don't.
This is a DATA CENTER, its whole purpose in life is to be available when things like this happen. It had better have generators and plenty of fuel on hand at all times. The data center I work at has the capability to run at full power with nothing coming in from the outside world for 36 hours. I don't know what the standard is for other data centers, but it seems like they should be capable of getting at least 12 hours of operation without incoming power from the grid.
It's been a long time since I went on a tour of several data centers to locate a new facility for our dot-com. I believe that 365 Main was a facility that does not use a battery UPS. Instead, they have engine-backed flywheel UPS system (see http://www.enterprisenetworksandservers.com/monthl y/art.php?2813 for a description). At the time, they have 10 2-megawatt generators on the roof in a N+2 configuration. The engines are kept heated and are spec'd to go from stop to engage-clutch/deliver-power in 3 seconds. The flywheel can deliver 11 seconds of power so they can fail through a couple of bad engines before running out of flywheel power. They periodidally do a 20-hour load test into a pair of 500,000 watt heat-sinks. Time will tell if this outage was a failure of design, failure of maintenance, or outright malfeasance. But it wasn't supposed to happen. They've got some 'splainin' to do.
As to diesel storage, use of diesel is widespread for emergency use everywhere from hospitals to emergency-services to hospitals. Those systems are run regularly - typically weekly. The use of biocides, stabilizers, and mobile fuel-scrubbing services, and extra filtration systems can maintain the fuel quality. Our colo currently maintains a 1-week fuel-supply and has multiple quick-refuel contracts in place. I can't imagine any colo having less than 24-48 hours in-the-tank with quick-refill on-call.
But one thing that is missing is cooling. Our colo has a typical contract that says something like blah-blah won't exceed 80F for more than 4 hours blah blah. OK, but a rack full of blade servers can crank out 15-20kW of heat load and a data center can heat up real quick without AC. By contract, 150F for 3.5 hours would be in-spec.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
They do, but one of the dirty little secrets of most data centers is that they don't have enough generator capacity for all the cooling. They'll woo you with the generator, the 2,000 gallons of diesel, and N+1 array of UPSes, but when utility power dies, it gets hot very quickly. And some racks must go down.
365 Main has a long and ignominious history of frequent and prolonged power outages, yet it remains fully booked. Some people just can't learn a lesson.
For what it's worth, the datacenter which is adjacent to 365 Main, called 360 Spear, did not suffer from this outage.
I work 3 blocks from 365 Main.
There were 5 individual power failures, each no longer than 5 minutes, over a roughly 30 minute period. A couple of them were in quick succession.
365 Main Customer,
At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was effected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St.
An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building. On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.
As a result of the incident, continuous power was interrupted for up to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4 were directly affected, though other colocation rooms are still being investigated. We are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause.
All generators will continue to operate on diesel until the root cause of the event has been identified and corrected. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.
We understand the seriousness of this issue and will provide full details once they come available. We sincerely apologize for the impact this has had on your operations.
Regards,
Vice President, Security
365 Main
"The World's Finest Data Centers"
Just send me a big fat check and all is forgiven.
I was a hardware engineer about 10 years back on the battery backup systems. We were developing new technology to try and stretch the life of the batteries. We worked together with some of the top minds in battery technology in the US.
The battery systems that are installed in the "Bomb cages" as we called because the larger ones were often underground and appeared similar to a 3 person bomb-shelter where quite impressive. Typically, they were two full banks of twenty four, 2 Volt, 375 AMP batteries. Each of them physically twice the size of a truck battery. They were most often lead-acid mammoths at the time since lead acid was reliable for a measurable period of time and inexpensive in comparison to the lithium-ion variety in the same capacity.
The batteries were always rated at 10 years life from the manufacturer, but the telephone companies had tested in real-world environments and would rotate the cells out at 4 year intervals instead since down-time on the network to replace power systems was far more expensive then being prepared instead. After all, each one of these cabinets would typically handle as many as 15,000 telephone lines and would often contain fibre repeaters for higher speed lines connecting the boxes all together and then to the central.
The biggest problem with these installments was that a single battery in a shipment would show signs of early fatigue, most typically visible from the appearance of bubbling in the plastic walls, then it was policy to replace the entire batch of cells immediately, not just the single battery displaying fatigue. This was because it was clear that if a single battery in the group showed fatigue then all the cells in the bank would probably be susceptable to the same issue. It could be something as simple as a manufacturing screw up or it could be due to a cooling system problem in the box, or any of a lot of other environmentally related issues.
It's really quite impressive the cost and efforts the telephone company would go through just to maintain and prevent issues with the UPS system which thankfully, rarely ever gets exercised in places where people are intelligent enough not to live on fault lines or high risk hurricane paths.
The greatest flaw in the design of the batteries systems was that they were always trickle-charged. The chargers were unintelligent and simply kept the batteries topped off. This caused "memory issues" as we're all familiar with, especially thanks to notebook batteries.
What we learned about the cells where I was engineering was that, if a cell could physically survive as long as 7 years without environmentally related damage (bubbles), then it should be possible to detect early stages of design related fatigue within a single cell.
We also found that if a weekly or monthly power cycle of a bank of cells were to be performed, the batteries would last substantially longer than the 4 years expectancy. So, in the case of Bomb Cages where at least two full banks of cells were available (that's pretty much a minimum configuration), on a proper schedule, using a huge-ass resistor bank, we would fully drain a bank of cells until we could detect nearly 0 current across the resistor. Then we would perform a full charge on the cells again, monitoring each cell more than 10 times per second. Batteries that failed to charge in sync with the other cells were typically early replacement candidates.
Well, all that being said, one thing I'm 100% confident of is that data centers lack the experience and the interest to budget this kind of research for their systems. The telephone companies are amazingly well prepared in comparison.
On a side note, just last week, I installed my first 48V DC powered RAID rack. I designed a high efficiency hard drive case that contained no fans. Each case was 1U and shallow enough to install two back-to-back in a rack. We installed 96 units in a single rack with 4 drives each and no-air conditioning in the room. The design was extremely simple.
1) Use Telco