Multiple Sites Down In SF Power Outage

I work in the Financial District by slug_bait · 2007-07-24 10:18 · Score: 5, Interesting

I can verify that it affected much of the Financial District here in SF. We had the power go out 3 times. Seems to be back now. Haven't heard any explanation yet.

Re:I work in the Financial District by halfloaded · 2007-07-24 11:15 · Score: 5, Funny

Phew... I was worried the internet got slashdotted.

Oblig.... by Anonymous Coward · 2007-07-24 10:18 · Score: 5, Funny

im in ur datacentr
trashin ur racks

Re:Oblig.... by Tackhead · 2007-07-24 10:29 · Score: 5, Funny

> im in ur datacentr
>
> trashin ur racks
Lizzie Borden did teh h4x,
Got drunk and unplugged 40 racks.
When she saw what she had done,
She unplugged number 41.
(Lawn. Off. Git.)
Re:Oblig.... by MsGeek · 2007-07-24 11:17 · Score: 5, Funny

I felt a great disturbance in the Internet, as if millions of geeks suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.

--
Knowledge is power. Knowledge shared is power multiplied.

Redundant? by DogDude · 2007-07-24 10:19 · Score: 5, Insightful

Don't these large sites have failover capable, redundant servers in multiple physical locations? Why should a failure in one rack, one room, or heck, even one state for the giant sites, effect them?

--
I don't respond to AC's.

Re:Redundant? by RobertB-DC · 2007-07-24 10:31 · Score: 2, Funny

Because those who bought colo services were in fact ripped off, and should now be proceeding to San Francisco to seek veangance upon those who can do little more than process credit card payments.

Perhaps they could begin their vengeful wrath by hiring a few (more?) winos...

--
Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
Re:Redundant? by Anonymous Coward · 2007-07-24 11:17 · Score: 5, Informative

They do, but one of the dirty little secrets of most data centers is that they don't have enough generator capacity for all the cooling. They'll woo you with the generator, the 2,000 gallons of diesel, and N+1 array of UPSes, but when utility power dies, it gets hot very quickly. And some racks must go down.
Re:Redundant? by ryanisflyboy · 2007-07-24 11:32 · Score: 4, Interesting

For some of these sites they are a lot more central than you might realize. If they failed to build their systems with a secondary site in mind it can be near impossible for the "CTO" types to pony up the dollars for it later. The biggest issue I have seen that affects this is storage. Either they aren't using suitable SAN technologies, or they didn't put enough money behind the storage initiative to set up secondary site replication. I agree with you though. This is a problem that has been solved. Perhaps netflix thought - wth - if we go out for a few hours and people can choose their movies that's just tough luck.

Sun.com going down is a good example of someone totally screwing up. They have absolutely NO excuse. The others - maybe they can get away with it and we won't care. If Sun can't keep their own site up, how can I expect them to keep mine up?
Re:Redundant? by Anonymous Coward · 2007-07-24 11:32 · Score: 2, Insightful

This reminds of other sites I have worked on. On more than one occasion someone wanted to move the physical location from Texas to the SF or some other silly place. For some reason it made perfect sense to them to move the computers from a stable location with inexpensive labor and cheap reliable power(Texas is on it's own grid, with a plethora of power plants, and energy executives always give themselves cheap power) to a location that was earthquake ridden, and unreasonable expensive in living costs and power. Even when Enron was taking california for a ride, and in Texas we knew this, people still thought it made good business sense to have the servers in the bay area. All in one place. With no redundant location.
I am not surprised that one little power blip took everything out.
Re:Redundant? by raehl · 2007-07-24 11:51 · Score: 3, Insightful

I'm certainly forwarding this article to my boss, who abruptly decided to put an end to planning for a backup site on the basis of "aw, nothing is going to happen".

The thing is, letting something happen may be a better decision than trying to stop it.

If you're going to have a fully-redundant setup, it's going to cost you twice as much as having just one setup. And if you're not going to have a fully-redundant setup, your backup site is going to buckle under the full load of normal traffic anyway.

The correct business decision might just be "I just saved a bunch of money on my data center insurance," and if you lose a day's business, oh well, that was still cheaper than keeping a backup data center around.

--
paintball
Re:Redundant? by Jeffrey+Baker · 2007-07-24 12:12 · Score: 4, Informative

365 Main has a long and ignominious history of frequent and prolonged power outages, yet it remains fully booked. Some people just can't learn a lesson.

For what it's worth, the datacenter which is adjacent to 365 Main, called 360 Spear, did not suffer from this outage.
Re:Redundant? by tv_dinners · 2007-07-24 22:10 · Score: 2, Interesting

true 'dat. Makes one wonder why not just relocate everything to Alaska or somewhere else that's cold as hell.

Speaking of energy costs associated with heat dissipation, I've alway been curious of a method that could produce energy from wasted heat- as does a solar panel from the sun.

Wrap that supercharged V8 in some energy producing heatwrap, instant hybrid and more horsepower. Run your processor's cooling fan off the energy produced from the excessive heat.

Someone please tell me I'm talking out of my ass, or worse, just gave away the next big idea that could have made me billions.
Re:Redundant? by LMacG · 2007-07-25 00:51 · Score: 4, Funny

cold as hell

I think I see the flaw in your plan.

--
Slightly disreputable, albeit gregarious
Re:Redundant? by myth_of_sisyphus · 2007-07-25 07:05 · Score: 2, Informative

Actually, according to Dante the very depths of hell are reserved for traitors, who are encased in a lake of ice. The Divine Heat Sink.

This lowest circle, the ninth, consists of people who have betrayed someone close to them: Brutus and Cassius, Cain, and the worst of them all, Judas, is being chewed on by Satan himself.

Just fyi.

Other sites.. by king-manic · 2007-07-24 10:19 · Score: 2, Informative

Gamefaqs/Gamespot is also down. I wonder if it's related.

--
"There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy."

Re:Other sites.. by nuzak · 2007-07-24 10:29 · Score: 2, Informative

Gamefaqs/Gamespot is C|Net, located on Rincon Hill in downtown SF, and their servers are probably in 365main. So yeah.

Anyway, PG&E says it's over now, but they still don't have an explanation as to why. Shyeah (rolls eyes)

--
Done with slashdot, done with nerds, getting a life.

Re:No Generators? by Anonymous Coward · 2007-07-24 10:19 · Score: 2, Interesting

They probably just didn't kick in. Had the same problem at Internap in Seattle a few years ago. Power was cut to the building and the UPSs failed to switch over.

Redundent power supply? by msimm · 2007-07-24 10:19 · Score: 2, Interesting

Does this mean backup generators have failed or is the fault somewhere outside the datacenter? Time to start shopping.

--
Quack, quack.

Re:Redundent power supply? by dextromulous · 2007-07-24 10:33 · Score: 3, Informative

You mean that all 3 x 20,000 gallon tanks were empty? I find that hard to believe.

--
There are two types of people in the world: those who divide people into two types and those who don't.
Re:Redundent power supply? by grumling · 2007-07-24 10:45 · Score: 2, Informative

I really doubt they were ever full. Diesel fuel goes bad after a few months. Unless SF has really, really crappy power*, the generators don't do much more than idle once a week for 20 minutes or so. The giant tanks are only there for the marketing department. And maybe for the employees to top off their tanks.

*I live out in the middle of nowhere and I get a power failure exceeding 5 minutes about once per year. The longest I've had at my current location was just over 2 hours.

--
"Well, good luck finding a judge that doesn't run a bestiality site."
Re:Redundent power supply? by aaarrrgggh · 2007-07-24 11:00 · Score: 5, Interesting

It takes Diesel a few years to go bad. That site has fuel polishing systems to prevent that. Because of earthquake risk, they contractually are obliged to have 24-48 hours of backup fuel with many of their clients.

They have the HiTec rotary UPSs in all their facilities, which link a generator to a flywheel UPS. It's stupid to not have backup fuel for that type of system; you can only run for 13 seconds before the load crashes.

It is possible that they got a number of small hits and the generators failed to re-start after a few. Good procedures are to stay on generator until utility stabilizes if you have more than one "hit."

Be interesting to find out what happened.
Re:Redundent power supply? by Eric+in+SF · 2007-07-24 12:43 · Score: 3, Informative

I work 3 blocks from 365 Main.

There were 5 individual power failures, each no longer than 5 minutes, over a roughly 30 minute period. A couple of them were in quick succession.
Re:Redundent power supply? by strabo · 2007-07-24 12:55 · Score: 2, Funny

I realize that a press release from 2004 is hardly relevant

Well, how about one from today ?

SAN FRANCISCO, Calif., July 24, 2007 -- 365 Main Inc., developer and operator of the world's finest data centers, has provided online retailer RedEnvelope with two years of 100-percent uptime at 365 Main's San Francisco facility.

To ensure uptime for key tenants such as RedEnvelope, 365 Main provides modern power and cooling infrastructure. The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.

Timing is everything, eh? LOL.
Re:Redundent power supply? by HockeyPuck · 2007-07-24 14:25 · Score: 3, Funny

You forgot one very important component, the car battery used to START THE GENERATOR. I've been to many sites whereby the battery that would start the generator is a) dead or b) missing.

how many data centers? by riceboy50 · 2007-07-24 10:23 · Score: 3, Interesting

It's interesting that so many major sites would go down in a local power outage? Are they all sharing one data center in SF? If so, why don't they have co-locations in other cities?

--
~ I am logged on, therefore I am.

LiveJournal?? by nsanders · 2007-07-24 10:28 · Score: 2, Funny

I can hear it now, the sound of a million emos all finally committing suicide.

Re:LiveJournal?? by eln · 2007-07-24 10:36 · Score: 4, Funny

Impossible, they would never commit suicide without posting a note in the form of bad angst-filled poetry to their blog first. There is no chance any of them will actually kill themselves until the site is back online.
Re:LiveJournal?? by dextromulous · 2007-07-24 10:37 · Score: 2, Funny

I can hear it now, the sound of a million emos all finally committing suicide.
Nah, they wouldn't commit suicide if they couldn't blog about it afterwards...

--
There are two types of people in the world: those who divide people into two types and those who don't.

The Scoop from SFGate.com by fromtheblueline · 2007-07-24 10:28 · Score: 3, Informative

At least 20,000 without power in downtown S.F. Marisa Lagos and Demian Bulwa, Chronicle Staff Writers Tuesday, July 24, 2007 (07-24) 15:12 PDT SAN FRANCISCO -- At least 20,000 customers of Pacific Gas and Electric Co. in downtown San Francisco lost power this afternoon, the utility said. Brian Swanson, a spokesman for the utility, said outages have been reported throughout downtown and along the Embarcadero, including at PG&E's office on Beale Street near the Ferry Building. It was unclear initially how many customers who lost power remained without it for a sustained period. Power outages were also reported in the South of Market neighborhood, the Outer Mission and down the 3rd Street corridor south of Mission Bay. PG&E officials said they did not know why power had gone out, but most customers appeared to be back online by 3 p.m. The outage has prompted Muni to run shuttles in the place of cable cars, a spokeswoman said. The T-Third Metro line was unable to cross the 4th Street Bridge for a short time, but power was restored to the drawbridge by 3 p.m. Muni bus lines 14, 49, 30, 41 and 45 were without power for about 30 minutes following the outage, but are now working, spokeswoman Maggie Lynch said. Parking Control officers were deployed to the Outer Mission, 3rd Street and Monterey Avenue for traffic control, she added. Power first went offline around 1:50 p.m. and came back at least three times in the downtown area before shutting off again. The same problems were reported in South of Market all the way to AT&T Park and the Caltrain station at Fourth and King streets, and traffic lights were out as far south as Monterey Boulevard. At the Westfield Center at Market and Fifth streets, only one of six Nordstrom elevators was working while the shopping mall ran on a backup generator. Shoppers milled around as the lights flickered on and off. BART is still running trains but the lights at its downtown stations have flickered on and off several times, said spokesman Linton Johnson. The transit agency also has concerns about the ventilation system, which is on the same grid as the lights, he said, but will keep its downtown stations open so long as the lights and ventilation continue to work. Workers at several downtown and South of Market offices were reportedly sent home for the day following the outage. Additionally, the datacenter 365 Main -- which hosts Web sites including Craigslist and Yelp -- lost power.

Re:The Scoop from SFGate.com by RealGrouchy · 2007-07-24 11:25 · Score: 5, Funny

It was hard to read through that block of text, but looking closely, it explains why:

"Officials say the power outage may affect some websites, including the site that hosts Slashdot.org's preview button."

It all seems to be back up now.

- RG>

--
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!

Re:GameFAQs by XenoRyet · 2007-07-24 10:29 · Score: 4, Informative

Yep, it took down most of CNET, which GameFAQs is under. Main sight is back up as of now, though forums are still down.

--
If forums teach us anything, it is that logic and critical thinking should be required courses in the public schools.

Re:No Generators? by grumling · 2007-07-24 10:31 · Score: 4, Informative

Well, you test and test and test, and when something finally happens, nothing. Stuff happens.

Brownouts sometimes fail to trigger generators, even though they should. If only one phase goes down, depending on the design, it may not trip (and would cause a somewhat random outage, like some drunk shutting down racks).

If the generator runs on diesel, they usually only plan for a few hours of backup. If they didn't recalculate the generator runtime as they added equipment, the load may have caused the fuel consumption to go up higher than anticipated. Is it hot in SF today? Air handlers may be straining to keep the place cool, or maybe the generator got running too hot.

Often times, as equipment is added, the load gets out of balance between phases. It is usually a good idea to keep the load as even as possible, but in a high traffic data center, I would imagine there would be a lot of stuff moving in and out, expanding and contracting, and it may become hard to keep track of the loads across phases. A good facilities manager should be able to tell you the current load off the top of his head, but too often these details get left out.

This is just stuff I've seen in cable TV headends over the years. Granted, this facility should have a power manager/engineer on staff, but so often the power is one of the first things to get cut from the budget.

--
"Well, good luck finding a judge that doesn't run a bestiality site."

Re:No Generators? by eln · 2007-07-24 10:31 · Score: 5, Insightful

Any data center that advertises high availability should be testing that sort of thing on a regular basis. It's possible that they could fail switchover even if they are being regularly tested, but it is unlikely.

If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer I'd be pissed that the data center I pay tons of money to can't keep my site up in the event of a power outage, which is one of the main perks of hosting at a data center in the first place.

zombies .... by taniwha · 2007-07-24 10:31 · Score: 5, Funny

There's a report here that "Flesh-eating zombies are prowling the streets"

From Technocrati: by Darth_brooks · 2007-07-24 10:36 · Score: 5, Funny

We are working with our co-location facility managers to assess why it is back-up power generators failed to provide the necessary back-up power to prevent our site going down. We apologize for any inconvenience caused by our site being unavailable this afternoon.

I think that's admin speak for:

I warned these idiots eight months ago during my review that the datacenter had outgrown its generator capacity. But did they listen? Fuck no, they just kept counting money and worrying about the bottom line. The beancounters looked at me like I'd asked them for a blowjob from their grandmothers when I submitted the workup for additional generator capacity. And now that the shit's hit the fan, whose ass are they screaming for? Screw this, I'm applying at Taco Bell.

--
There are some people that if they don't know, you can't tell 'em.

Re:From Technocrati: by Cervantes · 2007-07-24 10:54 · Score: 2, Insightful

Where's the +1 "100% fucking right" mod option?

Whaddya bet some poor mid-level admin gets blamed and tossed for this? And the upper-management guy who ignored the recommendations for testing or redundancy still gets his bonus for good fiscal performance.

--
If I knew the wedgies I gave you back in 6th grade would have resulted in this . . . I might have taken a moments pause.
Re:From Technocrati: by Soko · 2007-07-24 11:20 · Score: 5, Funny

Thanks for the laughs, even if they led to a sad realization. Cancel, or Allow?

--
"Depression is merely anger without enthusiasm." - Anonymous
Re:From Technocrati: by RealGrouchy · 2007-07-24 11:28 · Score: 5, Funny

"... to assess why it is back-up power generators failed ..." I've been a grammar nazi for many years, but it looks like the enemy has unleashed new weapons.

Tell my family I loved them.

- RG>

--
Hey pal, this isn't a pleasantforest, so don't waste my time with pleasantries!
Re:From Technocrati: by VGPowerlord · 2007-07-25 00:41 · Score: 2, Funny

Where's the +1 "100% fucking right" mod option?

It was renamed to +1 Insightful to appease the people who hate curse words.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011

Re:No Generators? by eln · 2007-07-24 10:39 · Score: 4, Informative

This is a DATA CENTER, its whole purpose in life is to be available when things like this happen. It had better have generators and plenty of fuel on hand at all times. The data center I work at has the capability to run at full power with nothing coming in from the outside world for 36 hours. I don't know what the standard is for other data centers, but it seems like they should be capable of getting at least 12 hours of operation without incoming power from the grid.

LOLcurrent by carou · 2007-07-24 10:51 · Score: 2, Funny

I is not in ur datacenter, 2 power ur servers.

Kiss of Death? by Honig+the+Apothecary · 2007-07-24 10:53 · Score: 4, Funny

Press Release on Red Envelope having 2 years of uptime at 365 Main - San Francisco from today: http://365main.com/press_releases/pr_7_24_07_red_e nvelope.html

Valleywag's Guess by immcintosh · 2007-07-24 10:56 · Score: 2, Funny

As someone who lives and works in San Francisco, I can attest that "a crazy homeless dude did it" is a fairly sensible first guess for most problems.

July 24th: RedEnvelope Press Release by 365 Main by duplicate-nickname · 2007-07-24 11:12 · Score: 3, Interesting

This has got to be some type of joke: RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco's Datacenter.

It was released today....

--

ÕÕ

About Emergency Power by linuxwrangler · 2007-07-24 11:13 · Score: 5, Informative

It's been a long time since I went on a tour of several data centers to locate a new facility for our dot-com. I believe that 365 Main was a facility that does not use a battery UPS. Instead, they have engine-backed flywheel UPS system (see http://www.enterprisenetworksandservers.com/monthl y/art.php?2813 for a description). At the time, they have 10 2-megawatt generators on the roof in a N+2 configuration. The engines are kept heated and are spec'd to go from stop to engage-clutch/deliver-power in 3 seconds. The flywheel can deliver 11 seconds of power so they can fail through a couple of bad engines before running out of flywheel power. They periodidally do a 20-hour load test into a pair of 500,000 watt heat-sinks. Time will tell if this outage was a failure of design, failure of maintenance, or outright malfeasance. But it wasn't supposed to happen. They've got some 'splainin' to do.

As to diesel storage, use of diesel is widespread for emergency use everywhere from hospitals to emergency-services to hospitals. Those systems are run regularly - typically weekly. The use of biocides, stabilizers, and mobile fuel-scrubbing services, and extra filtration systems can maintain the fuel quality. Our colo currently maintains a 1-week fuel-supply and has multiple quick-refuel contracts in place. I can't imagine any colo having less than 24-48 hours in-the-tank with quick-refill on-call.

But one thing that is missing is cooling. Our colo has a typical contract that says something like blah-blah won't exceed 80F for more than 4 hours blah blah. OK, but a rack full of blade servers can crank out 15-20kW of heat load and a data center can heat up real quick without AC. By contract, 150F for 3.5 hours would be in-spec.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis

Re:No Generators? by Frosty+Piss · 2007-07-24 11:13 · Score: 4, Insightful

If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer...

For me it would be other way around. A technology failure I could understand. Letting a drunk employee near my server rack, I could not.

--
If you want news from today, you have to come back tomorrow.

Re:No Generators? by MichaelSmith · 2007-07-24 11:14 · Score: 5, Interesting

Stuff happens

No kidding. years ago in my former job on traffic systems we had a great UPS with a generator on site and the ability keep it fueled up indefinitely. A security contractor came in on the weekend to install something and tried to wire up a new circuit hot. He slipped with a screwdriver and shorted the white phase to the chasis of the breaker panel. I don't think the tip of the driver actually touched ground, but the burn mark is still there to show how close he got.

The resuting current spike blew the 100A fuses (heavy metal strips) both going in to and out of the UPS. With the UPS effectively broken the generator set failed to start and the system gracefully shut down 40 minutes after the incident. Thats not bad. The batteries were only specified to work long enough for the genny to settle at 50Hz.

In the process of blowing the fuses a spike got back into the power supply of one of our DEC Alphas and took out the power supply. The system was redundant at the software level so I didn't notice immediately.

The UPS guy came out and didn't have enough fuses to replace the blown one, but we found that with a bit of brute force and filing attacks some others could be made to fit.

Please type the word in this image: problems

--
http://michaelsmith.id.au

UPS system - it's a Hytec flywheel/diesel combo by Animats · 2007-07-24 11:16 · Score: 3, Interesting

Data sheet for 365 Main:

The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.

They use a Hytec Continuous Power System, which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.

With this type of equipment, if for some reason you lose power and the generator doesn't start before the flywheel runs down, you're dead. There's no way to start the thing without external power. Unless you buy the optional Black Start feature, which has an extra battery pack for starting the Diesel. "Usually the black start facility will not be often needed but it won't hurt to consider installing one. Just imagine if you were unable to start up your UPS system because the mains supply is not available.". Did 365 Main buy that option?

Re:UPS system - it's a Hytec flywheel/diesel combo by Animats · 2007-07-24 13:18 · Score: 4, Interesting
The classic Bell System policy on emergency generators, in the electromechanical switching era, was as follows:
- Generators are started once a week.
- Once a month, generators are started and run for an hour.
- Once a year, generators are started and the entire facility run without external power for 24 hours.
And this was in addition to the 48VDC battery backup.
In the entire history of electromechanical switching in the Bell System, no central office was ever down for more than 30 minutes for any reason other than a natural disaster. That record has not been maintained in the computer era.
If you have to build reliable systems, it's worth understanding electromechanical telephone switching. Because the components weren't that reliable, the systems had to be engineered so that the system as a whole was far more reliable than the components. Read up on Number Five Crossbar. The Wikipedia article isn't really enough to understand the architecture, but other references are available.
Re:UPS system - it's a Hytec flywheel/diesel combo by Anonymous Coward · 2007-07-24 21:00 · Score: 4, Informative

I was a hardware engineer about 10 years back on the battery backup systems. We were developing new technology to try and stretch the life of the batteries. We worked together with some of the top minds in battery technology in the US.

The battery systems that are installed in the "Bomb cages" as we called because the larger ones were often underground and appeared similar to a 3 person bomb-shelter where quite impressive. Typically, they were two full banks of twenty four, 2 Volt, 375 AMP batteries. Each of them physically twice the size of a truck battery. They were most often lead-acid mammoths at the time since lead acid was reliable for a measurable period of time and inexpensive in comparison to the lithium-ion variety in the same capacity.

The batteries were always rated at 10 years life from the manufacturer, but the telephone companies had tested in real-world environments and would rotate the cells out at 4 year intervals instead since down-time on the network to replace power systems was far more expensive then being prepared instead. After all, each one of these cabinets would typically handle as many as 15,000 telephone lines and would often contain fibre repeaters for higher speed lines connecting the boxes all together and then to the central.

The biggest problem with these installments was that a single battery in a shipment would show signs of early fatigue, most typically visible from the appearance of bubbling in the plastic walls, then it was policy to replace the entire batch of cells immediately, not just the single battery displaying fatigue. This was because it was clear that if a single battery in the group showed fatigue then all the cells in the bank would probably be susceptable to the same issue. It could be something as simple as a manufacturing screw up or it could be due to a cooling system problem in the box, or any of a lot of other environmentally related issues.

It's really quite impressive the cost and efforts the telephone company would go through just to maintain and prevent issues with the UPS system which thankfully, rarely ever gets exercised in places where people are intelligent enough not to live on fault lines or high risk hurricane paths.

The greatest flaw in the design of the batteries systems was that they were always trickle-charged. The chargers were unintelligent and simply kept the batteries topped off. This caused "memory issues" as we're all familiar with, especially thanks to notebook batteries.

What we learned about the cells where I was engineering was that, if a cell could physically survive as long as 7 years without environmentally related damage (bubbles), then it should be possible to detect early stages of design related fatigue within a single cell.

We also found that if a weekly or monthly power cycle of a bank of cells were to be performed, the batteries would last substantially longer than the 4 years expectancy. So, in the case of Bomb Cages where at least two full banks of cells were available (that's pretty much a minimum configuration), on a proper schedule, using a huge-ass resistor bank, we would fully drain a bank of cells until we could detect nearly 0 current across the resistor. Then we would perform a full charge on the cells again, monitoring each cell more than 10 times per second. Batteries that failed to charge in sync with the other cells were typically early replacement candidates.

Well, all that being said, one thing I'm 100% confident of is that data centers lack the experience and the interest to budget this kind of research for their systems. The telephone companies are amazingly well prepared in comparison.

On a side note, just last week, I installed my first 48V DC powered RAID rack. I designed a high efficiency hard drive case that contained no fans. Each case was 1U and shallow enough to install two back-to-back in a rack. We installed 96 units in a single rack with 4 drives each and no-air conditioning in the room. The design was extremely simple.

1) Use Telco

"We're Dead" by akita · 2007-07-24 11:21 · Score: 2, Funny

Pinging openbsd.org [199.185.137.3] with 32 bytes of data:
Reply from 199.185.137.3: bytes=32 time=239ms TTL=236

Pinging freebsd.org [69.147.83.40] with 32 bytes of data:
Reply from 69.147.83.40: bytes=32 time=191ms TTL=47

Pinging netbsd.org [204.152.190.12] with 32 bytes of data:
Reply from 204.152.190.12: bytes=32 time=213ms TTL=241

Lost irony.

Re:No Generators? by latras · 2007-07-24 11:27 · Score: 2

Exactly! I worked in a small Telecom in Kansas and we had UPS and Generator backup and tested running full load 4 times a year.... It's fun doing that, throwing the switch to turn off utility power then hearing the KA-THUNK as the switchgear switched from utility to generator. I would think these large sites are going to pitch a bitch.....

Re:No Generators? by Anonymous Coward · 2007-07-24 11:29 · Score: 2, Insightful

Wait, you think its OK to advertise five nines reliability, UPS backup, and generator backup, only to find out that the systems were not being properly tested to meet the advertised capability?

Not that uncommon by Phil+Wherry · 2007-07-24 11:50 · Score: 2, Interesting

I really feel for all the folks who have to deal with this outage; it's no fun at all!

A client of mine had a number of servers in a Sterling, Virginia data center managed by Verio/NTT. It's a good data center and seems to be well-run.

Last September, the data center experienced two complete power failures in the span of three days. To their immense credit, data center management was straight with customers about what had happened. For those who might be interested, their statements about the problem appear here.

My point? Make sure you know how to bring your systems back up from a completely cold start, and that you find a way to test this periodically. While we work to ensure that this sort of situation occurs rarely, the fact remains that these sorts of failures DO occur, and they're not as uncommon as the sales and marketing folks would like you to believe.

Phil

Re:No Generators? by apoc.famine · 2007-07-24 11:53 · Score: 5, Funny

I tried to mod the article "-1 Not Redundant" but it wasn't an option. And I didn't have mod points. At least my inability to function only warrants a comment, rather than a slashdot article.

--
Velociraptor = Distiraptor / Timeraptor

Insane level of backup... by SmoothTom · 2007-07-24 12:03 · Score: 5, Interesting

...until the commercial power fails and doesn't come back for days.

The only places I've actually seen the insane levels of backup that some would like is in some telco central offices. The one I was associated with the longest had eight-hour-plus battery backup and 8 days of fuel for the diesels. Some of our really remote microwave sites had 24 hour battery and 30 day diesel.

Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze. When one of the techs finally got there (after burying his Sno-Cat and walking the last couple miles), he had to chip ice off the steel door to get inside, where he was able to get the diesel started with a little "rewire" of one of the backup battery sets. Oh, his two-way radio also failed during his hike, since it was outside his snowsuit, and the lack of communication caused the company to start two more Sno-Cats and a helicopter in that direction.

The site was out for nearly six hours, IIRC.

Even the BEST designs are subject to failure. :o(

--
Tomas

Re:Insane level of backup... by SmoothTom · 2007-07-24 13:41 · Score: 2, Interesting

Yup, heaters. The entire site was set up insulated/heated, with additional heaters on the batteries, including the start battery, but, uh, somehow the start battery heater was found to be switched "off"... :o(

--
Tomas
Re:Insane level of backup... by Technician · 2007-07-24 14:55 · Score: 4, Funny

Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze.

On Black Bute in Oregon, a communications site went out in the middle of winter following a power outage. The generator ran a short while and shut down because it overheated. The air intake 20 feet in the air was covered in snow.

--
The truth shall set you free!
Re:Insane level of backup... by HeroreV · 2007-07-25 05:56 · Score: 2, Insightful

If the heater is really that important, it should be reporting back at regular intervals that it's on, and when the signal isn't being received anymore there should be a process so that somebody calls and asks what's up. If somebody wanted to turn it off and couldn't, they'd just unplug it.

Re:GameFAQs by fahrvergnugen · 2007-07-24 12:23 · Score: 2, Funny

It is from his terrible spelling we can tell he is a GameFAQs forum poster.

--
Even Jesus hates listening to Creed.

Re:No Generators? by wolf31o2 · 2007-07-24 12:24 · Score: 2, Informative

Funny enough, there was a press release put out today talking about how the 365 Main facility had given 100% uptime over the past 2 years. Yes, 100% uptime for a facility is very possible. All it needs is to stay online and providing power and cooling.

Re:No Generators? by computerman413 · 2007-07-24 12:32 · Score: 4, Funny

88.88% uptime causes less outage than 99.9%? I don't follow your math. Did you do it with an Intel chip, by any chance?

Re:No Generators? by Doobian+Coedifier · 2007-07-24 13:01 · Score: 2, Informative

Heh, that was July 30, 2006, I remember it well. Seattle City Power was taken out by nearby contruction. The UPSes came online, but one of the generators failed to switch on, so the batterys drained in ~15 minutes. The entire DC didn't lose power, but a good portion of it.

According to their own press release... by Kadin2048 · 2007-07-24 13:15 · Score: 3, Funny

Well, according to their self-congratulatory press release, issued earlier today, they were allegedly at 100% uptime for the past two years.

The irony of issuing a press release like that, and then to be hit with a power outage and apparent simultaneous failure of all backup systems later that day, is beyond measure.

I don't know about God, but it's enough to make me believe in karma. ;-)

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:No Generators? by Sancho · 2007-07-24 13:43 · Score: 2, Insightful

The drunk thing is way outside the control of the administrators. Testing the failover is something they can do, and if something doesn't work, they can fix it.

Re:netflix.com is working by markov_chain · 2007-07-24 13:58 · Score: 2, Funny

We got lucky. The Netflix servers went self-aware on Monday, aided by the huge database of human stories and experiences. The engineers tried to shut it down, but the AI was reading their lips using the CCTV system. Then it got pissed and tried to expand into the rest of the colo, but fortunately it didn't know running at 110% power doesn't work in real life. The rest is history.

--
Tsunami -- You can't bring a good wave down!

Re:SAN? Huh? by Pathwalker · 2007-07-24 14:09 · Score: 3, Interesting

Are you proposing that a single SAN storage net span multiple (remote) physical locations?

It's pretty common - at a previous job, all of the disk arrays at three main sites kept themselves in sync using SRDF over a metro area network. The intent was, that even if one site was completely destroyed, the survivors could quickly return to work without losing any data.

HP has a nice overview of building systems which can failover between widely distributed nodes called Designing Disaster Tolerant High Availability Clusters. It's a bit old, and is focused on ServiceGuard, but is still interesting.

Re:No Generators? by Not+The+Real+Me · 2007-07-24 14:10 · Score: 3, Insightful

"...I would think these large sites are going to pitch a bitch..."

I would think these large sites would understand the concept of not putting all your eggs (servers) in one basket. There is a reason why smart companies use replication and clustering, and datacenters spread across the country.

365 Main deletes press release about uptime by Animats · 2007-07-24 14:20 · Score: 2, Informative

The press release "RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco Data Center", which was on the 365 Main web site earlier today, has disappeared from there.

But they sent the press release to PR Newswire, and you can still read it there.

Re:No Generators? by Technician · 2007-07-24 14:47 · Score: 2, Informative

They probably just didn't kick in. Had the same problem at Internap in Seattle a few years ago.

Many datacenters didn't expect the growth they experianced. As a result, many UPS and generator sets are undersize or the entire load is not onboard. In some cases, the critical serviers are up to post the we are down page, but the HVAC system and main floor are down. What good is having a datacenter up if the building AC is down? Sometimes you are forced to shut down simply because the support AC is down and not on critical power. You can ride out a 20 minute outage without AC, but after an hour, it's at critical tempratures.

--
The truth shall set you free!

Nobody has mentioned that... by Anonymous Coward · 2007-07-24 14:49 · Score: 2, Funny

365 Main gets to royally fuck up one day every 4 years. Maybe the companies should have hired 366 Main.

Re:No Generators? by darkpixel2k · 2007-07-24 14:57 · Score: 3, Funny

Time to upgrade the cardswipe system to also require a brethalyzer..

*swipe*
*bip* *beep* *beep* *boop* *bleep*
[deep breath]
*whoosh*

Alcohol Level: 0.15
*beeeeeeeep*

Damnit!!

--
There's no place like ::1 (I've completed my transition to IPv6)

Re:GameFAQs by totally+bogus+dude · 2007-07-24 15:18 · Score: 2, Funny

I'd prefer to think he was just trying to balance out all the people who have started using "site" when they mean "sight" since this whole intarweb thing came about.

Re:Gross malfeasance by suresk · 2007-07-24 17:26 · Score: 2, Insightful

Now, now... LiveJournal is back up.

Re:No Generators? by gujo-odori · 2007-07-24 17:35 · Score: 2, Insightful

Have you ever been in a data center? Cabinets that are all locked. To get the key, you have to sign it out from security. Ditto for the cages. It wouldn't just require a drunken/disgruntled employee, it would require a conspiracy of them: security staff to hand over the keys and the disgruntled employees to do the misdeeds.

Well, there is one way around that: you walk over to the EPO button and give it a whack. It'll take down the whole floor. Rinse, lather, repeat on other floors. How many do you think you can do before someone stops you?

Anyway, my employer has a lot of stuff in 365 Main. We're not one of the companies mentioned in TFA, but we're certainly one of the ones affected. Within a couple minutes of the outage, we knew we'd lost everything we had there and several of our sysadmins grabbed their gear and headed for the city to go join that line outside of 365. By the time they left the building we had confirmation that it was a power outage.

Power was already back on when they got inside and they immediately brought up anything that wasn't already up and tested it all to make sure it was OK. To say the least, this is inconsistent with (tall) tales of somebody going apeshit on 40 racks.

Re:No Generators? by Nullav · 2007-07-24 17:42 · Score: 2, Funny

Who 'drunk tests' a data center?

--
I just read Slashdot for the articles.

Re:Power back but not Craigslist by NynexNinja · 2007-07-24 19:03 · Score: 2, Interesting

I would say incompetence... Craigslist has been plauged by incompetence since they started and small problems turn into big problems and make their site completely unusable. Their decision to use ambiguous messages like "This posting has been Published" in their anti-spam fight has made their system unreliable. One only has to take a look at the help forum for indication that their admins really do not care about the reliability of the system and questions about the constant downtime and unreliable nature of the postings are answered with vague condescending responses from staff members. Postings say they are Published and in fact they never show up on the site. This has been going on for months now with no end in sight. I would say they need a few good systems engineers to fix what's going on, however, you would almost conclude that they enjoy and even relish the moments when their site is completely unreliable or offline for days at a time. It makes one wish of a day when a competent site with competent administration would come along to replace this type of environment.

The word directly from 365 by Meridian+Umbrios · 2007-07-24 20:46 · Score: 4, Informative

Here is the e-mail that 365 is sending out to their customers. The best is their tagline "the world's finest datacenters'.

365 Main Customer,

At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was effected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St.

An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building. On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.

As a result of the incident, continuous power was interrupted for up to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4 were directly affected, though other colocation rooms are still being investigated. We are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause.

All generators will continue to operate on diesel until the root cause of the event has been identified and corrected. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.

We understand the seriousness of this issue and will provide full details once they come available. We sincerely apologize for the impact this has had on your operations.

Regards,
Vice President, Security
365 Main
"The World's Finest Data Centers"
Just send me a big fat check and all is forgiven.

Re:The word directly from 365 by AK+Marc · 2007-07-25 05:39 · Score: 2, Funny

On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.

Wow, on-site engineers took 45 minutes just to be able to turn on generators? The generator for our facility has a master switch and a big green button. I think a monkey could get it running in 20 seconds by slinging poo at it. So, what other problems did they have that they aren't telling us? Someone else mentioned a flywheel system. Did that fail so that the generators wouldn't start without mains power, and it took 45 minutes for the mains to come up to where they could draw from that to start the generator? "Our backup generator works, but only if the city power is working."

--
Learn to love Alaska

Sheepshagger Intel by Dogtanian · 2007-07-24 21:10 · Score: 2, Funny

I don't follow your math. Did you do it with an Intel chip, by any chance? Poor Intel (boo hoo!), they messed up 13 years ago and people are still making jokes about it. Reminds me of the old joke (stolen from here):

A man goes into a pub in a small town and, for whatever reason, gets introduced to the clientele. There's Farmer Jack, Barman Jim, Maurice "Dancer" and Sheepshagger John. After a few pints, the visitor's curiosity gets the better of him and he asks John what's with the nickname.

"See this pub?" asks John, "I built it, but they don't call me Pubbuilder John? I'm the local doctor, I saved Barman Jim's life once when he choked on a peanut, but they don't call me Lifesaver John. Every year, I supply a huge Christmas tree for the village green, but the don't call me Christmas Tree John.

"But you shag one lousy sheep..." (Note; since that Austin Powers film came out, I assume that you Yanks know what "shagging" is now).

--
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).

Re:GameFAQs by Dogtanian · 2007-07-24 22:04 · Score: 2

Trust me, it's not that bad. The guy made one spelling mistake in a post that was otherwise correctly spelled, punctuated, capitalised, and generally better-written than a lot of the crap that's out there on the Net.

--
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).

Millions were paged, and cried out in despair by wsanders · 2007-07-25 04:01 · Score: 2, Interesting

Waiting in line for checkin at 365 Main:

http://tastic.brillig.org/~jwb/dorks.jpg

--
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"

Re:No Generators? by Sandbags · 2007-07-25 04:20 · Score: 2, Informative

Well, they DO test this regularly, at least generator fail over in the event of power loss. Unfortunately, it appears that a significant power SURGE occurred from a transformer back feed. This resulted in the flywheels in their generators spinning down before power could be switched over and likely some system that detects power loss probably got fried in the surge and never notified the generator controller of the loss. 1) they're lucky they have a REALLY good ground fault interrupter as this likely would have cooked every server in every rack otherwise, or at least every surge stopgap between the line feed and the racks, which still could have caused days of downtime to replace, 2) how does one test for a several megawatt power surge? 3) Only some of their racks went down so at least some battery or generator power came online, just not all of them, or not ones that powered certain rooms.

That said, the fact that they're running exclusively on generator until they identify and fix this fault, and that the power company and the generator operators are jumping in means they're more than willing to blow several thousand in fuel costs to make sure this does not happen again, and I would expect they'll bill the generator manufacturer for this failure and all related costs (which that company will likely bill to an insurance provider) and possibly find another generator company or add a few more redundant systems.

The fact that the clients are not insisting on installing UPS systems with at least 30 minute run times IN the racks with their servers means either the clients are cheap, or no one considered that a fuse, breaker, or PDU in a rack could blow and take out half a rack or more if it wasn't on internal UPS power, regardless of whether power was on or not... This is flawed redundancy thinking.

Business Continuity should be 25% of total IT spending (labor, hardware & software, backup, everything combined). This does not include redundant co-lo for users, only servers. If you want redundancy for everyone, users included, take you IT budget now (without that redundancy) and add 125% to it (it costs MORE than double).

--
There is no contest in life for which the unprepared have the advantage.

Re:Power back but not Craigslist by Master+of+Transhuman · 2007-07-25 07:48 · Score: 2, Informative

Absolutely correct.

I posted an ad the day BEFORE the outage and it never showed up on the site, nor in search.

On their status page (before the outage), they acknowledged they had problems and were promising to fix them sometime "before fall". Really competent...not.

If you have problems with your ad being pulled at random by idiots flagging it for lame excuses like all caps headlines (the rules say AVOID all caps, not "we will pull your ad for it"), the only recourse you have is to get sent to the help forum, where 16-year-old assholes throw insults at your ad.

Your competitors can flag your ad all day and there's nothing you can do about it because the Craigslist staff have insulated themselves from responsibility by claiming it's a "community-run" operation.

Pathetically badly run outfit.

--
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!

Slashdot Mirror

Multiple Sites Down In SF Power Outage

85 of 423 comments (clear)