Slashdot Mirror


Multiple Sites Down In SF Power Outage

corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).

18 of 423 comments (clear)

  1. I work in the Financial District by slug_bait · · Score: 5, Interesting

    I can verify that it affected much of the Financial District here in SF. We had the power go out 3 times. Seems to be back now. Haven't heard any explanation yet.

  2. Re:No Generators? by Anonymous Coward · · Score: 2, Interesting

    They probably just didn't kick in. Had the same problem at Internap in Seattle a few years ago. Power was cut to the building and the UPSs failed to switch over.

  3. Redundent power supply? by msimm · · Score: 2, Interesting

    Does this mean backup generators have failed or is the fault somewhere outside the datacenter? Time to start shopping.

    --
    Quack, quack.
    1. Re:Redundent power supply? by aaarrrgggh · · Score: 5, Interesting

      It takes Diesel a few years to go bad. That site has fuel polishing systems to prevent that. Because of earthquake risk, they contractually are obliged to have 24-48 hours of backup fuel with many of their clients.

      They have the HiTec rotary UPSs in all their facilities, which link a generator to a flywheel UPS. It's stupid to not have backup fuel for that type of system; you can only run for 13 seconds before the load crashes.

      It is possible that they got a number of small hits and the generators failed to re-start after a few. Good procedures are to stay on generator until utility stabilizes if you have more than one "hit."

      Be interesting to find out what happened.

    2. Re:Redundent power supply? by Anonymous Coward · · Score: 1, Interesting

      In the case of common batteries, it is not the voltage that you
      have to monitor (it always looks fine), it is the deliverable
      current over a period of time. And that is something that can
      be accurately tested only by loading. That is why a battery
      discharge/recharge cycle is common in advanced UPS solutions.

      But you are correct about the monthly/weekly tests, which
      should tell you something.

  4. how many data centers? by riceboy50 · · Score: 3, Interesting

    It's interesting that so many major sites would go down in a local power outage? Are they all sharing one data center in SF? If so, why don't they have co-locations in other cities?

    --
    ~ I am logged on, therefore I am.
  5. July 24th: RedEnvelope Press Release by 365 Main by duplicate-nickname · · Score: 3, Interesting

    This has got to be some type of joke: RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco's Datacenter.

    It was released today....

    --

    ÕÕ

  6. Re:No Generators? by MichaelSmith · · Score: 5, Interesting

    Stuff happens

    No kidding. years ago in my former job on traffic systems we had a great UPS with a generator on site and the ability keep it fueled up indefinitely. A security contractor came in on the weekend to install something and tried to wire up a new circuit hot. He slipped with a screwdriver and shorted the white phase to the chasis of the breaker panel. I don't think the tip of the driver actually touched ground, but the burn mark is still there to show how close he got.

    The resuting current spike blew the 100A fuses (heavy metal strips) both going in to and out of the UPS. With the UPS effectively broken the generator set failed to start and the system gracefully shut down 40 minutes after the incident. Thats not bad. The batteries were only specified to work long enough for the genny to settle at 50Hz.

    In the process of blowing the fuses a spike got back into the power supply of one of our DEC Alphas and took out the power supply. The system was redundant at the software level so I didn't notice immediately.

    The UPS guy came out and didn't have enough fuses to replace the blown one, but we found that with a bit of brute force and filing attacks some others could be made to fit.

    Please type the word in this image: problems

  7. UPS system - it's a Hytec flywheel/diesel combo by Animats · · Score: 3, Interesting

    Data sheet for 365 Main:

    The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.

    They use a Hytec Continuous Power System, which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.

    With this type of equipment, if for some reason you lose power and the generator doesn't start before the flywheel runs down, you're dead. There's no way to start the thing without external power. Unless you buy the optional Black Start feature, which has an extra battery pack for starting the Diesel. "Usually the black start facility will not be often needed but it won't hurt to consider installing one. Just imagine if you were unable to start up your UPS system because the mains supply is not available.". Did 365 Main buy that option?

    1. Re:UPS system - it's a Hytec flywheel/diesel combo by Animats · · Score: 4, Interesting
      The classic Bell System policy on emergency generators, in the electromechanical switching era, was as follows:
      • Generators are started once a week.
      • Once a month, generators are started and run for an hour.
      • Once a year, generators are started and the entire facility run without external power for 24 hours.

      And this was in addition to the 48VDC battery backup.

      In the entire history of electromechanical switching in the Bell System, no central office was ever down for more than 30 minutes for any reason other than a natural disaster. That record has not been maintained in the computer era.

      If you have to build reliable systems, it's worth understanding electromechanical telephone switching. Because the components weren't that reliable, the systems had to be engineered so that the system as a whole was far more reliable than the components. Read up on Number Five Crossbar. The Wikipedia article isn't really enough to understand the architecture, but other references are available.

  8. Re:Redundant? by ryanisflyboy · · Score: 4, Interesting

    For some of these sites they are a lot more central than you might realize. If they failed to build their systems with a secondary site in mind it can be near impossible for the "CTO" types to pony up the dollars for it later. The biggest issue I have seen that affects this is storage. Either they aren't using suitable SAN technologies, or they didn't put enough money behind the storage initiative to set up secondary site replication. I agree with you though. This is a problem that has been solved. Perhaps netflix thought - wth - if we go out for a few hours and people can choose their movies that's just tough luck.

    Sun.com going down is a good example of someone totally screwing up. They have absolutely NO excuse. The others - maybe they can get away with it and we won't care. If Sun can't keep their own site up, how can I expect them to keep mine up?

  9. Not that uncommon by Phil+Wherry · · Score: 2, Interesting

    I really feel for all the folks who have to deal with this outage; it's no fun at all!

    A client of mine had a number of servers in a Sterling, Virginia data center managed by Verio/NTT. It's a good data center and seems to be well-run.

    Last September, the data center experienced two complete power failures in the span of three days. To their immense credit, data center management was straight with customers about what had happened. For those who might be interested, their statements about the problem appear here.

    My point? Make sure you know how to bring your systems back up from a completely cold start, and that you find a way to test this periodically. While we work to ensure that this sort of situation occurs rarely, the fact remains that these sorts of failures DO occur, and they're not as uncommon as the sales and marketing folks would like you to believe.

    Phil

  10. Insane level of backup... by SmoothTom · · Score: 5, Interesting

    ...until the commercial power fails and doesn't come back for days.

    The only places I've actually seen the insane levels of backup that some would like is in some telco central offices. The one I was associated with the longest had eight-hour-plus battery backup and 8 days of fuel for the diesels. Some of our really remote microwave sites had 24 hour battery and 30 day diesel.

    Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze. When one of the techs finally got there (after burying his Sno-Cat and walking the last couple miles), he had to chip ice off the steel door to get inside, where he was able to get the diesel started with a little "rewire" of one of the backup battery sets. Oh, his two-way radio also failed during his hike, since it was outside his snowsuit, and the lack of communication caused the company to start two more Sno-Cats and a helicopter in that direction.

    The site was out for nearly six hours, IIRC.

    Even the BEST designs are subject to failure. :o(

    --
    Tomas

    1. Re:Insane level of backup... by SmoothTom · · Score: 2, Interesting

      Yup, heaters. The entire site was set up insulated/heated, with additional heaters on the batteries, including the start battery, but, uh, somehow the start battery heater was found to be switched "off"... :o(

      --
      Tomas

  11. Re:SAN? Huh? by Pathwalker · · Score: 3, Interesting

    Are you proposing that a single SAN storage net span multiple (remote) physical locations?
    It's pretty common - at a previous job, all of the disk arrays at three main sites kept themselves in sync using SRDF over a metro area network. The intent was, that even if one site was completely destroyed, the survivors could quickly return to work without losing any data.

    HP has a nice overview of building systems which can failover between widely distributed nodes called Designing Disaster Tolerant High Availability Clusters. It's a bit old, and is focused on ServiceGuard, but is still interesting.
  12. Re:Power back but not Craigslist by NynexNinja · · Score: 2, Interesting

    I would say incompetence... Craigslist has been plauged by incompetence since they started and small problems turn into big problems and make their site completely unusable. Their decision to use ambiguous messages like "This posting has been Published" in their anti-spam fight has made their system unreliable. One only has to take a look at the help forum for indication that their admins really do not care about the reliability of the system and questions about the constant downtime and unreliable nature of the postings are answered with vague condescending responses from staff members. Postings say they are Published and in fact they never show up on the site. This has been going on for months now with no end in sight. I would say they need a few good systems engineers to fix what's going on, however, you would almost conclude that they enjoy and even relish the moments when their site is completely unreliable or offline for days at a time. It makes one wish of a day when a competent site with competent administration would come along to replace this type of environment.

  13. Re:Redundant? by tv_dinners · · Score: 2, Interesting

    true 'dat. Makes one wonder why not just relocate everything to Alaska or somewhere else that's cold as hell.

    Speaking of energy costs associated with heat dissipation, I've alway been curious of a method that could produce energy from wasted heat- as does a solar panel from the sun.

    Wrap that supercharged V8 in some energy producing heatwrap, instant hybrid and more horsepower. Run your processor's cooling fan off the energy produced from the excessive heat.

    Someone please tell me I'm talking out of my ass, or worse, just gave away the next big idea that could have made me billions.

  14. Millions were paged, and cried out in despair by wsanders · · Score: 2, Interesting

    Waiting in line for checkin at 365 Main:

    http://tastic.brillig.org/~jwb/dorks.jpg

    --
    Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"