Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

6 of 333 comments (clear)

  1. faulty mobo's by Lifthrasir · · Score: 5, Interesting

    so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

    --
    No beer, no TV make Lifthrasir something something
  2. Re:And here by tmhsiao · · Score: 2, Interesting

    Aside from allowing an unaccompanied client access to the Big Red Button, perhaps?

    --
    "My God...It's full of ads!" -Fry, about the Internet, Futurama
  3. Re:No! by juuri · · Score: 2, Interesting

    Who in their right mind goes with the on-board NIC in a server environment?

    Are you kidding?

    How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.

    --
    --- I do not moderate.
  4. Re:History Eraser Button by scribblej · · Score: 4, Interesting

    I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.

    I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.

    Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.

    Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.

    Then it turns out no one on site has a key.

    Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?

    And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."

    Maybe they should have put a cover over the damn button then. Morons.

  5. Accidents happen by Migraineman · · Score: 2, Interesting

    About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...

    About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch ... *click*

    So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.

    So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click* ...

    All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since ...

  6. Re:Wait a second! by psykocrime · · Score: 2, Interesting

    Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

    It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling into a live wire that's hanging down from the ceiling, or cutting into a live wire with a vent saw, or getting caught up in one with a pike pole or something.

    Having been a firefighter for somewhere around 15 years, I'd say that I for one would not be particularly concerned about the small UPS's. That's not to say that they *couldn't* pose a danger... just that relatively speaking, they'd be a minor concern.

    --
    // TODO: Insert Cool Sig