Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

12 of 333 comments (clear)

  1. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

  2. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
    2. Re:machine failure by gkuz · · Score: 2, Insightful
      Every Saturday evening, we rebooted all of our servers

      Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.

      Scheduled reboots are a part of good systems administration

      Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?

    3. Re:machine failure by TeraCo · · Score: 2, Insightful

      You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.

      --
      Not Meta-modding due to apathy.
  3. OOB console access is the answer. by Mordant · · Score: 2, Insightful

    They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

    Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

  4. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

    Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

    They're cheeky enough to document this now. It's a feature, not a bug! Honest!

  5. No! by Saeed+al-Sahaf · · Score: 2, Insightful
    embedded NICs...

    Who in their right mind goes with the on-board NIC in a server environment?

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
  6. Photo of the button by teneighty · · Score: 2, Insightful

    Apparently this photo is an example of the button that was "accidently" pressed.

    I'd love to hear the explanation for this "accident".

  7. Re:Lesser OS... by ghjm · · Score: 1, Insightful

    I'm about to leave work and go home. When I do, I plan to hit the so-called "power button." When I do this, code will execute on the box that flushes cache to disk and then commands the power supply to interrupt most (but not all) of its DC output. At that time, my computer will be in a state commonly referred to as "off."

    By your logic, I can claim that my computer is down due to a power failure.

    Perhaps you would complain: But power was getting to the computer.

    So what about the situation where I accidentally hit the (again, so-called) "off" button on my UPS. In this case the computer will go down due to a lack of power getting to it. However, the power is still on at the wall socket - I have just chosen (unintentionally) to interrupt the supply to the computer. Is this a power failure?

    I don't think you can call it a power failure if power is abundantly available, and you just don't choose to make use of it.

    -Graham

  8. Re:History Eraser Button by Anonymous Coward · · Score: 1, Insightful

    Its from an episode of Ren & Stimpy "Space Madness".

    [Button room] REN: Now, listen, Cadet. I've got a JOB for you. See this button? (Stimpy reaches for the button) DON'T TOUCH IT! It's the HISTORY ERASER button, you FOOL!
    STIMPY: So what'll happen?
    REN: That's just IT! We don't KNOW! Maayyybeee something bad?...Mayyybeee something good! I guess we'll never know! 'Cause you're going to guard it! You won't TOUCH it, will you?
    (Stimpy salutes. Ren leaves.)
    REN: Hehhh...hehhhh...hehhhh...hehhhh...
    (Stimpy marches back and forth, starting at the button.)
    ANNOUNCER: Oh, how long can trusty Cadet Stimpy hold out? How can he possibly resist the diabolical urge to push the button that could erase his very existence? Will his tortured mind give in to its uncontrollable desires?
    (Announcer grabs Stimpy, forces him closer to the button.) Can he resist the temptation to push the button that, even now, beckons him even closer? Will he succumb to the maddening urge to eradicate history? At the MERE...PUSH...of a SINGLE...BUTTON! The beeyootiful SHINY button! The jolly CANDY-LIKE button! Will he hold out, folks? CAN he hold out?
    STIMPY: NO I CAN'T!!!EEEEEYAAAHHHH! (pushes button)
    (Alarms go off. Ren, Stimpy, and Announcer stand around table with button.)
    ANNOUNCER: Tune in next week, as...
    (Flash, explosion as they all disappear.)
    We see the Ren and Stimpy logo, Ren and Stimpy also flash and disappear.

  9. Re:Lesser OS... by Anonymous Coward · · Score: 2, Insightful

    Nice use of intentional confusion of the issue to make an argument there.

    You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.

    You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.

    If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.