Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

10 of 333 comments (clear)

  1. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

  2. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
    2. Re:machine failure by gkuz · · Score: 2, Insightful
      Every Saturday evening, we rebooted all of our servers

      Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.

      Scheduled reboots are a part of good systems administration

      Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?

    3. Re:machine failure by TeraCo · · Score: 2, Insightful

      You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.

      --
      Not Meta-modding due to apathy.
  3. OOB console access is the answer. by Mordant · · Score: 2, Insightful

    They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

    Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

  4. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

    Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

    They're cheeky enough to document this now. It's a feature, not a bug! Honest!

  5. No! by Saeed+al-Sahaf · · Score: 2, Insightful
    embedded NICs...

    Who in their right mind goes with the on-board NIC in a server environment?

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
  6. Photo of the button by teneighty · · Score: 2, Insightful

    Apparently this photo is an example of the button that was "accidently" pressed.

    I'd love to hear the explanation for this "accident".

  7. Re:Lesser OS... by Anonymous Coward · · Score: 2, Insightful

    Nice use of intentional confusion of the issue to make an argument there.

    You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.

    You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.

    If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.