Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

22 of 333 comments (clear)

  1. Lesser OS... by Anonymous Coward · · Score: 5, Funny


    They should be using OpenBSD. It can run right through power failures

  2. The less we've learned... by geoffspear · · Score: 4, Funny

    Don't let your clients near the Big Red Button without an escort. Preferably an armed one.

    --
    Don't blame me; I'm never given mod points.
  3. faulty mobo's by Lifthrasir · · Score: 5, Interesting

    so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

    --
    No beer, no TV make Lifthrasir something something
  4. Oppsie by darkstar949 · · Score: 5, Funny

    "I'll just set my coffee down here, and..."
    ...
    "Oppsie, I hope that button wasn't anything important."

  5. History Eraser Button by bsd4me · · Score: 4, Funny

    Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:History Eraser Button by scribblej · · Score: 4, Interesting

      I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.

      I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.

      Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.

      Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.

      Then it turns out no one on site has a key.

      Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?

      And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."

      Maybe they should have put a cover over the damn button then. Morons.

  6. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

  7. Missing opportunities by Rosco+P.+Coltrane · · Score: 3, Funny

    Apparently someone "accidentally" pushed the emergency power off

    They had to power back on when they realized deadjournal.com was already taken...

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  8. LJDotting: LJ user base vs Slashdot user base. by TrevorB · · Score: 4, Funny

    If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....

    LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.

  9. Auto-negotiation by stilwebm · · Score: 3, Informative

    When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.

    1. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

      Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

      They're cheeky enough to document this now. It's a feature, not a bug! Honest!

  10. ...and ran off? by stratjakt · · Score: 5, Funny

    What do you mean, ran off?

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?

    My dog likes remote controls more than snausages.

    OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?

    --
    I don't need no instructions to know how to rock!!!!
  11. Credit by XorNand · · Score: 4, Informative

    Anyone who's a paid member of LJ can get a 2-week credit here.

    --
    Entrepreneur : (noun), French for "unemployed"
  12. Ahhhh silence is GOOOOLDEN by ShatteredDream · · Score: 3, Funny

    *crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.

    1. Re:Ahhhh silence is GOOOOLDEN by metalhed77 · · Score: 3, Funny

      So says the author of yet another political weblog whose startling impartialiality and sense will pave the way for a brave new world?

      --
      Photos.
  13. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
  14. LOL! Kindof like when... by GillBates0 · · Score: 5, Funny
    ...when I was on AOL and I hit the X and I couldn't talk to my AOL Buddies anymore.

    And I was like OMG I shut off the internets and stuff!!1!!

    And i called the AOL helpdesk and they helped turn it back on.

    --
    An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
  15. 13 yo? :P by Spy+der+Mann · · Score: 3, Funny

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    By any chance, was his name "Zero Cool"?

  16. 2 accounts of the powerloss by Spazholio · · Score: 4, Funny

    The one they tell you about and the real one.

  17. Not millions of paying accounts. by EvilStein · · Score: 4, Informative

    Actually, most of the accounts don't pay. They're just freeloading whiners.

    This is a paste from the Livejournal stats:

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

  18. You, sir, are an idiot. by Anonymous Coward · · Score: 5, Informative

    Go ahead and read up on how auto-negotiation works. I'll wait...

    No, really. Go read up on it...

    Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.

    (To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)