Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

51 of 333 comments (clear)

  1. Lesser OS... by Anonymous Coward · · Score: 5, Funny


    They should be using OpenBSD. It can run right through power failures

    1. Re:Lesser OS... by ergo98 · · Score: 2, Informative

      Power failed to get to the computers. It was a power failure - whether it was the electric grid, the UPS blowing up, or all the wires in the wall, or in this case the EPO button, it's a bloody power failure.

    2. Re:Lesser OS... by Anonymous Coward · · Score: 2, Insightful

      Nice use of intentional confusion of the issue to make an argument there.

      You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.

      You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.

      If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.

  2. The less we've learned... by geoffspear · · Score: 4, Funny

    Don't let your clients near the Big Red Button without an escort. Preferably an armed one.

    --
    Don't blame me; I'm never given mod points.
    1. Re:The less we've learned... by Chris+Mattern · · Score: 2, Funny

      "The beautiful shiny button! The jolly, candy-like button!"

      Chris Mattern

    2. Re:The less we've learned... by geminidomino · · Score: 2, Funny

      Evil overlord list item #9: I will not include a self-destruct mechanism unless absolutely necessary. If it is necessary, it will not be a large red button labelled "Danger: Do Not Push". The big red button marked "Do Not Push" will instead trigger a spray of bullets on anyone stupid enough to disregard it. Similarly, the ON/OFF switch will not clearly be labelled as such.

  3. faulty mobo's by Lifthrasir · · Score: 5, Interesting

    so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

    --
    No beer, no TV make Lifthrasir something something
  4. Oppsie by darkstar949 · · Score: 5, Funny

    "I'll just set my coffee down here, and..."
    ...
    "Oppsie, I hope that button wasn't anything important."

  5. History Eraser Button by bsd4me · · Score: 4, Funny

    Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:History Eraser Button by scribblej · · Score: 4, Interesting

      I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.

      I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.

      Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.

      Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.

      Then it turns out no one on site has a key.

      Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?

      And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."

      Maybe they should have put a cover over the damn button then. Morons.

    2. Re:History Eraser Button by Local+ID10T · · Score: 2, Funny
      I was consulting for State Farm installing machines that were supposed to help with the Y2K problem.


      Hey! I worked that project too... it was fun, but mindnumbing. They actualy sent me to New Orleans for an install on fat tuesday.

      Mardi Gras on an expense account :)

      --
      "You want to know how to help your kids? Leave them the fuck alone." -George Carlin
    3. Re:History Eraser Button by Aaden42 · · Score: 2, Funny

      Nobody remembers!

    4. Re:History Eraser Button by cgenman · · Score: 2, Funny

      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.

      You of all people should realize that putting someone's head on a pole in front of a data centre is dangerous. For one, it tends to become a disease vector, as for some mysterious reason everyone feels the need to touch it. Rats are usually attracted to the smell, and you know how rats wreak havock on ethernet cables, especially the rats of the dead. Furthermore, putting the dead on a spike on your front lawn tends to attract ghosts, which are no problem if you're running a secure OS but everyone knows what havok ghosts can wrack on a Windows Server 2000 installation.

      On the other hand, how would putting a clear, hinged plastic cover over an emergency power kill switch be likely to kill someone? I know people panic in desperate situations, but if someone can't get a plastic hinged cover off of a button quickly during an emergency they shouldn't be trusted with electricity.

      There are many ways you could safely "fuck with" the safety equipment while making it less likely to take down your entire network. You could make it a handle that had to be pulled down, like most fire alarms are. It could be "Break flimsy plastic and press button to kill power." Heck, it could just be recessed, like many good last-resort buttons are.

  6. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

  7. Re:Where was the switch? by grub · · Score: 2, Informative


    They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.

    --
    Trolling is a art,
  8. Missing opportunities by Rosco+P.+Coltrane · · Score: 3, Funny

    Apparently someone "accidentally" pushed the emergency power off

    They had to power back on when they realized deadjournal.com was already taken...

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  9. LJDotting: LJ user base vs Slashdot user base. by TrevorB · · Score: 4, Funny

    If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....

    LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.

  10. Auto-negotiation by stilwebm · · Score: 3, Informative

    When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.

    1. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

      Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

      They're cheeky enough to document this now. It's a feature, not a bug! Honest!

    2. Re:Auto-negotiation by Undertaker43017 · · Score: 2, Funny

      The part I like is they are claiming that everyone else is wrong, and they are right. ;)

      I don't buy Cisco anymore for this very reason, it's not just their switches, it's on everything they make that has a NIC.

      I deployed some CSS's, right after Cisco bought ArrowPoint, and they did auto correctly. Another client deployed some a couple of months ago, and auto was broken. Cisco is the Borg! ;)

  11. ...and ran off? by stratjakt · · Score: 5, Funny

    What do you mean, ran off?

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?

    My dog likes remote controls more than snausages.

    OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?

    --
    I don't need no instructions to know how to rock!!!!
  12. Credit by XorNand · · Score: 4, Informative

    Anyone who's a paid member of LJ can get a 2-week credit here.

    --
    Entrepreneur : (noun), French for "unemployed"
  13. Ahhhh silence is GOOOOLDEN by ShatteredDream · · Score: 3, Funny

    *crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.

    1. Re:Ahhhh silence is GOOOOLDEN by metalhed77 · · Score: 3, Funny

      So says the author of yet another political weblog whose startling impartialiality and sense will pave the way for a brave new world?

      --
      Photos.
  14. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
    2. Re:machine failure by gkuz · · Score: 2, Insightful
      Every Saturday evening, we rebooted all of our servers

      Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.

      Scheduled reboots are a part of good systems administration

      Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?

    3. Re:machine failure by TeraCo · · Score: 2, Insightful

      You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.

      --
      Not Meta-modding due to apathy.
  15. LOL! Kindof like when... by GillBates0 · · Score: 5, Funny
    ...when I was on AOL and I hit the X and I couldn't talk to my AOL Buddies anymore.

    And I was like OMG I shut off the internets and stuff!!1!!

    And i called the AOL helpdesk and they helped turn it back on.

    --
    An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
  16. Re:I want to name this file..... by Cocoronixx · · Score: 2, Funny

    uhhh 0? Well I guess 1 since I can count you now.

    --
    "Obscenity is the crutch of the inarticulate motherfucker." - cloak42
  17. The reason why some NICs don't auto-neg by phaetonic · · Score: 2, Informative

    I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.

    1. Re:The reason why some NICs don't auto-neg by caluml · · Score: 2, Informative

      That's what Compaq Lights-Out cards are for. Lovely things. Very handy.

  18. Re:How do you do that by *accident*???? by FudgePackinJesus · · Score: 2, Funny

    Stimpy couldn't resist "The Red, Shiney, CANDY-LIKE Button!!"

  19. Re:And here by tmhsiao · · Score: 2, Interesting

    Aside from allowing an unaccompanied client access to the Big Red Button, perhaps?

    --
    "My God...It's full of ads!" -Fry, about the Internet, Futurama
  20. 13 yo? :P by Spy+der+Mann · · Score: 3, Funny

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    By any chance, was his name "Zero Cool"?

  21. OOB console access is the answer. by Mordant · · Score: 2, Insightful

    They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

    Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

  22. Re:Wait a second! by rah1420 · · Score: 2, Informative

    Technically, yes. I'm hoping that if LJ decides to implement such a scheme (let's call it "LEPO" for "Leisurely Emergency Power Off") that they run it past the fire marshal or the code inspectors first, who may have another opinion about how smart this idea is.

    "If it's stupid and it works, it's not stupid."

    --
    Mit der Dummheit kämpfen Götter selbst vergebens.
  23. 2 accounts of the powerloss by Spazholio · · Score: 4, Funny

    The one they tell you about and the real one.

  24. No! by Saeed+al-Sahaf · · Score: 2, Insightful
    embedded NICs...

    Who in their right mind goes with the on-board NIC in a server environment?

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    1. Re:No! by juuri · · Score: 2, Interesting

      Who in their right mind goes with the on-board NIC in a server environment?

      Are you kidding?

      How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.

      --
      --- I do not moderate.
    2. Re:No! by SenorChuck · · Score: 2, Informative

      On all of the (actual) servers I've worked with, the onboard NICs are exactly the same hardware that you get with the server-grade PCI NICs.

      --
      A wise person makes his own decisions, a weak one obeys public opinion. -- Chinese proverb
  25. Not millions of paying accounts. by EvilStein · · Score: 4, Informative

    Actually, most of the accounts don't pay. They're just freeloading whiners.

    This is a paste from the Livejournal stats:

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

  26. Re:Also, by Scott+Laird · · Score: 2, Informative

    "Why do we even have that button?" Because it's basically required by law. Covering them with a plastic cover doesn't seem to help either--Internap did that the *last* time someone hit the EPO button in this datacenter.

  27. Accidents happen by Migraineman · · Score: 2, Interesting

    About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...

    About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch ... *click*

    So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.

    So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click* ...

    All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since ...

  28. You, sir, are an idiot. by Anonymous Coward · · Score: 5, Informative

    Go ahead and read up on how auto-negotiation works. I'll wait...

    No, really. Go read up on it...

    Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.

    (To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)

  29. Re:How do you do that by *accident*???? by AndroidCat · · Score: 2, Funny
    Another customer in the facility accidentally pressed the EPO button, then depressed it

    I'm trying to figure out how depressing a button reverses a press. (Since the button is depressed by pressing it.) Unpressed it?

    --
    One line blog. I hear that they're called Twitters now.
  30. Photo of the button by teneighty · · Score: 2, Insightful

    Apparently this photo is an example of the button that was "accidently" pressed.

    I'd love to hear the explanation for this "accident".

  31. Re:No UPSes before? by Nonesuch · · Score: 2, Informative
    I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
    Unfortunately, this would defeat the purpose of the "Big Red Button", which is there to quickly and definitively cut of all power to all line-powered devices in the data center.

    When you've got an analyst smoking and twitching next to one of the racks as 110VAC courses through her veins, you don't want to have to go hunting to figure out which UPS is supplying the juice.

  32. Nothing wrong with onboard NICs in "real" servers. by Nonesuch · · Score: 2, Informative
    Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.
    The smallest of the Sun 1U rackmount Sparc servers do not even have a PCI slot to take a NIC -- no expansion at all, but two on-board 100M interfaces are plenty for most data center deployments of these small boxes.
  33. Re:Wait a second! by psykocrime · · Score: 2, Interesting

    Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

    It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling into a live wire that's hanging down from the ceiling, or cutting into a live wire with a vent saw, or getting caught up in one with a pike pole or something.

    Having been a firefighter for somewhere around 15 years, I'd say that I for one would not be particularly concerned about the small UPS's. That's not to say that they *couldn't* pose a danger... just that relatively speaking, they'd be a minor concern.

    --
    // TODO: Insert Cool Sig
  34. Re:Its a Small World... by radish · · Score: 2, Funny

    LiveJournal got hit the hardest, they had some IDE drives on their servers, doh!

    I was unaware that SCSI drives had the ability to run without power - thanks for the info!

    --

    ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"