LiveJournal Blackout Analysis Online
Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "
Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.
Carousel is a lie!
I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.
They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.
Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.
Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.
They're cheeky enough to document this now. It's a feature, not a bug! Honest!
Who in their right mind goes with the on-board NIC in a server environment?
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
Apparently this photo is an example of the button that was "accidently" pressed.
I'd love to hear the explanation for this "accident".
I'm about to leave work and go home. When I do, I plan to hit the so-called "power button." When I do this, code will execute on the box that flushes cache to disk and then commands the power supply to interrupt most (but not all) of its DC output. At that time, my computer will be in a state commonly referred to as "off."
By your logic, I can claim that my computer is down due to a power failure.
Perhaps you would complain: But power was getting to the computer.
So what about the situation where I accidentally hit the (again, so-called) "off" button on my UPS. In this case the computer will go down due to a lack of power getting to it. However, the power is still on at the wall socket - I have just chosen (unintentionally) to interrupt the supply to the computer. Is this a power failure?
I don't think you can call it a power failure if power is abundantly available, and you just don't choose to make use of it.
-Graham
Its from an episode of Ren & Stimpy "Space Madness".
[Button room] REN: Now, listen, Cadet. I've got a JOB for you. See this button? (Stimpy reaches for the button) DON'T TOUCH IT! It's the HISTORY ERASER button, you FOOL!
STIMPY: So what'll happen?
REN: That's just IT! We don't KNOW! Maayyybeee something bad?...Mayyybeee something good! I guess we'll never know! 'Cause you're going to guard it! You won't TOUCH it, will you?
(Stimpy salutes. Ren leaves.)
REN: Hehhh...hehhhh...hehhhh...hehhhh...
(Stimpy marches back and forth, starting at the button.)
ANNOUNCER: Oh, how long can trusty Cadet Stimpy hold out? How can he possibly resist the diabolical urge to push the button that could erase his very existence? Will his tortured mind give in to its uncontrollable desires?
(Announcer grabs Stimpy, forces him closer to the button.) Can he resist the temptation to push the button that, even now, beckons him even closer? Will he succumb to the maddening urge to eradicate history? At the MERE...PUSH...of a SINGLE...BUTTON! The beeyootiful SHINY button! The jolly CANDY-LIKE button! Will he hold out, folks? CAN he hold out?
STIMPY: NO I CAN'T!!!EEEEEYAAAHHHH! (pushes button)
(Alarms go off. Ren, Stimpy, and Announcer stand around table with button.)
ANNOUNCER: Tune in next week, as...
(Flash, explosion as they all disappear.)
We see the Ren and Stimpy logo, Ren and Stimpy also flash and disappear.
Nice use of intentional confusion of the issue to make an argument there.
You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.
You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.
If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.