LiveJournal Blackout Analysis Online
Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "
They should be using OpenBSD. It can run right through power failures
Don't let your clients near the Big Red Button without an escort. Preferably an armed one.
Don't blame me; I'm never given mod points.
so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?
No beer, no TV make Lifthrasir something something
"I'll just set my coffee down here, and..."
...
"Oppsie, I hope that button wasn't anything important."
Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...
(S(SKK)(SKK))(S(SKK)(SKK))
Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.
Carousel is a lie!
If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....
LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.
What do you mean, ran off?
Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?
Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?
My dog likes remote controls more than snausages.
OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?
I don't need no instructions to know how to rock!!!!
Anyone who's a paid member of LJ can get a 2-week credit here.
Entrepreneur : (noun), French for "unemployed"
And I was like OMG I shut off the internets and stuff!!1!!
And i called the AOL helpdesk and they helped turn it back on.
An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.
Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.
Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?
You're special forces then? That's great! I just love your olympics!
The one they tell you about and the real one.
Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.
They're cheeky enough to document this now. It's a feature, not a bug! Honest!
Actually, most of the accounts don't pay. They're just freeloading whiners.
This is a paste from the Livejournal stats:
* Free Account: 5713743 (98.3%)
* Early Adopter: 14220 (0.2%)
* Paid Account: 94857 (1.6%)
* Permanent Account: 1632 (0.0%)
Go ahead and read up on how auto-negotiation works. I'll wait...
No, really. Go read up on it...
Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.
(To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)