LiveJournal Blackout Analysis Online
Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "
They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.
Trolling is a art,
When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.
Anyone who's a paid member of LJ can get a 2-week credit here.
Entrepreneur : (noun), French for "unemployed"
I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.
Technically, yes. I'm hoping that if LJ decides to implement such a scheme (let's call it "LEPO" for "Leisurely Emergency Power Off") that they run it past the fire marshal or the code inspectors first, who may have another opinion about how smart this idea is.
"If it's stupid and it works, it's not stupid."
Mit der Dummheit kämpfen Götter selbst vergebens.
Actually, most of the accounts don't pay. They're just freeloading whiners.
This is a paste from the Livejournal stats:
* Free Account: 5713743 (98.3%)
* Early Adopter: 14220 (0.2%)
* Paid Account: 94857 (1.6%)
* Permanent Account: 1632 (0.0%)
"Why do we even have that button?" Because it's basically required by law. Covering them with a plastic cover doesn't seem to help either--Internap did that the *last* time someone hit the EPO button in this datacenter.
Power failed to get to the computers. It was a power failure - whether it was the electric grid, the UPS blowing up, or all the wires in the wall, or in this case the EPO button, it's a bloody power failure.
Go ahead and read up on how auto-negotiation works. I'll wait...
No, really. Go read up on it...
Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.
(To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)
On all of the (actual) servers I've worked with, the onboard NICs are exactly the same hardware that you get with the server-grade PCI NICs.
A wise person makes his own decisions, a weak one obeys public opinion. -- Chinese proverb
When you've got an analyst smoking and twitching next to one of the racks as 110VAC courses through her veins, you don't want to have to go hunting to figure out which UPS is supplying the juice.
I do not deploy Linux. Ever.
I do not deploy Linux. Ever.