LiveJournal Blackout Analysis Online
Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "
They should be using OpenBSD. It can run right through power failures
What they do makes me happy when I think how simple my setup is by comparison.
Don't let your clients near the Big Red Button without an escort. Preferably an armed one.
Don't blame me; I'm never given mod points.
so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?
No beer, no TV make Lifthrasir something something
Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..
I havent seen them that much lately, but then i havent been online that much either...
"I'll just set my coffee down here, and..."
...
"Oppsie, I hope that button wasn't anything important."
Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...
(S(SKK)(SKK))(S(SKK)(SKK))
Did they put it right next to the light switches? Shouldn't something like that be locked away in a server room or at least in a place where it can be under supervision?
What doesn't kill you only delays the inevitable
/.s current poll now?
Monstar L
Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.
Carousel is a lie!
Apparently someone "accidentally" pushed the emergency power off
They had to power back on when they realized deadjournal.com was already taken...
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....
LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.
When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.
What do you mean, ran off?
Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?
Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?
My dog likes remote controls more than snausages.
OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?
I don't need no instructions to know how to rock!!!!
Speaking of stupid things to do how many people know someone that has named a file on a Unix server * and then at some point later in time decided they no longer needed that file and decided to rm *?
News Reporters Make Tasty Polar Bear Treats!
Anyone who's a paid member of LJ can get a 2-week credit here.
Entrepreneur : (noun), French for "unemployed"
I must compliment LJ for at least being honest with their system... many would lie and say "it was the datacenter's fault".
They at least admit their own systems weren't perfect... and clearly explained each fault they observed.
Good info.
I always wanted to push that button... Now I don't have to.
*crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.
Click here or a puppy gets stomped!
I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.
And I was like OMG I shut off the internets and stuff!!1!!
And i called the AOL helpdesk and they helped turn it back on.
An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
Is it me, or are some of those LJ users' expressions of thanks just a bit OTT?
The way the comments go, you'd think this was a life support system or something!
I mean, well done for getting the site back up after like 24 hours or something, but hey I'm not creaming my shorts over it!
#include <sig.h>
everybody was blaming Internap for screwing up and running a shoddy Datacenter, when actually Internap did everything they were supposed to correctly.
Your hair look like poop, Bob! - Wanker.
Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS)
This also raised the all-important "Why do we even have that button?" question.
Maybe they should use the Button of Doom (USB) to lock the pcs down too...
"EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."
"...all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown. (but if we do the rest of the stuff right, this step isn't even required, including having UPSes... in theory... but the UPSes would be comforting)
Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?
I'm not trying to be a smartass here, since I'm not an expert in datacenters or the purposes behind EPOs - I'm asking. . .
The only acceptable defense of scientific results is to say that they were the product of the Scientific Method.
I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.
Stimpy couldn't resist "The Red, Shiney, CANDY-LIKE Button!!"
Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?
By any chance, was his name "Zero Cool"?
They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.
Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.
I told you so.
Looks like my "Newbie Operator" found hisself a new job.
Mit der Dummheit kämpfen Götter selbst vergebens.
The one they tell you about and the real one.
Who in their right mind goes with the on-board NIC in a server environment?
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
Actually, most of the accounts don't pay. They're just freeloading whiners.
This is a paste from the Livejournal stats:
* Free Account: 5713743 (98.3%)
* Early Adopter: 14220 (0.2%)
* Paid Account: 94857 (1.6%)
* Permanent Account: 1632 (0.0%)
The article cites disk caches as a source of data-loss.
They claim that their battery-backed RAID caches were safe, but that the actual drives themselves were performing unsafe write cacheing. It strikes me that this is the kind of thing that's quite easy to *suggest*, but far more difficult to *prove*.
I don't have any first-hand knowledge of disk corruption due to write-caching. Is this a real problem or just some kind of legend? Can someone who has RTFA'ed and knows about disk caches please comment?
This is somewhat irrelevant, but I've messed with some non-battery-backed RAID setups in the past. In these situations, it always made sense to me that the controller would set the individual drives' cache policy to match its own.
It seems that my company and LiveJournal host at the same datacenter here in Seattle. Looks like they got hit pretty hard when the datacenter with multiple redundant battery backups and generators had a massive cascade emergency power off, and every server in the building got shutdown at once. LiveJournal got hit the hardest, they had some IDE drives on their servers, doh! Looks like even multiple redundant battery backup with power generator datacenters are still vulnerable to dumbass electricians who don't know what they are doing. The datacenter has been under construction for the past few months too, so you KNOW that had something to do with it. Looks like we'll have to put a UPS in our cabinet at the multiple redundant battery back up and power generator datacenter housing, seeing as all that backup protection doesn't mean diddly squat.
Meet new people, and kill them.
Most of the time it is Stimpy's fault. The rest of the time it is Fry's fault. I think there may be a connection...
(S(SKK)(SKK))(S(SKK)(SKK))
I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
Arnt these sorts of switches usually behind little glass things that say "BREAK IN CASE OF EMERGENCY" ?
I mean I'm sure it's a big red button of some sort like the one we've got in our server room, but man, that's the sorta thing that needs a video camera aimed at it.
Of course, if it was a malicious inside job, then there's not too much to do about it.
I understand the REASON for an easily accesable switch like this, but would it be possible just to wire it into the fire system or something and not have a switch that just screams touch me for a thrill ?
About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...
... *click*
...
...
About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch
So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.
So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click*
All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since
Maybe people will see this and relise the LJ staff are geeks, unlike most of their fanbase, so while you maybe mocking their minions they can still bring down a server looking at a single article with the rest of us slashdotters.
I like muppets.
Way back when, I was working at an IBM site (STF) that had a boatload of mainframes and equipment on a raised floor area that was badge-access only. Every summer we'd get interns to learn the finer points of computer science by doing things like bursting printouts from the lineprinter and delivering them. Seems that the intern introductory tour had gotten a bit lax... One day a cleaning person knocked at the door to the raised floor to get let in to empty the wastebaskets. Nobody else was around, so one of the interns decided to let them in. Of course they pushed The Big Red Switch that was right next to the door. Oops. Whole floor went down...hard, about 10% of the stuff didn't come back up when the power was restored. Not fun...
They revised the introductory tour a bit, and added a label to the EPO switch.
(And no, it wasn't me who hit the button...)
Needless to say, we now have a cover over our Button. Funny thing is, the electrician who installed the original button is also the guy who leaned his ladder against it.
Go ahead and read up on how auto-negotiation works. I'll wait...
No, really. Go read up on it...
Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.
(To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)
This reminds me of the time when I had a server that would not reboot because there wasn't a keyboard plugged in, and I did not change the setting in the BIOS.
Brian.
So make a little black button and know where it is, but also make an big red one that turns off the lights. That way you get to yell at little kids without much harm to your system.
This post written under Gentoo-linux with an SCO IP license.
Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it.
Total kissasses. I wonder how many of them are paid members vs free accounts.
Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...
Account Types
What type of account do people have?
* Free Account: 5713743 (98.3%)
* Early Adopter: 14220 (0.2%)
* Paid Account: 94857 (1.6%)
* Permanent Account: 1632 (0.0%)
They're required by law to have it. It's a building code thing. Every data center I've ever been in has one.
Also.. ""EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."
...when you buy crappy kit. Next time do it right.
Knowledge is power. Knowledge shared is power multiplied.
A couple of years ago, when our server room was being 'certified', one of the specific checks was "No, big red button, check". One of the guys in the group came up with a story about how someone's kid at the end of a 'tour' thought that the 'big red button' was ment to be pushed.
The force that blew the Big Bang continues to accelerate.
I most often see autoneg problems with faulty cabling (split pairs from crimps). 98% of newbies cannot get it right, and they aren't to blame because the standards are counter-intuitive unless you've worked for Ma Bell for 40+ years. I beware of all field crimps.
OTOH, I saw one example of a Crisco Crapalyst router not wanting to play with some devices. Of course they blamed the device, but I never had any problem with interconnects or using cheap @$$ switches, so I wonder why the expensive @$$ switch gets huffy.
Nonsense. I had my server up for 360 days without rebooting, with kernel 2.4. It had 360 days on the uptime counter. I only shut it down because it was too slow for the newer stuff I wanted to run.
(a) Manager that pushed the "off" button gets promoted.
(b) Engineers that spent their weekends getting the system back up: off to India with your jobs!
"Awww, I don't know why we even have a jug!"
This space intentionally left (almost) blank.
I'm waiting for the day that machines come built such that when the power dies, an emergency battery kicks in just long enough to dump the RAM state to a nonvolatile cache, and then when power resumes, restore the system from there. Like VirtualPC.
Heck, having that be a user-accessible feature supported by the OS ("Save and Shutdown") would make a lot of sense too.
-Forrest Cameranesi, Geek of all Trades
"I am Sam. Sam I am. I do not like trolls, flames, or spam."
Lazy writes allow for faster system operation and have only one detrimental downside: in a poweroff or unexpected reset the data waiting to be written won't be. As bad as that sounds, the performance gains during normal system operations usually overcome fears of this data loss potential.
It boils down to this: if every bit of data is crucial, disable write cache. If performance is paramount and some tolerance exists for infrequent data loss due to catastrophic failures, enable it. LiveJournal evidently wanted your normal experience to be pleasantly quick rather than painfully accurate.
-- @rjamestaylor on Ello
I'm trying to figure out how depressing a button reverses a press. (Since the button is depressed by pressing it.) Unpressed it?
One line blog. I hear that they're called Twitters now.
I assume that they will have the responsible luser pay for the down time plus the 2 weeks credit plus the extra hours for the staff to bring the system up.
And what the hell was a visitor doing playing with the Big Red Button anyways?
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
If it looks like this, don't push it!
Become a FSF associate member before the low #s are used
Apparently this photo is an example of the button that was "accidently" pressed.
I'd love to hear the explanation for this "accident".
I do not deploy Linux. Ever.
This happened to us last year in our datacenter.
The Facilities manager had some guys in to install shelving to store toner, cables, etc.
Our datacenter is divided into two sections, inner and outer. All CPUs, UPSs, HVAC, etc are in the inner room. The outer room is shelving, desks, CCTV (security), etc.
The EPOs are near every door, as they should be, including the outer doors. Some guy, while installing the shelves, decided to take a little break and lean against the wall, leaning on the EPO in the process.
It took us about 10 minutes to figure out what the hell happened, because even the generator didn't fire as it should. Meanwhile, the shelving guys were just merrily installing shelves. When asked, the guy just said he didn't realize anything was wrong and just thought it was nice that everything "got so quiet" all of a sudden.
Like LiveJournal, we promptly installed cages over the EPO buttons.
Ahh, I always wondered how one trolled. I feel better now.
You sure hit the nail on the head son. I am glad that you recognize that I am without bias, opinion or a tendency to propagandize for my side. My reporting is beyond reproach and I cannot even fathom how someone could insinuate that things might be to the contrary...
Click here or a puppy gets stomped!
When will people stop using this POS for production environments? do you drive to work in your kid's toy car just because it's cheaper? no. you get the best car you can afford. Do you use FAT32 for your production severs? no. you use reiser or ffs+softupdate.
So - if they'd spent the extra 10 minutes it takes to learn how to program a real database, they'd have come right back up with maybe 5 min of transactions needing to be replayed.
Sitting Walrus Blog
Is why do you have an emergency shut down for a bunch of journals? Dear God Jim! The hax0rz have gotten the journals! Shut them down, now!
> OTOH, I saw one example of a Crisco Crapalyst
We always had problems with auto negotiation and the Crapalyst. It wasn't wiring or the workstation either. Whenever there was a performance problem it was almost always in the switch.
The new guy's first day was also his last.
Particularly one shaped like this Big Red Button
Patriotism is a virtue of the vicious
Simple solution to this one. At work we don't have a kill button. We have a kill key. It takes a little bit more work to "insert key" and "turn", but it's better than having incidents like this wherein somebody hits the big red button.
Plus, you can give the key only people that aren't idiots. With the big red button, you'll inevitably get somebody who thinks "hmm, wonder what would happen if I pushed this big red button duhhhhhh."
Anyhow; I have seen EPO activations ranging from the malicious to a simple slam of the door and never once has it saved a life. So what? if a monitor smokes.
Until then: Place the redundant part of your system in a seperate room, building, or country.
Bigger PSU capacitors = a machine less likely to crash or shut down during a brownout. I mean, after all, their job is to buffer power fluctuations. I doubt it had much to do with the OS.
Or maybe I've just been reading too many episodes of BOFH lately.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
The second was a guy who was on his first day of work with us. A Big Boss came towards the machine room, so - feeling helpful - the new guy opens the door for him... or so he thought.
My favourite story (though I wasn't there) is about some old DEC machines, which apparently had the power switch about 6" from the floor. Nobody knew why they kept crashing at night, until someone spotted a cleaner ramming a vacuum cleaner right up to the servers.
That beats the one we had, when I used to do a lot of soak-testing of machines in a lab - I'd kick off a test on a Friday night, come back in on Monday to find the machine had rebooted. Nothing in the logs, just looked like the power had died, and returned again half an hour later. Other machines on the same power supply were fine.
It turned out that the cleaners were unplugging the servers, so they could plug in the vacuum cleaner!
Author, Shell Scripting : Expert Re
If they used PostgreSQL they would'nt have had to deal with rebuilding indexes, etc.
There are real-world reasons to use an ACID compliant database!
Just because it CAN be done, doesn't mean it should!
It is a commonly known fact that cisco autoneg sucks ass.
Somebody's been hiring Stooges to guard that button. Bunch of lousy idiots.
yes, that's how it works. I used to have a computer which I could turn off for a quarter of a second without causing it to reboot. As you might suspect, I discovered this behavior by accident.
On a related note a Brownout isn't desirable and can cause a sitiuation which is commonly called a loss of power. I really don't understand why some people here don't see the difference between powering off and an unintentional drop in voltage.
Since it's not exceptional to have brownouts (some elevators cause them btw) there are standards for PSUs on how much they can take before they can't supply anymore. Good computer magazines simulate brownouts when they test PSUs and the cheap brands usually fail miserably.
That's why GP's link is so funny after all - even the best OS in the world will fail if the motherboard, CPU or other peripherals don't have any power.
I don't read replies by ACs.
They need a Molly Guard
"Everything is adjustable, provided you have the right tools"
I noticed one thing conspicuously absent from their list of :Things we're doing to avoid this crap in the future..." That item is:
"Put a big sign next to the EPO button saying 'Do NOT Press This Button, it cuts off power to the entire building, it is not a light switch nor a door switch. Push this button only if your life is in danger. If your life is not in danger and you push this button, your life WILL be in danger."
"Why can't the EPO button perform in the same manner as a door release for an emergency exit..."
t ml
. html
Emergency Power Off (EPO) switches are primarilly a safety feature. If some person is being electrocuted, you hit the switch and the power dies so the person doesn't. You don't have time to wait in a situation like that. A person's life is considered more valuable then LiveJournal, which despite the name, isn't actually alive. (Insert comment about angst-ridden teen-age girls here.)
See also:
http://catb.org/~esr/jargon/html/S/scram-switch.h
http://catb.org/~esr/jargon/html/B/Big-Red-Switch
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
So, I double-clicked the button as fast as I could. No problem! Everything stayed up.
I have seen that a few times since then, where the good-quality computers have survived momentary power outages and the crummy ones haven't. Just another reason to buy quality hardware...
Linux IT Consulting and Domino Development in Michigan
"Please do not press this button again!"
Vista:XPSP2::ME:98SE
The BSd box just had bigger caps for it's PSu size or just better quality capcitors.
I had the same thing happen when my PC rebooted but my old powermac rode though a brown out.
The mac's psu was physically 2x the size of the atx in the pc ie far bigger caps and heat sinks.
Dear Mr. Rotund. How does one "roack"? Is it a sound? An action? Is it a new dance? Please elaborate on this stupidity. kthnx
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
Now we're getting somewhere Mr. Rotund (implication that you are a fat lazy slob). I see that you must be using Windows 3.1 to operate your brain. That would explain the latency in your response. Six days. Not bad Mr. Rotund. That 16-bit single tasking brain of yours can work a little, even if it's wayyyy late. LOL!!!!111!!!! OMFG!!!11111!!!!!! I made a funny.
Bleh.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
No. I think not. I *HAVE* a life in "meatspace" as you call it (I'm no geek. I'm an artist who happens to use computers). If I didn't have one, I'd be trolling like you all the time. You definitely don't have a life Rotund Bastard.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
Whheeee! Fun with trolling the trolls. Your ignorance is quite entertaining Mr. Penis Pudgepack.
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o