Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

241 of 333 comments (clear)

  1. Lesser OS... by Anonymous Coward · · Score: 5, Funny


    They should be using OpenBSD. It can run right through power failures

    1. Re:Lesser OS... by ergo98 · · Score: 2, Informative

      Power failed to get to the computers. It was a power failure - whether it was the electric grid, the UPS blowing up, or all the wires in the wall, or in this case the EPO button, it's a bloody power failure.

    2. Re:Lesser OS... by ghjm · · Score: 1, Insightful

      I'm about to leave work and go home. When I do, I plan to hit the so-called "power button." When I do this, code will execute on the box that flushes cache to disk and then commands the power supply to interrupt most (but not all) of its DC output. At that time, my computer will be in a state commonly referred to as "off."

      By your logic, I can claim that my computer is down due to a power failure.

      Perhaps you would complain: But power was getting to the computer.

      So what about the situation where I accidentally hit the (again, so-called) "off" button on my UPS. In this case the computer will go down due to a lack of power getting to it. However, the power is still on at the wall socket - I have just chosen (unintentionally) to interrupt the supply to the computer. Is this a power failure?

      I don't think you can call it a power failure if power is abundantly available, and you just don't choose to make use of it.

      -Graham

    3. Re:Lesser OS... by Anonymous Coward · · Score: 2, Insightful

      Nice use of intentional confusion of the issue to make an argument there.

      You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.

      You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.

      If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.

    4. Re:Lesser OS... by ergo98 · · Score: 1

      Right, but we're talking from the perspective of the people whose computers suddenly had no power - the infrastructure suddenly stopped providing power, so from their perspective it is a power failure. The NorthEast blackout of 2003 was a power failure, even though the protection circuits were doing exactly what they were supposed to do by shutting off the grid.

    5. Re:Lesser OS... by Aethel · · Score: 1

      as always, everything is relative with repect to your point of view, spooky at-a-distance action be damned!

    6. Re:Lesser OS... by Nykon · · Score: 1

      power yes..
      failure no..

      failure would indicate that it did not work the way it was supposed to.

      Someone hit the emergency kill button. The power went off as it should when that happens. Hardly a failure. A power outage is not the same as someone turning the power off.

      --
      "It's better to be a pirate then join the Navy"
    7. Re:Lesser OS... by ghjm · · Score: 1

      So, if you hit the power button on the UPS with your elbow, would it be okay to tell your boss "the systems went down because of a power failure?"

    8. Re:Lesser OS... by haruchai · · Score: 1

      Yes. Do you really want to tell him the truth?

      --
      Pain is merely failure leaving the body
  2. Good job! by Uriel · · Score: 1

    What they do makes me happy when I think how simple my setup is by comparison.

  3. The less we've learned... by geoffspear · · Score: 4, Funny

    Don't let your clients near the Big Red Button without an escort. Preferably an armed one.

    --
    Don't blame me; I'm never given mod points.
    1. Re:The less we've learned... by stupidfoo · · Score: 1

      And don't have it red. Have it black. People, especially kids, love pushing that damn red button, no matter how many warning signs you put around it.

    2. Re:The less we've learned... by Macrolord · · Score: 1

      One time where I work, the data center took a complete power outage because when the annual fire suppression system test was made. As it turns out, the bypass switches that would have prevented the power from automatically being cut by the fire system weren't flipped. ......

      Turns out, due to cost cutting, we laid off the guys who know how this stuff worked. Nothing like bringing a very large Teradata, hundreds of servers, mainframes all back online. Glad I wasn't at the data center that day!

    3. Re:The less we've learned... by Lispy · · Score: 1

      Offtopic but hey:
      The red button will eternally be linked in my brain to the one in the pool of ManiacMansion that reads "Do not push" and wich everyone i know pushed anyways. ,-)

    4. Re:The less we've learned... by cdrudge · · Score: 1

      Code may dictate that it needs to be red.

    5. Re:The less we've learned... by Chris+Mattern · · Score: 2, Funny

      "The beautiful shiny button! The jolly, candy-like button!"

      Chris Mattern

    6. Re:The less we've learned... by puhuri · · Score: 1

      Some years ago we finaly got UPS for our laboratory; it was installed and the technican tested setup and we were statisfied. Some half a year later came the first blackout; we then went to the laboratory to see all systems running... all was black and silent! We found out that UPS had been bypassed all the time, the technican had not turn it back after testing.

    7. Re:The less we've learned... by DrWho520 · · Score: 1

      "Who puts a 'Destroy the Engines' button on a ship, anyway!?!" - Kree, The Kids Next Door

      --
      The cancel button is your friend. Do not hesitate to use it.
    8. Re:The less we've learned... by geminidomino · · Score: 2, Funny

      Evil overlord list item #9: I will not include a self-destruct mechanism unless absolutely necessary. If it is necessary, it will not be a large red button labelled "Danger: Do Not Push". The big red button marked "Do Not Push" will instead trigger a spray of bullets on anyone stupid enough to disregard it. Similarly, the ON/OFF switch will not clearly be labelled as such.

    9. Re:The less we've learned... by Trejkaz · · Score: 1

      An armed escort... man, that's hot.

      --
      Karma: It's all a bunch of tree-huggin' hippy crap!
    10. Re:The less we've learned... by Zen · · Score: 1

      Your comment about making sure everyone has an armed escort struck me as pretty funny. Here's our situation:

      We recently had this same problem at my employer's state of the art datacenter. I work for a large (multi-state, a name the vast majority in the US knows) health care provider. One of our security guards was teaching a new security guard the ropes, and showed him the emergency button. Now, if we had any other type of power failure that myseteriously killed both our A and B power feeds, our emergency generators would immediately kick in and not even the lights would flicker. But the emergency button obviously has to cut everything. So he actually said "Now, whatever you do, don't do this" while pointing to the button, and hit it by mistake. However, it seems we fared much better than live-journal in that it only took about 10 hours to get everything back up and fully tested. A couple parts failed that we had full onsite support contracts for, but nothing major (including multiple mainframes that went down hard!) The good news is that now we know that all the disaster recovery drills we've done in the past 5 years actually work. It did make the newspaper though, and marketing had to call all of our large clients individually and apologize.

    11. Re:The less we've learned... by Bi()hazard · · Score: 1

      Excellent advice. In my underground volcano lair, the real self-destruct button camouflaged at the bottom of a murky pool full of angry crocodiles. They were angry because the henchmen were instructed to throw things at them on a regular basis. Of course, I moved it when we upgraded to cyborg crocodiles with lasers and fire breath, which operate better on land (water puts out the fire) and are naturally pissed off without having to be annoyed manually.

      And yes, instead of on/off switches on all my engines of destruction and vehicles, there's fucking keys. So you can't just jump into any parked tank and go on a rampage.

    12. Re:The less we've learned... by geminidomino · · Score: 1

      Keys! Brilliant!!

  4. faulty mobo's by Lifthrasir · · Score: 5, Interesting

    so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

    --
    No beer, no TV make Lifthrasir something something
    1. Re:faulty mobo's by wankledot · · Score: 1
      The solution is even funnier...
      To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch.
      This is how a company with Millions of paying accounts runs its data center, and they even knew about the problem!
      --
      My sig is blank, I typed this by hand.
    2. Re:faulty mobo's by BridgeBum · · Score: 1

      Maybe faulty, maybe not. There are a lot of incompatibilities and general "flakiness" with some network auto-negotion interactions. It's a fairly standard precaution in large network environments that servers should not rely on auto-negotiate and instead should have their speed and duplex settings hard-coded.

      In reality, the only places where auto-negotiation is important are mobile devices (laptops) which may connect to a variety of network connection types or for the home user "plug-and-play" market. Major datacenter infrastructure isn't the place for auto-negotiating low level network settings any more than it is appropriate to have dynamic IP addressing via DHCP.

      --
      My UID is the product of 2 primes.
    3. Re:faulty mobo's by Lifthrasir · · Score: 1
      Yes, i know that, but these NIC's couldn't even be set to the proper speed/duplex.

      From TFA:

      We have 9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly. They only work with certain switches, so they reboot fine, but then their gigabit network comes up at 100 half duplex or something that doesn't work. To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch. Setting the speed/duplex settings on both the host and/or switch themselves doesn't work....
      --
      No beer, no TV make Lifthrasir something something
    4. Re:faulty mobo's by tchuladdiass · · Score: 1

      Also, you should never rely on autonegotiation -- there are no standards. That's what ethtool or mii-tool is for, or at a minimum specify speed/duplex in your /etc/modules.conf file.

    5. Re:faulty mobo's by ignorant_newbie · · Score: 1

      >We have 9 machines with faulty motherboards
      >with embedded NIC

      so basically, they're using shite hardware because they're too cheap. bet they've noticed by now that it costs less to use good hardware than to try to fix it later when something goes wrong

    6. Re:faulty mobo's by Surye · · Score: 1

      The same team still works for SixApart now.

    7. Re:faulty mobo's by dbIII · · Score: 1
      so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?
      If it's something that you need all the time, and only has problems on boot you get to it when you can organise a shutdown - I've had to leave dead disks in machines for months before I can bring the thing down and pull it apart. To put things in perspective major bits of plant - like power station units, typically run for three years between shutdowns, and relatively major faults may persist for a couple of years before they are dealt with.
    8. Re:faulty mobo's by Lifthrasir · · Score: 1
      well in this case it was a 9 computer cluster that was supposed to be redundant, automatic failover and what not.

      they could have taken one machine down, added a NIC, turned it back on. it would have taken 30 minutes (being really generous here).

      If they did one machine at a time, they wouldn't have noticed any downtime, but would have prevented this from happening.

      And in regards to computers in plant situations with faults for years (granted, i don't know the specifics of your situation, and i'm not trying to flame) - i'd much rather have some planned downtime to fix it than be called out in the middle of the night to fix it.

      --
      No beer, no TV make Lifthrasir something something
    9. Re:faulty mobo's by Jeff+Mahoney · · Score: 1

      The motherboard were faulty with respect to network connectivity on boot, not stability.

      The end result of replacing them before a major outage? Having another outage.

    10. Re:faulty mobo's by dbIII · · Score: 1
      plant situations with faults for years
      They are called planned shutdowns, and usually happen every three years. If something breaks and lets the steam out, you have an unplanned shutdown, and a whole lot of queued tasks get done during the duration of fixing the main fault. This situation is common in production environments - for instance an oil heater in a refinery may lose all of its temperature monitoring gear in the first couple of months, so it's just run conservatively for the next three years with visual checks each day to see how red the pipes are (optical pyrometer as well as just looking at them) and work out if it can handle the heat.

      There are a lot of computer systems that need to run like industrial plant now - turn them off and production completely stops.

      well in this case it was a 9 computer cluster that was supposed to be redundant, automatic failover and what not.
      These things happen every now and again. I know of one backup generator made from a fighter jet engine that starts up and runs for a couple of minutes as a test every Sunday without faults. Nearly every time it has been needed to actually be a backup generator a different fault each time has prevented it starting. It's likely that some contition outside of the usual test clobbered the cluster as well.
    11. Re:faulty mobo's by Lifthrasir · · Score: 1

      But this is a cluster, so they could have just taken one node down at a time and replaced the motherboards or disabled the on board NIC and added a PCI NIC.

      --
      No beer, no TV make Lifthrasir something something
    12. Re:faulty mobo's by dbIII · · Score: 1
      And in regards to computers in plant situations with faults for years
      To clarify things, I meant bits of plant with major faults for years. Computers can often be taken off line while being replaced temporarily by another system, but bits of pipe in a flame that isn't going out for three years is another story. If the pipe is getting thin or losing strength due to heat damage, you can run at reduced capacity if necessary or possible until the next scheduled shutdown.
  5. 503 pages by Folmer · · Score: 1

    Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..
    I havent seen them that much lately, but then i havent been online that much either...

    1. Re:503 pages by Rosco+P.+Coltrane · · Score: 1

      Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..

      You get 503 sites? I only reach one at slashdot.org

      Then again, you're a subscriber. Who knows what goodies you lucky few get here...

      --
      "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    2. Re:503 pages by vagabond_gr · · Score: 1

      So you're complaining about the 503's that you don't see, basically because you're rarely online?

  6. Oppsie by darkstar949 · · Score: 5, Funny

    "I'll just set my coffee down here, and..."
    ...
    "Oppsie, I hope that button wasn't anything important."

    1. Re:Oppsie by Gary+Destruction · · Score: 1

      You mean that big red button wasn't the coffee maker? Oops.

    2. Re:Oppsie by superpulpsicle · · Score: 1

      You mean that Staples commercial with the big EASY button is not a real product? I was waiting for it to go on sale.

  7. History Eraser Button by bsd4me · · Score: 4, Funny

    Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:History Eraser Button by stratjakt · · Score: 1

      Are you saying this was Stimpys fault?

      You idiot! My god man, do you know what you're saying?

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:History Eraser Button by scribblej · · Score: 4, Interesting

      I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.

      I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.

      Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.

      Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.

      Then it turns out no one on site has a key.

      Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?

      And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."

      Maybe they should have put a cover over the damn button then. Morons.

    3. Re:History Eraser Button by Local+ID10T · · Score: 2, Funny
      I was consulting for State Farm installing machines that were supposed to help with the Y2K problem.


      Hey! I worked that project too... it was fun, but mindnumbing. They actualy sent me to New Orleans for an install on fat tuesday.

      Mardi Gras on an expense account :)

      --
      "You want to know how to help your kids? Leave them the fuck alone." -George Carlin
    4. Re:History Eraser Button by Aaden42 · · Score: 2, Funny

      Nobody remembers!

    5. Re:History Eraser Button by Anonymous Coward · · Score: 1, Insightful

      Its from an episode of Ren & Stimpy "Space Madness".

      [Button room] REN: Now, listen, Cadet. I've got a JOB for you. See this button? (Stimpy reaches for the button) DON'T TOUCH IT! It's the HISTORY ERASER button, you FOOL!
      STIMPY: So what'll happen?
      REN: That's just IT! We don't KNOW! Maayyybeee something bad?...Mayyybeee something good! I guess we'll never know! 'Cause you're going to guard it! You won't TOUCH it, will you?
      (Stimpy salutes. Ren leaves.)
      REN: Hehhh...hehhhh...hehhhh...hehhhh...
      (Stimpy marches back and forth, starting at the button.)
      ANNOUNCER: Oh, how long can trusty Cadet Stimpy hold out? How can he possibly resist the diabolical urge to push the button that could erase his very existence? Will his tortured mind give in to its uncontrollable desires?
      (Announcer grabs Stimpy, forces him closer to the button.) Can he resist the temptation to push the button that, even now, beckons him even closer? Will he succumb to the maddening urge to eradicate history? At the MERE...PUSH...of a SINGLE...BUTTON! The beeyootiful SHINY button! The jolly CANDY-LIKE button! Will he hold out, folks? CAN he hold out?
      STIMPY: NO I CAN'T!!!EEEEEYAAAHHHH! (pushes button)
      (Alarms go off. Ren, Stimpy, and Announcer stand around table with button.)
      ANNOUNCER: Tune in next week, as...
      (Flash, explosion as they all disappear.)
      We see the Ren and Stimpy logo, Ren and Stimpy also flash and disappear.

    6. Re:History Eraser Button by anticypher · · Score: 1

      Maybe they should have put a cover over the damn button then

      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.

      Never fuck with safety equipment. It would be better to not have kit directly next to the big red button, leaving it a nice clear space so in an emergency someone could reach it and maybe save your life.

      the AC

      Yeah, I got a 208 volt jolt at RedBus today. Fucking check your hot and neutral orientation, shitheads!

      --
      Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
    7. Re:History Eraser Button by Trifthen · · Score: 1

      And what's wrong with just a plastic cover on a hinge that keeps someone from just pressing the button on accident?

      An Emergency! Oh N0es!

      1.) Lift cover.

      2.) Press Button.

      3.) Profit?

      --
      Read: Rabbit Rue - Free serial nove
    8. Re:History Eraser Button by trolman · · Score: 1
      Oh Yea? I was in Charlottesville and it was cold.

      What was your user name? BOFH
      http://members.iinet.net.au/~bofh/

    9. Re:History Eraser Button by martinX · · Score: 1

      What we need is a Big Red Button, uncovered. If you push it, a Big Blue Button pops out with a sign above it that says "Are you sure you want to activate the Big Red Button? Push the Big Blue Button for OK."

      --
      When they came for the communists, I said "He's next door. Take him away. Goddam commies."
    10. Re:History Eraser Button by jonwil · · Score: 1

      Put a cover on it like a fire alarm button has.
      They can be pressed very fast when you need to but are very hard to just bump accidentially.

    11. Re:History Eraser Button by cgenman · · Score: 2, Funny

      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.

      You of all people should realize that putting someone's head on a pole in front of a data centre is dangerous. For one, it tends to become a disease vector, as for some mysterious reason everyone feels the need to touch it. Rats are usually attracted to the smell, and you know how rats wreak havock on ethernet cables, especially the rats of the dead. Furthermore, putting the dead on a spike on your front lawn tends to attract ghosts, which are no problem if you're running a secure OS but everyone knows what havok ghosts can wrack on a Windows Server 2000 installation.

      On the other hand, how would putting a clear, hinged plastic cover over an emergency power kill switch be likely to kill someone? I know people panic in desperate situations, but if someone can't get a plastic hinged cover off of a button quickly during an emergency they shouldn't be trusted with electricity.

      There are many ways you could safely "fuck with" the safety equipment while making it less likely to take down your entire network. You could make it a handle that had to be pulled down, like most fire alarms are. It could be "Break flimsy plastic and press button to kill power." Heck, it could just be recessed, like many good last-resort buttons are.

  8. Where was the switch? by SoupGuru · · Score: 1

    Did they put it right next to the light switches? Shouldn't something like that be locked away in a server room or at least in a place where it can be under supervision?

    --
    What doesn't kill you only delays the inevitable
    1. Re:Where was the switch? by grub · · Score: 2, Informative


      They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.

      --
      Trolling is a art,
    2. Re:Where was the switch? by crimoid · · Score: 1

      Typically these types of devices are just inside the door to the rooms that they cut off. This way Fire & Emergency personnel can get to them quickly and easily.

      Generally the buttons themselves are behind plexiglass lids that easily flop up or behind breakable glass.

    3. Re:Where was the switch? by Shkuey · · Score: 1

      Locking up an emergency button defeats the purpose. They'll typically have a plastic cover you need to lift or some other mechanism to make sure it cant be done by accident. If the person who did it wont own up, they should have it fingerprinted. I mean... how many other people have pressed it? Should be fairly easy.

    4. Re:Where was the switch? by grub · · Score: 1


      Locking up an emergency button defeats the purpose.

      So I shouldn't have my fire extinguishers under lock and key? Whoops... ;)

      --
      Trolling is a art,
    5. Re:Where was the switch? by vasqzr · · Score: 1


      Locking up an emergency button defeats the purpose. They'll typically have a plastic cover you need to lift or some other mechanism to make sure it cant be done by accident. If the person who did it wont own up, they should have it fingerprinted. I mean... how many other people have pressed it? Should be fairly easy.

      Video survellience camera, anyone?

    6. Re:Where was the switch? by bsd4me · · Score: 1

      These switches are generally big round buttons about 2" in diameter, and almost always made out of bright red plastic. On top of that, the button take some force to depress and many facilities place a hinged, clear plexiglass box over them to prevent accidental use. It is pretty hard to mistake one for a normal light switch.

      --

      (S(SKK)(SKK))(S(SKK)(SKK))

    7. Re:Where was the switch? by irc.goatse.cx+troll · · Score: 1

      Or your gun, unless you want to ask that kind man with a knife to wait while you dig out the key.

      --
      Pain lasts, kid. Its how you know you're alive. Sometimes I think this growing up thing is just pain management-TheMaxx
    8. Re:Where was the switch? by cypher_6502 · · Score: 1

      In my old company, the 'router' guy accidentally mistaken that 'big red' power reset button by the door for the light switch. He thought he would do the equipmment a favor by turning off the lights, so the room would run cooler. Within five minutes of him leaving the room, HP OpenView started to barrage everyone on the network staff with a list of 'down servers' Since then, the 'big red' button is now enclosed inside a plastic box, and as for the router guy, he was pretty hostile to everyone and wasn't a team player. You'll think he been fired or reprimanded. Instead, he lucked out as we had a corporate consolidation on the regional scale, and he was promoted to manage the new regional WAN group.

    9. Re:Where was the switch? by buckeyeguy · · Score: 1
      Sounds like the Dilbert principle there... promote the guy to a position where he won't be near the Big Red Button.

      Seems like a LOT of people have these stories; I've had mine for awhile; after moving our organization's computer room (could hardly call it a data center at the time), and thankfully still during the buildout phase, the phone guy (one of Ameritech's geniuses, fwiw) pressed the button, thinking it was the handicap-open-the-door button. We put a transparent plastic cover over it after that.

      --
      I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
    10. Re:Where was the switch? by galaxy300 · · Score: 1

      You should have just used scotch tape. Nobody ever pushes the button with scotch tape on it.

  9. Perhaps they should answer by antifoidulus · · Score: 1

    /.s current poll now?

    1. Re:Perhaps they should answer by zeylisse · · Score: 1

      several unbootable machines -- few thousands $$$
      thousands man-hours of repair -- several thousands $$$
      zillions teenage-girls-unable-to-blog-crying-hours -- priceless.

    2. Re:Perhaps they should answer by game+kid · · Score: 1

      Yup. Looks like another Over $20k, but no one knew it was me response.

      Maybe I'll answer for them--they might be too busy preventing the next wreck. They ought to be with all their users.

      --
      You can hold down the "B" button for continuous firing.
  10. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

    1. Re:Fascinating read by caluml · · Score: 1

      Agreed. I always appreciate when people explain how large scale outages happened, were able to happen, how they fix it, and what they do to prevent it happening again. It's useful (and good for your employment status) to learn from other people mistakes rather than your own.
      So Slashdot - what are all the 500 errors about then? :)

  11. Missing opportunities by Rosco+P.+Coltrane · · Score: 3, Funny

    Apparently someone "accidentally" pushed the emergency power off

    They had to power back on when they realized deadjournal.com was already taken...

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  12. LJDotting: LJ user base vs Slashdot user base. by TrevorB · · Score: 4, Funny

    If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....

    LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.

    1. Re:LJDotting: LJ user base vs Slashdot user base. by game+kid · · Score: 1

      How about the (somewhat) phonetic form of the complete URI:

      an http-colon-slash-slash-slash-dot-dot-orging?

      The people who got that domain name are some lucky geniuses.

      --
      You can hold down the "B" button for continuous firing.
  13. Auto-negotiation by stilwebm · · Score: 3, Informative

    When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.

    1. Re:Auto-negotiation by Malk-a-mite · · Score: 1

      If you know what speed port you are plugging in to why would you need to autoneg?

      It's a convenience that isn't always needed.

    2. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

      Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

      They're cheeky enough to document this now. It's a feature, not a bug! Honest!

    3. Re:Auto-negotiation by Undertaker43017 · · Score: 2, Funny

      The part I like is they are claiming that everyone else is wrong, and they are right. ;)

      I don't buy Cisco anymore for this very reason, it's not just their switches, it's on everything they make that has a NIC.

      I deployed some CSS's, right after Cisco bought ArrowPoint, and they did auto correctly. Another client deployed some a couple of months ago, and auto was broken. Cisco is the Borg! ;)

    4. Re:Auto-negotiation by bjz · · Score: 1

      Actually, nowadays even Cisco recommends trying auto negotiation first, and only hard coding port/speed settings for problem NICs or for other switches, routers, and important servers. Also, with gigabit ethernet, the port speed and other settings like flow control have to be auto negotiated ( http://www.cisco.com/en/US/products/hw/switches/ps 663/products_tech_note09186a0080094713.shtml#auto_ neg/).

      Apparently, when auto negotiation was first being standardized, it was crap and most network admins just learned to shut it off and never changed practices as auto negiotiation became more stable. Instead, the "turn it off" wisdom was passed down, normally with vague hand waving about "problems". Today Cisco and Sun (the only companies I researched) recommend auto negotiation. I'll bet those 9 machines failing to auto negotiate is more because of crap components being used than any fault of auto negotiation; this was apparently a known problem, and auto negotiation should have been turned off for those specific machines.

    5. Re:Auto-negotiation by ignorant_newbie · · Score: 1

      >and never changed practices as auto negiotiation
      > became more stable

      So you believe the manufacturer's press release? Ok. setting that aside for a minute, given that most installations purchase hardware as it's needed, that means that most people have some old stuff and some new stuff. Do you think it makes more sense to have a different policy for each piece of hardware you're plugged into, or to have one policy that always works nomatter what you're attached to?

      now if someone in the gnu/linux world would just fix ifconfig so that it actually knew how to configure all the settings on a given interface so that I wouldn't have to read the damn kernel documentation for every new nic i purchase...

    6. Re:Auto-negotiation by archen · · Score: 1

      I hope they at least asked you what kind of switches and NICs you were using. I found out the hard way one time that Nortel switches (at least the ones we use) default to 100Tx with NO duplexing when you turn negociation off (and you can't force it to duplex either). Man did networking take a shit on some servers that day...

    7. Re:Auto-negotiation by Moofie · · Score: 1

      So I don't have to jack with it on every computer I install.

      It's a convenience that saves me time.

      --
      Why yes, I AM a rocket scientist!
    8. Re:Auto-negotiation by stilwebm · · Score: 1

      Yes, they did, and as others suggested, they were Cisco switches (this was in 2001) and they were Cisco certified engineers.

    9. Re:Auto-negotiation by mink · · Score: 1

      I have a nortel switch that has auto-neg issues with heardware connected to it.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  14. ...and ran off? by stratjakt · · Score: 5, Funny

    What do you mean, ran off?

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?

    My dog likes remote controls more than snausages.

    OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?

    --
    I don't need no instructions to know how to rock!!!!
    1. Re:...and ran off? by stratjakt · · Score: 1

      Ya well, shit happens, and I hardly think they're going to call in the cast of CSI to investigate this.

      I mean, for the most part, it's a free service. It's not like those users with free accounts can sue to get their money back.

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:...and ran off? by stratjakt · · Score: 1

      a) I don't want to wait 6-12 months for delivery, which I've been told, is about the average turnaround ordering stuff from TiVo. I kind of wanted to watch TV today.

      b) That, and I've given TiVo enough of my money directly. 35 bucks for a single-function remote is ridiculous. They don't even give you free shipping. Some people deserve to go bankrupt.

      Don't any retailers carry replacements? Or even a third party remote that has the right buttons, in the right places? The philips universals control TiVo, but it makes finding the "TiVo central" and "live tv" buttons a chore.

      --
      I don't need no instructions to know how to rock!!!!
    3. Re:...and ran off? by DrHogie · · Score: 1

      9thtee.com and weaknees.com should both sell replacement remotes for TiVo. After one too many drops on our living room's tile floor, it's about time we get a new one ourselves . . .

      http://www.weaknees.com/tivo_remotes.php

      --
      --DrH, the Sandwich with the Ph.D.
    4. Re:...and ran off? by stratjakt · · Score: 1

      But

      I

      Want

      One

      NOWWWWWWWWW

      I can't watch Sunday's Simpsons until I get a remote.

      There should be a law against selling remote operated products that don't have the equivelant buttons on the device itself. Eg, TiVo, and the downstairs TV in which the only way to put it in rear A/V in mode is via the remote.

      --
      I don't need no instructions to know how to rock!!!!
    5. Re:...and ran off? by Rolan · · Score: 1

      Ran off like "god I hope nobody has a gun back there" I would imagine.

      --
      - AMW
    6. Re:...and ran off? by UWC · · Score: 1

      I share your pain in that regard, especially the input selecting. While it seems that most (or at least many) VCRs let you set the A/V inputs as standard channels in the normal tuning sequence (though setting that up still requires the remote), both of the TVs I have with A/V inputs require a remote for access to those. Which is frustrating when you have the DVD player remote in hand, with its audio running to external speakers and all you have to do is press a single button once (maybe twice, depending on which input is used) on the TV remote which is nowhere to be found.

    7. Re:...and ran off? by SmittyTheBold · · Score: 1

      Get one of the One-For-All universal remotes, they're cheap and the slightly-more-expensive ones ($15 or so) can be flash-upgraded with new device codes and completely customized kemaps. To do this you'll have to be willing to geek out quite a bit to learn how to program the remote, but they're very powerful when you get down to it.

      A good suggestion is the OFA URC-8811 which can be purchased at your freindly naighborhood Wal-Mart for cheap and used right away, then soft-upgraded later when you want to get the absolute most out of it.

      Learn more about all this here.

      --
      ± 29 dB
    8. Re:...and ran off? by stratjakt · · Score: 1

      I know they do I just don't care too much for the button layout.

      But then again, they control everything I own (XBox, PS2 and TiVo and even that cheap-ass Sears branded TV from 1902)

      --
      I don't need no instructions to know how to rock!!!!
  15. I want to name this file..... by Evil+W1zard · · Score: 1

    Speaking of stupid things to do how many people know someone that has named a file on a Unix server * and then at some point later in time decided they no longer needed that file and decided to rm *?

    --
    News Reporters Make Tasty Polar Bear Treats!
    1. Re:I want to name this file..... by Cocoronixx · · Score: 2, Funny

      uhhh 0? Well I guess 1 since I can count you now.

      --
      "Obscenity is the crutch of the inarticulate motherfucker." - cloak42
    2. Re:I want to name this file..... by shuz · · Score: 1

      That is why I try to always include "" around everything I do in any unix environment and leave off trailing /'s when ever possible.

      --
      There is or can be built a machine that can simulate any physical object. -Church-Turing principle
  16. Credit by XorNand · · Score: 4, Informative

    Anyone who's a paid member of LJ can get a 2-week credit here.

    --
    Entrepreneur : (noun), French for "unemployed"
  17. A great article by digitalgimpus · · Score: 1

    I must compliment LJ for at least being honest with their system... many would lie and say "it was the datacenter's fault".

    They at least admit their own systems weren't perfect... and clearly explained each fault they observed.

    Good info.

  18. I'm in that datacenter once a month or so... by marked23 · · Score: 1

    I always wanted to push that button... Now I don't have to.

  19. Ahhhh silence is GOOOOLDEN by ShatteredDream · · Score: 3, Funny

    *crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.

    1. Re:Ahhhh silence is GOOOOLDEN by eln · · Score: 1

      Yah, but now we have nerds talking about girls talking about boys, jcrew, and high school/college drama. I shudder to think what would happen if Slashdot had an outage like that right now.

    2. Re:Ahhhh silence is GOOOOLDEN by metalhed77 · · Score: 3, Funny

      So says the author of yet another political weblog whose startling impartialiality and sense will pave the way for a brave new world?

      --
      Photos.
    3. Re:Ahhhh silence is GOOOOLDEN by AndroidCat · · Score: 1

      For emergency backup, they could always switch back to paper diaries, except that their kid brothers could steal them and read them. Can't have that! (Let the pest browse it like everyone else.)

      --
      One line blog. I hear that they're called Twitters now.
    4. Re:Ahhhh silence is GOOOOLDEN by mattwarden · · Score: 1

      (Score:3, Funny)

      Actually, I believe that is the sound of another ridiculously redundant comment being moderated by slashdot mods who didn't read the comments of the last 2 stories about this incident.

  20. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
    2. Re:machine failure by gkuz · · Score: 2, Insightful
      Every Saturday evening, we rebooted all of our servers

      Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.

      Scheduled reboots are a part of good systems administration

      Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?

    3. Re:machine failure by Saeed+al-Sahaf · · Score: 1
      Scheduled reboots are a part of good systems administration

      He's talking about Windows, where regular reboots are a good thing when they are planned, so you don't have regular reboots when they are NOT planned!

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    4. Re:machine failure by prshaw · · Score: 1

      And do those OS's test the hardware to make sure it will restart after a shutdown?

      It's more then just will the OS keep running, it is also will the hardware live through a power cycle.

    5. Re:machine failure by gkuz · · Score: 1
      it is also will the hardware live through a power cycle

      Why should it have to? If it's a critical server, your infrastructure should be such that it never power cycles. Our computer room has "power cycled" once since the facility was built in 1984. And that incident led to spending $65k in consulting engineering services alone, to determine why it happened and develop a plan to prevent it happening again. I'm not even sure what the expenditure in hardware or electrical contracting related to that was. I guess we define "critical" differently.

    6. Re:machine failure by TeraCo · · Score: 2, Insightful

      You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.

      --
      Not Meta-modding due to apathy.
    7. Re:machine failure by Local+ID10T · · Score: 1

      Its called contingency planning.

      Asking the "What if..." questions, and coming up with an answer. Even if the odds are one in a billion or more, a good admin wil have an answer. A better admin will have written the answer down for someone else in case they arent around.

      The right answer is not to simply say that it will never happen.

      --
      "You want to know how to help your kids? Leave them the fuck alone." -George Carlin
    8. Re:machine failure by radish · · Score: 1

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane?

      You're kidding right?

      We've had critical, and I mean critical, servers that have uptimes measured in years

      Well good for you. But when (not if) one of those boxes gets a hardware fault, or a power problem, what do you do? Do you have ANY confidence that it will come up properly? If you rebooted that thing every x (day/week/month whatever) then that answer would be YES, you know that if you have to bring it up it will come up.

      Number one rule of high availability systems: NO SINGLE POINTS OF FAILURE. You need a hot backup for EVERYTHING. Provided you have that, then regular reboots are not a problem, as each box cycles the others take up the slack. If you don't have that, then you don't have a reliable system, you have a timebomb.

      Hell we even do regular cable pull tests. Someone will walk through one of the server rooms and yank a cable or three, could be power, could be network, whatever. If your system is properly put together nothing (or no-one) except your monitoring systems should notice.

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    9. Re:machine failure by Chirs · · Score: 1

      An hour a month is a lot of downtime for many companies. (Think online stores, telcos, etc.)

      This is why you have hot-standby redundant hardware (or at least warm-standby with data syncing).

      Every week you switch to the standby and reboot the previously-active from backup. After testing that it's okay, you reboot it again and bring it back into sync with the active.

    10. Re:machine failure by drew · · Score: 1

      in defence of lj on this point, i don't think any of the issues they didn't already know about (mobo's that won't auto negotiate, db's that don't restart automatically) wouldn't have been uncovered by scheduled reboots. most of their problems were results of the hard shutdown. so unless you're just pulling the plug on your servers when you do a scheduled shutdown, this isn't really comparable.

      the issues that they already knew about, on the other hand, were all issues that never seemed like a big deal to them before because they were thinking in terms of one computer going down at a time, not all of them at once....

      there are a few other issues in there that i would criticize them on, but not doing scheduled reboots isn't on of them. in this case, however, i'll pass on criticism, and instead thank them for being as candid as they have been in explaining what happened and how they are going about ensuring it doesn't happen again.

      --
      If I don't put anything here, will anyone recognize me anymore?
    11. Re:machine failure by rjstanford · · Score: 1

      And that's always an option too. I didn't say anything about bringing all provided services down - just bringing a machine down. Some operations have dead time - some don't. Either way, by doing it formally all the time you're in much better shape when you have to do it for whatever reason.

      I've been in shops before where a machine has been running for a couple of years and needs an upgrade, and everyone's really fscking scared to touch it because they have no confidence that it will come back up. Doing a bounce on a regular basis at least lets you make sure that - if something's happened to the boot sequence - it was recent, and can be fixed.

      --
      You're special forces then? That's great! I just love your olympics!
    12. Re:machine failure by gkuz · · Score: 1
      You're kidding right?

      Uh, no. The part where I said "On a serious note" should have given that away. Show me where I argued against contingency planning. I'll bet I know as much about business continuity planning/disaster recovery planning as most of the people who are misreading my arguments here, and have written/rewritten my share of such plans.

      The grandparent poster said he rebooted every server every Saturday, and came in every Sunday to fix the ones that didn't come up. My argument is that that does not add one bit to system (in the large sense) reliability, it is almost always done to compensate for cheap hardware or shitty *cough*Microsoft*cough* OS'es. Note his words, that he did this every weekend "instead of during the week when a reboot of a critical machine that did not work would be much worse." This does not describe a robust, high-availability system, it describes excuses for crap. You're absolutely right, that in a properly designed system with redundant equipment, hot spares, well-designed and -tested failover mechanisms and good management, you should be able to knock out any piece of equipment or any data path at any time without it causing a crisis. But that wasn't what the grandparent was describing. He was describing a set of systems where you spend every weekend rebooting everything because you'll shit your pants if you have a problem on a Wednesday. Well some of us don't have the luxury of that much downtime. So plan and test away. But every week is just wrong.

    13. Re:machine failure by gkuz · · Score: 1
      The right answer is not to simply say that it will never happen.

      Read my other post. I never argued that, otherwise I wouldn't have a diesel generator as backup to my UPS, with two different failover mechanisms. Or 24x7 security guards with two different phone systems and printed (on paper, in a binder) emergency instructions in both of the buildings on the property, with a list of contacts ordered by distance from the building and skill set. Or.... you get the picture. The guy I was responding to wasn't talking about contingency planning, he was talking about spending every weekend compensating for crappy servers. That's not "mission-critical", that's a bunch of toys.

    14. Re:machine failure by br00tus · · Score: 1
      I have read through the responses and will explain more.

      A lot of people have dwelled on the word critical. I could have expanded this to mean both critical and important machines. Our critical machines were highly available, with hot standby redundant hardware (as were their RAID arrays and such). But we rebooted the running systems and then the standbys every week to make sure the failover would work. I do not know why people presume that a scheduled reboot means we have no failover. We have to know the failover would work!

      Some people alluded to that perhaps Saturday night scheduled applications might prevent a reboot. This was true, we had a few machines that processed data from Monday morning all the way into Sunday afternoon, after which they were rebooted.

      Someone else said "Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?" As I said, we had servers that did not come up every single week. Something to be expected in an environment with thousands of servers. If virtually every week there is a problem with some servers, then once a week is often enough to reboot. We would have probably done it more often except too many machines were running all day during weekdays, as well as machines which absolutely had to be working by 9AM. Rebooting on Saturday evening gave us two days to fix problems and escalate problems. As far as shitty systems - there were some things I was unhappy with, but a lot of things were done right. Some the people in systems engineering were smarter than you and me put together I'm sure, F100 companies can afford these people. As far as how often we rebooted Cisco routers - every week. We had redundant routers and switches where needed. I worked in systems so know only a little IOS or about the network administration maintenance there, so I don't know what exceptions they made, or what happened to routing tables in memory or such.

      "Note his words, that he did this every weekend 'instead of during the week when a reboot of a critical machine that did not work would be much worse.' This does not describe a robust, high-availability system, it describes excuses for crap." I definitely disagree with this. We had Sun Enterprise 6500s on VCS with redundant RAID arrays that ran from 9AM to 5PM where one machine alone would process *billions* of dollars worth of transactions. After this, they would spend 5PM to 11PM or so processing (or offloading) this data. If we rebooted these machines at midnight, and they did not come up, they would absolutely have to be up by 9AM. This is not an excuse for crap, it would be insane to do such a reboot. And as far as crap, we had trouble from everyone - Microsoft, Sun, EMC, whoever - all of these people produced machines or software with defects, sometimes which we discovered - I don't know what your solution is to avoiding vendors who never introduce such errors, if you know of any vendors who have perfect products, I'd love to know. You do not sound like someone who has worked in an environment where critical machines need to be working by 9AM so as to do billions of dollars worth of transactions, your suggestion that if we can't reboot on midnight during a weekday our system is crap is insane.

      "If it's a critical server, your infrastructure should be such that it never power cycles." - well we are located in New York City and we had a blackout in 2003, as did much of the northeast. It started during a workday, on a Thursday, and on Friday morning electricity was still not functioning. So you are running all systems on UPS backup for 24 hours. Systems processing billions of dollars in transactions. Our systems did not power cycle, and ran on battery power for 24 hours, but your assertion that "your infrastructure should be such that it never power cycles" is ridiculous. In such a situation, I would be much happier knowing my machines had all rebooted fine days ago, instead of knowing the

    15. Re:machine failure by dbIII · · Score: 1
      Every Saturday evening, we rebooted all of our servers
      Obviously an MS windows shop geting around memory leaks.

      I feel like an amataur because I can turn all but two of my machines off on Christmas eve and not have to turn them on again until four days later. A lot of places really do have to do 365/24/7 with a lot of machines. Some computing tasks still take well over a week on reasonably serious hardware, and even if they are checkpointed every day you do not want to lose power.

      Scheduled reboots are a part of good systems administration
      Perhaps in desktop pc land, but some of us have to go months between any possible shutdown windows. You do have to do it often enough to know that your current configuration is going to come up - and you do have to know the machines backwards before you bring them down, and certainly need to know what sequence to bring them up and ensure as much as possible that each machine can come up alone.
    16. Re:machine failure by gkuz · · Score: 1
      Our systems did not power cycle, and ran on battery power for 24 hours, but your assertion that "your infrastructure should be such that it never power cycles" is ridiculous

      Not at all ridiculous, that's what generators are for. As I said, our data center has lost power once in 20 years.

    17. Re:machine failure by snero3 · · Score: 1

      organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      hmm I if you need that app 24/7/365 how are you going to get time to reboot the machine? Of course if you cluster it you could but not all machines need rebooting just to function. Also DR sites are great if you have total hardware failure/power failure. I have worked many places where a shutdown is just not considered/necessary (banks, road side assistance, trading houses) and they buy their hardware to suit.

      --
      It said "windows 98 or better" so I installed Linux
  21. LOL! Kindof like when... by GillBates0 · · Score: 5, Funny
    ...when I was on AOL and I hit the X and I couldn't talk to my AOL Buddies anymore.

    And I was like OMG I shut off the internets and stuff!!1!!

    And i called the AOL helpdesk and they helped turn it back on.

    --
    An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
    1. Re:LOL! Kindof like when... by game+kid · · Score: 1

      Among the many reasons that viruses spread across Windows PCs.

      The internets...now that's a classic.

      --
      You can hold down the "B" button for continuous firing.
    2. Re:LOL! Kindof like when... by Saeed+al-Sahaf · · Score: 1

      When you tried to turn it back on, did it go, like, "beep, beep, beep"?

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
  22. Way too thankful? by BestNicksRTaken · · Score: 1

    Is it me, or are some of those LJ users' expressions of thanks just a bit OTT?

    The way the comments go, you'd think this was a life support system or something!

    I mean, well done for getting the site back up after like 24 hours or something, but hey I'm not creaming my shorts over it!

    --
    #include <sig.h>
  23. And here by OverlordQ · · Score: 1

    everybody was blaming Internap for screwing up and running a shoddy Datacenter, when actually Internap did everything they were supposed to correctly.

    --
    Your hair look like poop, Bob! - Wanker.
    1. Re:And here by tmhsiao · · Score: 2, Interesting

      Aside from allowing an unaccompanied client access to the Big Red Button, perhaps?

      --
      "My God...It's full of ads!" -Fry, about the Internet, Futurama
  24. Also, by revery · · Score: 1

    Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS)

    This also raised the all-important "Why do we even have that button?" question.

    1. Re:Also, by Scott+Laird · · Score: 2, Informative

      "Why do we even have that button?" Because it's basically required by law. Covering them with a plastic cover doesn't seem to help either--Internap did that the *last* time someone hit the EPO button in this datacenter.

    2. Re:Also, by merlin_jim · · Score: 1

      This also raised the all-important "Why do we even have that button?" question.

      Those buttons are generally maintenance devices; it's usually less of a button and more of a keyswitch though. So the guy comes in to service something, he needs to know that no power is anywhere in there, so he removes the key and keeps it in his pocket. Now he knows he's safe.

      --
      I am disrespectful to dirt! Can you see that I am serious?!
    3. Re:Also, by Peridriga · · Score: 1
      It's the law. It's also in the article.

      EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center
    4. Re:Also, by revery · · Score: 1

      I keep forgetting that this is slashdot. I shouls bave put in my disclamer:

      Please, do not be alarmed or reply with an explanation. This is a joke. I am joking. You have been joked with.

      Sigh...

    5. Re:Also, by RollingThunder · · Score: 1

      So that the firefighters don't start dumping water onto live power mains?

      It also helps people there stop electrical fires from massively spreading. Yes, there's already a fire, but the spread of it won't cause more shorts which can keep the fire going and/or burn out in seperate areas where the lines are overheating.

    6. Re:Also, by jacksonj04 · · Score: 1

      I think you're thinking of something different. Keyswitch isolators are more commonly used in localised sites, such as labs, where occasional maintenance is performed. They usually don't cut power to hundreds of critical servers, even bypassing the usual UPS systems.

      The EPO, on the other hand, is designed to cut all the power to everything. This comes in useful in things like fires, where you really don't want to be fumbling around for keys to the isolator.

      --
      How many people can read hex if only you and dead people can read hex?
  25. Button of Doom by clinko · · Score: 1

    Maybe they should use the Button of Doom (USB) to lock the pcs down too...

  26. Wait a second! by Sialagogue · · Score: 1

    "EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."

    "...all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown. (but if we do the rest of the stuff right, this step isn't even required, including having UPSes... in theory... but the UPSes would be comforting)

    Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

    I'm not trying to be a smartass here, since I'm not an expert in datacenters or the purposes behind EPOs - I'm asking. . .

    --
    The only acceptable defense of scientific results is to say that they were the product of the Scientific Method.
    1. Re:Wait a second! by rah1420 · · Score: 2, Informative

      Technically, yes. I'm hoping that if LJ decides to implement such a scheme (let's call it "LEPO" for "Leisurely Emergency Power Off") that they run it past the fire marshal or the code inspectors first, who may have another opinion about how smart this idea is.

      "If it's stupid and it works, it's not stupid."

      --
      Mit der Dummheit kämpfen Götter selbst vergebens.
    2. Re:Wait a second! by Malk-a-mite · · Score: 1

      APC has a white paper on EPO availible online at:
      [PDF warning]
      ftp://www.apcmedia.com/salestools/ASTE-5 T3TTT_R1_E N.pdf

      "Executive Summary
      Emergency Power Off (EPO) is the capability to power down a piece of electronic equipment
      or an entire installation from a single point by activating a push button. EPO is employed in
      many applications such as industrial, telecommunications, information technology (IT), etc.
      This white paper describes the use of EPO for protecting data centers and small IT
      equipment rooms containing UPS systems. Various applicable standards that require EPO
      are discussed. Recommended practices are suggested for the use of EPO with UPS
      systems.
      "

    3. Re:Wait a second! by reed · · Score: 1

      Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?



      No, you have two, one for each power system, seperated by enough space that it's hard to hit them both by accident, but easy for both to be hit in an emergency.
    4. Re:Wait a second! by Sialagogue · · Score: 1

      Sorry, but you confused me.

      It seemed as though they were talking in the article about putting a separate, independent UPS system in place for their machines, that are independent of the EPO system. It sounds to me like that would keep their machines on for four minutes even after one or both of the facilities EPO systems have been triggered creating an electrocution danger.

      Are you suggesting that their UPS would have a separate EPO just for it? I don't think that's the case, because they specifically mentioned wanting to have a 4 minute window if the main EPO was hit. But if they did that would put them right back where they started, because although they'd have four minutes in case the main EPO got triggered, they'd still have their own brand new EPO button hanging out there just waiting to be triggered accidentally.

      Could you clarify?

      --
      The only acceptable defense of scientific results is to say that they were the product of the Scientific Method.
    5. Re:Wait a second! by psykocrime · · Score: 2, Interesting

      Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

      It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling into a live wire that's hanging down from the ceiling, or cutting into a live wire with a vent saw, or getting caught up in one with a pike pole or something.

      Having been a firefighter for somewhere around 15 years, I'd say that I for one would not be particularly concerned about the small UPS's. That's not to say that they *couldn't* pose a danger... just that relatively speaking, they'd be a minor concern.

      --
      // TODO: Insert Cool Sig
  27. The reason why some NICs don't auto-neg by phaetonic · · Score: 2, Informative

    I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.

    1. Re:The reason why some NICs don't auto-neg by caluml · · Score: 2, Informative

      That's what Compaq Lights-Out cards are for. Lovely things. Very handy.

  28. Re:How do you do that by *accident*???? by FudgePackinJesus · · Score: 2, Funny

    Stimpy couldn't resist "The Red, Shiney, CANDY-LIKE Button!!"

  29. 13 yo? :P by Spy+der+Mann · · Score: 3, Funny

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    By any chance, was his name "Zero Cool"?

  30. OOB console access is the answer. by Mordant · · Score: 2, Insightful

    They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

    Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

  31. HAH! by rah1420 · · Score: 1

    I told you so.

    Looks like my "Newbie Operator" found hisself a new job.

    --
    Mit der Dummheit kämpfen Götter selbst vergebens.
  32. 2 accounts of the powerloss by Spazholio · · Score: 4, Funny

    The one they tell you about and the real one.

  33. No! by Saeed+al-Sahaf · · Score: 2, Insightful
    embedded NICs...

    Who in their right mind goes with the on-board NIC in a server environment?

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    1. Re:No! by juuri · · Score: 2, Interesting

      Who in their right mind goes with the on-board NIC in a server environment?

      Are you kidding?

      How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.

      --
      --- I do not moderate.
    2. Re:No! by Saeed+al-Sahaf · · Score: 1

      Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    3. Re:No! by SenorChuck · · Score: 2, Informative

      On all of the (actual) servers I've worked with, the onboard NICs are exactly the same hardware that you get with the server-grade PCI NICs.

      --
      A wise person makes his own decisions, a weak one obeys public opinion. -- Chinese proverb
    4. Re:No! by ihaddsl · · Score: 1

      Please explain, why not use the onboad nics? After all for a respectable server we're not talking about your el cheapo embedded NIC as found on many desktop motherboards, but Intel e1000's, Broadcom's and others.

      Nothing wrong with using the embedded NIC's at all.

    5. Re:No! by grommit · · Score: 1

      That's all fine and well if you've got money to burn on unnecessary things but quite a few organizations have a budget that they need to adhere to. Sure, in some horribly sadistic way, I guess I can see some glimmer of a benefit to every machine having the same type of network card but the added time/expense/hassle of cracking open each and every case to put in a network card is just unimaginable to me. You can't possibly deal with many 1U chassis very often. I don't even think most blade servers have the ability to have an extra NIC installed on them.

      I guess what I'm trying to say is that what you're talking about doing is complete nonsense IMO.

    6. Re:No! by Saeed+al-Sahaf · · Score: 1
      Life in the rackmount data center is much different than your five computer home network.

      Thanks for the insult. But I'm not talking about home.

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    7. Re:No! by caluml · · Score: 1

      DL360s have 2 onboard eepro100s in them. They have never failed on me.

    8. Re:No! by gl4ss · · Score: 1

      if the network chips are the same, and the onboard nics made for to be used, why not?

      i have hard time seeing you slapping pci cards into 1u servers anyways. or perhaps you slap 'em with usb2 nics....

      --
      world was created 5 seconds before this post as it is.
    9. Re:No! by darkwhite · · Score: 1

      Oh, I dunno, perhaps every single space-conscious datacenter user?

      Anything thinner than 4u either won't have space for an off-board nic or won't need it if it has a riser and is not part of a fiber network. For 99% of server uses, the benefits of an off-board nic are dubious when a halfway modern mobo is installed.

      --

      [an error occurred while processing this directive]
    10. Re:No! by mink · · Score: 1

      On-board PC-net chipsets have exactly 1 driver ever written for SCO Openserver (not my fault, I just have to support it) and it has issues with random TCP/IP lockups.
      My solution, since SCO and IBM were playing the blame game, was to disable it and put in a good, well supported 3-COM card.
      What sucks more is I'm just a 3rd party support guy and I had to pay for this out of my pocket or nothing would ever get done.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  34. Not millions of paying accounts. by EvilStein · · Score: 4, Informative

    Actually, most of the accounts don't pay. They're just freeloading whiners.

    This is a paste from the Livejournal stats:

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

    1. Re:Not millions of paying accounts. by thephotoman · · Score: 1

      However, I am among the paying whiners. Oh well...for one day without LJ entertainment (during which I was out of the house for the most part anyway, and therefore nowhere near a computer), I got two weeks more paid time for free.

      Pretty good deal in my book, I'd say.

      --
      Haec merda tauri est. Ceterum censeo Carthaginem esse delendam.
    2. Re:Not millions of paying accounts. by JoeNotCharles · · Score: 1

      How many of those free accounts are active, though?

    3. Re:Not millions of paying accounts. by EvilStein · · Score: 1

      http://www.livejournal.com/stats.bml

      It's offline right now, though. Big shock that is. heh.

    4. Re:Not millions of paying accounts. by metamatic · · Score: 1

      Where do those stats categorize people like me, who paid for accounts but had them deleted by abusive admins?

      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
  35. Calling all disk cache experts. by turm · · Score: 1

    The article cites disk caches as a source of data-loss.

    They claim that their battery-backed RAID caches were safe, but that the actual drives themselves were performing unsafe write cacheing. It strikes me that this is the kind of thing that's quite easy to *suggest*, but far more difficult to *prove*.

    I don't have any first-hand knowledge of disk corruption due to write-caching. Is this a real problem or just some kind of legend? Can someone who has RTFA'ed and knows about disk caches please comment?

    This is somewhat irrelevant, but I've messed with some non-battery-backed RAID setups in the past. In these situations, it always made sense to me that the controller would set the individual drives' cache policy to match its own.

  36. Its a Small World... by eieken · · Score: 1

    It seems that my company and LiveJournal host at the same datacenter here in Seattle. Looks like they got hit pretty hard when the datacenter with multiple redundant battery backups and generators had a massive cascade emergency power off, and every server in the building got shutdown at once. LiveJournal got hit the hardest, they had some IDE drives on their servers, doh! Looks like even multiple redundant battery backup with power generator datacenters are still vulnerable to dumbass electricians who don't know what they are doing. The datacenter has been under construction for the past few months too, so you KNOW that had something to do with it. Looks like we'll have to put a UPS in our cabinet at the multiple redundant battery back up and power generator datacenter housing, seeing as all that backup protection doesn't mean diddly squat.

    --
    Meet new people, and kill them.
    1. Re:Its a Small World... by radish · · Score: 2, Funny

      LiveJournal got hit the hardest, they had some IDE drives on their servers, doh!

      I was unaware that SCSI drives had the ability to run without power - thanks for the info!

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    2. Re:Its a Small World... by CounterZer0 · · Score: 1

      Well duh, what did you think the 'battery backup' on the RAID card was for??

  37. Blame by bsd4me · · Score: 1

    Most of the time it is Stimpy's fault. The rest of the time it is Fry's fault. I think there may be a connection...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:Blame by GoatPigSheep · · Score: 1

      Yup, billy west did the voices for both of them

      --
      GoatPigSheep, the 3 most important food groups
  38. No UPSes before? by iabervon · · Score: 1

    I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.

    1. Re:No UPSes before? by Nonesuch · · Score: 2, Informative
      I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
      Unfortunately, this would defeat the purpose of the "Big Red Button", which is there to quickly and definitively cut of all power to all line-powered devices in the data center.

      When you've got an analyst smoking and twitching next to one of the racks as 110VAC courses through her veins, you don't want to have to go hunting to figure out which UPS is supplying the juice.

    2. Re:No UPSes before? by moggie_xev · · Score: 1
      all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown.

      I find it worring that they are actually going to do that. If the place needs a big red button it HAS to work. If it doesn't need it don't have it.

    3. Re:No UPSes before? by gkuz · · Score: 1

      "Worrying"? This sounds like a code violation. An electrician can lose his license.

    4. Re:No UPSes before? by iabervon · · Score: 1

      The article says they're planning to have UPSes, and it's unlikely that nobody from their hosting facility reads livejournal. So what they're doing is probably okay (or at least, they'll be stopped if it's not). I suspect that the button has to kill everything that can supply a lot of power, but that single-computer 5-minute UPSes aren't a big deal. The way you fry analysts is when the batteries and generators that can run a whole data center short through the person, not with hardware you can pick up at the computer store.

  39. Switch location by cyberfunk2 · · Score: 1

    Arnt these sorts of switches usually behind little glass things that say "BREAK IN CASE OF EMERGENCY" ?

    I mean I'm sure it's a big red button of some sort like the one we've got in our server room, but man, that's the sorta thing that needs a video camera aimed at it.

    Of course, if it was a malicious inside job, then there's not too much to do about it.

    I understand the REASON for an easily accesable switch like this, but would it be possible just to wire it into the fire system or something and not have a switch that just screams touch me for a thrill ?

  40. Accidents happen by Migraineman · · Score: 2, Interesting

    About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...

    About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch ... *click*

    So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.

    So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click* ...

    All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since ...

  41. LJ IS TEH LITTLE GIRL HOLE! by Turn-X+Alphonse · · Score: 1

    Maybe people will see this and relise the LJ staff are geeks, unlike most of their fanbase, so while you maybe mocking their minions they can still bring down a server looking at a single article with the rest of us slashdotters.

    --
    I like muppets.
  42. Seen that happen before... by patmandu · · Score: 1

    Way back when, I was working at an IBM site (STF) that had a boatload of mainframes and equipment on a raised floor area that was badge-access only. Every summer we'd get interns to learn the finer points of computer science by doing things like bursting printouts from the lineprinter and delivering them. Seems that the intern introductory tour had gotten a bit lax... One day a cleaning person knocked at the door to the raised floor to get let in to empty the wastebaskets. Nobody else was around, so one of the interns decided to let them in. Of course they pushed The Big Red Switch that was right next to the door. Oops. Whole floor went down...hard, about 10% of the stuff didn't come back up when the power was restored. Not fun...

    They revised the introductory tour a bit, and added a label to the EPO switch.

    (And no, it wasn't me who hit the button...)

  43. happened to us by bwindle2 · · Score: 1
    We have one of those Big Red Buttons in our datacenter (about 7 feet up on the wall, so no one could accident bump into it). About a year after it was installed, an electrician showed up to do something in the ceiling, and accident leaned his ladder up against our exposed Big Red Button.

    Needless to say, we now have a cover over our Button. Funny thing is, the electrician who installed the original button is also the guy who leaned his ladder against it.

    1. Re:happened to us by TomHandy · · Score: 1

      Wait a minute............. that's not funny at all!

  44. You, sir, are an idiot. by Anonymous Coward · · Score: 5, Informative

    Go ahead and read up on how auto-negotiation works. I'll wait...

    No, really. Go read up on it...

    Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.

    (To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)

    1. Re:You, sir, are an idiot. by Undertaker43017 · · Score: 1

      OK, assuming you are correct, then why does every other NIC/switch vendor on the planet seem to have no problem with auto-neg and Cisco does?

      I have never seen this problem with Foundry switches, only Cisco.

    2. Re:You, sir, are an idiot. by ghjm · · Score: 1

      Because with most other NIC/switch brands, you can never really, truly disable autonegotiation. When you hard-code the speed and duplex, it's more of a suggestion. They still run the negotiation to find out what the other side is doing. With Cisco gear, if you hard code the settings, then it just ignores any autonegotiation, just like in the dark ages.

      I'm not defending Cisco - I think their approach is wrong. But it's at least understandable that they are taking the side of the MCSITW (most conservative sysadmin in the world), who would want to hard-code settings on all sides and ALSO find a way to ensure that autonegotiation could NEVER POSSIBLY happen, even if random electrical noise somehow convinced one device that maybe the other one was trying to negotiate with it.

      -Graham

    3. Re:You, sir, are an idiot. by sparkz · · Score: 1
      See Sunsolve. The IEEE specs are open to various interpretations; this can lead to Gb interfaces going to 100/hdx or other dodgy configs. See also Cisco's website for their take. (Also see here .)

      Cisco seem to recommend autonegotiation; Sun recommend forcing the speed/duplex.

      We've had problems in the past with Sun's "ce" fibre cards and Cisco Catalyst switches. It's not that either implementation is "wrong", the specs simply are not specific enough.
      Sorry, can't find the detail in the spec which causes the problem

      --
      Author, Shell Scripting : Expert Re
  45. BIOS Config by art3d · · Score: 1

    This reminds me of the time when I had a server that would not reboot because there wasn't a keyboard plugged in, and I did not change the setting in the BIOS.

    Brian.

  46. Big red buttons by cbr2702 · · Score: 1

    So make a little black button and know where it is, but also make an big red one that turns off the lights. That way you get to yell at little kids without much harm to your system.

    --


    This post written under Gentoo-linux with an SCO IP license.
  47. They're attention whoring by EvilStein · · Score: 1

    Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it.

    Total kissasses. I wonder how many of them are paid members vs free accounts.
    Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...

    Account Types

    What type of account do people have?

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

    1. Re:They're attention whoring by mdwh2 · · Score: 1

      Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it. Total kissasses. I wonder how many of them are paid members vs free accounts.
      Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...


      But how does that support your argument? If anything, I'd say it's the other way round - people are showing their thanks, and someone who uses a service for free has more reason to be grateful for the work being done for them. A paid user would expect it to have worked in the first place.

  48. No, it did not by EvilStein · · Score: 1

    They're required by law to have it. It's a building code thing. Every data center I've ever been in has one.

    Also.. ""EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."

  49. This is what happens... by MsGeek · · Score: 1

    ...when you buy crappy kit. Next time do it right.

    --
    Knowledge is power. Knowledge shared is power multiplied.
  50. You mean "The Big Red Button" by rednip · · Score: 1

    A couple of years ago, when our server room was being 'certified', one of the specific checks was "No, big red button, check". One of the guys in the group came up with a story about how someone's kid at the end of a 'tour' thought that the 'big red button' was ment to be pushed.

    --
    The force that blew the Big Bang continues to accelerate.
  51. Cabling? by redelm · · Score: 1
    OK, this _shouldn't_ apply to a good, reputable datacenter that has structured wiring to TIA/EIA-568 running gigabit.

    I most often see autoneg problems with faulty cabling (split pairs from crimps). 98% of newbies cannot get it right, and they aren't to blame because the standards are counter-intuitive unless you've worked for Ma Bell for 40+ years. I beware of all field crimps.

    OTOH, I saw one example of a Crisco Crapalyst router not wanting to play with some devices. Of course they blamed the device, but I never had any problem with interconnects or using cheap @$$ switches, so I wonder why the expensive @$$ switch gets huffy.

  52. Re:Don't forget... by vadim_t · · Score: 1

    Nonsense. I had my server up for 360 days without rebooting, with kernel 2.4. It had 360 days on the uptime counter. I only shut it down because it was too slow for the newer stuff I wanted to run.

  53. The result. by Pathetic+Coward · · Score: 1

    (a) Manager that pushed the "off" button gets promoted.
    (b) Engineers that spent their weekends getting the system back up: off to India with your jobs!

    1. Re:The result. by 6Yankee · · Score: 1

      (a) Manager that pushed the "off" button gets promoted.

      About a year back, we had a power failure at our place one morning - maybe 20 seconds' worth. All the servers fell over horribly and refused to come back up.

      When we came back in the following morning, we found everything working - and an email from our MD to the entire company, thanking the IT people for the heroic job they'd done in getting everything back up. Apparently they'd been there till 2am, poor little lambs. So that was nice.

      We later (a lot later!) found out that the reason everything had fallen over so horribly was that nobody had thought to test the UPS battery for six months, and it had quietly died. But IT were heroes.

      The MD who sent that email is now in charge of IT for our parent company's parent company.

      True story.

  54. Oblig. Homer Simpson by The_REAL_DZA · · Score: 1

    "Awww, I don't know why we even have a jug!"

    --


    This space intentionally left (almost) blank.
  55. Power-smart PCs by Pfhorrest · · Score: 1

    I'm waiting for the day that machines come built such that when the power dies, an emergency battery kicks in just long enough to dump the RAM state to a nonvolatile cache, and then when power resumes, restore the system from there. Like VirtualPC.

    Heck, having that be a user-accessible feature supported by the OS ("Save and Shutdown") would make a lot of sense too.

    --
    -Forrest Cameranesi, Geek of all Trades
    "I am Sam. Sam I am. I do not like trolls, flames, or spam."
    1. Re:Power-smart PCs by Nonesuch · · Score: 1
      I'm waiting for the day that machines come built such that when the power dies, an emergency battery kicks in just long enough to dump the RAM state to a nonvolatile cache, and then when power resumes, restore the system from there. Like VirtualPC.

      Heck, having that be a user-accessible feature supported by the OS ("Save and Shutdown") would make a lot of sense too.

      Way back in the days of Windows 3.0, there were actually ISA cards available which could provide exactly this feature.

      Some of the "mini" versions of multi-user systems from the 1980s had similar features, so when you accidentally kicked the power cord out from the wall, you didn't abend sessions for the whole department.

  56. Nah, just Google "disk write cache" by rjamestaylor · · Score: 1
    check out this article on write cache.

    Lazy writes allow for faster system operation and have only one detrimental downside: in a poweroff or unexpected reset the data waiting to be written won't be. As bad as that sounds, the performance gains during normal system operations usually overcome fears of this data loss potential.

    It boils down to this: if every bit of data is crucial, disable write cache. If performance is paramount and some tolerance exists for infrequent data loss due to catastrophic failures, enable it. LiveJournal evidently wanted your normal experience to be pleasantly quick rather than painfully accurate.

    --
    -- @rjamestaylor on Ello
  57. Re:How do you do that by *accident*???? by AndroidCat · · Score: 2, Funny
    Another customer in the facility accidentally pressed the EPO button, then depressed it

    I'm trying to figure out how depressing a button reverses a press. (Since the button is depressed by pressing it.) Unpressed it?

    --
    One line blog. I hear that they're called Twitters now.
  58. Make the Luser pay by sconeu · · Score: 1

    I assume that they will have the responsible luser pay for the down time plus the 2 weeks credit plus the extra hours for the staff to bring the system up.

    And what the hell was a visitor doing playing with the Big Red Button anyways?

    --
    General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
  59. Photo of the button by teneighty · · Score: 2, Insightful

    Apparently this photo is an example of the button that was "accidently" pressed.

    I'd love to hear the explanation for this "accident".

  60. Nothing wrong with onboard NICs in "real" servers. by Nonesuch · · Score: 2, Informative
    Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.
    The smallest of the Sun 1U rackmount Sparc servers do not even have a PCI slot to take a NIC -- no expansion at all, but two on-board 100M interfaces are plenty for most data center deployments of these small boxes.
  61. It happens by boodaman · · Score: 1

    This happened to us last year in our datacenter.

    The Facilities manager had some guys in to install shelving to store toner, cables, etc.

    Our datacenter is divided into two sections, inner and outer. All CPUs, UPSs, HVAC, etc are in the inner room. The outer room is shelving, desks, CCTV (security), etc.

    The EPOs are near every door, as they should be, including the outer doors. Some guy, while installing the shelves, decided to take a little break and lean against the wall, leaning on the EPO in the process.

    It took us about 10 minutes to figure out what the hell happened, because even the generator didn't fire as it should. Meanwhile, the shelving guys were just merrily installing shelves. When asked, the guy just said he didn't realize anything was wrong and just thought it was nice that everything "got so quiet" all of a sudden.

    Like LiveJournal, we promptly installed cages over the EPO buttons.

  62. Re:Don't forget... by orderb13 · · Score: 1

    Ahh, I always wondered how one trolled. I feel better now.

  63. Damn skippy by ShatteredDream · · Score: 1

    You sure hit the nail on the head son. I am glad that you recognize that I am without bias, opinion or a tendency to propagandize for my side. My reporting is beyond reproach and I cannot even fathom how someone could insinuate that things might be to the contrary...

    1. Re:Damn skippy by metalhed77 · · Score: 1

      I hadn't actually read your weblog (I didn't find anything political, despite your sig, on the first few entries). So I figured you were one of those 'meet in the middle means I'm guaranteed to be right' kind of people. Apologies for that.

      Still, LJ doesn't deserve to be bashed. It doesn't even pollute google like the political blogs do!

      --
      Photos.
  64. heh - most of the delay was because of mysql by ignorant_newbie · · Score: 1

    MyISAM vs InnoDB: When you lose power to a MySQL db w/ MyISAM tables, the indexes are generally messed and you need to rebuild.

    When will people stop using this POS for production environments? do you drive to work in your kid's toy car just because it's cheaper? no. you get the best car you can afford. Do you use FAT32 for your production severs? no. you use reiser or ffs+softupdate.


    So - if they'd spent the extra 10 minutes it takes to learn how to program a real database, they'd have come right back up with maybe 5 min of transactions needing to be replayed.

    1. Re:heh - most of the delay was because of mysql by smitty45 · · Score: 1

      like Yahoo, when they use the "POS" database ?

  65. I think the real question by Kaisum · · Score: 1

    Is why do you have an emergency shut down for a bunch of journals? Dear God Jim! The hax0rz have gotten the journals! Shut them down, now!

  66. bad auto negotiation happens a lot by oogoody · · Score: 1

    > OTOH, I saw one example of a Crisco Crapalyst

    We always had problems with auto negotiation and the Crapalyst. It wasn't wiring or the workstation either. Whenever there was a performance problem it was almost always in the switch.

  67. Not just a computer issue by lrucker · · Score: 1
    This happened to a friend of mine in a manufacturing plant. They had machines that made plastic cups, and every so often the machine would jam, the operator would hit the little switch that was right next to him, clear the jam, and go on. So they hired a new operator, my friend explained the procedure, left the guy alone. Shortly thereafter, the machine jammed, new guy panicked, ran across the room to the Big Red Switch, hit it, and cut power to every machine in the plant. It took the rest of the day to get the machines all running again.

    The new guy's first day was also his last.

    1. Re:Not just a computer issue by pe1chl · · Score: 1

      The new guy's first day was also his last.

      Of course it should have been your friend's last day...

  68. Re:How do you do that by *accident*???? by hachete · · Score: 1

    Particularly one shaped like this Big Red Button

    --
    Patriotism is a virtue of the vicious
  69. Button Vs key by phorm · · Score: 1

    Simple solution to this one. At work we don't have a kill button. We have a kill key. It takes a little bit more work to "insert key" and "turn", but it's better than having incidents like this wherein somebody hits the big red button.

    Plus, you can give the key only people that aren't idiots. With the big red button, you'll inevitably get somebody who thinks "hmm, wonder what would happen if I pushed this big red button duhhhhhh."

    1. Re:Button Vs key by dismayed · · Score: 1
      National electic code requirements mandate that you have EPO for any data center or similar facility. The "big red button" is also an informal standard so that fire fighters can quickly identify and disconnect electricity if need be.

      See this document, Understanding EPO, for more information: Understanding Emergency Power Off

    2. Re:Button Vs key by AndrewRUK · · Score: 1

      And by having it need a key, you are missing the point of it being an Emergency Power Off. A hypothetical scenario for you: Which is least bad, "Oh shit, who's got the key for the power off, Phorm's getting electrocuted! Oh, too late..." or *thunk* *whirr* "You ok?"

      If you need to cut the power in an emergency, you want to do it now, not in five minutes, when someone's found the key.

    3. Re:Button Vs key by phorm · · Score: 1

      Must be some hefty power going on then. I suppose in higher-class server rooms you might actually have a mains or something greater than what we have here. The key in our case is to shutdown for maintenance ... so I guess I missed the point of this one. We do have a mains room but I don't remember a button, it's probably a lever or breaker.

      You could still use something similar to the key by having a rotating power switch (that is, one that turns clockwise). Nothing to be inserted, it's still fast, but it's harder to accidentally bump off. Nicer than having a glass/lexan cover too, since there's nothing blocking it.

      I've seen some buildings which have a 90-degree-turning lever which works along the same ways... still seems better than a button to me but not as good as a rotating switch (most are big enough that somebody could accidentally push down on the lever).

  70. EPO by trolman · · Score: 1
    I have worked inside hundreds of data centers and have designed a few EPO systems and have never ever had a case of and EPO button press being used to save a life. NFPA really needs to look at this again and decide how easily a facilities power can be turned off in an emergency. Only a lot of negitive feedback will make that happen. Geez, even power distribution panels have key locks to prevent tampering.

    Anyhow; I have seen EPO activations ranging from the malicious to a simple slam of the door and never once has it saved a life. So what? if a monitor smokes.

    Until then: Place the redundant part of your system in a seperate room, building, or country.

  71. The BSD box PSU probably had bigger capacitors.. by EMIce · · Score: 1

    Bigger PSU capacitors = a machine less likely to crash or shut down during a brownout. I mean, after all, their job is to buffer power fluctuations. I doubt it had much to do with the OS.

  72. I was right! by Megane · · Score: 1
    Ha ha haaaa!
    All right, who did it? Who pressed the shiiiny, candy-like history eras... I mean emergency stop button?

    Or maybe I've just been reading too many episodes of BOFH lately.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  73. Big Red Button stories by sparkz · · Score: 1
    I'm pleased to say that I've never (AFAIK) been the culprit, but I've been around for a few goodies - two being the classic "but I thought that was the door-release". One of these just hit the Big Red Button as someone happened to be entering, so the door opened, and the culprit wandered out of the machine room without noticing that it all went dark and quiet behind him.

    The second was a guy who was on his first day of work with us. A Big Boss came towards the machine room, so - feeling helpful - the new guy opens the door for him... or so he thought.

    My favourite story (though I wasn't there) is about some old DEC machines, which apparently had the power switch about 6" from the floor. Nobody knew why they kept crashing at night, until someone spotted a cleaner ramming a vacuum cleaner right up to the servers.

    That beats the one we had, when I used to do a lot of soak-testing of machines in a lab - I'd kick off a test on a Friday night, come back in on Monday to find the machine had rebooted. Nothing in the logs, just looked like the power had died, and returned again half an hour later. Other machines on the same power supply were fine.
    It turned out that the cleaners were unplugging the servers, so they could plug in the vacuum cleaner!

    --
    Author, Shell Scripting : Expert Re
    1. Re:Big Red Button stories by Hognoxious · · Score: 1
      It turned out that the cleaners were unplugging the servers, so they could plug in the vacuum cleaner!
      There's an urban myth about the 'bed of death' in a hospital intensive care ward - they even had a priest exorcise it - it was actually the cleaner disconnecting the respirator.
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:Big Red Button stories by mink · · Score: 1

      The important datail of that story is the outlet the cleaner was using was labeled VAX.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  74. Database corruption courtesy of MySQL by Frank+T.+Lofaro+Jr. · · Score: 1

    If they used PostgreSQL they would'nt have had to deal with rebuilding indexes, etc.

    There are real-world reasons to use an ACID compliant database!

    --
    Just because it CAN be done, doesn't mean it should!
  75. No, You sir are the idiot. by bano · · Score: 1

    It is a commonly known fact that cisco autoneg sucks ass.

  76. Well... by Daimando · · Score: 1

    Somebody's been hiring Stooges to guard that button. Bunch of lousy idiots.

  77. Re:The BSD box PSU probably had bigger capacitors. by nbert · · Score: 1

    yes, that's how it works. I used to have a computer which I could turn off for a quarter of a second without causing it to reboot. As you might suspect, I discovered this behavior by accident.

    On a related note a Brownout isn't desirable and can cause a sitiuation which is commonly called a loss of power. I really don't understand why some people here don't see the difference between powering off and an unintentional drop in voltage.
    Since it's not exceptional to have brownouts (some elevators cause them btw) there are standards for PSUs on how much they can take before they can't supply anymore. Good computer magazines simulate brownouts when they test PSUs and the cheap brands usually fail miserably.
    That's why GP's link is so funny after all - even the best OS in the world will fail if the motherboard, CPU or other peripherals don't have any power.

  78. Molly Guard by xixax · · Score: 1
    Maybe they should have put a cover over the damn button then. Morons.

    They need a Molly Guard
    --
    "Everything is adjustable, provided you have the right tools"
  79. "Things we're doing to avoid this crap..." by sakusha · · Score: 1

    I noticed one thing conspicuously absent from their list of :Things we're doing to avoid this crap in the future..." That item is:

    "Put a big sign next to the EPO button saying 'Do NOT Press This Button, it cuts off power to the entire building, it is not a light switch nor a door switch. Push this button only if your life is in danger. If your life is not in danger and you push this button, your life WILL be in danger."

  80. Floods are OK - but beware of clueless accountants by dbIII · · Score: 1
    Don't let your clients near the Big Red Button without an escort
    Had a new building, and 160 litres of water made it into the server room - nothing but wet carpet but a lot of spectators turned up. The worst was an idiot chief accountant - while talking to someone she leaned back on the front of the server rack and against a few power switches, and her idiot boyfriend, employed as a "handyman", who was smoking in there until I turned up. The front door went back on the rack, and I kept a very close eye on all of the tourists - too many people had keys to that room.
  81. Emergency Power Off by DragonHawk · · Score: 1

    "Why can't the EPO button perform in the same manner as a door release for an emergency exit..."

    Emergency Power Off (EPO) switches are primarilly a safety feature. If some person is being electrocuted, you hit the switch and the power dies so the person doesn't. You don't have time to wait in a situation like that. A person's life is considered more valuable then LiveJournal, which despite the name, isn't actually alive. (Insert comment about angst-ridden teen-age girls here.)

    See also:

    http://catb.org/~esr/jargon/html/S/scram-switch.ht ml

    http://catb.org/~esr/jargon/html/B/Big-Red-Switch. html

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  82. Re:The BSD box PSU probably had bigger capacitors. by tmasssey · · Score: 1
    I was working on a PS/2 Model 95. This was one *heck* of a server back in the day. I had my finger pressed down on the button, when I realized it was not completely shut down: it had gotten stuck and I needed to kill a process to get it to finish. But I had my finger on the button!

    So, I double-clicked the button as fast as I could. No problem! Everything stayed up.

    I have seen that a few times since then, where the good-quality computers have survived momentary power outages and the crummy ones haven't. Just another reason to buy quality hardware...

  83. Heart of Gold? by NaDrew · · Score: 1

    "Please do not press this button again!"

    --
    Vista:XPSP2::ME:98SE
  84. Re:The BSD box PSU probably had bigger capacitors. by Criton · · Score: 1

    The BSd box just had bigger caps for it's PSu size or just better quality capcitors.
    I had the same thing happen when my PC rebooted but my old powermac rode though a brown out.
    The mac's psu was physically 2x the size of the atx in the pc ie far bigger caps and heat sinks.

  85. Re:Don't forget... by eno2001 · · Score: 1

    Dear Mr. Rotund. How does one "roack"? Is it a sound? An action? Is it a new dance? Please elaborate on this stupidity. kthnx

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  86. Re:Don't forget... by eno2001 · · Score: 1

    Now we're getting somewhere Mr. Rotund (implication that you are a fat lazy slob). I see that you must be using Windows 3.1 to operate your brain. That would explain the latency in your response. Six days. Not bad Mr. Rotund. That 16-bit single tasking brain of yours can work a little, even if it's wayyyy late. LOL!!!!111!!!! OMFG!!!11111!!!!!! I made a funny.

    Bleh.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  87. Re:Don't forget... by eno2001 · · Score: 1

    No. I think not. I *HAVE* a life in "meatspace" as you call it (I'm no geek. I'm an artist who happens to use computers). If I didn't have one, I'd be trolling like you all the time. You definitely don't have a life Rotund Bastard.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  88. Re:Don't forget... by eno2001 · · Score: 1

    Whheeee! Fun with trolling the trolls. Your ignorance is quite entertaining Mr. Penis Pudgepack.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o