Slashdot Mirror


LiveJournal Blackout Analysis Online

Hakubi_Washu writes "LiveJournal has posted their official analysis of what happened last Friday. Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS), reset it and ran off. They had problems to come back up fast, because of "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others. "

333 comments

  1. Lesser OS... by Anonymous Coward · · Score: 5, Funny


    They should be using OpenBSD. It can run right through power failures

    1. Re:Lesser OS... by One+of+the+abnormals · · Score: 0

      RTFA. It wasn't a power failure.

      --

      2b || !2b =?
    2. Re:Lesser OS... by Anonymous Coward · · Score: 0

      RTFA, a dummy hit the EPO button and the place died.

    3. Re:Lesser OS... by Anonymous Coward · · Score: 0

      That's because OpenBSD is an undead operating system.

    4. Re:Lesser OS... by Neil+Blender · · Score: 0

      RTFA, a dummy hit the EPO button and the place died.

      So, uh, when you turn off a light, do you think of it as a power failure? It was most certainly not a power failure. Nothing actually failed. In fact, everything worked perfectly.

    5. Re:Lesser OS... by ergo98 · · Score: 2, Informative

      Power failed to get to the computers. It was a power failure - whether it was the electric grid, the UPS blowing up, or all the wires in the wall, or in this case the EPO button, it's a bloody power failure.

    6. Re:Lesser OS... by ghjm · · Score: 1, Insightful

      I'm about to leave work and go home. When I do, I plan to hit the so-called "power button." When I do this, code will execute on the box that flushes cache to disk and then commands the power supply to interrupt most (but not all) of its DC output. At that time, my computer will be in a state commonly referred to as "off."

      By your logic, I can claim that my computer is down due to a power failure.

      Perhaps you would complain: But power was getting to the computer.

      So what about the situation where I accidentally hit the (again, so-called) "off" button on my UPS. In this case the computer will go down due to a lack of power getting to it. However, the power is still on at the wall socket - I have just chosen (unintentionally) to interrupt the supply to the computer. Is this a power failure?

      I don't think you can call it a power failure if power is abundantly available, and you just don't choose to make use of it.

      -Graham

    7. Re:Lesser OS... by p5 · · Score: 0

      They should be using OpenBSD.

      This may be half the solution, they should be considering Mark Beihoffer a.k.a dragonfly_blue as their new Senior Systems Engineer.

    8. Re:Lesser OS... by Anonymous Coward · · Score: 2, Insightful

      Nice use of intentional confusion of the issue to make an argument there.

      You say you 'choose (unintentionally)'. I'd say that if you accidentally hit your UPS or computer on/off switch, you are unintentionally causing a power failure.

      You're putting a break in the circuit. By your logic, if I hit a tree in my car, knocking it over into a powerline, killing power to my entire neighborhood, that's not a 'power failure' because there's power available to the break in the lines, and I 'chose (unintentionally)' for my entire neighborhood to not make use of the power. As far as the people with space in that colo are concerned, the supply of power failed (to their rack, their room, whatever) - in other words, a power failure.

      If I have machines in a colo, and power to those machines drops in an unscheduled manner, that's a power failure from my perspective, root cause be damned.

    9. Re:Lesser OS... by jacksonj04 · · Score: 0

      I have to disagree. A failure by definition is something working how it is not intended - the whole point of an EPO is to take out the power regardless of anything else.

      --
      How many people can read hex if only you and dead people can read hex?
    10. Re:Lesser OS... by ergo98 · · Score: 1

      Right, but we're talking from the perspective of the people whose computers suddenly had no power - the infrastructure suddenly stopped providing power, so from their perspective it is a power failure. The NorthEast blackout of 2003 was a power failure, even though the protection circuits were doing exactly what they were supposed to do by shutting off the grid.

    11. Re:Lesser OS... by Aethel · · Score: 1

      as always, everything is relative with repect to your point of view, spooky at-a-distance action be damned!

    12. Re:Lesser OS... by Anonymous Coward · · Score: 0

      Yikes...DON'T GO UP A DIRECTORY. have to undergo intensive therapy now.

    13. Re:Lesser OS... by Nykon · · Score: 1

      power yes..
      failure no..

      failure would indicate that it did not work the way it was supposed to.

      Someone hit the emergency kill button. The power went off as it should when that happens. Hardly a failure. A power outage is not the same as someone turning the power off.

      --
      "It's better to be a pirate then join the Navy"
    14. Re:Lesser OS... by ghjm · · Score: 1

      So, if you hit the power button on the UPS with your elbow, would it be okay to tell your boss "the systems went down because of a power failure?"

    15. Re:Lesser OS... by haruchai · · Score: 1

      Yes. Do you really want to tell him the truth?

      --
      Pain is merely failure leaving the body
  2. Good job! by Uriel · · Score: 1

    What they do makes me happy when I think how simple my setup is by comparison.

  3. The less we've learned... by geoffspear · · Score: 4, Funny

    Don't let your clients near the Big Red Button without an escort. Preferably an armed one.

    --
    Don't blame me; I'm never given mod points.
    1. Re:The less we've learned... by stupidfoo · · Score: 1

      And don't have it red. Have it black. People, especially kids, love pushing that damn red button, no matter how many warning signs you put around it.

    2. Re:The less we've learned... by Macrolord · · Score: 1

      One time where I work, the data center took a complete power outage because when the annual fire suppression system test was made. As it turns out, the bypass switches that would have prevented the power from automatically being cut by the fire system weren't flipped. ......

      Turns out, due to cost cutting, we laid off the guys who know how this stuff worked. Nothing like bringing a very large Teradata, hundreds of servers, mainframes all back online. Glad I wasn't at the data center that day!

    3. Re:The less we've learned... by Lispy · · Score: 1

      Offtopic but hey:
      The red button will eternally be linked in my brain to the one in the pool of ManiacMansion that reads "Do not push" and wich everyone i know pushed anyways. ,-)

    4. Re:The less we've learned... by cdrudge · · Score: 1

      Code may dictate that it needs to be red.

    5. Re:The less we've learned... by Chris+Mattern · · Score: 2, Funny

      "The beautiful shiny button! The jolly, candy-like button!"

      Chris Mattern

    6. Re:The less we've learned... by puhuri · · Score: 1

      Some years ago we finaly got UPS for our laboratory; it was installed and the technican tested setup and we were statisfied. Some half a year later came the first blackout; we then went to the laboratory to see all systems running... all was black and silent! We found out that UPS had been bypassed all the time, the technican had not turn it back after testing.

    7. Re:The less we've learned... by Anonymous Coward · · Score: 0

      Some years ago we finaly got UPS for our laboratory; it was installed and the technican tested setup and we were statisfied. Some half a year later came the first blackout; we then went to the laboratory to see all systems running... all was black and silent! We found out that UPS had been bypassed all the time, the technican had not turn it back after testing.

      Obviously it wasn't tested properly. One place I worked installed a UPS and wanted to stop after doing a similar test. After convincing the CEO that a proper test involved killing the power at the mains, we killed the power at the mains and all the critical systems kept running but they were useless...A critical router had not been connected to a UPS-protected circuit.

    8. Re:The less we've learned... by DrWho520 · · Score: 1

      "Who puts a 'Destroy the Engines' button on a ship, anyway!?!" - Kree, The Kids Next Door

      --
      The cancel button is your friend. Do not hesitate to use it.
    9. Re:The less we've learned... by geminidomino · · Score: 2, Funny

      Evil overlord list item #9: I will not include a self-destruct mechanism unless absolutely necessary. If it is necessary, it will not be a large red button labelled "Danger: Do Not Push". The big red button marked "Do Not Push" will instead trigger a spray of bullets on anyone stupid enough to disregard it. Similarly, the ON/OFF switch will not clearly be labelled as such.

    10. Re:The less we've learned... by Trejkaz · · Score: 1

      An armed escort... man, that's hot.

      --
      Karma: It's all a bunch of tree-huggin' hippy crap!
    11. Re:The less we've learned... by Zen · · Score: 1

      Your comment about making sure everyone has an armed escort struck me as pretty funny. Here's our situation:

      We recently had this same problem at my employer's state of the art datacenter. I work for a large (multi-state, a name the vast majority in the US knows) health care provider. One of our security guards was teaching a new security guard the ropes, and showed him the emergency button. Now, if we had any other type of power failure that myseteriously killed both our A and B power feeds, our emergency generators would immediately kick in and not even the lights would flicker. But the emergency button obviously has to cut everything. So he actually said "Now, whatever you do, don't do this" while pointing to the button, and hit it by mistake. However, it seems we fared much better than live-journal in that it only took about 10 hours to get everything back up and fully tested. A couple parts failed that we had full onsite support contracts for, but nothing major (including multiple mainframes that went down hard!) The good news is that now we know that all the disaster recovery drills we've done in the past 5 years actually work. It did make the newspaper though, and marketing had to call all of our large clients individually and apologize.

    12. Re:The less we've learned... by Anonymous Coward · · Score: 0

      Can you put green polarized glass casing over it, to make it appear black until you lift the casing?

    13. Re:The less we've learned... by Bi()hazard · · Score: 1

      Excellent advice. In my underground volcano lair, the real self-destruct button camouflaged at the bottom of a murky pool full of angry crocodiles. They were angry because the henchmen were instructed to throw things at them on a regular basis. Of course, I moved it when we upgraded to cyborg crocodiles with lasers and fire breath, which operate better on land (water puts out the fire) and are naturally pissed off without having to be annoyed manually.

      And yes, instead of on/off switches on all my engines of destruction and vehicles, there's fucking keys. So you can't just jump into any parked tank and go on a rampage.

    14. Re:The less we've learned... by geminidomino · · Score: 1

      Keys! Brilliant!!

  4. Great by Anonymous Coward · · Score: 0

    Now we've got blogging about blogging. Yay for the rise of meta-blogging.

  5. faulty mobo's by Lifthrasir · · Score: 5, Interesting

    so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?

    --
    No beer, no TV make Lifthrasir something something
    1. Re:faulty mobo's by wankledot · · Score: 1
      The solution is even funnier...
      To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch.
      This is how a company with Millions of paying accounts runs its data center, and they even knew about the problem!
      --
      My sig is blank, I typed this by hand.
    2. Re:faulty mobo's by BridgeBum · · Score: 1

      Maybe faulty, maybe not. There are a lot of incompatibilities and general "flakiness" with some network auto-negotion interactions. It's a fairly standard precaution in large network environments that servers should not rely on auto-negotiate and instead should have their speed and duplex settings hard-coded.

      In reality, the only places where auto-negotiation is important are mobile devices (laptops) which may connect to a variety of network connection types or for the home user "plug-and-play" market. Major datacenter infrastructure isn't the place for auto-negotiating low level network settings any more than it is appropriate to have dynamic IP addressing via DHCP.

      --
      My UID is the product of 2 primes.
    3. Re:faulty mobo's by Lifthrasir · · Score: 1
      Yes, i know that, but these NIC's couldn't even be set to the proper speed/duplex.

      From TFA:

      We have 9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly. They only work with certain switches, so they reboot fine, but then their gigabit network comes up at 100 half duplex or something that doesn't work. To get them back up they need somebody at the NOC to plug them into a compatible switch, let them autonego, then switch them to their real switch. Setting the speed/duplex settings on both the host and/or switch themselves doesn't work....
      --
      No beer, no TV make Lifthrasir something something
    4. Re:faulty mobo's by tchuladdiass · · Score: 1

      Also, you should never rely on autonegotiation -- there are no standards. That's what ethtool or mii-tool is for, or at a minimum specify speed/duplex in your /etc/modules.conf file.

    5. Re:faulty mobo's by Richard_at_work · · Score: 0, Flamebait

      No, the PREVIOUS LIVEJOURNAL OWNERS knew about the faulty motherboards, and didnt do anything to fix them. I doubt the new owners had been told about it until the outage.

    6. Re:faulty mobo's by ignorant_newbie · · Score: 1

      >We have 9 machines with faulty motherboards
      >with embedded NIC

      so basically, they're using shite hardware because they're too cheap. bet they've noticed by now that it costs less to use good hardware than to try to fix it later when something goes wrong

    7. Re:faulty mobo's by Anonymous Coward · · Score: 0

      You also should always read the article before posting. It says, "Setting the speed/duplex settings on both the host and/or switch themselves doesn't work.... most annoying."

    8. Re:faulty mobo's by Surye · · Score: 1

      The same team still works for SixApart now.

    9. Re:faulty mobo's by dbIII · · Score: 1
      so, they had faulty motherboards, knew about it, and didn't do anything to fix it before they had a major outage?
      If it's something that you need all the time, and only has problems on boot you get to it when you can organise a shutdown - I've had to leave dead disks in machines for months before I can bring the thing down and pull it apart. To put things in perspective major bits of plant - like power station units, typically run for three years between shutdowns, and relatively major faults may persist for a couple of years before they are dealt with.
    10. Re:faulty mobo's by Lifthrasir · · Score: 1
      well in this case it was a 9 computer cluster that was supposed to be redundant, automatic failover and what not.

      they could have taken one machine down, added a NIC, turned it back on. it would have taken 30 minutes (being really generous here).

      If they did one machine at a time, they wouldn't have noticed any downtime, but would have prevented this from happening.

      And in regards to computers in plant situations with faults for years (granted, i don't know the specifics of your situation, and i'm not trying to flame) - i'd much rather have some planned downtime to fix it than be called out in the middle of the night to fix it.

      --
      No beer, no TV make Lifthrasir something something
    11. Re:faulty mobo's by Jeff+Mahoney · · Score: 1

      The motherboard were faulty with respect to network connectivity on boot, not stability.

      The end result of replacing them before a major outage? Having another outage.

    12. Re:faulty mobo's by dbIII · · Score: 1
      plant situations with faults for years
      They are called planned shutdowns, and usually happen every three years. If something breaks and lets the steam out, you have an unplanned shutdown, and a whole lot of queued tasks get done during the duration of fixing the main fault. This situation is common in production environments - for instance an oil heater in a refinery may lose all of its temperature monitoring gear in the first couple of months, so it's just run conservatively for the next three years with visual checks each day to see how red the pipes are (optical pyrometer as well as just looking at them) and work out if it can handle the heat.

      There are a lot of computer systems that need to run like industrial plant now - turn them off and production completely stops.

      well in this case it was a 9 computer cluster that was supposed to be redundant, automatic failover and what not.
      These things happen every now and again. I know of one backup generator made from a fighter jet engine that starts up and runs for a couple of minutes as a test every Sunday without faults. Nearly every time it has been needed to actually be a backup generator a different fault each time has prevented it starting. It's likely that some contition outside of the usual test clobbered the cluster as well.
    13. Re:faulty mobo's by Lifthrasir · · Score: 1

      But this is a cluster, so they could have just taken one node down at a time and replaced the motherboards or disabled the on board NIC and added a PCI NIC.

      --
      No beer, no TV make Lifthrasir something something
    14. Re:faulty mobo's by dbIII · · Score: 1
      And in regards to computers in plant situations with faults for years
      To clarify things, I meant bits of plant with major faults for years. Computers can often be taken off line while being replaced temporarily by another system, but bits of pipe in a flame that isn't going out for three years is another story. If the pipe is getting thin or losing strength due to heat damage, you can run at reduced capacity if necessary or possible until the next scheduled shutdown.
    15. Re:faulty mobo's by zonker · · Score: 0

      ehhh, probably not if they are anything like companies i've worked for. you know, small time operations like uhhh, banks...

  6. 503 pages by Folmer · · Score: 1

    Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..
    I havent seen them that much lately, but then i havent been online that much either...

    1. Re:503 pages by Rosco+P.+Coltrane · · Score: 1

      Now, if slashdot could fix their servers, so we wouldnt get thoose annoying 503 sites..

      You get 503 sites? I only reach one at slashdot.org

      Then again, you're a subscriber. Who knows what goodies you lucky few get here...

      --
      "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
    2. Re:503 pages by vagabond_gr · · Score: 1

      So you're complaining about the 503's that you don't see, basically because you're rarely online?

  7. Oppsie by darkstar949 · · Score: 5, Funny

    "I'll just set my coffee down here, and..."
    ...
    "Oppsie, I hope that button wasn't anything important."

    1. Re:Oppsie by Gary+Destruction · · Score: 1

      You mean that big red button wasn't the coffee maker? Oops.

    2. Re:Oppsie by superpulpsicle · · Score: 1

      You mean that Staples commercial with the big EASY button is not a real product? I was waiting for it to go on sale.

  8. History Eraser Button by bsd4me · · Score: 4, Funny

    Ah, the famous History Eraser Button rears its ugly head. I think that everyone who has worked in a large datacenter or lab environment with one of these has a story to tell...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:History Eraser Button by stratjakt · · Score: 1

      Are you saying this was Stimpys fault?

      You idiot! My god man, do you know what you're saying?

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:History Eraser Button by scribblej · · Score: 4, Interesting

      I'll go right ahead then. I was consulting for State Farm installing machines that were supposed to help with the Y2K problem. Hell if I know, I just got the box, went to the site, installed it and made sure it was working. Easy. I had five to do a week, and would be done by Tuesday morning and helping out other contractors on similar projects.

      I'll never forget my visit to the State Farm DSO in Detroit, MI. I'd just physically installed the new machine, at the bottom of a rack, and stood up.

      Stood up putting my shoulder right into the unprotected "History Eraser Button" on the wall. The screams of the employees working int he datacenter could be heard all the way back home in Chicago, I've no doubt.

      Then it turns out the fuses which will reset the systems in the datacenter are in a locked cabinet.

      Then it turns out no one on site has a key.

      Fortunately, I found that the cabinet will pop open if you kick it hard enough. Hey, I was panicking, okay?

      And get this. After it was all over and I realized I probably wouldn't get killed by anyone... they told me "It's okay, this happens all the time. The guy installing the A/C unit last week did it too."

      Maybe they should have put a cover over the damn button then. Morons.

    3. Re:History Eraser Button by Anonymous Coward · · Score: 0

      Can you please enlighten me? What does the history erase button do, exactly?

    4. Re:History Eraser Button by Local+ID10T · · Score: 2, Funny
      I was consulting for State Farm installing machines that were supposed to help with the Y2K problem.


      Hey! I worked that project too... it was fun, but mindnumbing. They actualy sent me to New Orleans for an install on fat tuesday.

      Mardi Gras on an expense account :)

      --
      "You want to know how to help your kids? Leave them the fuck alone." -George Carlin
    5. Re:History Eraser Button by Aaden42 · · Score: 2, Funny

      Nobody remembers!

    6. Re:History Eraser Button by Anonymous Coward · · Score: 0

      Remembers what?

    7. Re:History Eraser Button by Anonymous Coward · · Score: 1, Insightful

      Its from an episode of Ren & Stimpy "Space Madness".

      [Button room] REN: Now, listen, Cadet. I've got a JOB for you. See this button? (Stimpy reaches for the button) DON'T TOUCH IT! It's the HISTORY ERASER button, you FOOL!
      STIMPY: So what'll happen?
      REN: That's just IT! We don't KNOW! Maayyybeee something bad?...Mayyybeee something good! I guess we'll never know! 'Cause you're going to guard it! You won't TOUCH it, will you?
      (Stimpy salutes. Ren leaves.)
      REN: Hehhh...hehhhh...hehhhh...hehhhh...
      (Stimpy marches back and forth, starting at the button.)
      ANNOUNCER: Oh, how long can trusty Cadet Stimpy hold out? How can he possibly resist the diabolical urge to push the button that could erase his very existence? Will his tortured mind give in to its uncontrollable desires?
      (Announcer grabs Stimpy, forces him closer to the button.) Can he resist the temptation to push the button that, even now, beckons him even closer? Will he succumb to the maddening urge to eradicate history? At the MERE...PUSH...of a SINGLE...BUTTON! The beeyootiful SHINY button! The jolly CANDY-LIKE button! Will he hold out, folks? CAN he hold out?
      STIMPY: NO I CAN'T!!!EEEEEYAAAHHHH! (pushes button)
      (Alarms go off. Ren, Stimpy, and Announcer stand around table with button.)
      ANNOUNCER: Tune in next week, as...
      (Flash, explosion as they all disappear.)
      We see the Ren and Stimpy logo, Ren and Stimpy also flash and disappear.

    8. Re:History Eraser Button by anticypher · · Score: 1

      Maybe they should have put a cover over the damn button then

      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.

      Never fuck with safety equipment. It would be better to not have kit directly next to the big red button, leaving it a nice clear space so in an emergency someone could reach it and maybe save your life.

      the AC

      Yeah, I got a 208 volt jolt at RedBus today. Fucking check your hot and neutral orientation, shitheads!

      --
      Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
    9. Re:History Eraser Button by Trifthen · · Score: 1

      And what's wrong with just a plastic cover on a hinge that keeps someone from just pressing the button on accident?

      An Emergency! Oh N0es!

      1.) Lift cover.

      2.) Press Button.

      3.) Profit?

      --
      Read: Rabbit Rue - Free serial nove
    10. Re:History Eraser Button by Anonymous Coward · · Score: 0

      Clearly this appears to have been a malicious act at LJ: opening the safety cage, hitting the button, pressing reset and leaving. Can't really be an 'accident'.

      The clear plastic should be sealed so a broken seal reveals its been activated.

      Why can't the EPO button perform in the same manner as a door release for an emergency exit in a public building: press to sound alarm and wait fifteen seconds for door to release. Alot of kill commands can be issued and processes can die nicely.

    11. Re:History Eraser Button by trolman · · Score: 1
      Oh Yea? I was in Charlottesville and it was cold.

      What was your user name? BOFH
      http://members.iinet.net.au/~bofh/

    12. Re:History Eraser Button by martinX · · Score: 1

      What we need is a Big Red Button, uncovered. If you push it, a Big Blue Button pops out with a sign above it that says "Are you sure you want to activate the Big Red Button? Push the Big Blue Button for OK."

      --
      When they came for the communists, I said "He's next door. Take him away. Goddam commies."
    13. Re:History Eraser Button by Anonymous Coward · · Score: 0

      The clear plastic should be sealed so a broken seal reveals its been activated.

      They already know it's been activated.

      Why can't the EPO button perform in the same manner as a door release for an emergency exit in a public building: press to sound alarm and wait fifteen seconds for door to release. Alot of kill commands can be issued and processes can die nicely.

      Because you don't want to wait 15 seconds while everything catches fire and explodes before the power shuts off. The only real solution is to use failure-resistant hardware that can deal gracefully with emergency shutoffs.

    14. Re:History Eraser Button by Anonymous Coward · · Score: 0

      Now that's a good neighbor! ;)

    15. Re:History Eraser Button by jonwil · · Score: 1

      Put a cover on it like a fire alarm button has.
      They can be pressed very fast when you need to but are very hard to just bump accidentially.

    16. Re:History Eraser Button by cgenman · · Score: 2, Funny

      If I ever catch anyone putting a cover over a critical piece of safety equipment, like an Emergency Power Cutoff switch, I'll put their head on a pole in front of the data centre as a warning to others.

      You of all people should realize that putting someone's head on a pole in front of a data centre is dangerous. For one, it tends to become a disease vector, as for some mysterious reason everyone feels the need to touch it. Rats are usually attracted to the smell, and you know how rats wreak havock on ethernet cables, especially the rats of the dead. Furthermore, putting the dead on a spike on your front lawn tends to attract ghosts, which are no problem if you're running a secure OS but everyone knows what havok ghosts can wrack on a Windows Server 2000 installation.

      On the other hand, how would putting a clear, hinged plastic cover over an emergency power kill switch be likely to kill someone? I know people panic in desperate situations, but if someone can't get a plastic hinged cover off of a button quickly during an emergency they shouldn't be trusted with electricity.

      There are many ways you could safely "fuck with" the safety equipment while making it less likely to take down your entire network. You could make it a handle that had to be pulled down, like most fire alarms are. It could be "Break flimsy plastic and press button to kill power." Heck, it could just be recessed, like many good last-resort buttons are.

  9. Where was the switch? by SoupGuru · · Score: 1

    Did they put it right next to the light switches? Shouldn't something like that be locked away in a server room or at least in a place where it can be under supervision?

    --
    What doesn't kill you only delays the inevitable
    1. Re:Where was the switch? by grub · · Score: 2, Informative


      They usually are in a server room. They're for emergencies. Ours have red cages around them and a BIG RED SIGN, you have to basically punch them.

      --
      Trolling is a art,
    2. Re:Where was the switch? by crimoid · · Score: 1

      Typically these types of devices are just inside the door to the rooms that they cut off. This way Fire & Emergency personnel can get to them quickly and easily.

      Generally the buttons themselves are behind plexiglass lids that easily flop up or behind breakable glass.

    3. Re:Where was the switch? by Shkuey · · Score: 1

      Locking up an emergency button defeats the purpose. They'll typically have a plastic cover you need to lift or some other mechanism to make sure it cant be done by accident. If the person who did it wont own up, they should have it fingerprinted. I mean... how many other people have pressed it? Should be fairly easy.

    4. Re:Where was the switch? by grub · · Score: 1


      Locking up an emergency button defeats the purpose.

      So I shouldn't have my fire extinguishers under lock and key? Whoops... ;)

      --
      Trolling is a art,
    5. Re:Where was the switch? by vasqzr · · Score: 1


      Locking up an emergency button defeats the purpose. They'll typically have a plastic cover you need to lift or some other mechanism to make sure it cant be done by accident. If the person who did it wont own up, they should have it fingerprinted. I mean... how many other people have pressed it? Should be fairly easy.

      Video survellience camera, anyone?

    6. Re:Where was the switch? by Anonymous Coward · · Score: 0

      Our company has servers at Internap. These switches are inside the third of four doors you must pass through with a security card. The have a plexiglass guard over them, you have to reach under to hit the button. Internap has said that a tenant accidently hit the button (I'd post all the emails from Internap, but the junk filter rejects them.)

      My theory is this: When you exit the first door heading out, you push a button under a similar plexiglass guard to open the door. At all other doors you swip your card. The next door on your way out is where the EPO button is. I think some spaced out individual got to the third door on his way out and hit the EPO switch thinking it opened the door.

    7. Re:Where was the switch? by Anonymous Coward · · Score: 0

      We had one of those next to the exit switch (old RF sheild so there was a motor that extended and retracted locking bolts, no power the bolts don't move), the 3rd shift security guard hit the power kill button and suddenly found himself locked in a dark shielded room - radio didn't work, cell phone didn't work, couldn't see anything. An engineer working late came to his rescue and talked him through the manual escape process.

    8. Re:Where was the switch? by bsd4me · · Score: 1

      These switches are generally big round buttons about 2" in diameter, and almost always made out of bright red plastic. On top of that, the button take some force to depress and many facilities place a hinged, clear plexiglass box over them to prevent accidental use. It is pretty hard to mistake one for a normal light switch.

      --

      (S(SKK)(SKK))(S(SKK)(SKK))

    9. Re:Where was the switch? by irc.goatse.cx+troll · · Score: 1

      Or your gun, unless you want to ask that kind man with a knife to wait while you dig out the key.

      --
      Pain lasts, kid. Its how you know you're alive. Sometimes I think this growing up thing is just pain management-TheMaxx
    10. Re:Where was the switch? by Anonymous Coward · · Score: 0

      It was connected to the circut with the rest of the computers and was a new-fangled one that stored its video in RAM. :D

    11. Re:Where was the switch? by cypher_6502 · · Score: 1

      In my old company, the 'router' guy accidentally mistaken that 'big red' power reset button by the door for the light switch. He thought he would do the equipmment a favor by turning off the lights, so the room would run cooler. Within five minutes of him leaving the room, HP OpenView started to barrage everyone on the network staff with a list of 'down servers' Since then, the 'big red' button is now enclosed inside a plastic box, and as for the router guy, he was pretty hostile to everyone and wasn't a team player. You'll think he been fired or reprimanded. Instead, he lucked out as we had a corporate consolidation on the regional scale, and he was promoted to manage the new regional WAN group.

    12. Re:Where was the switch? by Anonymous Coward · · Score: 0

      At a high school I've worked at, the Big Red Button, which was indeed red and large, was located immediately inside the doorway to a computer lab on the far wing of the building where there were fewer adults to supervise the teenagers when the lab wasn't in use.

      The button had no shield, and it was mounted about a foot below eye-level for a freshman. Do conditions get more perfect?

      Eventually, the administration got so fed up they had an electrician disconnect the killswitch. The Big Red Button is still there but does nothing.

    13. Re:Where was the switch? by buckeyeguy · · Score: 1
      Sounds like the Dilbert principle there... promote the guy to a position where he won't be near the Big Red Button.

      Seems like a LOT of people have these stories; I've had mine for awhile; after moving our organization's computer room (could hardly call it a data center at the time), and thankfully still during the buildout phase, the phone guy (one of Ameritech's geniuses, fwiw) pressed the button, thinking it was the handicap-open-the-door button. We put a transparent plastic cover over it after that.

      --
      I'd have a personalized plate on my car, but "toxic bachelor" won't fit into 7 letters.
    14. Re:Where was the switch? by galaxy300 · · Score: 1

      You should have just used scotch tape. Nobody ever pushes the button with scotch tape on it.

    15. Re:Where was the switch? by Anonymous Coward · · Score: 0

      Mod Parent Up. Best explanation so far.

  10. Perhaps they should answer by antifoidulus · · Score: 1

    /.s current poll now?

    1. Re:Perhaps they should answer by zeylisse · · Score: 1

      several unbootable machines -- few thousands $$$
      thousands man-hours of repair -- several thousands $$$
      zillions teenage-girls-unable-to-blog-crying-hours -- priceless.

    2. Re:Perhaps they should answer by game+kid · · Score: 1

      Yup. Looks like another Over $20k, but no one knew it was me response.

      Maybe I'll answer for them--they might be too busy preventing the next wreck. They ought to be with all their users.

      --
      You can hold down the "B" button for continuous firing.
  11. Fascinating read by Saint+Aardvark · · Score: 4, Insightful
    It's amazing how much you can learn from things going horribly wrong. :-)

    Congrats to the LJ folks for getting things working, taking the time to do it right, and giving an admin's-eye-view into what actually happened.

    1. Re:Fascinating read by caluml · · Score: 1

      Agreed. I always appreciate when people explain how large scale outages happened, were able to happen, how they fix it, and what they do to prevent it happening again. It's useful (and good for your employment status) to learn from other people mistakes rather than your own.
      So Slashdot - what are all the 500 errors about then? :)

  12. I bet I know who flipped the switch by Anonymous Coward · · Score: 0

    The linksys rep.

    I mean for like $20x9, they could have avoided the problem by adding a few NICs.

  13. Missing opportunities by Rosco+P.+Coltrane · · Score: 3, Funny

    Apparently someone "accidentally" pushed the emergency power off

    They had to power back on when they realized deadjournal.com was already taken...

    --
    "A door is what a dog is perpetually on the wrong side of" - Ogden Nash
  14. LJDotting: LJ user base vs Slashdot user base. by TrevorB · · Score: 4, Funny

    If Mr. "I Pushed The Big Red Button"'s personal information ever gets published....

    LJ's active user base is easily 10x that of Slashdot's. We'd have to come up with a new term for the internet event that pales any slashdotting that ever came before.

    1. Re:LJDotting: LJ user base vs Slashdot user base. by game+kid · · Score: 1

      How about the (somewhat) phonetic form of the complete URI:

      an http-colon-slash-slash-slash-dot-dot-orging?

      The people who got that domain name are some lucky geniuses.

      --
      You can hold down the "B" button for continuous firing.
    2. Re:LJDotting: LJ user base vs Slashdot user base. by Anonymous Coward · · Score: 0

      Blogzami?
      Blognami?

      Something along those lines. (Screw being PC)

  15. How do you do that by *accident*???? by Anonymous Coward · · Score: 0

    Those particular buttons are shielded by plastic covers. You have to deliberately lift the cover to get to the button. You can't just "bump into it".

    What, somebody was attracted to the pretty red button and just *had* to push it???

    1. Re:How do you do that by *accident*???? by FudgePackinJesus · · Score: 2, Funny

      Stimpy couldn't resist "The Red, Shiney, CANDY-LIKE Button!!"

    2. Re:How do you do that by *accident*???? by AndroidCat · · Score: 2, Funny
      Another customer in the facility accidentally pressed the EPO button, then depressed it

      I'm trying to figure out how depressing a button reverses a press. (Since the button is depressed by pressing it.) Unpressed it?

      --
      One line blog. I hear that they're called Twitters now.
    3. Re:How do you do that by *accident*???? by hachete · · Score: 1

      Particularly one shaped like this Big Red Button

      --
      Patriotism is a virtue of the vicious
  16. Auto-negotiation by stilwebm · · Score: 3, Informative

    When I first moved company servers in to a new colo four years ago, their engineers advised me that I should turn auto-negotiation off on every port, including our switches and host NICs. I asked why they recommended this and they replied, "trust us, auto-negotiation causes problems when you least expect it." I went ahead and fixed the port speeds everywhere. Now I understand why.

    1. Re:Auto-negotiation by Malk-a-mite · · Score: 1

      If you know what speed port you are plugging in to why would you need to autoneg?

      It's a convenience that isn't always needed.

    2. Re:Auto-negotiation by jjgm · · Score: 4, Insightful

      Sounds like a classic Cisco problem. I don't know what switches LJ were plugged into, but for years most Cisco switches would autonegotiate 100/half-duplex if the NIC was locked to 100/full; conversely, sometimes, NICs would autonegotiate 100/half if the Cisco was locked to 100/full.

      They're cheeky enough to document this now. It's a feature, not a bug! Honest!

    3. Re:Auto-negotiation by Anonymous Coward · · Score: 0

      The engineers didn't work for Cisco, by chance, did they? The guy who ran my Cisco class said the same thing.

      "9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly"

      When I read this, the first thing that popped into my mind was "hey, I didn't know Cisco made PC motherboards!"

      I've worked with pretty much every major brand of router and switch (including ones that don't exist anymore), and the *only* company that has problems with autonegotiation is Cisco.

    4. Re:Auto-negotiation by Undertaker43017 · · Score: 2, Funny

      The part I like is they are claiming that everyone else is wrong, and they are right. ;)

      I don't buy Cisco anymore for this very reason, it's not just their switches, it's on everything they make that has a NIC.

      I deployed some CSS's, right after Cisco bought ArrowPoint, and they did auto correctly. Another client deployed some a couple of months ago, and auto was broken. Cisco is the Borg! ;)

    5. Re:Auto-negotiation by bjz · · Score: 1

      Actually, nowadays even Cisco recommends trying auto negotiation first, and only hard coding port/speed settings for problem NICs or for other switches, routers, and important servers. Also, with gigabit ethernet, the port speed and other settings like flow control have to be auto negotiated ( http://www.cisco.com/en/US/products/hw/switches/ps 663/products_tech_note09186a0080094713.shtml#auto_ neg/).

      Apparently, when auto negotiation was first being standardized, it was crap and most network admins just learned to shut it off and never changed practices as auto negiotiation became more stable. Instead, the "turn it off" wisdom was passed down, normally with vague hand waving about "problems". Today Cisco and Sun (the only companies I researched) recommend auto negotiation. I'll bet those 9 machines failing to auto negotiate is more because of crap components being used than any fault of auto negotiation; this was apparently a known problem, and auto negotiation should have been turned off for those specific machines.

    6. Re:Auto-negotiation by ignorant_newbie · · Score: 1

      >and never changed practices as auto negiotiation
      > became more stable

      So you believe the manufacturer's press release? Ok. setting that aside for a minute, given that most installations purchase hardware as it's needed, that means that most people have some old stuff and some new stuff. Do you think it makes more sense to have a different policy for each piece of hardware you're plugged into, or to have one policy that always works nomatter what you're attached to?

      now if someone in the gnu/linux world would just fix ifconfig so that it actually knew how to configure all the settings on a given interface so that I wouldn't have to read the damn kernel documentation for every new nic i purchase...

    7. Re:Auto-negotiation by archen · · Score: 1

      I hope they at least asked you what kind of switches and NICs you were using. I found out the hard way one time that Nortel switches (at least the ones we use) default to 100Tx with NO duplexing when you turn negociation off (and you can't force it to duplex either). Man did networking take a shit on some servers that day...

    8. Re:Auto-negotiation by Moofie · · Score: 1

      So I don't have to jack with it on every computer I install.

      It's a convenience that saves me time.

      --
      Why yes, I AM a rocket scientist!
    9. Re:Auto-negotiation by Anonymous Coward · · Score: 0

      Nortel switches default to 100Tx with NO duplexing

      Sorry, but *WHAT*!??!?!

      How the hell do you get "no" duplexing?

      Duplex is either half (one side can talk at a time) or full (both sides can talk at the same time.)

      "No" duplexing would mean that neither side can talk at any time.

    10. Re:Auto-negotiation by stilwebm · · Score: 1

      Yes, they did, and as others suggested, they were Cisco switches (this was in 2001) and they were Cisco certified engineers.

    11. Re:Auto-negotiation by Anonymous Coward · · Score: 0

      If you know what speed port you are plugging in to why would you need to autoneg?

      That is why I didn't argue with the experts about turning it off, even though you do have to actively turn off autonegotion on most devices. Other than a few seconds of extra configuration time, there was no reason to ignore their advice.

    12. Re:Auto-negotiation by mink · · Score: 1

      I have a nortel switch that has auto-neg issues with heardware connected to it.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  17. ...and ran off? by stratjakt · · Score: 5, Funny

    What do you mean, ran off?

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    Or do you really mean, slunk off, like my dog does when I walk in and find her curled up on top of the remains of the remotes for the TV, TiVo, DVD player and stereo?

    My dog likes remote controls more than snausages.

    OT: Anyone know where (brick and mortar) to get a replacement (original) TiVo remote?

    --
    I don't need no instructions to know how to rock!!!!
    1. Re:...and ran off? by Anonymous Coward · · Score: 0

      They'd have to go out past the guard/reception desk at Fisher Plaza. That would put them on a security camera tape.

      Plus the fact that the biometric ID card system Internap has lets them know *exactly* who was in the room at the time...

    2. Re:...and ran off? by stratjakt · · Score: 1

      Ya well, shit happens, and I hardly think they're going to call in the cast of CSI to investigate this.

      I mean, for the most part, it's a free service. It's not like those users with free accounts can sue to get their money back.

      --
      I don't need no instructions to know how to rock!!!!
    3. Re:...and ran off? by Anonymous Coward · · Score: 0

      You can order several types of remotes (colors) from TiVo themselves at: https://store.tivo.com/main.aspx?cid=102

    4. Re:...and ran off? by stratjakt · · Score: 1

      a) I don't want to wait 6-12 months for delivery, which I've been told, is about the average turnaround ordering stuff from TiVo. I kind of wanted to watch TV today.

      b) That, and I've given TiVo enough of my money directly. 35 bucks for a single-function remote is ridiculous. They don't even give you free shipping. Some people deserve to go bankrupt.

      Don't any retailers carry replacements? Or even a third party remote that has the right buttons, in the right places? The philips universals control TiVo, but it makes finding the "TiVo central" and "live tv" buttons a chore.

      --
      I don't need no instructions to know how to rock!!!!
    5. Re:...and ran off? by Anonymous Coward · · Score: 0


      Internap is a huge facility. There are many big companies that colocate there. The Seattle facility houses 8,000+ servers.

    6. Re:...and ran off? by Anonymous Coward · · Score: 0

      Who cares about LJ - it's the rest of us who had to deal with cleaning up that mess. There are a *lot* of companies in that particular space...

    7. Re:...and ran off? by DrHogie · · Score: 1

      9thtee.com and weaknees.com should both sell replacement remotes for TiVo. After one too many drops on our living room's tile floor, it's about time we get a new one ourselves . . .

      http://www.weaknees.com/tivo_remotes.php

      --
      --DrH, the Sandwich with the Ph.D.
    8. Re:...and ran off? by stratjakt · · Score: 1

      But

      I

      Want

      One

      NOWWWWWWWWW

      I can't watch Sunday's Simpsons until I get a remote.

      There should be a law against selling remote operated products that don't have the equivelant buttons on the device itself. Eg, TiVo, and the downstairs TV in which the only way to put it in rear A/V in mode is via the remote.

      --
      I don't need no instructions to know how to rock!!!!
    9. Re:...and ran off? by Anonymous Coward · · Score: 0

      No, but when you get your new one don't use it with peanut butter covered fingers. YOU HEAR THAT NICHOLAS!!!!!

    10. Re:...and ran off? by Rolan · · Score: 1

      Ran off like "god I hope nobody has a gun back there" I would imagine.

      --
      - AMW
    11. Re:...and ran off? by UWC · · Score: 1

      I share your pain in that regard, especially the input selecting. While it seems that most (or at least many) VCRs let you set the A/V inputs as standard channels in the normal tuning sequence (though setting that up still requires the remote), both of the TVs I have with A/V inputs require a remote for access to those. Which is frustrating when you have the DVD player remote in hand, with its audio running to external speakers and all you have to do is press a single button once (maybe twice, depending on which input is used) on the TV remote which is nowhere to be found.

    12. Re:...and ran off? by Anonymous Coward · · Score: 0

      Philips univeral remotes generally work to control Tivos just fine. I'm using one on my Tivo, the remote broke and like you I wanted it now. So I went to Best Buy and bought one.

      Check the package before buying it. It will have Tivo listed as a brand under VCRs. If it does, then it will work. No thumbs but essentially every other button works.

    13. Re:...and ran off? by SmittyTheBold · · Score: 1

      Get one of the One-For-All universal remotes, they're cheap and the slightly-more-expensive ones ($15 or so) can be flash-upgraded with new device codes and completely customized kemaps. To do this you'll have to be willing to geek out quite a bit to learn how to program the remote, but they're very powerful when you get down to it.

      A good suggestion is the OFA URC-8811 which can be purchased at your freindly naighborhood Wal-Mart for cheap and used right away, then soft-upgraded later when you want to get the absolute most out of it.

      Learn more about all this here.

      --
      ± 29 dB
    14. Re:...and ran off? by stratjakt · · Score: 1

      I know they do I just don't care too much for the button layout.

      But then again, they control everything I own (XBox, PS2 and TiVo and even that cheap-ass Sears branded TV from 1902)

      --
      I don't need no instructions to know how to rock!!!!
  18. I want to name this file..... by Evil+W1zard · · Score: 1

    Speaking of stupid things to do how many people know someone that has named a file on a Unix server * and then at some point later in time decided they no longer needed that file and decided to rm *?

    --
    News Reporters Make Tasty Polar Bear Treats!
    1. Re:I want to name this file..... by Anonymous Coward · · Score: 0

      Worse..At one point I was trying to clear the directory of . files..
      yup. you guessed it.
      rm -rf .*

      All I can say is...ouch.

    2. Re:I want to name this file..... by Cocoronixx · · Score: 2, Funny

      uhhh 0? Well I guess 1 since I can count you now.

      --
      "Obscenity is the crutch of the inarticulate motherfucker." - cloak42
    3. Re:I want to name this file..... by Anonymous Coward · · Score: 0

      Yeah, like that one time I copied "format.exe" to "dir.exe" within the cwd of my DOS box and then wanted to see the contents of the root directory.

    4. Re:I want to name this file..... by Anonymous Coward · · Score: 0

      Yes I realize I'm biting..

      IIRC, this wouldn't work because DOS first looks in its internal commands to match what you've typed, then looks in the current directory, then the PATH environment variable. You would have to explicitly type the extension to avoid built-in dir from running.

      Furthermore, without a switch, format.COM will prompt.

    5. Re:I want to name this file..... by shuz · · Score: 1

      That is why I try to always include "" around everything I do in any unix environment and leave off trailing /'s when ever possible.

      --
      There is or can be built a machine that can simulate any physical object. -Church-Turing principle
  19. *GASP* by Anonymous Coward · · Score: 0
    (Warning: it's late and I'm tired/rambly, so this post might be incoherent...

    So what you're saying is that your post will mirror 99% of all LJ posts?

  20. Credit by XorNand · · Score: 4, Informative

    Anyone who's a paid member of LJ can get a 2-week credit here.

    --
    Entrepreneur : (noun), French for "unemployed"
  21. Microsoft responds to Live Journal incident by Gary+Destruction · · Score: 0, Flamebait

    "Must be uh, must be why we're not shipping Longhorn yet."

  22. I guess that eliminates anyone on the IT staff by Anonymous Coward · · Score: 0

    With that 'running off' part. If you had said 'wobbled off' or 'jiggled off' you might be able to make a case.

  23. A great article by digitalgimpus · · Score: 1

    I must compliment LJ for at least being honest with their system... many would lie and say "it was the datacenter's fault".

    They at least admit their own systems weren't perfect... and clearly explained each fault they observed.

    Good info.

  24. I'm in that datacenter once a month or so... by marked23 · · Score: 1

    I always wanted to push that button... Now I don't have to.

  25. Ahhhh silence is GOOOOLDEN by ShatteredDream · · Score: 3, Funny

    *crickets chirping* That's the sound millions of teenage girls not using up bandwidth and disk space talking about boys, jcrew and high school/college drama.

    1. Re:Ahhhh silence is GOOOOLDEN by eln · · Score: 1

      Yah, but now we have nerds talking about girls talking about boys, jcrew, and high school/college drama. I shudder to think what would happen if Slashdot had an outage like that right now.

    2. Re:Ahhhh silence is GOOOOLDEN by metalhed77 · · Score: 3, Funny

      So says the author of yet another political weblog whose startling impartialiality and sense will pave the way for a brave new world?

      --
      Photos.
    3. Re:Ahhhh silence is GOOOOLDEN by AndroidCat · · Score: 1

      For emergency backup, they could always switch back to paper diaries, except that their kid brothers could steal them and read them. Can't have that! (Let the pest browse it like everyone else.)

      --
      One line blog. I hear that they're called Twitters now.
    4. Re:Ahhhh silence is GOOOOLDEN by Anonymous Coward · · Score: 0

      Christ, find some new material. Do we have to read this same unfunny drivel every single time a story even remotely mentions Livejournal? It's the fucking "Where's the Beef?" of the Internet age.

    5. Re:Ahhhh silence is GOOOOLDEN by mattwarden · · Score: 1

      (Score:3, Funny)

      Actually, I believe that is the sound of another ridiculously redundant comment being moderated by slashdot mods who didn't read the comments of the last 2 stories about this incident.

  26. machine failure by br00tus · · Score: 3, Insightful
    "They had problems to come back up fast, because of '9 machines with faulty motherboards with embedded NICs that don't do auto-negotiation properly", Machines not fully rebooting for analysis reasons and few others.'"

    I was a sysadmin at a Fortune 100 company with thousands of servers. Every Saturday evening, we rebooted all of our servers. We almost always had several machines which would not come back up for one reason or another - so we dealt with it then, on Sunday morning, instead of during the week when a reboot of a critical machine that did not work would be much worse. Scheduled reboots are a part of good systems administration. If once a week is too often, then once every two weeks, or once a month. With this much failure, I'm almost certain they never did scheduled reboots. They had two failures - their power failed, and then their lack of planning allowed for so much to go wrong a result of that.

    1. Re:machine failure by rjstanford · · Score: 4, Insightful

      One of the last steps of our standard deployment was a full hard shutdown and restore from backup. This was shceduled to happen approximately a week before bringing the machines live - after a lot of data setup had been done.

      Many customers - and internal staff - really, really got scared at that point. The thing is, if you don't trust your backups, what good are they? Its amazing what things got taken care of and found during double-checks the week before the backup/restoration test.

      Oh, and we always went with scheduled reboots as well, for very much the same reason as you mentioned. An hour a month of scheduled downtime is almost always available - usually we booted every week and had an optional downtime window on a monthly basis. And if your (talking to readers here, not parent) organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      --
      You're special forces then? That's great! I just love your olympics!
    2. Re:machine failure by Jpunkroman · · Score: 0, Offtopic

      Let me get this straight? You spent Saturday night and Sunday morning at work? Thats some serious dedication, I barely want to spend Tuesday mornings at work. And Saturday night? Wow.

    3. Re:machine failure by gkuz · · Score: 2, Insightful
      Every Saturday evening, we rebooted all of our servers

      Yeah, we had servers like that once, too. Ba-da-bing! Thanks, I'll be here all week.

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane? We've had critical, and I mean critical, servers that have uptimes measured in years. But then again they run NetWare, or OS/400, or MVS, or.... ABW.

      Scheduled reboots are a part of good systems administration

      Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?

    4. Re:machine failure by Saeed+al-Sahaf · · Score: 1
      Scheduled reboots are a part of good systems administration

      He's talking about Windows, where regular reboots are a good thing when they are planned, so you don't have regular reboots when they are NOT planned!

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    5. Re:machine failure by prshaw · · Score: 1

      And do those OS's test the hardware to make sure it will restart after a shutdown?

      It's more then just will the OS keep running, it is also will the hardware live through a power cycle.

    6. Re:machine failure by gkuz · · Score: 1
      it is also will the hardware live through a power cycle

      Why should it have to? If it's a critical server, your infrastructure should be such that it never power cycles. Our computer room has "power cycled" once since the facility was built in 1984. And that incident led to spending $65k in consulting engineering services alone, to determine why it happened and develop a plan to prevent it happening again. I'm not even sure what the expenditure in hardware or electrical contracting related to that was. I guess we define "critical" differently.

    7. Re:machine failure by Anonymous Coward · · Score: 0

      I'm a sysadmin for the worlds largest intranet (hint, its for the military). If I even think about asking for a recurring, scheduled reboot I'd be shot. I don't know what kind of SLA you have but if you reboot once a week theres no way you can make three 9s, and I'm held to 5 (held to does not necesarily mean achieve..)

    8. Re:machine failure by TeraCo · · Score: 2, Insightful

      You sir, sound like a man who needs a load balanced cluster. If you're relying on individual boxes staying up to meet your SLA's, your career is a ticking timebomb.

      --
      Not Meta-modding due to apathy.
    9. Re:machine failure by Local+ID10T · · Score: 1

      Its called contingency planning.

      Asking the "What if..." questions, and coming up with an answer. Even if the odds are one in a billion or more, a good admin wil have an answer. A better admin will have written the answer down for someone else in case they arent around.

      The right answer is not to simply say that it will never happen.

      --
      "You want to know how to help your kids? Leave them the fuck alone." -George Carlin
    10. Re:machine failure by radish · · Score: 1

      On a serious note, am I the only one here who thinks a world in which no one questions a policy like that is insane?

      You're kidding right?

      We've had critical, and I mean critical, servers that have uptimes measured in years

      Well good for you. But when (not if) one of those boxes gets a hardware fault, or a power problem, what do you do? Do you have ANY confidence that it will come up properly? If you rebooted that thing every x (day/week/month whatever) then that answer would be YES, you know that if you have to bring it up it will come up.

      Number one rule of high availability systems: NO SINGLE POINTS OF FAILURE. You need a hot backup for EVERYTHING. Provided you have that, then regular reboots are not a problem, as each box cycles the others take up the slack. If you don't have that, then you don't have a reliable system, you have a timebomb.

      Hell we even do regular cable pull tests. Someone will walk through one of the server rooms and yank a cable or three, could be power, could be network, whatever. If your system is properly put together nothing (or no-one) except your monitoring systems should notice.

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    11. Re:machine failure by Chirs · · Score: 1

      An hour a month is a lot of downtime for many companies. (Think online stores, telcos, etc.)

      This is why you have hot-standby redundant hardware (or at least warm-standby with data syncing).

      Every week you switch to the standby and reboot the previously-active from backup. After testing that it's okay, you reboot it again and bring it back into sync with the active.

    12. Re:machine failure by drew · · Score: 1

      in defence of lj on this point, i don't think any of the issues they didn't already know about (mobo's that won't auto negotiate, db's that don't restart automatically) wouldn't have been uncovered by scheduled reboots. most of their problems were results of the hard shutdown. so unless you're just pulling the plug on your servers when you do a scheduled shutdown, this isn't really comparable.

      the issues that they already knew about, on the other hand, were all issues that never seemed like a big deal to them before because they were thinking in terms of one computer going down at a time, not all of them at once....

      there are a few other issues in there that i would criticize them on, but not doing scheduled reboots isn't on of them. in this case, however, i'll pass on criticism, and instead thank them for being as candid as they have been in explaining what happened and how they are going about ensuring it doesn't happen again.

      --
      If I don't put anything here, will anyone recognize me anymore?
    13. Re:machine failure by rjstanford · · Score: 1

      And that's always an option too. I didn't say anything about bringing all provided services down - just bringing a machine down. Some operations have dead time - some don't. Either way, by doing it formally all the time you're in much better shape when you have to do it for whatever reason.

      I've been in shops before where a machine has been running for a couple of years and needs an upgrade, and everyone's really fscking scared to touch it because they have no confidence that it will come back up. Doing a bounce on a regular basis at least lets you make sure that - if something's happened to the boot sequence - it was recent, and can be fixed.

      --
      You're special forces then? That's great! I just love your olympics!
    14. Re:machine failure by gkuz · · Score: 1
      You're kidding right?

      Uh, no. The part where I said "On a serious note" should have given that away. Show me where I argued against contingency planning. I'll bet I know as much about business continuity planning/disaster recovery planning as most of the people who are misreading my arguments here, and have written/rewritten my share of such plans.

      The grandparent poster said he rebooted every server every Saturday, and came in every Sunday to fix the ones that didn't come up. My argument is that that does not add one bit to system (in the large sense) reliability, it is almost always done to compensate for cheap hardware or shitty *cough*Microsoft*cough* OS'es. Note his words, that he did this every weekend "instead of during the week when a reboot of a critical machine that did not work would be much worse." This does not describe a robust, high-availability system, it describes excuses for crap. You're absolutely right, that in a properly designed system with redundant equipment, hot spares, well-designed and -tested failover mechanisms and good management, you should be able to knock out any piece of equipment or any data path at any time without it causing a crisis. But that wasn't what the grandparent was describing. He was describing a set of systems where you spend every weekend rebooting everything because you'll shit your pants if you have a problem on a Wednesday. Well some of us don't have the luxury of that much downtime. So plan and test away. But every week is just wrong.

    15. Re:machine failure by gkuz · · Score: 1
      The right answer is not to simply say that it will never happen.

      Read my other post. I never argued that, otherwise I wouldn't have a diesel generator as backup to my UPS, with two different failover mechanisms. Or 24x7 security guards with two different phone systems and printed (on paper, in a binder) emergency instructions in both of the buildings on the property, with a list of contacts ordered by distance from the building and skill set. Or.... you get the picture. The guy I was responding to wasn't talking about contingency planning, he was talking about spending every weekend compensating for crappy servers. That's not "mission-critical", that's a bunch of toys.

    16. Re:machine failure by Anonymous Coward · · Score: 0

      Scheduled reboots are a part of good systems administration.....

      If running Windows systems they are..

    17. Re:machine failure by br00tus · · Score: 1
      I have read through the responses and will explain more.

      A lot of people have dwelled on the word critical. I could have expanded this to mean both critical and important machines. Our critical machines were highly available, with hot standby redundant hardware (as were their RAID arrays and such). But we rebooted the running systems and then the standbys every week to make sure the failover would work. I do not know why people presume that a scheduled reboot means we have no failover. We have to know the failover would work!

      Some people alluded to that perhaps Saturday night scheduled applications might prevent a reboot. This was true, we had a few machines that processed data from Monday morning all the way into Sunday afternoon, after which they were rebooted.

      Someone else said "Yeah, scheduled, as part of a disaster recovery test once a year, maybe. Weekly scheduled reboots are a sign of shitty systems. How often do you reboot your Cisco routers?" As I said, we had servers that did not come up every single week. Something to be expected in an environment with thousands of servers. If virtually every week there is a problem with some servers, then once a week is often enough to reboot. We would have probably done it more often except too many machines were running all day during weekdays, as well as machines which absolutely had to be working by 9AM. Rebooting on Saturday evening gave us two days to fix problems and escalate problems. As far as shitty systems - there were some things I was unhappy with, but a lot of things were done right. Some the people in systems engineering were smarter than you and me put together I'm sure, F100 companies can afford these people. As far as how often we rebooted Cisco routers - every week. We had redundant routers and switches where needed. I worked in systems so know only a little IOS or about the network administration maintenance there, so I don't know what exceptions they made, or what happened to routing tables in memory or such.

      "Note his words, that he did this every weekend 'instead of during the week when a reboot of a critical machine that did not work would be much worse.' This does not describe a robust, high-availability system, it describes excuses for crap." I definitely disagree with this. We had Sun Enterprise 6500s on VCS with redundant RAID arrays that ran from 9AM to 5PM where one machine alone would process *billions* of dollars worth of transactions. After this, they would spend 5PM to 11PM or so processing (or offloading) this data. If we rebooted these machines at midnight, and they did not come up, they would absolutely have to be up by 9AM. This is not an excuse for crap, it would be insane to do such a reboot. And as far as crap, we had trouble from everyone - Microsoft, Sun, EMC, whoever - all of these people produced machines or software with defects, sometimes which we discovered - I don't know what your solution is to avoiding vendors who never introduce such errors, if you know of any vendors who have perfect products, I'd love to know. You do not sound like someone who has worked in an environment where critical machines need to be working by 9AM so as to do billions of dollars worth of transactions, your suggestion that if we can't reboot on midnight during a weekday our system is crap is insane.

      "If it's a critical server, your infrastructure should be such that it never power cycles." - well we are located in New York City and we had a blackout in 2003, as did much of the northeast. It started during a workday, on a Thursday, and on Friday morning electricity was still not functioning. So you are running all systems on UPS backup for 24 hours. Systems processing billions of dollars in transactions. Our systems did not power cycle, and ran on battery power for 24 hours, but your assertion that "your infrastructure should be such that it never power cycles" is ridiculous. In such a situation, I would be much happier knowing my machines had all rebooted fine days ago, instead of knowing the

    18. Re:machine failure by dbIII · · Score: 1
      Every Saturday evening, we rebooted all of our servers
      Obviously an MS windows shop geting around memory leaks.

      I feel like an amataur because I can turn all but two of my machines off on Christmas eve and not have to turn them on again until four days later. A lot of places really do have to do 365/24/7 with a lot of machines. Some computing tasks still take well over a week on reasonably serious hardware, and even if they are checkpointed every day you do not want to lose power.

      Scheduled reboots are a part of good systems administration
      Perhaps in desktop pc land, but some of us have to go months between any possible shutdown windows. You do have to do it often enough to know that your current configuration is going to come up - and you do have to know the machines backwards before you bring them down, and certainly need to know what sequence to bring them up and ensure as much as possible that each machine can come up alone.
    19. Re:machine failure by gkuz · · Score: 1
      Our systems did not power cycle, and ran on battery power for 24 hours, but your assertion that "your infrastructure should be such that it never power cycles" is ridiculous

      Not at all ridiculous, that's what generators are for. As I said, our data center has lost power once in 20 years.

    20. Re:machine failure by snero3 · · Score: 1

      organization can't afford to be without a single machine for a 2-3 hour block once a month, WTF is your plan to handle a hardware failure? Prayer?

      hmm I if you need that app 24/7/365 how are you going to get time to reboot the machine? Of course if you cluster it you could but not all machines need rebooting just to function. Also DR sites are great if you have total hardware failure/power failure. I have worked many places where a shutdown is just not considered/necessary (banks, road side assistance, trading houses) and they buy their hardware to suit.

      --
      It said "windows 98 or better" so I installed Linux
  27. LOL! Kindof like when... by GillBates0 · · Score: 5, Funny
    ...when I was on AOL and I hit the X and I couldn't talk to my AOL Buddies anymore.

    And I was like OMG I shut off the internets and stuff!!1!!

    And i called the AOL helpdesk and they helped turn it back on.

    --
    An Indian-American Hindu committed to non-violent thought/speech/action alarmed by the global explosion of radical Islam
    1. Re:LOL! Kindof like when... by game+kid · · Score: 1

      Among the many reasons that viruses spread across Windows PCs.

      The internets...now that's a classic.

      --
      You can hold down the "B" button for continuous firing.
    2. Re:LOL! Kindof like when... by Saeed+al-Sahaf · · Score: 1

      When you tried to turn it back on, did it go, like, "beep, beep, beep"?

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
  28. Way too thankful? by BestNicksRTaken · · Score: 1

    Is it me, or are some of those LJ users' expressions of thanks just a bit OTT?

    The way the comments go, you'd think this was a life support system or something!

    I mean, well done for getting the site back up after like 24 hours or something, but hey I'm not creaming my shorts over it!

    --
    #include <sig.h>
    1. Re:Way too thankful? by Anonymous Coward · · Score: 0

      Imagine if IRC shut off one day. Your favorite large server group. Certainly there'd be some consternation out there.

  29. And here by OverlordQ · · Score: 1

    everybody was blaming Internap for screwing up and running a shoddy Datacenter, when actually Internap did everything they were supposed to correctly.

    --
    Your hair look like poop, Bob! - Wanker.
    1. Re:And here by tmhsiao · · Score: 2, Interesting

      Aside from allowing an unaccompanied client access to the Big Red Button, perhaps?

      --
      "My God...It's full of ads!" -Fry, about the Internet, Futurama
    2. Re:And here by Anonymous Coward · · Score: 0

      You're a shining exemplar of intellectual sloth. If I were you, I'd choke on shit and asphyxiate.

  30. Also, by revery · · Score: 1

    Apparently someone "accidentally" pushed the emergency power off (which should keep all power off, even UPS)

    This also raised the all-important "Why do we even have that button?" question.

    1. Re:Also, by Anonymous Coward · · Score: 0

      This also raised the all-important "Why do we even have that button?" question.

      From TFA:

      "EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."

    2. Re:Also, by Scott+Laird · · Score: 2, Informative

      "Why do we even have that button?" Because it's basically required by law. Covering them with a plastic cover doesn't seem to help either--Internap did that the *last* time someone hit the EPO button in this datacenter.

    3. Re:Also, by merlin_jim · · Score: 1

      This also raised the all-important "Why do we even have that button?" question.

      Those buttons are generally maintenance devices; it's usually less of a button and more of a keyswitch though. So the guy comes in to service something, he needs to know that no power is anywhere in there, so he removes the key and keeps it in his pocket. Now he knows he's safe.

      --
      I am disrespectful to dirt! Can you see that I am serious?!
    4. Re:Also, by Peridriga · · Score: 1
      It's the law. It's also in the article.

      EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center
    5. Re:Also, by Anonymous Coward · · Score: 0

      Let me guess...

      In case you need to turn off all power in an emergency?
      Like when the firefighters come into the room and get ready to hose a big fire down after the argon didn't get released for some reason?
      Or more likely, when you see a clumsy electrician getting reminded he shouldn't be working on live circuits?

    6. Re:Also, by revery · · Score: 1

      I keep forgetting that this is slashdot. I shouls bave put in my disclamer:

      Please, do not be alarmed or reply with an explanation. This is a joke. I am joking. You have been joked with.

      Sigh...

    7. Re:Also, by RollingThunder · · Score: 1

      So that the firefighters don't start dumping water onto live power mains?

      It also helps people there stop electrical fires from massively spreading. Yes, there's already a fire, but the spread of it won't cause more shorts which can keep the fire going and/or burn out in seperate areas where the lines are overheating.

    8. Re:Also, by jacksonj04 · · Score: 1

      I think you're thinking of something different. Keyswitch isolators are more commonly used in localised sites, such as labs, where occasional maintenance is performed. They usually don't cut power to hundreds of critical servers, even bypassing the usual UPS systems.

      The EPO, on the other hand, is designed to cut all the power to everything. This comes in useful in things like fires, where you really don't want to be fumbling around for keys to the isolator.

      --
      How many people can read hex if only you and dead people can read hex?
  31. Button of Doom by clinko · · Score: 1

    Maybe they should use the Button of Doom (USB) to lock the pcs down too...

    1. Re:Button of Doom by Anonymous Coward · · Score: 0

      Oh man, that thing is awesome. I'm going to buy one and use at as my doorbell, then maybe those damn Jehovah's Witnesses would stop bothering me.

  32. Wait a second! by Sialagogue · · Score: 1

    "EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."

    "...all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown. (but if we do the rest of the stuff right, this step isn't even required, including having UPSes... in theory... but the UPSes would be comforting)

    Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

    I'm not trying to be a smartass here, since I'm not an expert in datacenters or the purposes behind EPOs - I'm asking. . .

    --
    The only acceptable defense of scientific results is to say that they were the product of the Scientific Method.
    1. Re:Wait a second! by rah1420 · · Score: 2, Informative

      Technically, yes. I'm hoping that if LJ decides to implement such a scheme (let's call it "LEPO" for "Leisurely Emergency Power Off") that they run it past the fire marshal or the code inspectors first, who may have another opinion about how smart this idea is.

      "If it's stupid and it works, it's not stupid."

      --
      Mit der Dummheit kämpfen Götter selbst vergebens.
    2. Re:Wait a second! by Malk-a-mite · · Score: 1

      APC has a white paper on EPO availible online at:
      [PDF warning]
      ftp://www.apcmedia.com/salestools/ASTE-5 T3TTT_R1_E N.pdf

      "Executive Summary
      Emergency Power Off (EPO) is the capability to power down a piece of electronic equipment
      or an entire installation from a single point by activating a push button. EPO is employed in
      many applications such as industrial, telecommunications, information technology (IT), etc.
      This white paper describes the use of EPO for protecting data centers and small IT
      equipment rooms containing UPS systems. Various applicable standards that require EPO
      are discussed. Recommended practices are suggested for the use of EPO with UPS
      systems.
      "

    3. Re:Wait a second! by reed · · Score: 1

      Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?



      No, you have two, one for each power system, seperated by enough space that it's hard to hit them both by accident, but easy for both to be hit in an emergency.
    4. Re:Wait a second! by Sialagogue · · Score: 1

      Sorry, but you confused me.

      It seemed as though they were talking in the article about putting a separate, independent UPS system in place for their machines, that are independent of the EPO system. It sounds to me like that would keep their machines on for four minutes even after one or both of the facilities EPO systems have been triggered creating an electrocution danger.

      Are you suggesting that their UPS would have a separate EPO just for it? I don't think that's the case, because they specifically mentioned wanting to have a 4 minute window if the main EPO was hit. But if they did that would put them right back where they started, because although they'd have four minutes in case the main EPO got triggered, they'd still have their own brand new EPO button hanging out there just waiting to be triggered accidentally.

      Could you clarify?

      --
      The only acceptable defense of scientific results is to say that they were the product of the Scientific Method.
    5. Re:Wait a second! by psykocrime · · Score: 2, Interesting

      Isn't that circumventing the purpose of the EPO? If there's a smokey fire in there and the firefighters have to enter the room and start spraying water around, won't a few machines glowing for four minutes after the EPO was pressed put them in danger of electrocution? Or force them to wait four minutes beore they can enter?

      It's not so much that the firefighters spraying water are worried about getting electrocuted via current conducting through the water itself... it's more about worrying bout stumbling into a live wire that's hanging down from the ceiling, or cutting into a live wire with a vent saw, or getting caught up in one with a pike pole or something.

      Having been a firefighter for somewhere around 15 years, I'd say that I for one would not be particularly concerned about the small UPS's. That's not to say that they *couldn't* pose a danger... just that relatively speaking, they'd be a minor concern.

      --
      // TODO: Insert Cool Sig
  33. The reason why some NICs don't auto-neg by phaetonic · · Score: 2, Informative

    I have run across this issue in data centers numerous times. This still occurs with the latest hardware, no matter what vendor or OS. I have this problem on SunFire280Rs and Compaq DL360s. What it comes down to is the switch being used in the data center and the settings in the OS. Typically, data centers set their switch to forced 100-full (unless of course they are using fibre or Gb). The OS must be set to force its NICs in the same mode, or they will either drop alot of packets. Sounds like a disconnect in communications between the NOC and the customer.

    1. Re:The reason why some NICs don't auto-neg by Anonymous Coward · · Score: 0

      Absolutely true. You'd think "autonegotiate" would mean just that, but if the switch side is hard-set and not autonegotiating on it's end, the standard for the side trying to autonegotiate is to fall back to 10/half.

    2. Re:The reason why some NICs don't auto-neg by caluml · · Score: 2, Informative

      That's what Compaq Lights-Out cards are for. Lovely things. Very handy.

  34. 13 yo? :P by Spy+der+Mann · · Score: 3, Funny

    Ran off skipping and giggling, like a 13 year old who just put toothpaste on the toilet seat?

    By any chance, was his name "Zero Cool"?

  35. OOB console access is the answer. by Mordant · · Score: 2, Insightful

    They ought to have out-of-band (OOB )serial-console access to their servers via a terminal server for any number of reasons, including this one; if they'd implemented OOB console access, they could've sshed into the terminal server, gotten onto the consoles of the servers in question, and used ifconfig to fix the duplex issue.

    Why they don't seem to grasp this is beyond me . . . anyone running a public-facing, high-volume service should have OOB access to all servers, routers, switches, firewalls, etc. . . . it's just common sense.

    1. Re:OOB console access is the answer. by Anonymous Coward · · Score: 0

      This is addressed in the document where it says forcing to Full/100 doesn't work. Instead they need to move it from one switch to another, negotiate Full/100 then move it back.
      Even better, they're replacing the damn NICs.

  36. HAH! by rah1420 · · Score: 1

    I told you so.

    Looks like my "Newbie Operator" found hisself a new job.

    --
    Mit der Dummheit kämpfen Götter selbst vergebens.
  37. 2 accounts of the powerloss by Spazholio · · Score: 4, Funny

    The one they tell you about and the real one.

  38. No! by Saeed+al-Sahaf · · Score: 2, Insightful
    embedded NICs...

    Who in their right mind goes with the on-board NIC in a server environment?

    --
    "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    1. Re:No! by juuri · · Score: 2, Interesting

      Who in their right mind goes with the on-board NIC in a server environment?

      Are you kidding?

      How about everyone? Regardless of PC, Sun, Alpha or whatever hardware.

      --
      --- I do not moderate.
    2. Re:No! by Saeed+al-Sahaf · · Score: 1

      Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    3. Re:No! by Anonymous Coward · · Score: 0

      Life in the rackmount data center is much different than your five computer home network. Space is a critical factor, as well as cost (but not so much).

    4. Re:No! by SenorChuck · · Score: 2, Informative

      On all of the (actual) servers I've worked with, the onboard NICs are exactly the same hardware that you get with the server-grade PCI NICs.

      --
      A wise person makes his own decisions, a weak one obeys public opinion. -- Chinese proverb
    5. Re:No! by ihaddsl · · Score: 1

      Please explain, why not use the onboad nics? After all for a respectable server we're not talking about your el cheapo embedded NIC as found on many desktop motherboards, but Intel e1000's, Broadcom's and others.

      Nothing wrong with using the embedded NIC's at all.

    6. Re:No! by grommit · · Score: 1

      That's all fine and well if you've got money to burn on unnecessary things but quite a few organizations have a budget that they need to adhere to. Sure, in some horribly sadistic way, I guess I can see some glimmer of a benefit to every machine having the same type of network card but the added time/expense/hassle of cracking open each and every case to put in a network card is just unimaginable to me. You can't possibly deal with many 1U chassis very often. I don't even think most blade servers have the ability to have an extra NIC installed on them.

      I guess what I'm trying to say is that what you're talking about doing is complete nonsense IMO.

    7. Re:No! by Saeed+al-Sahaf · · Score: 1
      Life in the rackmount data center is much different than your five computer home network.

      Thanks for the insult. But I'm not talking about home.

      --
      "Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
    8. Re:No! by caluml · · Score: 1

      DL360s have 2 onboard eepro100s in them. They have never failed on me.

    9. Re:No! by gl4ss · · Score: 1

      if the network chips are the same, and the onboard nics made for to be used, why not?

      i have hard time seeing you slapping pci cards into 1u servers anyways. or perhaps you slap 'em with usb2 nics....

      --
      world was created 5 seconds before this post as it is.
    10. Re:No! by darkwhite · · Score: 1

      Oh, I dunno, perhaps every single space-conscious datacenter user?

      Anything thinner than 4u either won't have space for an off-board nic or won't need it if it has a riser and is not part of a fiber network. For 99% of server uses, the benefits of an off-board nic are dubious when a halfway modern mobo is installed.

      --

      [an error occurred while processing this directive]
    11. Re:No! by mink · · Score: 1

      On-board PC-net chipsets have exactly 1 driver ever written for SCO Openserver (not my fault, I just have to support it) and it has issues with random TCP/IP lockups.
      My solution, since SCO and IBM were playing the blame game, was to disable it and put in a good, well supported 3-COM card.
      What sucks more is I'm just a 3rd party support guy and I had to pay for this out of my pocket or nothing would ever get done.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  39. Not millions of paying accounts. by EvilStein · · Score: 4, Informative

    Actually, most of the accounts don't pay. They're just freeloading whiners.

    This is a paste from the Livejournal stats:

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

    1. Re:Not millions of paying accounts. by thephotoman · · Score: 1

      However, I am among the paying whiners. Oh well...for one day without LJ entertainment (during which I was out of the house for the most part anyway, and therefore nowhere near a computer), I got two weeks more paid time for free.

      Pretty good deal in my book, I'd say.

      --
      Haec merda tauri est. Ceterum censeo Carthaginem esse delendam.
    2. Re:Not millions of paying accounts. by JoeNotCharles · · Score: 1

      How many of those free accounts are active, though?

    3. Re:Not millions of paying accounts. by EvilStein · · Score: 1

      http://www.livejournal.com/stats.bml

      It's offline right now, though. Big shock that is. heh.

    4. Re:Not millions of paying accounts. by metamatic · · Score: 1

      Where do those stats categorize people like me, who paid for accounts but had them deleted by abusive admins?

      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
    5. Re:Not millions of paying accounts. by zonker · · Score: 0

      yeah, i think their pages-about-peoples-cats service, errr... livejournal, should go down more often. it seems you and most of the human race will benefit from it. a win-win situation!

  40. Calling all disk cache experts. by turm · · Score: 1

    The article cites disk caches as a source of data-loss.

    They claim that their battery-backed RAID caches were safe, but that the actual drives themselves were performing unsafe write cacheing. It strikes me that this is the kind of thing that's quite easy to *suggest*, but far more difficult to *prove*.

    I don't have any first-hand knowledge of disk corruption due to write-caching. Is this a real problem or just some kind of legend? Can someone who has RTFA'ed and knows about disk caches please comment?

    This is somewhat irrelevant, but I've messed with some non-battery-backed RAID setups in the past. In these situations, it always made sense to me that the controller would set the individual drives' cache policy to match its own.

    1. Re:Calling all disk cache experts. by Anonymous Coward · · Score: 0

      I don't actively write kernel code, but I remember reading somewhere (kernel trap or something) that the linux kernel disables the disk write cache, since they're doing their own caching and work in the OS, and some drives don't follow intelligent practices reguarding that cache. I'm sure a more knowledgeable person can comment.

  41. Its a Small World... by eieken · · Score: 1

    It seems that my company and LiveJournal host at the same datacenter here in Seattle. Looks like they got hit pretty hard when the datacenter with multiple redundant battery backups and generators had a massive cascade emergency power off, and every server in the building got shutdown at once. LiveJournal got hit the hardest, they had some IDE drives on their servers, doh! Looks like even multiple redundant battery backup with power generator datacenters are still vulnerable to dumbass electricians who don't know what they are doing. The datacenter has been under construction for the past few months too, so you KNOW that had something to do with it. Looks like we'll have to put a UPS in our cabinet at the multiple redundant battery back up and power generator datacenter housing, seeing as all that backup protection doesn't mean diddly squat.

    --
    Meet new people, and kill them.
    1. Re:Its a Small World... by radish · · Score: 2, Funny

      LiveJournal got hit the hardest, they had some IDE drives on their servers, doh!

      I was unaware that SCSI drives had the ability to run without power - thanks for the info!

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    2. Re:Its a Small World... by CounterZer0 · · Score: 1

      Well duh, what did you think the 'battery backup' on the RAID card was for??

  42. Blame by bsd4me · · Score: 1

    Most of the time it is Stimpy's fault. The rest of the time it is Fry's fault. I think there may be a connection...

    --

    (S(SKK)(SKK))(S(SKK)(SKK))

    1. Re:Blame by GoatPigSheep · · Score: 1

      Yup, billy west did the voices for both of them

      --
      GoatPigSheep, the 3 most important food groups
  43. No UPSes before? by iabervon · · Score: 1

    I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.

    1. Re:No UPSes before? by Nonesuch · · Score: 2, Informative
      I'm surprised that they didn't have their own little UPSes to bring the system down cleanly before. Sure, the facility is supposed to provide power at all times, even if there's a power grid interruption, but that doesn't get tested very often and isn't under your control. Furthermore, in the event that the facility's power is actually going to go out, there isn't any way for the machines to find this out and shut down cleanly.
      Unfortunately, this would defeat the purpose of the "Big Red Button", which is there to quickly and definitively cut of all power to all line-powered devices in the data center.

      When you've got an analyst smoking and twitching next to one of the racks as 110VAC courses through her veins, you don't want to have to go hunting to figure out which UPS is supplying the juice.

    2. Re:No UPSes before? by moggie_xev · · Score: 1
      all our DBs have redundant power supplies. we'll be plugging one side into Internap's, and the other side into our own UPS, which itself is plugged into Internap's other power grid. that way if EPO is pressed, we'll have 1-4 minutes to do a clean shutdown.

      I find it worring that they are actually going to do that. If the place needs a big red button it HAS to work. If it doesn't need it don't have it.

    3. Re:No UPSes before? by gkuz · · Score: 1

      "Worrying"? This sounds like a code violation. An electrician can lose his license.

    4. Re:No UPSes before? by iabervon · · Score: 1

      The article says they're planning to have UPSes, and it's unlikely that nobody from their hosting facility reads livejournal. So what they're doing is probably okay (or at least, they'll be stopped if it's not). I suspect that the button has to kill everything that can supply a lot of power, but that single-computer 5-minute UPSes aren't a big deal. The way you fry analysts is when the batteries and generators that can run a whole data center short through the person, not with hardware you can pick up at the computer store.

  44. Switch location by cyberfunk2 · · Score: 1

    Arnt these sorts of switches usually behind little glass things that say "BREAK IN CASE OF EMERGENCY" ?

    I mean I'm sure it's a big red button of some sort like the one we've got in our server room, but man, that's the sorta thing that needs a video camera aimed at it.

    Of course, if it was a malicious inside job, then there's not too much to do about it.

    I understand the REASON for an easily accesable switch like this, but would it be possible just to wire it into the fire system or something and not have a switch that just screams touch me for a thrill ?

  45. Accidents happen by Migraineman · · Score: 2, Interesting

    About a decade ago, we had a series of "incidents" with the EPO button in the software lab. Shortly after a serious lab upgrade (due to constantly blowing breakers,) someone decided to test the EPO switch (it was a bit of a novelty at the time.) *click* "Cool, it works. Hey, how do you reset this thing?" Turns out you needed to have a key to reset it. It took about 4 hours to find someone who had the key. That one got replaced with the Mark II resetable switch ...

    About a month later, one of the managers was giving a prospective new-hire a tour. He got to the software lab, and started blathering about "don't ever push the red switch" as he put his finger on the switch ... *click*

    So some einstein decided that the Big Red Switch was "dangerous" and put a plexi cover over it - the same kind that goes over the thermostat control, and the same kind that has a key lock. Yep, about six months later we had a gen-you-ine emergency. One of the HP 9000/300 monitors went crispy, and was snorting smoke and sparks. One of the software folks went to hit the Big Red Button, but was somewhat nonplussed to find a locking cover over it. She took the co-located fire bottle, sheared the cover off, pressed the button, then got to use said fire bottle on the monitor.

    So the cover gets replaced again, though this time with a non-locking cover. At some point, the software server stack needed to be relocated into the corner with the Big Red Button. Another einstein discovered that it was inconvenient to slink behind the equipment rack - the cover kept bashing him in the neck or shoulder. So he removed it, thinking that accidental presses wouldn't happen because the button was obstructed by the server stack. (yep, inaccessible = useless.) Some time later, the equipment was being jockeyed for an upgrade, and one of the big SCSI cables snagged the Big Red Button and *click* ...

    All these shenanigans happened in the space of one year, and I got tired of the thrash. I measured the space between the back of the switch and the faceplate - just over 3/4 inch. I cut a horseshoe shape out of 3/4 plywood, and hung it on the switch shaft. In and emergency, it's really easy (and obvious) to remove it. Gravity keeps it there otherwise. No problems since ...

  46. LJ IS TEH LITTLE GIRL HOLE! by Turn-X+Alphonse · · Score: 1

    Maybe people will see this and relise the LJ staff are geeks, unlike most of their fanbase, so while you maybe mocking their minions they can still bring down a server looking at a single article with the rest of us slashdotters.

    --
    I like muppets.
  47. Seen that happen before... by patmandu · · Score: 1

    Way back when, I was working at an IBM site (STF) that had a boatload of mainframes and equipment on a raised floor area that was badge-access only. Every summer we'd get interns to learn the finer points of computer science by doing things like bursting printouts from the lineprinter and delivering them. Seems that the intern introductory tour had gotten a bit lax... One day a cleaning person knocked at the door to the raised floor to get let in to empty the wastebaskets. Nobody else was around, so one of the interns decided to let them in. Of course they pushed The Big Red Switch that was right next to the door. Oops. Whole floor went down...hard, about 10% of the stuff didn't come back up when the power was restored. Not fun...

    They revised the introductory tour a bit, and added a label to the EPO switch.

    (And no, it wasn't me who hit the button...)

  48. happened to us by bwindle2 · · Score: 1
    We have one of those Big Red Buttons in our datacenter (about 7 feet up on the wall, so no one could accident bump into it). About a year after it was installed, an electrician showed up to do something in the ceiling, and accident leaned his ladder up against our exposed Big Red Button.

    Needless to say, we now have a cover over our Button. Funny thing is, the electrician who installed the original button is also the guy who leaned his ladder against it.

    1. Re:happened to us by TomHandy · · Score: 1

      Wait a minute............. that's not funny at all!

  49. You, sir, are an idiot. by Anonymous Coward · · Score: 5, Informative

    Go ahead and read up on how auto-negotiation works. I'll wait...

    No, really. Go read up on it...

    Okay, since you don't bother reading up on it, and since you claim that someone's cheeky because they *document* what happens when you misconfigure a connection, I must conclude that you, sir, are indeed an idiot.

    (To summarize for those of you who won't bother to look it up, a NIC can sense the carrier for 100, so it can differentiate 10/100. Full and half are actively negotiated by the two sides of the connection. If side 'A' is hard set to 100/full, it won't negotiate with the other side. Hearing no negotiation, side 'B' will assume the NIC doesn't support full duplex connections and failover to half duplex. This is the proper, standardized, documented behavior. Anything else would require the psychic interface spec that *still* hasn't been finalized.)

    1. Re:You, sir, are an idiot. by Undertaker43017 · · Score: 1

      OK, assuming you are correct, then why does every other NIC/switch vendor on the planet seem to have no problem with auto-neg and Cisco does?

      I have never seen this problem with Foundry switches, only Cisco.

    2. Re:You, sir, are an idiot. by Anonymous Coward · · Score: 0

      Anything else would require the psychic interface spec that *still* hasn't been finalized.
      This has been crossing over, with john edwards... sponserd by the new psychic interface NIC to allow your computer to talk to the dead, and operate at full duplex without auto negotiation....

    3. Re:You, sir, are an idiot. by ghjm · · Score: 1

      Because with most other NIC/switch brands, you can never really, truly disable autonegotiation. When you hard-code the speed and duplex, it's more of a suggestion. They still run the negotiation to find out what the other side is doing. With Cisco gear, if you hard code the settings, then it just ignores any autonegotiation, just like in the dark ages.

      I'm not defending Cisco - I think their approach is wrong. But it's at least understandable that they are taking the side of the MCSITW (most conservative sysadmin in the world), who would want to hard-code settings on all sides and ALSO find a way to ensure that autonegotiation could NEVER POSSIBLY happen, even if random electrical noise somehow convinced one device that maybe the other one was trying to negotiate with it.

      -Graham

    4. Re:You, sir, are an idiot. by sparkz · · Score: 1
      See Sunsolve. The IEEE specs are open to various interpretations; this can lead to Gb interfaces going to 100/hdx or other dodgy configs. See also Cisco's website for their take. (Also see here .)

      Cisco seem to recommend autonegotiation; Sun recommend forcing the speed/duplex.

      We've had problems in the past with Sun's "ce" fibre cards and Cisco Catalyst switches. It's not that either implementation is "wrong", the specs simply are not specific enough.
      Sorry, can't find the detail in the spec which causes the problem

      --
      Author, Shell Scripting : Expert Re
  50. BIOS Config by art3d · · Score: 1

    This reminds me of the time when I had a server that would not reboot because there wasn't a keyboard plugged in, and I did not change the setting in the BIOS.

    Brian.

  51. Big red buttons by cbr2702 · · Score: 1

    So make a little black button and know where it is, but also make an big red one that turns off the lights. That way you get to yell at little kids without much harm to your system.

    --


    This post written under Gentoo-linux with an SCO IP license.
  52. They're attention whoring by EvilStein · · Score: 1

    Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it.

    Total kissasses. I wonder how many of them are paid members vs free accounts.
    Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...

    Account Types

    What type of account do people have?

    * Free Account: 5713743 (98.3%)
    * Early Adopter: 14220 (0.2%)
    * Paid Account: 94857 (1.6%)
    * Permanent Account: 1632 (0.0%)

    1. Re:They're attention whoring by mdwh2 · · Score: 1

      Plain and simple. People notice a "historical post" and they want to have their LJ face right up there in it. Total kissasses. I wonder how many of them are paid members vs free accounts.
      Remember, the overwhelming majority of Livejournal users are *NOT* paying customers...


      But how does that support your argument? If anything, I'd say it's the other way round - people are showing their thanks, and someone who uses a service for free has more reason to be grateful for the work being done for them. A paid user would expect it to have worked in the first place.

  53. No, it did not by EvilStein · · Score: 1

    They're required by law to have it. It's a building code thing. Every data center I've ever been in has one.

    Also.. ""EPO, by the way, stands for Emergency Power Off and it's a national fire/electrical requirement for firefighters to be able to press these big red buttons near all exits that turn off all power in the entire data center."

  54. This is what happens... by MsGeek · · Score: 1

    ...when you buy crappy kit. Next time do it right.

    --
    Knowledge is power. Knowledge shared is power multiplied.
    1. Re:This is what happens... by Anonymous Coward · · Score: 0

      Asus sucks... I've had trouble with three out of three asus mother boards, and one out of one asus graphic card, which asus had the audacity to tell me it wouldn't replace because I had to send it back to the reseller first... anyone who calls asus a tier 1 vendor anymore has been into the wacky weed... sure some of thier products are okay, but their QA and RMA departments all got fired to 'cut costs' so bottom line they aren't a tier 1 vendor.

  55. You mean "The Big Red Button" by rednip · · Score: 1

    A couple of years ago, when our server room was being 'certified', one of the specific checks was "No, big red button, check". One of the guys in the group came up with a story about how someone's kid at the end of a 'tour' thought that the 'big red button' was ment to be pushed.

    --
    The force that blew the Big Bang continues to accelerate.
  56. Cabling? by redelm · · Score: 1
    OK, this _shouldn't_ apply to a good, reputable datacenter that has structured wiring to TIA/EIA-568 running gigabit.

    I most often see autoneg problems with faulty cabling (split pairs from crimps). 98% of newbies cannot get it right, and they aren't to blame because the standards are counter-intuitive unless you've worked for Ma Bell for 40+ years. I beware of all field crimps.

    OTOH, I saw one example of a Crisco Crapalyst router not wanting to play with some devices. Of course they blamed the device, but I never had any problem with interconnects or using cheap @$$ switches, so I wonder why the expensive @$$ switch gets huffy.

  57. Re:Don't forget... by vadim_t · · Score: 1

    Nonsense. I had my server up for 360 days without rebooting, with kernel 2.4. It had 360 days on the uptime counter. I only shut it down because it was too slow for the newer stuff I wanted to run.

  58. Re:Don't forget... by Anonymous Coward · · Score: 0
    You fucking asscock. Goo gobbling knob handler! There are a number of things wrong with your poast and I shall berate you every step of the way dickwad:


    There is a nasty bug in Linux that makes the computer reboot every 49.7 days. The worst part is that this bug has been around for almost 10 years...


    WTF???!!! Only if your a cretinized asshandler like yourself. My last big uptime on Linux was 131 days and that ended because of a hard drive failure. You lie like a fucking $2 whore for SCO because there is NO bug that makes computers reboot every 49.7 days. Take my advice, back away from the computer and go back to sucking on Darl's limp cocktail wiener.
    What good is a million eyes looking at the code if they are attached to half a million idiots?


    Those million eyes have more intelligence than an infinite number of copies of you. If we took billions of copies of you and stuck them in a room with typewriters, we might wind up with a collective "ungh". Stupid pudthumping moron! Quit jacking off over pictures of your mom and realize that you are worthless and a complete and utter failure as a troll.
    I guess most people don't realize this because they need to recompile their kernel every other week, or they use Linux only to boot into illegal copies of Windows.


    N. You're wrong again you chocopipe loving jackhole. Most people running Linux are too busy having a real life (ie. screwing our wives/girlfriends, hanging out at our favorite bars/clubs. or even watching a game on TV) instead of patching Windows worms like all SCO lovers are apt to be doing. Because in the end, you don't hate Linux, you don't love SCO, you love the cocks of Bill Gates and Steve Ballmer. Congrats because you've just set the bar even lower than it's ever been before. Good job on making the world that much more retarded with your stupid troll. Good work you assmunching turd goblin.
    Well, you get what you pay for. Is this SCO Linux that you talk about any better for $699?

    Proof that you are the most unoriginal troll to ever hit the pages of Slashdot. Hey kid, why don't you spend some time learning about being a fucking adult and get some manners. Then maybe, just maybe, you can learn how to troll from the masters. Get real with yourself fuckass. You don't know what the hell you're talking about. You are a complete failure. You are ugly and unloved and a miserable little nutsack. Fuck you. Go to hell. Have a horrible day. Screw off jackass. Go do me a favor and jam a hot soldering iron up your diseased ass. Bye bitch.

  59. The result. by Pathetic+Coward · · Score: 1

    (a) Manager that pushed the "off" button gets promoted.
    (b) Engineers that spent their weekends getting the system back up: off to India with your jobs!

    1. Re:The result. by 6Yankee · · Score: 1

      (a) Manager that pushed the "off" button gets promoted.

      About a year back, we had a power failure at our place one morning - maybe 20 seconds' worth. All the servers fell over horribly and refused to come back up.

      When we came back in the following morning, we found everything working - and an email from our MD to the entire company, thanking the IT people for the heroic job they'd done in getting everything back up. Apparently they'd been there till 2am, poor little lambs. So that was nice.

      We later (a lot later!) found out that the reason everything had fallen over so horribly was that nobody had thought to test the UPS battery for six months, and it had quietly died. But IT were heroes.

      The MD who sent that email is now in charge of IT for our parent company's parent company.

      True story.

    2. Re:The result. by Anonymous Coward · · Score: 0

      The MD who sent that email is now in charge of IT for our parent company's parent company.

      Barclay's?

  60. Oblig. Homer Simpson by The_REAL_DZA · · Score: 1

    "Awww, I don't know why we even have a jug!"

    --


    This space intentionally left (almost) blank.
  61. Power-smart PCs by Pfhorrest · · Score: 1

    I'm waiting for the day that machines come built such that when the power dies, an emergency battery kicks in just long enough to dump the RAM state to a nonvolatile cache, and then when power resumes, restore the system from there. Like VirtualPC.

    Heck, having that be a user-accessible feature supported by the OS ("Save and Shutdown") would make a lot of sense too.

    --
    -Forrest Cameranesi, Geek of all Trades
    "I am Sam. Sam I am. I do not like trolls, flames, or spam."
    1. Re:Power-smart PCs by Nonesuch · · Score: 1
      I'm waiting for the day that machines come built such that when the power dies, an emergency battery kicks in just long enough to dump the RAM state to a nonvolatile cache, and then when power resumes, restore the system from there. Like VirtualPC.

      Heck, having that be a user-accessible feature supported by the OS ("Save and Shutdown") would make a lot of sense too.

      Way back in the days of Windows 3.0, there were actually ISA cards available which could provide exactly this feature.

      Some of the "mini" versions of multi-user systems from the 1980s had similar features, so when you accidentally kicked the power cord out from the wall, you didn't abend sessions for the whole department.

  62. Nah, just Google "disk write cache" by rjamestaylor · · Score: 1
    check out this article on write cache.

    Lazy writes allow for faster system operation and have only one detrimental downside: in a poweroff or unexpected reset the data waiting to be written won't be. As bad as that sounds, the performance gains during normal system operations usually overcome fears of this data loss potential.

    It boils down to this: if every bit of data is crucial, disable write cache. If performance is paramount and some tolerance exists for infrequent data loss due to catastrophic failures, enable it. LiveJournal evidently wanted your normal experience to be pleasantly quick rather than painfully accurate.

    --
    -- @rjamestaylor on Ello
  63. Make the Luser pay by sconeu · · Score: 1

    I assume that they will have the responsible luser pay for the down time plus the 2 weeks credit plus the extra hours for the staff to bring the system up.

    And what the hell was a visitor doing playing with the Big Red Button anyways?

    --
    General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
  64. Re:Don't forget... by Anonymous Coward · · Score: 0

    Huh? I think the post you replied to was doing the trolling. It was much more effective because it was much longer and more biting than the parent post. If anyone was trolled it was you because you responded. I'm guessing you don't understand how trolling works. Based on your user ID number, I'm guessing you're new here. I've been here since th 5000s. The key to master trolling is to make your post incredibly long, use a lot of profanity and ad hominems, and appear to be extremely upset. Then watch as the losers come in and try to get you back for what you said. I've been trolling on Slashdot myself for the past few years just to relieve the monotony. Trust me. You are the loser and the post you replied to is the success.

  65. Photo of the button by teneighty · · Score: 2, Insightful

    Apparently this photo is an example of the button that was "accidently" pressed.

    I'd love to hear the explanation for this "accident".

  66. They are stupid by Anonymous Coward · · Score: 0

    MySQL: Rebuilding indexes that automatically corrupt on power interruption
    Lying disk controllers: corrupted data on power interruption

    Try using decent hardware and a decent database (PostgreSQL)

  67. Actually by Anonymous Coward · · Score: 0

    Some LJer showing her boobs accidentally hit the EPO switch. "Hey, look at me!! OOOPS"

  68. big picture by Anonymous Coward · · Score: 0

    there's a big picture livejournal seems to be missing - it's great they built all this redundancy into their systems INSIDE THE SAME DATACENTER.
    they should also be investigating setting up either a disaster recovery site w/ fast failover, or another facility for an active/active configuration. thrown in some 3DNSes for WIPs, etc etc. that might help guard against a facility experiencing the stupidity of one of their customers, act of God, etc.

  69. Nothing wrong with onboard NICs in "real" servers. by Nonesuch · · Score: 2, Informative
    Does not mean it's a good idea! Not a single machine where I work uses the on-board NIC, from servers down to desktops. And all of our machines have a two year lifecycle, tops. We generally plug in a 3Com card of some type.
    The smallest of the Sun 1U rackmount Sparc servers do not even have a PCI slot to take a NIC -- no expansion at all, but two on-board 100M interfaces are plenty for most data center deployments of these small boxes.
  70. Re:Don't forget... by Anonymous Coward · · Score: 0

    I found your picture on your first excursion out of the house, you sack of semen-laced shit:

    http://www.critical.ch/src/linux_nylug_booth.jpg

    You sure have a nasty mouth, did you learn to speak like that when your dad was fucking your ass as a kid?

    Linux is shit and unstable and you know it, you disgusting scat-eater. But you run it because you love getting that nice monitor tan while endlessly patching and reading useless discussions of how Linux is going to make it one day.

    Come out of that shit hole you live in, poopy-dicked fag-bagger.

  71. yeah by Anonymous Coward · · Score: 0

    yeah. i had a linux server up for 72 hours once, i know 'cause i stayed up for three days watching xclock. i only brought the server down because i wanted to move my mouse

  72. Re:Don't forget... by Anonymous Coward · · Score: 0

    Why don't you say that to my face you limp dicked bitch? I'll tell you why. Because you know if you did, your teeth and glasses would wind up in your asshole.

    I don't have to "read useless discussions" about Linux making it one day. It's already made it for me for the past ten years you stumphumping steaming pile. It's a good thing that we're both posting AC or I would hunt you down since I'm pretty sure your'a pencil-necked geek or some fat Winblows luser in your mother's basement. Come back when your voice stops cracking and you grow a set.

  73. It happens by boodaman · · Score: 1

    This happened to us last year in our datacenter.

    The Facilities manager had some guys in to install shelving to store toner, cables, etc.

    Our datacenter is divided into two sections, inner and outer. All CPUs, UPSs, HVAC, etc are in the inner room. The outer room is shelving, desks, CCTV (security), etc.

    The EPOs are near every door, as they should be, including the outer doors. Some guy, while installing the shelves, decided to take a little break and lean against the wall, leaning on the EPO in the process.

    It took us about 10 minutes to figure out what the hell happened, because even the generator didn't fire as it should. Meanwhile, the shelving guys were just merrily installing shelves. When asked, the guy just said he didn't realize anything was wrong and just thought it was nice that everything "got so quiet" all of a sudden.

    Like LiveJournal, we promptly installed cages over the EPO buttons.

  74. Re:Don't forget... by orderb13 · · Score: 1

    Ahh, I always wondered how one trolled. I feel better now.

  75. Damn skippy by ShatteredDream · · Score: 1

    You sure hit the nail on the head son. I am glad that you recognize that I am without bias, opinion or a tendency to propagandize for my side. My reporting is beyond reproach and I cannot even fathom how someone could insinuate that things might be to the contrary...

    1. Re:Damn skippy by metalhed77 · · Score: 1

      I hadn't actually read your weblog (I didn't find anything political, despite your sig, on the first few entries). So I figured you were one of those 'meet in the middle means I'm guaranteed to be right' kind of people. Apologies for that.

      Still, LJ doesn't deserve to be bashed. It doesn't even pollute google like the political blogs do!

      --
      Photos.
    2. Re:Damn skippy by Anonymous Coward · · Score: 0

      He means you suck, and your weblog wasn't interesting enough to warrant more than a two-second glance, because it is both generic and at least as tired, old and hackneyed, if not more so, than anything you could rag on Livejournal about, and therefore your post is a paragon of clueless hypocrisy.

      Sure, he prettied up the language a bit, but the essential point is the same.

  76. heh - most of the delay was because of mysql by ignorant_newbie · · Score: 1

    MyISAM vs InnoDB: When you lose power to a MySQL db w/ MyISAM tables, the indexes are generally messed and you need to rebuild.

    When will people stop using this POS for production environments? do you drive to work in your kid's toy car just because it's cheaper? no. you get the best car you can afford. Do you use FAT32 for your production severs? no. you use reiser or ffs+softupdate.


    So - if they'd spent the extra 10 minutes it takes to learn how to program a real database, they'd have come right back up with maybe 5 min of transactions needing to be replayed.

    1. Re:heh - most of the delay was because of mysql by smitty45 · · Score: 1

      like Yahoo, when they use the "POS" database ?

    2. Re:heh - most of the delay was because of mysql by Anonymous Coward · · Score: 0

      you get the best car you can afford.

      Bad example.

      IN the case of the car, it depends whether you are trying to impress someone with it or drive it. I've always bought the cheapest, most reliable car I can find. If I can maintain it myself, so much the better.

      I guess databases are sort of the same way. There's always somone out there trying to sell you some ridiculous, bloated SUV of a database with oodles of useless features instead of something simple and reliable that uses half the resources and a quarter of the space to carry the same load.

  77. I think the real question by Kaisum · · Score: 1

    Is why do you have an emergency shut down for a bunch of journals? Dear God Jim! The hax0rz have gotten the journals! Shut them down, now!

  78. bad auto negotiation happens a lot by oogoody · · Score: 1

    > OTOH, I saw one example of a Crisco Crapalyst

    We always had problems with auto negotiation and the Crapalyst. It wasn't wiring or the workstation either. Whenever there was a performance problem it was almost always in the switch.

  79. Not just a computer issue by lrucker · · Score: 1
    This happened to a friend of mine in a manufacturing plant. They had machines that made plastic cups, and every so often the machine would jam, the operator would hit the little switch that was right next to him, clear the jam, and go on. So they hired a new operator, my friend explained the procedure, left the guy alone. Shortly thereafter, the machine jammed, new guy panicked, ran across the room to the Big Red Switch, hit it, and cut power to every machine in the plant. It took the rest of the day to get the machines all running again.

    The new guy's first day was also his last.

    1. Re:Not just a computer issue by pe1chl · · Score: 1

      The new guy's first day was also his last.

      Of course it should have been your friend's last day...

  80. Button Vs key by phorm · · Score: 1

    Simple solution to this one. At work we don't have a kill button. We have a kill key. It takes a little bit more work to "insert key" and "turn", but it's better than having incidents like this wherein somebody hits the big red button.

    Plus, you can give the key only people that aren't idiots. With the big red button, you'll inevitably get somebody who thinks "hmm, wonder what would happen if I pushed this big red button duhhhhhh."

    1. Re:Button Vs key by dismayed · · Score: 1
      National electic code requirements mandate that you have EPO for any data center or similar facility. The "big red button" is also an informal standard so that fire fighters can quickly identify and disconnect electricity if need be.

      See this document, Understanding EPO, for more information: Understanding Emergency Power Off

    2. Re:Button Vs key by AndrewRUK · · Score: 1

      And by having it need a key, you are missing the point of it being an Emergency Power Off. A hypothetical scenario for you: Which is least bad, "Oh shit, who's got the key for the power off, Phorm's getting electrocuted! Oh, too late..." or *thunk* *whirr* "You ok?"

      If you need to cut the power in an emergency, you want to do it now, not in five minutes, when someone's found the key.

    3. Re:Button Vs key by phorm · · Score: 1

      Must be some hefty power going on then. I suppose in higher-class server rooms you might actually have a mains or something greater than what we have here. The key in our case is to shutdown for maintenance ... so I guess I missed the point of this one. We do have a mains room but I don't remember a button, it's probably a lever or breaker.

      You could still use something similar to the key by having a rotating power switch (that is, one that turns clockwise). Nothing to be inserted, it's still fast, but it's harder to accidentally bump off. Nicer than having a glass/lexan cover too, since there's nothing blocking it.

      I've seen some buildings which have a 90-degree-turning lever which works along the same ways... still seems better than a button to me but not as good as a rotating switch (most are big enough that somebody could accidentally push down on the lever).

  81. EPO by trolman · · Score: 1
    I have worked inside hundreds of data centers and have designed a few EPO systems and have never ever had a case of and EPO button press being used to save a life. NFPA really needs to look at this again and decide how easily a facilities power can be turned off in an emergency. Only a lot of negitive feedback will make that happen. Geez, even power distribution panels have key locks to prevent tampering.

    Anyhow; I have seen EPO activations ranging from the malicious to a simple slam of the door and never once has it saved a life. So what? if a monitor smokes.

    Until then: Place the redundant part of your system in a seperate room, building, or country.

  82. The BSD box PSU probably had bigger capacitors.. by EMIce · · Score: 1

    Bigger PSU capacitors = a machine less likely to crash or shut down during a brownout. I mean, after all, their job is to buffer power fluctuations. I doubt it had much to do with the OS.

  83. I was right! by Megane · · Score: 1
    Ha ha haaaa!
    All right, who did it? Who pressed the shiiiny, candy-like history eras... I mean emergency stop button?

    Or maybe I've just been reading too many episodes of BOFH lately.

    --
    #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
  84. Big Red Button stories by sparkz · · Score: 1
    I'm pleased to say that I've never (AFAIK) been the culprit, but I've been around for a few goodies - two being the classic "but I thought that was the door-release". One of these just hit the Big Red Button as someone happened to be entering, so the door opened, and the culprit wandered out of the machine room without noticing that it all went dark and quiet behind him.

    The second was a guy who was on his first day of work with us. A Big Boss came towards the machine room, so - feeling helpful - the new guy opens the door for him... or so he thought.

    My favourite story (though I wasn't there) is about some old DEC machines, which apparently had the power switch about 6" from the floor. Nobody knew why they kept crashing at night, until someone spotted a cleaner ramming a vacuum cleaner right up to the servers.

    That beats the one we had, when I used to do a lot of soak-testing of machines in a lab - I'd kick off a test on a Friday night, come back in on Monday to find the machine had rebooted. Nothing in the logs, just looked like the power had died, and returned again half an hour later. Other machines on the same power supply were fine.
    It turned out that the cleaners were unplugging the servers, so they could plug in the vacuum cleaner!

    --
    Author, Shell Scripting : Expert Re
    1. Re:Big Red Button stories by Hognoxious · · Score: 1
      It turned out that the cleaners were unplugging the servers, so they could plug in the vacuum cleaner!
      There's an urban myth about the 'bed of death' in a hospital intensive care ward - they even had a priest exorcise it - it was actually the cleaner disconnecting the respirator.
      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    2. Re:Big Red Button stories by mink · · Score: 1

      The important datail of that story is the outlet the cleaner was using was labeled VAX.

      --
      Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.
  85. This happened to us once! by Anonymous Coward · · Score: 0

    Power went out in in the server room in the middle of the day. No one could figure out why for like 30 minutes. The breakers were fine. Finally the electrician (!!!) had traced the outage to the red emergency switch located kind of out-of-the-way. The switch has never been used so no one suspected it.

    One of the guys in the IT dept. is so incompetent his mere presence warps the space-time continuum. That's a topic for some other time. Anyway, while we were all discussing the incident, he kept referring to the Red Switch(tm) as the halon fire extinguisher switch. The electrician had intervened like 4 times to correct him and explain to him that the switch was the electrical kill switch. But he still kept saying "halon". That was weird - I though.

    Later that day, the same guy made the boss go out and buy him a fire extinguisher for his desk because he thought we had a fire earlier and he wanted to be prepared. He made it an HR issue. Which was also weird.

    There was no fire.

    No one had ever been able to prove it. But from the looks of it, the incompetent dude had thought that there was a fire, and had hit the emergency electrical kill switch because he thought it controlled the halon system. And then he either did not make a connection between his actions and the power going out, or he decided to cover it up and not tell anyone that he hit the switch. He never confessed.

    He still works here...

  86. Re:Don't forget... by Anonymous Coward · · Score: 0

    As another troll (who prefers to go light on the profanity in order to garner mod ups and unleash my crap onto the +1 threshold) the goal in trolling is to create chaos while having fun. No matter how long and stupid the argument gets, as long as you're getting off, you're winning. This is why most trolls are perverts. Post pix pls.

  87. Database corruption courtesy of MySQL by Frank+T.+Lofaro+Jr. · · Score: 1

    If they used PostgreSQL they would'nt have had to deal with rebuilding indexes, etc.

    There are real-world reasons to use an ACID compliant database!

    --
    Just because it CAN be done, doesn't mean it should!
  88. No, You sir are the idiot. by bano · · Score: 1

    It is a commonly known fact that cisco autoneg sucks ass.

  89. Well... by Daimando · · Score: 1

    Somebody's been hiring Stooges to guard that button. Bunch of lousy idiots.

  90. Re:The BSD box PSU probably had bigger capacitors. by nbert · · Score: 1

    yes, that's how it works. I used to have a computer which I could turn off for a quarter of a second without causing it to reboot. As you might suspect, I discovered this behavior by accident.

    On a related note a Brownout isn't desirable and can cause a sitiuation which is commonly called a loss of power. I really don't understand why some people here don't see the difference between powering off and an unintentional drop in voltage.
    Since it's not exceptional to have brownouts (some elevators cause them btw) there are standards for PSUs on how much they can take before they can't supply anymore. Good computer magazines simulate brownouts when they test PSUs and the cheap brands usually fail miserably.
    That's why GP's link is so funny after all - even the best OS in the world will fail if the motherboard, CPU or other peripherals don't have any power.

  91. Molly Guard by xixax · · Score: 1
    Maybe they should have put a cover over the damn button then. Morons.

    They need a Molly Guard
    --
    "Everything is adjustable, provided you have the right tools"
  92. Who cares about livejournal? by Anonymous Coward · · Score: 0

    It's just a bunch of stupid furries that whine to eachother how nobody understands them, how sexy balto is, and pictures of them having anal sex with eachother in their lame assed fur suits. eff em.

  93. "Things we're doing to avoid this crap..." by sakusha · · Score: 1

    I noticed one thing conspicuously absent from their list of :Things we're doing to avoid this crap in the future..." That item is:

    "Put a big sign next to the EPO button saying 'Do NOT Press This Button, it cuts off power to the entire building, it is not a light switch nor a door switch. Push this button only if your life is in danger. If your life is not in danger and you push this button, your life WILL be in danger."

  94. its there own fault by Anonymous Coward · · Score: 0

    Sorry folks, but there IT folks are stupid. Just about every major vendor and half way decent IT/OS/network idiot out there should know.

    NEVER LET YOUR SERVERS AUTO-NEGOTIATE !!!!!!

    Set the switch ports to the specific speed and force your server NIC's to a forced speed.

    AUTO-NEGOTIATE SHOULD ALWAYS BE OFF !!!!!!!!!!!

    Not to be an a##hole, but come on... ?

  95. Floods are OK - but beware of clueless accountants by dbIII · · Score: 1
    Don't let your clients near the Big Red Button without an escort
    Had a new building, and 160 litres of water made it into the server room - nothing but wet carpet but a lot of spectators turned up. The worst was an idiot chief accountant - while talking to someone she leaned back on the front of the server rack and against a few power switches, and her idiot boyfriend, employed as a "handyman", who was smoking in there until I turned up. The front door went back on the rack, and I kept a very close eye on all of the tourists - too many people had keys to that room.
  96. Emergency Power Off by DragonHawk · · Score: 1

    "Why can't the EPO button perform in the same manner as a door release for an emergency exit..."

    Emergency Power Off (EPO) switches are primarilly a safety feature. If some person is being electrocuted, you hit the switch and the power dies so the person doesn't. You don't have time to wait in a situation like that. A person's life is considered more valuable then LiveJournal, which despite the name, isn't actually alive. (Insert comment about angst-ridden teen-age girls here.)

    See also:

    http://catb.org/~esr/jargon/html/S/scram-switch.ht ml

    http://catb.org/~esr/jargon/html/B/Big-Red-Switch. html

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  97. Saltzer's Law by Anonymous Coward · · Score: 0

    "If you haven't tested your emergency plan
    recently, then it doesn't work anymore."

    Jerry Saltzer oversaw MIT's first campus-wide
    network. Saltzer's Law is the voice of experience.

  98. let me guess... by Anonymous Coward · · Score: 0

    You ran a Windows data center, right?

    You see, there are some other operating systems that don't need to be rebooted every day, or every week.

    Although, on the other hand, hard drives can die - but keep running until the computer is shut off. When my company moved out data center to a new location, we had at least 3 or 4 servers (out of maybe 50 or 100 total) that had been running for well over a year. They were transported carefully, but the hard drives never spun up again.

    Not that this was a really big problem - all we had to do is restore the machine from nightly backups onto a spare hard drive. Still had plenty of time left in the downtime window when the servers got booted back up.

    So yeah - servers should be rebooted at some point (probably just when services are added to verify they start on boot) - but no way should they be rebooted every single weekend.

  99. Re:The BSD box PSU probably had bigger capacitors. by Anonymous Coward · · Score: 0

    duh. You didn't read both emails in the link as I and most others did I suspect.

  100. Re:The BSD box PSU probably had bigger capacitors. by tmasssey · · Score: 1
    I was working on a PS/2 Model 95. This was one *heck* of a server back in the day. I had my finger pressed down on the button, when I realized it was not completely shut down: it had gotten stuck and I needed to kill a process to get it to finish. But I had my finger on the button!

    So, I double-clicked the button as fast as I could. No problem! Everything stayed up.

    I have seen that a few times since then, where the good-quality computers have survived momentary power outages and the crummy ones haven't. Just another reason to buy quality hardware...

  101. Heart of Gold? by NaDrew · · Score: 1

    "Please do not press this button again!"

    --
    Vista:XPSP2::ME:98SE
  102. Flamebait?? Funny is more like it by Anonymous Coward · · Score: 0

    Mod parent up.

  103. Re:The BSD box PSU probably had bigger capacitors. by Criton · · Score: 1

    The BSd box just had bigger caps for it's PSu size or just better quality capcitors.
    I had the same thing happen when my PC rebooted but my old powermac rode though a brown out.
    The mac's psu was physically 2x the size of the atx in the pc ie far bigger caps and heat sinks.

  104. Re:Don't forget... by eno2001 · · Score: 1

    Dear Mr. Rotund. How does one "roack"? Is it a sound? An action? Is it a new dance? Please elaborate on this stupidity. kthnx

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  105. Re:Don't forget... by eno2001 · · Score: 1

    Now we're getting somewhere Mr. Rotund (implication that you are a fat lazy slob). I see that you must be using Windows 3.1 to operate your brain. That would explain the latency in your response. Six days. Not bad Mr. Rotund. That 16-bit single tasking brain of yours can work a little, even if it's wayyyy late. LOL!!!!111!!!! OMFG!!!11111!!!!!! I made a funny.

    Bleh.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  106. Re:Don't forget... by Anonymous Coward · · Score: 0

    I don't know about "Rotund Prickpull", how about "Rotund Pillock"?

  107. Re:Don't forget... by eno2001 · · Score: 1

    No. I think not. I *HAVE* a life in "meatspace" as you call it (I'm no geek. I'm an artist who happens to use computers). If I didn't have one, I'd be trolling like you all the time. You definitely don't have a life Rotund Bastard.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
  108. Re:Don't forget... by Anonymous Coward · · Score: 0

    How about "too unimaginative to invent a name", anonymous cretin?

  109. Re:Don't forget... by Anonymous Coward · · Score: 0

    michael? is that you?

  110. Re:Don't forget... by eno2001 · · Score: 1

    Whheeee! Fun with trolling the trolls. Your ignorance is quite entertaining Mr. Penis Pudgepack.

    --
    -"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o