Slashdot Mirror


Explosion At ThePlanet Datacenter Drops 9,000 Servers

An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.

96 of 431 comments (clear)

  1. 9 Volts of Love by Anonymous Coward · · Score: 5, Funny

    Electricity is a fickle mistress, one moment she's gently caressing your genitals through gingerly applied electrodes the next she's blowing up your data centers.

    1. Re:9 Volts of Love by milsoRgen · · Score: 2, Funny

      I got your 9 Volts of Love right here

      --
      I'm sick of following my dreams. I'm just going to ask where they're goin' and hook up with 'em later.
  2. Kudo to their support team by QuietLagoon · · Score: 5, Insightful

    ... for posting frequent updates to the status of the outage.

    1. Re:Kudo to their support team by imipak · · Score: 3, Interesting

      Little-known fact: The Planet were the first ever retail ISP offering Internet access to the general public - from 1989. Hmmm, so the longest-established ISP in the world that they're not only working hard to get that DC back online, they're posting pretty open summaries of the state of play... coincidence? I don't think so.

    2. Re:Kudo to their support team by larien · · Score: 4, Insightful

      It's probably less effort to spend a few minutes updating a forum than it would be to man the phones against irate customers demanding their servers be brought back online.

    3. Re:Kudo to their support team by QuietLagoon · · Score: 3, Insightful
      man the phones against irate customers

      It does not sound like the type of company that thinks of its customers as an enemy, as your message implies.

    4. Re:Kudo to their support team by SSpade · · Score: 5, Informative

      It's little known mostly because it's not actually true. I think you're confusing theplanet with the world, aka world.std.com.

    5. Re:Kudo to their support team by Anonymous Coward · · Score: 4, Funny

      Not sure I want to go to a std.com domain, might get infected...

    6. Re:Kudo to their support team by c_forq · · Score: 2, Funny

      Or more likely it sounds like someone who has worked tech support (this is slashdot).

      --
      Computers allow humans to make mistakes at the fastest speeds known, with the possible exception of tequila and handguns
    7. Re:Kudo to their support team by Fred_A · · Score: 2, Funny

      Just answer the phone with a recording that has a background of screams, fires, stuff falling down and cracking, electrical buzzing and a few sirens...

      "Hello, this is the Planet, our servers are down for the moment but we're working on it, thank you for your comprehension... Oh no, Smith is on fire ! Someone get him !!! *click*"

      --

      May contain traces of nut.
      Made from the freshest electrons.
  3. explosion? by Anonymous Coward · · Score: 5, Funny

    Lesson learned: don't store dynamite in the power room.

    1. Re:explosion? by Gazzonyx · · Score: 4, Funny

      Lesson learned: don't store dynamite in the power room. But they told me to take it out of the room with the fuel for the generators, the management offices, and HR department...
      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    2. Re:Explosion? by Gazzonyx · · Score: 4, Informative

      Actually, modern batteries should be sealed valve or Absorbed Glass Mat (AGM) that don't vent (too much) hydrogen. During a thermal runaway, they vent a tiny bit before killing themselves, but hydrogen doesn't become explosive until the concentration in an enclosed environment is ~4%. 4% of a data center is a fairly large area. I've heard of this happening in one data center where the primary and fail over (IIRC) HVAC units failed and no one had been on site for well over a month. IOW, every battery in the place started venting and it took over a month without any air circulation for it to get to 4%.

      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    3. Re:Explosion? by RGRistroph · · Score: 3, Insightful

      Haven't you ever seen one of those gray garbage can sized transformers on a pole explode ? I used to live in a neighborhood that was right across the tracks from some sort of electrical switching station or something, they had rows of those things in a lot covered with white gravel. Explosions that were violent enough to feel like a granade going off a hundred yards away were not uncommon. I think most of them were simply the arcing of high voltage vaporizing everything and producing a shock wave, but sometimes the can-type transformers that are filled with cooling oil exploded and the burning oil sprayed everywhere.

      At one place I worked, every lightening storm my boss would rush to move his shitty old truck to underneath the can on the power pole, hoping the thing would blow and burn it so he could get insurance to replace it.

    4. Re:explosion? by guruevi · · Score: 3, Funny

      As always, you should've left it with support, they usually know what to do with it and that's where all the junk ends up anyway.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    5. Re:Explosion? by womenwantmefishfearm · · Score: 5, Interesting
    6. Re:explosion? by $0.02 · · Score: 2, Funny

      The did not store dynamite. They used Sony batteries.

      --
      If enithin kan gow rong it whil. (Murfey)
  4. trying to read it by z_gringo · · Score: 5, Funny

    At this writing almost 2,400 pelople are trying to read it. Posting it on slashdot should help speed it up.

    --
    -- -- Warning. Do not stare directly at the sun.
    1. Re:trying to read it by Lorcas · · Score: 5, Funny

      Here's a new update from Urvish Vashi: To keep you up-to-date, some idiot posted this forum page on slashdot. Expect some slowdowns and interruptions trying to access this page. ps: **** you slashdot.

    2. Re:trying to read it by Flamora · · Score: 2, Funny

      That's not from Urvish, that's from the guys having to maintain the servers we run our forums off of.

  5. Recovery costs by Scuzzm0nkey · · Score: 5, Funny

    I wonder what the dollar value of the repairs will run? I'm sure insurance covers this kind of thing, but I'd love to see hard figures like in one of those mastercard commercials: Structural damage: $15000 Melted hardware: $70000 Halon refill: $however much halon costs Real-Life Slashdot effect: Priceless

    --
    People are like slinkies; useless but fun to watch when you push them down the stairs
    1. Re:Recovery costs by macx666 · · Score: 4, Insightful

      Not to mention the cost of pulling all those consultants in, overnight, on a weekend...

      Also, only the electrical equipment (and structural stuff) was damaged - networking and customer servers are intact (but without power, obviously). I read that they pulled in vendors. Those types would be more than happy to show up at the drop of a hat for some un-negotiated products that insurance will pay for anyway, and they'll even throw in their time for "free" so long as you don't dent their commission.
    2. Re:Recovery costs by Yetihehe · · Score: 2, Interesting

      So maybe it would make more sense to just skip their insurance?

      --
      Extreme Programming - Redundant Array of Inexpensive Developers
    3. Re:Recovery costs by Geak · · Score: 3, Funny

      Maybe they'll just haul the servers to another datacenter:

      Dollys - $500, Truck rentals - $5000, Labour - $10000, Sending internets on trucks - Priceless

    4. Re:Recovery costs by zippthorne · · Score: 2, Funny

      We are doing something about that. Now sick days and personal days are pooled into one unit. So your vacations have to compete with your potentially contagious illnesses. Everybody wins!

      --
      Can you be Even More Awesome?!
  6. This is BAD KARMA!! by Izabael_DaJinn · · Score: 5, Funny

    Clearly this is bad karma resulting from all their years of human rights violations....especially Tiananmen Square...oh wait--

    --
    Careful What You Wish For....
  7. What does a server room by iminplaya · · Score: 2, Funny

    have that can explode like this? All I can think of are all those cheap electrolytic caps. They really do put on quite a show, don't they? Put the transformer up on the roof, ok?

    --
    What?
    1. Re:What does a server room by Hijacked+Public · · Score: 3, Insightful

      Probably less traditional explosion and more Arc Flash.

      --
      "Sacrifice for the good of The State" - The State
    2. Re:What does a server room by CptNerd · · Score: 2, Interesting

      From what they were saying (I'm a customer, with both servers in that datacenter) it was a high-voltage transformer, so it might very well have been one that size. They did say it was much larger than the kind on power poles, but not indication of exactly how much it was handling. This is probably one of those times when architecture and esthetics took primary status over safety when the building was built. I would have thought a transformer as large as what blew up would be outside the building proper. At any rate, it's a major fustercluck that's going to take time to fix.

      Maybe in the post-mortem, someone will figure out it's time to start looking at ways to use less power, maybe switching to servers that use the lower-power CPUs that are coming out, so that the very high power infrastructure isn't as necessary. I have a feeling there'll be a "fire sale" on server subscriptions once a lot of customers leave (I'm not one of them, but I will likely swap one of mine for another at another location, much much later).

      --
      By the taping of my glasses, something geeky this way passes
  8. Kevin Hazard? by Pyrex5000 · · Score: 3, Funny

    I blame Kevin Hazard.

  9. Helpful Slashdot! by quonsar · · Score: 5, Funny

    At this writing almost 2,400 people are trying to read it

    and as of this posting, make that 152,476.

  10. Photos or informaton on building? by PPH · · Score: 3, Insightful

    Being in the power systems engineering biz, I'd be interested in some more information on the type of building (age, original occupancy type, etc.) involved.

    To date. I've seen a number of data center power problems, from fires to isolated, dual source systems that turned out not to be. It raises the question of how well the engineering was done for the original facility, or the refit of an existing one. Or whether proper maintenance was carried out.

    From TFA:

    electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding their electrical equipment room. Properly designed systems should never result in any fault to become uncontained in this manner.
    --
    Have gnu, will travel.
    1. Re:Photos or informaton on building? by p0tat03 · · Score: 4, Informative

      I'm a mechanical/electrical engineer by training, and what you're saying makes no sense to us. Mistakes are made in the laboratory, where things are allowed to blow up and start fires. Once you hit the real world the considerations are *very different*. While it's possible that this fire could be caused by something entirely unforeseeable (unlikely given our experience in this field), it's also possible that this was due to improperly designed systems.

      I don't suppose you'd be singing the same tune if this was a bridge collapse that killed hundreds. There's a reason why engineering costs a lot, and that's directly correlated to how little failure we can tolerate.

    2. Re:Photos or informaton on building? by xaxa · · Score: 2, Interesting

      I was very impressed that a new bridge that was being extended over a busy railway line didn't cause any damage when they dropped it (they were lucky no trains were going under the bridge at the time, it's a very busy railway line -- about 40 trains in the next hour on a Sunday night, so you can imagine what it's like on a weekday. It did cause massive disruption, as they closed the line. And I don't know why they didn't have backup jacks if the failure of one left it unsupported.)

      I know it's not really relevant, but I didn't realise I was so interested in construction/engineering before reading about the past year's worth of posts on that blog (well, the construction ones. Not the "I was first on the new train!" ones. Though I admire the guy's dedication, to be awake at 4.00 to get the first ever train from the new Heathrow Airport station or whatever).

    3. Re:Photos or informaton on building? by aaarrrgggh · · Score: 2, Insightful

      This isn't that uncommon with a 200kAIC board with air-power breakers, if there is a bolted fault. Instantaneous delays. Newer insulated-case style breakers all have an instantaneous override which will limit fault energy,

      The other possibility was that a tie was closed and the breakers over-dutied and could not clear the fault.

      Odd that nobody was hurt though; spontaneous shorts are very rare-- most involve either switching or work in live boards, either of which would kill someone.

  11. Re:Server/customer ratio? by ChowRiit · · Score: 2, Insightful

    Only a few people need to have a lot of servers for there to be 18 servers for every 15 customers. To be honest, I'm surprised the ratio is so low, I would have guessed most hosting in a similar environment would be by people who'd want at least 2 servers for redundancy/backup/speed reasons...

  12. Explosion? by mrcdeckard · · Score: 3, Insightful


    The only thing that I can imagine that could've caused an explosion in a datacenter is a battery bank (the data centers I've been in didn't have any large A/C transformers inside). And even then, I thought that the NEC had some fairly strict codes about firewalls, explosion-proof vaults and the like.

    I just find it curious, since it's not unthinkable that rechargeable batteries might explode.

    mr c

    --
    "Physics is like sex. Sure, it may give some practical results, but that's not why we do it." - R. Feynman
  13. Coral cached LOFI status page by martyb · · Score: 4, Informative

    Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I was encouraged to see a Coral Cache link to their status page. In that light, here's: a link to the Coral Cache lofiversion of their status page:

    • http://forums.theplanet.com.nyud.net:8080/lofiversion/index.php/t90185.html
  14. Lithium Batteries in their UPS setup?? by Zymergy · · Score: 2, Interesting

    I am wondering what UPS/Generator Hardware was in use?
    Where would the "failure" (Short/Electrical Explosion) have to be to cause everything to go dark?
    Sounds like the power distribution circuits downstream of the UPS/Generator were damaged.

    Whatever vendor provided the now vaporized components are likely praying that the specifics are not mentioned here.

    I recall something about Lithium Batteries exploding in Telecom DSLAMs... I wonder if their UPS system used Lithium Ion cells?
    http://www.lightreading.com/document.asp?doc_id=109923
    http://tech.slashdot.org/article.pl?sid=07/08/25/1145216
    http://hardware.slashdot.org/article.pl?sid=07/09/06/0431237

    1. Re:Lithium Batteries in their UPS setup?? by Anonymous Coward · · Score: 2, Informative

      If you'd read the linked status report, you'd see that there was a short in a high voltage line. They are dark because the fire department told them not to power up their back-up generators.

  15. kaboom by rarel · · Score: 2, Funny

    Clearly these Sony batteries had to be replaced one way or another...

  16. Re:More planning could have prevented this by Hijacked+Public · · Score: 5, Informative

    It is often the case that transformers are kept apart from all other components And that appears to have been the case here. Had you read the article, or even the unusually accurate headline, you would know that the 9,000 servers were 'dropped' rather than 'blown apart'. They are still physically with us, they are just dropped from service because they don't have any power because the power supply blew up.

    Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.

    To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.

    So, in summary, the 9,000 servers were not blown up. Only the power.

    The power is off due to the explosion but there servers themselves are A-OK.
    --
    "Sacrifice for the good of The State" - The State
  17. Re:Server/customer ratio? by 42forty-two42 · · Score: 5, Insightful

    Wouldn't people who want such redundancy consider putting the other server in another DC?

  18. Re:Blank Label Comics, Schlock Mercenary by strredwolf · · Score: 2, Informative
    --

    --
    # Canmephians for a better Linux Kernel
    $Stalag99{"URL"}="http://stalag99.net";
  19. 5 servers, 5 cities, 5 providers by Anonymous Coward · · Score: 2, Insightful

    I have 5 servers. Each of them is in a different city, on a different provider. I had a server at The Planet in 2005.

    I feel bad for their techs, but I have no sympathy for someone who's single-sourced, they should have propagated to their offsite secondary.

    Which they'll be buying tomorrow, I'm sure.

    1. Re:5 servers, 5 cities, 5 providers by aronschatz · · Score: 4, Insightful

      Yeah, because everyone can afford redundancy like you can.

      Most people own a single server that they make backups of in case of it crashing OR have two servers in the same datacenter in case one fails.

      I don't know how you can easily do offsite switch over without a huge infrastructure to support it which most people don't have the time and money to do.

      Get off your high horse.

  20. Re:Server/customer ratio? by p0tat03 · · Score: 4, Insightful

    ThePlanet is a popular host for hosting resellers. Many of the no-name shared hosting providers out there host at ThePlanet, amongst other places. So... Many of these customers would be individuals (or very small companies), who in turn dole out space/bandwidth to their own clients. The total number of customers affected can be 10-20x the number reported because of this.

  21. Re:Server/customer ratio? by bipbop · · Score: 3, Informative

    At my last job, BCP guidelines required both: a minimum of four servers for anything, two of which must be at a physically distant datacenter.

  22. Re:More planning could have prevented this by Gazzonyx · · Score: 2, Informative

    No, the power was off because the fire department told them to shut it off (during an investigation, I assume). The explosion was in a high power conduit - I'm sure it severed all the lines inside the conduit itself. This is one of those things that couldn't easily be avoided at a single site. But, if your server is of any importance, you do have a colo, right?

    --

    If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

  23. Re:Server/customer ratio? by wirelessbuzzers · · Score: 2, Insightful

    I'm guessing that most of the customers are virtual-hosted, and therefore have only a fraction of a server, but some customers have many servers.

    --
    I hereby place the above post in the public domain.
  24. More details on the outage by 1sockchuck · · Score: 2, Informative

    Data Center Knowledge has a story on the downtime at The Planet, summarizing the information from the now Slashdotted forums. Only one of the company's six data centers was affected. The Planet has more than 50,000 servers in its network, meaning that one on five customers are offline.

    1. Re:More details on the outage by filmotheklown · · Score: 2, Informative

      Not Totally True.

      Many customers also use their DNS service, (the EV1 DNS), so while there are 9000 servers physically 'off' there are many more effectively 'black' as the conical names no longer resolve.

      I'm one of those customers. We're a very small business as are many of the other customers of The Planet (formerly Everyones Internet -- EV1.net)

      I can still access our sever via the IP address, but not via the conical name.

      While we host our site on a private server, many of the servers of other customers are resellers and with the DNS service, I could easily see how 10s of thousand of actual sites are down beyond the 9000 physical servers.

      --
      Filmo The Klown
    2. Re:More details on the outage by gnuman99 · · Score: 2, Informative

      Shouldn't they provide, you know, primary AND secondary DNS? And in that case, wouldn't the primary AND secondary be hosted in *different* data centers?

      DNS is *THE* *MOST* critical part of infrastructure. If the HTTP server fail, ok. If mail fails, ok. If data center explodes, you still have DNS so anyone sending email will just be stuck for a few days. But if DNS is offline, then email is offline. You are off the internet.

      I've had a server motherboard die and it took a few days to get new one installed and running. But my DNS was running because backups were on different IPs and places.

      I have to say, this is a BIG no-no for them not to provide proper DNS services.

  25. a bit wrong by unity100 · · Score: 2, Insightful

    its not the 'no name' hosting resellers who host at the planet. no name resellers do not employ an entire server, they just use whm reseller panel that is being handed out by a company which hosts servers there.

    1. Re:a bit wrong by billcopc · · Score: 2, Funny

      Hey hey! I'm a no-name reseller, but I run my own servers, none of this turnkey reseller bullshit. I am root, and I'm goddamned proud of it :)

      --
      -Billco, Fnarg.com
  26. Correction by Gazzonyx · · Score: 4, Informative

    Sorry for replying to myself, I don't think I made my post clear; the backup power is not on (the mains was blown to bits), because the fire department told them to shut it off.

    --

    If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

  27. No servers were damaged by cptnapalm · · Score: 4, Funny

    They need to build the building out of what ever they build the servers out of.

  28. It must have been HACKERS by Eudial · · Score: 4, Funny
    --
    GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
  29. Re:More planning could have prevented this by ottawanker · · Score: 5, Insightful

    so you're agreeing with me. The servers getting blown up was a huge mistake, one that certainly could have been avoided with a little proper planning. you are a fucking moron

  30. Re:Knocking down three walls... by ajlitt · · Score: 2, Funny

    Hopefully an explosion would jostle out the clog that makes their Rails pipes run slowly.

  31. Monty Python by Sentry21 · · Score: 4, Funny

    electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding their electrical equipment room. But the fourth wall stayed up! And that's what you're getting, son - the strongest data centre in all of Texas!
  32. First ISP by Anonymous Coward · · Score: 2, Informative

    You're thinking of The World. See http://www.theworld.com/about/internet.shtml.

  33. Re:Server/customer ratio? by billcopc · · Score: 2, Insightful

    What ? I run 4 servers myself. The small firm I work for, we run maybe 70-80 boxes in our cage.

    In fact I find it odd that this facility has so many individual customers. Seems like a lot of administrative overhead... If I were running that DC, I'd much rather lease out full or half racks, than individual units, then you let those people sublet to the small frys.

    That's how most of the big hosting companies operate. They don't own their own datacenters, they just lease a cage or two, cram it full of gear and sell you that godawful oversold web space you love to hate. That's also why colocating a single server can be so goddamned expensive - datacenters set per-unit pricing high to scare away the Joe Blows, and the resellers make a lot more money selling crap hosting than subletting their precious space. This is especially true in the USA/Canada.

    --
    -Billco, Fnarg.com
  34. Re:More planning could have prevented this by cecil_turtle · · Score: 4, Informative

    ThePlanet has 5 or more datacenters. The cost and complexity of doing a full blown physically separated 2N power system at every datacenter is far more expensive than taking the chance of having to issue a credit against an SLA. Not to mention that when a fire is involved, the fire department has full authority and may instruct you to cut all power anyway - they are coming in to an unknown situation and won't risk their own people just because you say the other power system is isolated.

    Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.

  35. Ignorant firemen = single point-of-failure by JoeShmoe · · Score: 4, Interesting


    Everyone loves firemen, right? Not me. While the guys you see in the movies running into burning buildings might be heroes, the real world firemen (or more specifically fire chiefs) are capricious, arbitrarty, ignorant little rulers of their own personal fiefdom. Did you know that if you are getting an inspection from your local firechief and he commands something, there is no appeal? His word is law, no matter how STUPID or IGNORANT. I'll give you some examples later.

    I'm one of the affected customers. I have about 100 domains down right now because both my nameservers were hosted at the facility, as is the control panel that I would use to change the nameserver IPs. Whoops. So I learned why I need to obviously have NS3 and ND4 and spread them around because even though the servers are spread everywhere, without my nameservers none of them currently resolve.

    It sounds like the facility was ordered to cut ALL power because of some fire chief's misguided fear that power flows backwards from a low-voltage source to a high-voltage one. I admit I don't know much about the engineering of this data center, but I'm pretty sure the "Y" junction where AC and generator power come together is going to be as close to the rack power as possible to avoid lossy transformation. It makes no sense why they would have 220 or 400 VAC generators running through the same high-voltage transformer when it would be far more efficient to have 120 or even 12VCD (if only servers would accept that). But I admit I could be wrong, and if it is a legit safety issue...then it's apparently a single point of failure for every data center out there because ThePlanet charged enough that they don't need to cut corners.

    Here's a couple of times that I've had my hackles raised by some fireman with no knowledge of technology. The first was when we switched alarm companies and required a fire inspector to come and sign off on the newly installed system. The inspector said we needed to shut down power for 24 hours to verify that the fire alarm would still work after that period of time (a code requirement). No problem, we said, reaching for the breaker for that circuit.

    No no, he said. ALL POWER. That meant the entire office complex, some 20-30 businesses, would need to be without power for an entire day so that this fing idiot could be sure that we weren't cheating by sneaking supplimentary power from another source.

    WHAT THE FRACK

    We ended up having to rent generators and park them outside to keep our racks and critical systems running, and then renting a conference room to relocate employees. We went all the way to the country commmissioners pointing out how absolutely stupid this was (not to mention, who the HELL is still going to be in a burning building 24 hours after the alarm's gone off) but we told that there was no override possible.

    The second time was at a different place when we installed a CO alarm as required for commercial property. Well, the inspector came and said we need to test it. OK, we said, pressing the test button. No no, he said, we need to spray it with carbon monoxide.

    Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier, we finally called the manufacturer of the device who pointed out that the device was void the second it was exposed to CO because the sensor was not reusuable. In other words, when the sensor was tripped, it was time to buy a new monitor. You can see the recursive loop that would have devloped if we actually had tested the device and then promptly had to replace it and get the new one retested by this idiot.

    So finally we got a letter from the manufacturer that pointed out the device was UL certified and that pressing the test button WAS the way you tested the device. It took four weeks of arguing before he finally found an excuse that let him safe face and

    --
    -- I wonder which will go down in history as the bigger failure: the War on Drugs or the War on Filesharing
  36. Re:Server/customer ratio? by cowscows · · Score: 2, Insightful

    I think it depends on just how mission critical things are. If your business completely ceases to function if your website goes down, then remote redundancy certainly makes a lot of sense. If you can deal with a couple of days with no website, then maybe it's not worth the extra trouble. I'd imagine that a hardware failure confined to a single server is more common than explosions bringing entire data-centers offline, so maybe a backup server sitting right next to it isn't such a useless idea.

    --

    One time I threw a brick at a duck.

  37. Printer ignition source by kmahan · · Score: 4, Funny

    Last message on the linux console before the explosion:

            lp0 printer on fire!

    --
    Invalid Checksum. Retrying.
    1. Re:Printer ignition source by jd · · Score: 4, Interesting

      *wonders how many remember the live incident at the BBC, many years ago, when the Grandstand teleprinter stopped displaying match results and started printing updates on a fire running through the building.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    2. Re:Printer ignition source by moosesocks · · Score: 2, Informative

      For those of you who don't get the joke, there's actually an entire wikipedia article devoted to it.

      In short, most unix printing systems understand a very small number of printer status codes, usually consisting of "READY, ONLINE, OFFLINE, and PRINTER ON FIRE"

      The latter status message was actually semi-serious, and was thrown whenever the printer was encountering a serious error, but for some reason was continuing to print anyway. In the case of a high-speed mainframe printer, if the printer jammed but continued attempting to print, a fire could easily start due to the amount of friction created by the high-speed motors.

      --
      -- If you try to fail and succeed, which have you done? - Uli's moose
  38. Re:More planning could have prevented this by Zebra_X · · Score: 3, Interesting

    "The power is off due to the explosion but there servers themselves are A-OK."

    Physically OK maybe... lets see how many of them come back up when the power is restored ^ ^

  39. "short in a high-volume wire conduit."? by Animats · · Score: 2, Informative

    They supposedly had a "short in a high-volume wire conduit." That leads to questions as to whether they exceeded the NEC limits on how much wire and how much current you can put through a conduit of a given size. Wires dissipate heat, and the basic rule is that conduits must be no more than 40% filled with wire. The rest of the space is needed for air cooling. The NEC rules are conservative, and if followed, overheating should not be a problem.

    This data center is in a hot climate, and a data center is often a continuous maximum load on the wiring, so if they do exceed the packing limits for conduit, a wiring failure through overheat is a very real possibility.

    Some fire inspector will pull charred wires out of damaged conduit and compare them against the NEC rules. We should know in a few days.

  40. Re:More planning could have prevented this by NewbieProgrammerMan · · Score: 5, Funny

    I wish I had mod points...I think this is the first time I ever wanted to mod those 5 words up.

    --
    [b.belong('us') for b in bases if b.owner() == 'you']
  41. Is this why YouTube is down? by Animats · · Score: 2, Interesting

    YouTube's home page is returning "Service unavailable". Is this related? (Google Video is up.)

  42. _The_ Power Room? by John+Hasler · · Score: 2, Insightful

    > ...they claim redundant power...

    How the hell could they claim redundant power with only one power room?

    --
    Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    1. Re:_The_ Power Room? by sciencewhiz · · Score: 2, Insightful

      They are not running backup power because of the fire department told them not to, not because it doesn't exist.

    2. Re:_The_ Power Room? by CFD339 · · Score: 2, Insightful

      Redundant power they have. Redundant power distribution grids they do not. This is common. The level of certification in redundancy on power for fully redundant grids is (I think) called 2N where they only claim N+1 -- which I understand means failover power. Its more than enough 99.9% of the the time. To have FULLY redundant power plus distribution from the main grid all the way into the building through the walls and to every rack is ridiculously more expensive. At that point, it is more sensible to buy another server at another facility for failover than to spend what it would cost to host a server with that kind of power redundancy -- on top of which, the server itself could still blow up and then where are you?

      --
      The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  43. Re:More planning could have prevented this by njcoder · · Score: 2, Insightful

    I assure you issuing a one-day pro-rated credit to all your customers is cheaper. But not cheaper than losing 7500 accounts to another DC that can handle this type of event gracefully. The fact that it's complex doesn't mean you shouldn't expect it in a data center that claims to be "World Class"

    In related news, I was wondering why I wasn't getting much spam today and my sites didn't have strange spiders hitting them.
  44. Re:More planning could have prevented this by jacquesm · · Score: 2, Informative

    I'm one of their customers, and it takes more than a single instance in 5 years of hosting to make me switch. That said we'll see how long it takes to get things back up. Unfortunately *both* my dns servers are in that DC, I thought they were in physically distant locations... so much for ass-um-ing things...

  45. Sadists! by STFS · · Score: 2, Funny

    as if they haven't been through enough with the explosion and fire and all... you just had to rub it in and slashdot their forum as well... kudos!

    --
    You don't think enough... therefore you better not be!
  46. Re:Who's hosted on ThePlanet? by aronschatz · · Score: 2, Insightful

    ThePlanet dropped the ball on redundant DNS. They had all the EV1 nameservers at that DC which is completely ridiculous...

  47. I'm a customer in that DC, and I'm a firefighter by CFD339 · · Score: 4, Insightful

    My servers dropped off the net yesterday afternoon, and if all goes well they'll be up and running late tonight. At 1700PST they're supposed to do a power test, then start bringing up the environmentals, the switching gear, and blocks of servers.

    My thoughts as a customer of theirs:

    1. Good updates. Not as frequent or clear as I'd like, but mostly they didn't have much to add.

    2. Anyone bitching about the thousands of dollars per hour they're losing has not credibility to me. If your junk is that important, your hot standby server should be in another data center.

    3. This is a very rare event, and I will not be pulling out of what has been an excellent relationship so far with them.

    4. I am adding a fail over server in another data center (their Dallas facility). I'd planned this already but got caught being too slow this time.

    5. Because of the incident, I will probably make the new Dallas server the primary and the existing Houston one the backup. This is because I think there will be long term stability issues in this Houston data center for months to come. I know what concrete, drywall, and fire extinguisher dust does to servers. I also know they'll have a lot of work in reconstruction ahead, and that can lead to other issues.

    For now, I'll wait it out. I've heard of this cool place called "outside". maybe I'll check it out.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  48. "Murphy's Law" != "Shit Happens" by fm6 · · Score: 2, Insightful

    This goes to show that no matter how much planning you do, Murphy's Law still applies. I am so tired of hearing that copout. Does the submitter know for a fact that ThePlanet did everything it could to keep its power system from exploding? I don't have any evidence one way or the other, but if they're anything like other independent data center operators, it's pretty unlikely.

    The lesson you should be taking from Murphy's Law is not "Shit Happens". The lesson you should be taking is that you can't assume that an unlikely problem (or one you can con yourself into thinking unlikely) is one you can ignore. It's only after you've prepared for every reasonable contingency that you're allowed to say "Shit Happens".
  49. Re:More planning could have prevented this by njcoder · · Score: 3, Interesting

    Part of my point that you apparently missed was that even a full 2N power system end-to-end doesn't guarantee uptime. There are very few - and I'd even go so far as to say "if any" - datacenters in the world that could handle an explosion / fire without going down. The dc didn't explode, just the power room. It seems there was just one power room. I've been to data centers around here, even small ones that have 2 power rooms.

    While it may be the fire dept that is erroneously preventing them from bringing up their back-up power, it's part of a poor disaster recovery plan to not engage with the fire dept, electric co, etc. before a disaster happens, so that everyone is on-board with your disaster recovery plans and that you have the ability to implement that plan.

    The explosion was isolated to the power room. The servers are fine, the backup generators and batteries are fine. The servers should have been back online if they had a good disaster recovery plan. The whole point of disaster recovery is being able to handle a disaster. You can't say "oh there was a disaster, you can't help that". This is exactly what their plan should have been able to handle. The power room goes offline. It shouldn't matter if it was because of an explosion, a fire, equipment failure or being beamed into outer space.

    It also shouldn't matter who is telling them to keep the power off. Part of the disaster recovery plan should have been making sure local authorities allowed them to carry it out. Fine, they have to shut off all power when firemen are in there with hoses. I understand that. But once the fire is out your plan should allow you to bring up backup power. It didn't. So I don't see how they can call themselves a "World Class Data Center". Part of what they sell and what customers expect is disaster recovery. And there are data centers that can provide this.

    ThePlanet is pretty cheap compared to datacenters like NAC that have more redundancy and security. But ThePlanet wants to advertise that they are just as good. Now they were caught with their pants down when there was actually a disaster and their disaster recovery plan failed.
  50. UMM.. USE STATIC PAGE?? by kyoorius · · Score: 5, Insightful

    There's no reason to use the forum software when they've locked the thread and are only using it to disseminate information. A Pentium one running lighttpd serving a static html page would be sufficient to handle the flood of requests.

  51. Re:Service Sucked for those affected by clare-ents · · Score: 2, Insightful

    SLA is not a substitute for business insurance.

    If your business loses $1000/minute while it's offline, get a quote for insurance that pays out $1000/minute while you're offline. Alternatively if you're happy self insuring take the loss when it happens.

    It's almost as if people believe that SLAs are a form of service guarantee instead of a free very bad insurance deal.

    --
    Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
  52. Re:More planning could have prevented this by cecil_turtle · · Score: 3, Informative

    You may also be interested in a pretty positive write-up from SANS about ThePlanet's response and handling of the situation thus far.

  53. Re:More planning could have prevented this by Viflux · · Score: 3, Informative

    From the status update thread... "Today at approximately 5:45 p.m., a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down." I read this as the fire department ordering them to kill *all* the power for safety reasons, rather than the explosion knocking the whole thing out.

  54. 1700 test not necessarily a failure by CFD339 · · Score: 2, Insightful

    First, that time was an estimate -- a target. Second, even if the initial power test passes, it will take hours to bring up the a/c systems, the switches, and the routers.

    The initial draw from each new bank of gear to be given power will be very high so it will need to go slow.

    The battery systems (be they on each rack or in large banks serving whole blocks) will try to charge all at once. If they're not careful, that'll heat those new power lines up like the filaments in a toaster. Remember, the battery plan they have was built with the idea that they'd be used very briefly during transition to generator power -- not drained down all at once.

    Only once all the switches and routing gear is back up can they start updating the network paths (do they use BGP for this -- that's not my area of expertise) so that peering data starts flowing.

    Only once the network is all up and stable (no small task on a site with dozens of high end peering points) can they even start doing banks of servers.

    Its also probably that each bank of servers will needs its own new power lines (and eventually replaced conduit) in the distribution center that was destroyed.

    Bank by bank they'll have to bring up all these servers, each of which will draw its maximum load during boot as disks are scanned and checked.

    Most of these servers probably haven't been shut down in months or years. Some drives may not spin up due to tired motors that can run fine but spinning from cold is just too much now. Other servers may have boot configuration problems undiscovered since the machines have been running without reboot for a long time -- linux ones anyway :-)

    This isn't something out of Young Frankenstein where they'll yell across the room "throw za main svitch!" and a watch the lights dim briefly while 9000 servers boot up with the deafening sound of system beeps. If they did try such a thing -- as if such a thing were possible -- it would immediately blow at least another transformer if not more.

    Think about it. 9000 servers @ an average of what, 300 watts, plus the networking gear, plus the air conditioning, plus charging all those batteries....you're talking megawatts.

    Without a Mr. Fusion or Harry Mudd stumbling in with some chicks wearing dilithium crystal jewelery this is going to take a while.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  55. Re:Who's hosted on ThePlanet? by LostCluster · · Score: 2, Interesting

    RackShack was also the company with the "screwdriver incident" where the a tech working in the power room dropped the tool into a UPS and shorted out the facility. No customer data was lost, but the power outage caused them to be offline for more than a day.

  56. I'm a firefighter AND a geek. You, not so much. by CFD339 · · Score: 4, Insightful

    Look, when I go into a building in gear and carrying an axe and an extinguisher, breathing bottled air, wading through toxic smoke I couldn't give crap number one about your 100 sites being down.

    I have a crew to protect. In this case, I'm going into an extremely hazardous environment. There has already been one explosion. I don't know what I'm going to see when I get there, but I do know that this place is wall to wall danger. Wires everywhere to get tangled in when its dark and I'm crawling through the smoke. Huge amounts of currents. Toxic batteries everywhere that may or may not be stable. Wiring that may or may not be exposed.

    If its me in charge, and its my crew making entry, the power is going off. Its getting a lock-out tag on it. If you wont turn it off, I will. If I do it, you won't be turning it on so easily. If need be, I will have the police haul you away in cuffs if you try to stop me.

    My job, as a firefighter -- as a fire officer -- is to ensure the safety of the general public, of my crew, and then if possible of the property.

    NOW -- As a network guy and software developer -- I can say that if you're too short sighted or cheap to spring for a secondary DNS server at another facility, or if your servers are so critical to your livelihood that losing them for a couple of days will kill you but you haven't bothered to go with hot spares at another data center then you sir, are an idiot.

    At any data center - anywhere - anything can happen at any time. The f'ing ground could open up and swallow your data center. Terrorists could target it because the guy in the rack next to yours is posting cartoon photos of their most sacred religious icons. Monkeys could fly out of the site admin's [nose] and shut down all the servers. Whatever. If its critical, you have off site failover. If not, you're not very good at what you do.

    End of rant.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  57. Re:I'm a firefighter AND a geek. You, not so much. by CFD339 · · Score: 2, Insightful

    You sir, don't know what you're talking about. Reaching for ridiculous examples of someone doing their job wrong doesn't change that.

    Our S.O.G. (standard operating guidelines) are actually very specific about risk.

    We will risk our lives to save a human life.
    We will take reasonable risk to save the lives of pets and livestock.
    We will take minimal risks to save property.

    Sorry, but your building isn't worth the risk of my crew. That's reality.

    Don't you DARE tell me what is and isn't bravery or cowardly until you put 50 pounds of gear on and crawl into a pitch black house that's burning over your head.

    Don't you DARE tell me that you think you understand the difference between saving the blonde girl and saving your computer server.

    This isn't TV World. This is the real world. Fire on TV doesn't look like real fire. You know why? Because a real house on fire doesn't look like anything but pitch black and that makes for lousy TV.

    Get over yourself and go volunteer at your local fire department. 86% of the men and women in this country who will risk their lives for yours are volunteers. We could use your help if you have the guts for it. We'll teach you what you need to know -- and we'll keep you as safe as we can so you can go home to your family when its done.

    Your examples are stupid and insulting to the 800,000 brave men and women who volunteer to risk death in the most painful way possible to save your sorry butt.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  58. Update 11:14 PM CST by Solokron · · Score: 3, Informative

    As previously committed, I would like to provide an update on where we stand following yesterday's explosion in our H1 data center. First, I would like to extend my sincere thanks for your patience during the past 28 hours. We are acutely aware that uptime is critical to your business, and you have my personal commitment that The Planet team will continue to work around the clock to restore your service. As you have read, we have begun receiving some of the equipment required to start repairs. While no customer servers have been damaged or lost, we have new information that damage to our H1 data center is worse than initially expected. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed. There is some good news, however. We have found a way to get power to Phase 2 (upstairs, second floor) of the data center and to restore network connectivity. We will be powering up the air conditioning system and other necessary equipment within the next few hours. Once these systems are tested, we will begin bringing the 6,000 servers online. It will take four to five hours to get them all running. We have brought in additional support from Dallas to have more hands and eyes on site to help with any servers that may experience problems. The call center has also brought in double staff to handle the increase in tickets we're expecting. Hopefully by sunrise tomorrow Phase 2 will be well on its way to full production. Let me next address Phase 1 (first floor) of the data center and the affected 3,000 servers. The news is not as good, and we were not as lucky. The damage there was far more extensive, and we have a bigger challenge that will require a two-step process. For the first step, we have designed a temporary method that we believe will bring power back to those servers sometime tomorrow evening, but the solution will be temporary. We will use a generator to supply power through next weekend when the necessary gear will be delivered to permanently restore normal utility power and our battery backup system. During the upcoming week, we will be working with those customers to resolve issues. We know this may not be a satisfactory solution for you and your business but at this time, it is the best we can do. We understand that you will be due service credits based on our Service Level Agreement. We will proactively begin providing those following the restoration of service, which is our number priority, so please bear with us until this has been completed. I recognize that this is not all good news. I can only assure you we will continue to utilize every means possible to fully restore service. I plan to have an audio update tomorrow evening. Until then, Douglas J. Erwin Chairman & Chief Executive Officer

    --
    30% off web hosting. Coupon code "SLASHDOT".
  59. Re:Server/customer ratio? by gnuman99 · · Score: 2, Informative

    How about catching on fire and burning down??

    http://lists.debian.org/debian-devel/2002/11/msg01926.html

  60. Not so simple. by CFD339 · · Score: 4, Informative

    While it sounds like a reasonable approach at first, it makes assumptions that I can't make as an officer on scene.

    1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.

    2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.

    3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.

    So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).

    The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.

    We go in to a place like this not knowing the equipment, not knowing its condition.

    My final proof point --

    If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.

    What does this prove?

    1. It proves the fire marshal was right in not allowing them to feed power in their.

    2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.

    Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  61. No, not necessarily by Sycraft-fu · · Score: 4, Informative

    You are probably thinking of auto insurance. Yes, it usually goes up when used. The reason is because when you use it, it is usually because you did something that changed your risk level. If you get in an accident, that makes you a higher risk. Continue to get in accidents, you are a higher risk still. Thus the companies want more money. It's all based on risk calculation. That's also why they want more money when you are under 25. Statistically speaking, young people are a much higher risk of accidents.

    Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.

    Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.

    Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.

  62. Yep by Sycraft-fu · · Score: 3, Insightful

    For example someone like Newegg.com probably has a redundant data centre. Reason being that if their site is down, their income drops to 0. Even if they had the phone techs to do the orders nobody knows their phone number and since the site is down, you can't look it up. However someone like Rotel.com probably doesn't. If their site is down it's inconvenient, and might possibly cost them some sales from people who can't research their products online, but ultimately it isn't a big deal even if it's gone for a couple of days. Thus it isn't so likely they'd spend the money on being in different data centres.

    You are also right on in terms of type of failure. I've been at the whole computer support business for quite a while now, and I have a lot of friends who do the same thing. I don't know that I could count the number of servers that I've seen die. I wouldn't call it a common occurrence, but it happens often enough that it is a real concern and thus important servers tend to have backups. However I've never heard of a data centre being taken out (I mean from someone I know personally, I've seen it on the news). Even when a UPS blew up in the university's main data centre, it didn't end up having to go down.

    I'm willing to bet that if you were able to get statistics on the whole of the US, you'd find my little sample is quite true. There'd be a lot of cases of servers dying, but very, very few of whole data centres going down, and then usually only because of things like hurricanes or the 9/11 attacks. Thus, a backup server makes sense, however unless it is really important a backup data centre may not.