Slashdot Mirror


Explosion At ThePlanet Datacenter Drops 9,000 Servers

An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.

9 of 431 comments (clear)

  1. Coral cached LOFI status page by martyb · · Score: 4, Informative

    Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I was encouraged to see a Coral Cache link to their status page. In that light, here's: a link to the Coral Cache lofiversion of their status page:

    • http://forums.theplanet.com.nyud.net:8080/lofiversion/index.php/t90185.html
  2. Re:More planning could have prevented this by Hijacked+Public · · Score: 5, Informative

    It is often the case that transformers are kept apart from all other components And that appears to have been the case here. Had you read the article, or even the unusually accurate headline, you would know that the 9,000 servers were 'dropped' rather than 'blown apart'. They are still physically with us, they are just dropped from service because they don't have any power because the power supply blew up.

    Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.

    To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.

    So, in summary, the 9,000 servers were not blown up. Only the power.

    The power is off due to the explosion but there servers themselves are A-OK.
    --
    "Sacrifice for the good of The State" - The State
  3. Re:Photos or informaton on building? by p0tat03 · · Score: 4, Informative

    I'm a mechanical/electrical engineer by training, and what you're saying makes no sense to us. Mistakes are made in the laboratory, where things are allowed to blow up and start fires. Once you hit the real world the considerations are *very different*. While it's possible that this fire could be caused by something entirely unforeseeable (unlikely given our experience in this field), it's also possible that this was due to improperly designed systems.

    I don't suppose you'd be singing the same tune if this was a bridge collapse that killed hundreds. There's a reason why engineering costs a lot, and that's directly correlated to how little failure we can tolerate.

  4. Re:Explosion? by Gazzonyx · · Score: 4, Informative

    Actually, modern batteries should be sealed valve or Absorbed Glass Mat (AGM) that don't vent (too much) hydrogen. During a thermal runaway, they vent a tiny bit before killing themselves, but hydrogen doesn't become explosive until the concentration in an enclosed environment is ~4%. 4% of a data center is a fairly large area. I've heard of this happening in one data center where the primary and fail over (IIRC) HVAC units failed and no one had been on site for well over a month. IOW, every battery in the place started venting and it took over a month without any air circulation for it to get to 4%.

    --

    If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

  5. Correction by Gazzonyx · · Score: 4, Informative

    Sorry for replying to myself, I don't think I made my post clear; the backup power is not on (the mains was blown to bits), because the fire department told them to shut it off.

    --

    If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

  6. Re:Kudo to their support team by SSpade · · Score: 5, Informative

    It's little known mostly because it's not actually true. I think you're confusing theplanet with the world, aka world.std.com.

  7. Re:More planning could have prevented this by cecil_turtle · · Score: 4, Informative

    ThePlanet has 5 or more datacenters. The cost and complexity of doing a full blown physically separated 2N power system at every datacenter is far more expensive than taking the chance of having to issue a credit against an SLA. Not to mention that when a fire is involved, the fire department has full authority and may instruct you to cut all power anyway - they are coming in to an unknown situation and won't risk their own people just because you say the other power system is isolated.

    Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.

  8. Not so simple. by CFD339 · · Score: 4, Informative

    While it sounds like a reasonable approach at first, it makes assumptions that I can't make as an officer on scene.

    1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.

    2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.

    3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.

    So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).

    The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.

    We go in to a place like this not knowing the equipment, not knowing its condition.

    My final proof point --

    If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.

    What does this prove?

    1. It proves the fire marshal was right in not allowing them to feed power in their.

    2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.

    Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.

    --
    The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
  9. No, not necessarily by Sycraft-fu · · Score: 4, Informative

    You are probably thinking of auto insurance. Yes, it usually goes up when used. The reason is because when you use it, it is usually because you did something that changed your risk level. If you get in an accident, that makes you a higher risk. Continue to get in accidents, you are a higher risk still. Thus the companies want more money. It's all based on risk calculation. That's also why they want more money when you are under 25. Statistically speaking, young people are a much higher risk of accidents.

    Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.

    Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.

    Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.