Explosion At ThePlanet Datacenter Drops 9,000 Servers
An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
... for posting frequent updates to the status of the outage.
Being in the power systems engineering biz, I'd be interested in some more information on the type of building (age, original occupancy type, etc.) involved.
To date. I've seen a number of data center power problems, from fires to isolated, dual source systems that turned out not to be. It raises the question of how well the engineering was done for the original facility, or the refit of an existing one. Or whether proper maintenance was carried out.
From TFA:
electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding their electrical equipment room. Properly designed systems should never result in any fault to become uncontained in this manner.Have gnu, will travel.
The only thing that I can imagine that could've caused an explosion in a datacenter is a battery bank (the data centers I've been in didn't have any large A/C transformers inside). And even then, I thought that the NEC had some fairly strict codes about firewalls, explosion-proof vaults and the like.
I just find it curious, since it's not unthinkable that rechargeable batteries might explode.
mr c
"Physics is like sex. Sure, it may give some practical results, but that's not why we do it." - R. Feynman
Also, only the electrical equipment (and structural stuff) was damaged - networking and customer servers are intact (but without power, obviously). I read that they pulled in vendors. Those types would be more than happy to show up at the drop of a hat for some un-negotiated products that insurance will pay for anyway, and they'll even throw in their time for "free" so long as you don't dent their commission.
Probably less traditional explosion and more Arc Flash.
"Sacrifice for the good of The State" - The State
Wouldn't people who want such redundancy consider putting the other server in another DC?
ThePlanet is a popular host for hosting resellers. Many of the no-name shared hosting providers out there host at ThePlanet, amongst other places. So... Many of these customers would be individuals (or very small companies), who in turn dole out space/bandwidth to their own clients. The total number of customers affected can be 10-20x the number reported because of this.
Haven't you ever seen one of those gray garbage can sized transformers on a pole explode ? I used to live in a neighborhood that was right across the tracks from some sort of electrical switching station or something, they had rows of those things in a lot covered with white gravel. Explosions that were violent enough to feel like a granade going off a hundred yards away were not uncommon. I think most of them were simply the arcing of high voltage vaporizing everything and producing a shock wave, but sometimes the can-type transformers that are filled with cooling oil exploded and the burning oil sprayed everywhere.
At one place I worked, every lightening storm my boss would rush to move his shitty old truck to underneath the can on the power pole, hoping the thing would blow and burn it so he could get insurance to replace it.
My servers dropped off the net yesterday afternoon, and if all goes well they'll be up and running late tonight. At 1700PST they're supposed to do a power test, then start bringing up the environmentals, the switching gear, and blocks of servers.
My thoughts as a customer of theirs:
1. Good updates. Not as frequent or clear as I'd like, but mostly they didn't have much to add.
2. Anyone bitching about the thousands of dollars per hour they're losing has not credibility to me. If your junk is that important, your hot standby server should be in another data center.
3. This is a very rare event, and I will not be pulling out of what has been an excellent relationship so far with them.
4. I am adding a fail over server in another data center (their Dallas facility). I'd planned this already but got caught being too slow this time.
5. Because of the incident, I will probably make the new Dallas server the primary and the existing Houston one the backup. This is because I think there will be long term stability issues in this Houston data center for months to come. I know what concrete, drywall, and fire extinguisher dust does to servers. I also know they'll have a lot of work in reconstruction ahead, and that can lead to other issues.
For now, I'll wait it out. I've heard of this cool place called "outside". maybe I'll check it out.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
There's no reason to use the forum software when they've locked the thread and are only using it to disseminate information. A Pentium one running lighttpd serving a static html page would be sufficient to handle the flood of requests.
Yeah, because everyone can afford redundancy like you can.
Most people own a single server that they make backups of in case of it crashing OR have two servers in the same datacenter in case one fails.
I don't know how you can easily do offsite switch over without a huge infrastructure to support it which most people don't have the time and money to do.
Get off your high horse.
Look, when I go into a building in gear and carrying an axe and an extinguisher, breathing bottled air, wading through toxic smoke I couldn't give crap number one about your 100 sites being down.
I have a crew to protect. In this case, I'm going into an extremely hazardous environment. There has already been one explosion. I don't know what I'm going to see when I get there, but I do know that this place is wall to wall danger. Wires everywhere to get tangled in when its dark and I'm crawling through the smoke. Huge amounts of currents. Toxic batteries everywhere that may or may not be stable. Wiring that may or may not be exposed.
If its me in charge, and its my crew making entry, the power is going off. Its getting a lock-out tag on it. If you wont turn it off, I will. If I do it, you won't be turning it on so easily. If need be, I will have the police haul you away in cuffs if you try to stop me.
My job, as a firefighter -- as a fire officer -- is to ensure the safety of the general public, of my crew, and then if possible of the property.
NOW -- As a network guy and software developer -- I can say that if you're too short sighted or cheap to spring for a secondary DNS server at another facility, or if your servers are so critical to your livelihood that losing them for a couple of days will kill you but you haven't bothered to go with hot spares at another data center then you sir, are an idiot.
At any data center - anywhere - anything can happen at any time. The f'ing ground could open up and swallow your data center. Terrorists could target it because the guy in the rack next to yours is posting cartoon photos of their most sacred religious icons. Monkeys could fly out of the site admin's [nose] and shut down all the servers. Whatever. If its critical, you have off site failover. If not, you're not very good at what you do.
End of rant.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
For example someone like Newegg.com probably has a redundant data centre. Reason being that if their site is down, their income drops to 0. Even if they had the phone techs to do the orders nobody knows their phone number and since the site is down, you can't look it up. However someone like Rotel.com probably doesn't. If their site is down it's inconvenient, and might possibly cost them some sales from people who can't research their products online, but ultimately it isn't a big deal even if it's gone for a couple of days. Thus it isn't so likely they'd spend the money on being in different data centres.
You are also right on in terms of type of failure. I've been at the whole computer support business for quite a while now, and I have a lot of friends who do the same thing. I don't know that I could count the number of servers that I've seen die. I wouldn't call it a common occurrence, but it happens often enough that it is a real concern and thus important servers tend to have backups. However I've never heard of a data centre being taken out (I mean from someone I know personally, I've seen it on the news). Even when a UPS blew up in the university's main data centre, it didn't end up having to go down.
I'm willing to bet that if you were able to get statistics on the whole of the US, you'd find my little sample is quite true. There'd be a lot of cases of servers dying, but very, very few of whole data centres going down, and then usually only because of things like hurricanes or the 9/11 attacks. Thus, a backup server makes sense, however unless it is really important a backup data centre may not.