Explosion At ThePlanet Datacenter Drops 9,000 Servers
An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
9000::7500?
So I guess a "customer" in this case is a company or business, not an individual? Unless many of the individuals have several servers each.
Electricity is a fickle mistress, one moment she's gently caressing your genitals through gingerly applied electrodes the next she's blowing up your data centers.
... for posting frequent updates to the status of the outage.
Lesson learned: don't store dynamite in the power room.
At this writing almost 2,400 pelople are trying to read it. Posting it on slashdot should help speed it up.
-- -- Warning. Do not stare directly at the sun.
I wonder what the dollar value of the repairs will run? I'm sure insurance covers this kind of thing, but I'd love to see hard figures like in one of those mastercard commercials: Structural damage: $15000 Melted hardware: $70000 Halon refill: $however much halon costs Real-Life Slashdot effect: Priceless
People are like slinkies; useless but fun to watch when you push them down the stairs
And then they put it on the front page of Slashdot.
It was Sunday, June 1, 2008. Xeon, my children, just don't belong in some places.
(About the only thing missing from this real-world version of the story is a YouTube video of a halon fire suppression system going off. Damn ozone-protection regs :)
Clearly this is bad karma resulting from all their years of human rights violations....especially Tiananmen Square...oh wait--
Careful What You Wish For....
have that can explode like this? All I can think of are all those cheap electrolytic caps. They really do put on quite a show, don't they? Put the transformer up on the roof, ok?
What?
I blame Kevin Hazard.
At this writing almost 2,400 people are trying to read it
and as of this posting, make that 152,476.
Sacred cows make the best burgers.
So who's missing from Al Gore's Internet? Who do we know who's hosted on ThePlanet?
Being in the power systems engineering biz, I'd be interested in some more information on the type of building (age, original occupancy type, etc.) involved.
To date. I've seen a number of data center power problems, from fires to isolated, dual source systems that turned out not to be. It raises the question of how well the engineering was done for the original facility, or the refit of an existing one. Or whether proper maintenance was carried out.
From TFA:
electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding their electrical equipment room. Properly designed systems should never result in any fault to become uncontained in this manner.Have gnu, will travel.
Schlock Mercenary, the popular webcomic, as well as most of the Blank Label Comics collective is down. Schlockmercenary.com now points to a holder site, and Sunday's comic is on the Livejournal community at http://schlocktroups.livejournal.com./
--
# Canmephians for a better Linux Kernel
$Stalag99{"URL"}="http://stalag99.net";
The only thing that I can imagine that could've caused an explosion in a datacenter is a battery bank (the data centers I've been in didn't have any large A/C transformers inside). And even then, I thought that the NEC had some fairly strict codes about firewalls, explosion-proof vaults and the like.
I just find it curious, since it's not unthinkable that rechargeable batteries might explode.
mr c
"Physics is like sex. Sure, it may give some practical results, but that's not why we do it." - R. Feynman
Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I
was encouraged to see a Coral Cache link to their status page. In that light, here's:
a link to the Coral Cache lofiversion of their status page:
I am wondering what UPS/Generator Hardware was in use?
Where would the "failure" (Short/Electrical Explosion) have to be to cause everything to go dark?
Sounds like the power distribution circuits downstream of the UPS/Generator were damaged.
Whatever vendor provided the now vaporized components are likely praying that the specifics are not mentioned here.
I recall something about Lithium Batteries exploding in Telecom DSLAMs... I wonder if their UPS system used Lithium Ion cells?
http://www.lightreading.com/document.asp?doc_id=109923
http://tech.slashdot.org/article.pl?sid=07/08/25/1145216
http://hardware.slashdot.org/article.pl?sid=07/09/06/0431237
Clearly these Sony batteries had to be replaced one way or another...
They should of also had two separate demarcation points for power as well, with a trow switch on both sides of the backup to have physical separation from the farm and the grid, only to be connected up when something like this happens. When you have that many servers, it's the only thing that makes sense.
Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.
To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.
So, in summary, the 9,000 servers were not blown up. Only the power.
The power is off due to the explosion but there servers themselves are A-OK.
"Sacrifice for the good of The State" - The State
I have 5 servers. Each of them is in a different city, on a different provider. I had a server at The Planet in 2005.
I feel bad for their techs, but I have no sympathy for someone who's single-sourced, they should have propagated to their offsite secondary.
Which they'll be buying tomorrow, I'm sure.
Never thought I'd see that headline.
I like right across the street from one of The Planet's Dallas data centers, so when I saw this article, I was like "So why wasn't I woken by an exploding generator?" Makes sense now.
Of course, I still have to go to work on Wednesday now, too. Bah.
No, the power was off because the fire department told them to shut it off (during an investigation, I assume). The explosion was in a high power conduit - I'm sure it severed all the lines inside the conduit itself. This is one of those things that couldn't easily be avoided at a single site. But, if your server is of any importance, you do have a colo, right?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
My main server is located there and it's killing me waiting for it to come back up again. The abuse some people are getting though for Planet customers for not having 'switched to their backup data centres' is amazing. Some of us are small fry, we can't afford to run multiple hosting infrastructures.
Data Center Knowledge has a story on the downtime at The Planet, summarizing the information from the now Slashdotted forums. Only one of the company's six data centers was affected. The Planet has more than 50,000 servers in its network, meaning that one on five customers are offline.
Fortunately I'm hosted at the Dallas facility, and this event was at the Houston one.
Hunt your preferred prey at Aliens vs Predator MUD. Join the war at avpmud.com port 4000
Really? What about a little known thing called colocation?
At least with colocation, if the building gets blown up by terrorists, the servers are still running somewhere else.
Fight Spammers!
its not the 'no name' hosting resellers who host at the planet. no name resellers do not employ an entire server, they just use whm reseller panel that is being handed out by a company which hosts servers there.
Read radical news here
If so, that could explain the cause of the explosion...
The view was horrible and the smell was even worse; Julie severely regretted becoming a proctologist.
Sorry for replying to myself, I don't think I made my post clear; the backup power is not on (the mains was blown to bits), because the fire department told them to shut it off.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Everyone please let me know when iyfwrestling.com is back up and running!
You never expect irony, do you?
Want to be a professional wrestler? Visit www.iyfwrestling.com
@iyfwrestling
If this were DreamHost there would be a few flippant words in the official statement but pages and pages of photos...
-1 not first post
Andy
They need to build the building out of what ever they build the servers out of.
Because Hackers can turn your computer into a BOMB!
GAAH! MY PRINTER IS ON FIRE!!! PUT IT OUT! PUT IT OUT!
This morning I was wondering what has happened with Darklyrics.com
Turns out they were hosted on ThePlanet!
http://toolbar.netcraft.com/site_report?url=http://www.darklyrics.com
What what I remember from Uni the coils of wire in a transformer want to be straight. When a transformer has power flowing through it the coils can exert some fairly serious pressure. Big transformers tend to be encased in concrete for this reason. Maybe there was a short, a big current flowed through the secondary coil and the force was enough to over come some weak restraints.
Can anyone give a less arm wavey description of this? Or have I misunderstood?
The power to the server was never lost, and I didn't even find out about it till a couple of days into it.
-- these are only opinions and they might not be mine.
I once caught a production exchange server on fire due to a faulty wire connected to one of the DLT drives.
The motherboard was scorched so bad that when you tapped it burnt flakes that used to be the transistors fell off.
I knew something was up as soon as I smelled burning. The smoke pretty much gave it away, though.
I don't work there any more.
Have a squat over at the hobo house.
And more to the point, the rest of the backup power systems were taken offline at the request of the fire brigade.
It's a common feature to have power shut off in the event of a fire. The Fire Service don't want to be hosing down live cabling after all. It's also why you shouldn't use lifts. Everyone thinks it's "in case the fire reaches the lift". It's actually 'cos the power is likely to be cut off at any moment (the office I work in cuts the power after 3 minutes)
Turn up the power
This is the hour
From every tower
A million watts of love
There comes a time when
You need a good friend
But all that you have
Is that glowing screen
You know you could fly
Your hate a run high
But you've been squeezed in
To that same old scene
You know what I mean
Turn up the power
This is the hour
From every tower
Shout it from above
Turn up the power
This is the hour
From every tower
A million watts of love
By turning that switch
You're finding your niche
And you could tell them
Where to put the advice
You should get back in
It's time to jack in
We'll help you hack in
To that glowing life
You won't have to think twice
Turn up the power
This is the hour
From every tower
Shout it from above
Turn up the power
This is the hour
From every tower
A million watts of love
.
Circumcision is child abuse.
You're thinking of The World. See http://www.theworld.com/about/internet.shtml.
ThePlanet has 5 or more datacenters. The cost and complexity of doing a full blown physically separated 2N power system at every datacenter is far more expensive than taking the chance of having to issue a credit against an SLA. Not to mention that when a fire is involved, the fire department has full authority and may instruct you to cut all power anyway - they are coming in to an unknown situation and won't risk their own people just because you say the other power system is isolated.
Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.
Everyone loves firemen, right? Not me. While the guys you see in the movies running into burning buildings might be heroes, the real world firemen (or more specifically fire chiefs) are capricious, arbitrarty, ignorant little rulers of their own personal fiefdom. Did you know that if you are getting an inspection from your local firechief and he commands something, there is no appeal? His word is law, no matter how STUPID or IGNORANT. I'll give you some examples later.
I'm one of the affected customers. I have about 100 domains down right now because both my nameservers were hosted at the facility, as is the control panel that I would use to change the nameserver IPs. Whoops. So I learned why I need to obviously have NS3 and ND4 and spread them around because even though the servers are spread everywhere, without my nameservers none of them currently resolve.
It sounds like the facility was ordered to cut ALL power because of some fire chief's misguided fear that power flows backwards from a low-voltage source to a high-voltage one. I admit I don't know much about the engineering of this data center, but I'm pretty sure the "Y" junction where AC and generator power come together is going to be as close to the rack power as possible to avoid lossy transformation. It makes no sense why they would have 220 or 400 VAC generators running through the same high-voltage transformer when it would be far more efficient to have 120 or even 12VCD (if only servers would accept that). But I admit I could be wrong, and if it is a legit safety issue...then it's apparently a single point of failure for every data center out there because ThePlanet charged enough that they don't need to cut corners.
Here's a couple of times that I've had my hackles raised by some fireman with no knowledge of technology. The first was when we switched alarm companies and required a fire inspector to come and sign off on the newly installed system. The inspector said we needed to shut down power for 24 hours to verify that the fire alarm would still work after that period of time (a code requirement). No problem, we said, reaching for the breaker for that circuit.
No no, he said. ALL POWER. That meant the entire office complex, some 20-30 businesses, would need to be without power for an entire day so that this fing idiot could be sure that we weren't cheating by sneaking supplimentary power from another source.
WHAT THE FRACK
We ended up having to rent generators and park them outside to keep our racks and critical systems running, and then renting a conference room to relocate employees. We went all the way to the country commmissioners pointing out how absolutely stupid this was (not to mention, who the HELL is still going to be in a burning building 24 hours after the alarm's gone off) but we told that there was no override possible.
The second time was at a different place when we installed a CO alarm as required for commercial property. Well, the inspector came and said we need to test it. OK, we said, pressing the test button. No no, he said, we need to spray it with carbon monoxide.
Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier, we finally called the manufacturer of the device who pointed out that the device was void the second it was exposed to CO because the sensor was not reusuable. In other words, when the sensor was tripped, it was time to buy a new monitor. You can see the recursive loop that would have devloped if we actually had tested the device and then promptly had to replace it and get the new one retested by this idiot.
So finally we got a letter from the manufacturer that pointed out the device was UL certified and that pressing the test button WAS the way you tested the device. It took four weeks of arguing before he finally found an excuse that let him safe face and
-- I wonder which will go down in history as the bigger failure: the War on Drugs or the War on Filesharing
Last message on the linux console before the explosion:
lp0 printer on fire!
Invalid Checksum. Retrying.
"The power is off due to the explosion but there servers themselves are A-OK."
Physically OK maybe... lets see how many of them come back up when the power is restored ^ ^
They supposedly had a "short in a high-volume wire conduit." That leads to questions as to whether they exceeded the NEC limits on how much wire and how much current you can put through a conduit of a given size. Wires dissipate heat, and the basic rule is that conduits must be no more than 40% filled with wire. The rest of the space is needed for air cooling. The NEC rules are conservative, and if followed, overheating should not be a problem.
This data center is in a hot climate, and a data center is often a continuous maximum load on the wiring, so if they do exceed the packing limits for conduit, a wiring failure through overheat is a very real possibility.
Some fire inspector will pull charred wires out of damaged conduit and compare them against the NEC rules. We should know in a few days.
I wish I had mod points...I think this is the first time I ever wanted to mod those 5 words up.
[b.belong('us') for b in bases if b.owner() == 'you']
YouTube's home page is returning "Service unavailable". Is this related? (Google Video is up.)
> ...they claim redundant power...
How the hell could they claim redundant power with only one power room?
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
I remember seeing a series of photos on a website a few years ago showing the remains of a transformer outside a commercial office building that housed another datacenter. Unfortunately I forget which company it was (I want to say Hurricane Electric but I'm not 100% sure). Those photos were pretty impressive. After the fire department put the fire out there wasn't much of anything left on the concrete slab where the transformer once was...
In related news, I was wondering why I wasn't getting much spam today and my sites didn't have strange spiders hitting them.
Open Source Java DAO Generator
could be nice to remove the link.. so we (clients) cand be up to date.
Blank Labels WebComics are hosted there, and this catastrophe could be the very first thing in over eight years straight to cause Howard Taylor to not update his daily comic Schlock Mercenary.
Maybe we deserve this world ?
Time it takes to get a new transformer (assumming a supply house has one laying around in their warehouse and are willing to do business on a Sunday): 1 day
Time it takes to get a new switchboard: a few weeks usually (although possibly 2 or 3 days at the earliest, assumming GE/SquareD/CutlerHammer have one laying around in their warehouse and the building doesn't have any special needs)
Time it takes to order all new copper wire and pull it in and terminate it (assumming no problems are encountered with the way the pipe is run, etc): probably about 3 to 5 days at the earliest (remember, their main board exploded, the wire ends are more-than-likely all burnt up and the wire is now worthless)
Time it takes to rehang/fireseal three walls: dunno, could take a while. sheetrockers are funny like that, LOL.
But in all seriousness, it sounds like ThePlanet are downplaying this. They are estimating stuff coming online today, when it looks like sometime next weekend (at the earliest (and assumming they're not trying to rig stuff up)).
I've been in DC when the power's dropped. It's surprising how physical it is when it gets quiet. I imagine the initial bang and then the silence broken by the alarms was quite an experience.
Count me as one affected. They've been great about notifying us and posting about it though. Thankfully my site is just one I run for fun, and not a business site. Sheesh.
I'm one of their customers, and it takes more than a single instance in 5 years of hosting to make me switch. That said we'll see how long it takes to get things back up. Unfortunately *both* my dns servers are in that DC, I thought they were in physically distant locations... so much for ass-um-ing things...
MP3 Search Engine
The fact is in the real world there is no such thing as 100% guarantee, datacenters no matter how well designed can, do and will go down, and it doesn't mean there is a design flaw or that another datacenter is superior.
YouTube is back up, after a few minutes of outage, so it was probably a different problem.
as if they haven't been through enough with the explosion and fire and all... you just had to rub it in and slashdot their forum as well... kudos!
You don't think enough... therefore you better not be!
My servers dropped off the net yesterday afternoon, and if all goes well they'll be up and running late tonight. At 1700PST they're supposed to do a power test, then start bringing up the environmentals, the switching gear, and blocks of servers.
My thoughts as a customer of theirs:
1. Good updates. Not as frequent or clear as I'd like, but mostly they didn't have much to add.
2. Anyone bitching about the thousands of dollars per hour they're losing has not credibility to me. If your junk is that important, your hot standby server should be in another data center.
3. This is a very rare event, and I will not be pulling out of what has been an excellent relationship so far with them.
4. I am adding a fail over server in another data center (their Dallas facility). I'd planned this already but got caught being too slow this time.
5. Because of the incident, I will probably make the new Dallas server the primary and the existing Houston one the backup. This is because I think there will be long term stability issues in this Houston data center for months to come. I know what concrete, drywall, and fire extinguisher dust does to servers. I also know they'll have a lot of work in reconstruction ahead, and that can lead to other issues.
For now, I'll wait it out. I've heard of this cool place called "outside". maybe I'll check it out.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
The lesson you should be taking from Murphy's Law is not "Shit Happens". The lesson you should be taking is that you can't assume that an unlikely problem (or one you can con yourself into thinking unlikely) is one you can ignore. It's only after you've prepared for every reasonable contingency that you're allowed to say "Shit Happens".
While it may be the fire dept that is erroneously preventing them from bringing up their back-up power, it's part of a poor disaster recovery plan to not engage with the fire dept, electric co, etc. before a disaster happens, so that everyone is on-board with your disaster recovery plans and that you have the ability to implement that plan.
The explosion was isolated to the power room. The servers are fine, the backup generators and batteries are fine. The servers should have been back online if they had a good disaster recovery plan. The whole point of disaster recovery is being able to handle a disaster. You can't say "oh there was a disaster, you can't help that". This is exactly what their plan should have been able to handle. The power room goes offline. It shouldn't matter if it was because of an explosion, a fire, equipment failure or being beamed into outer space.
It also shouldn't matter who is telling them to keep the power off. Part of the disaster recovery plan should have been making sure local authorities allowed them to carry it out. Fine, they have to shut off all power when firemen are in there with hoses. I understand that. But once the fire is out your plan should allow you to bring up backup power. It didn't. So I don't see how they can call themselves a "World Class Data Center". Part of what they sell and what customers expect is disaster recovery. And there are data centers that can provide this.
ThePlanet is pretty cheap compared to datacenters like NAC that have more redundancy and security. But ThePlanet wants to advertise that they are just as good. Now they were caught with their pants down when there was actually a disaster and their disaster recovery plan failed.
Open Source Java DAO Generator
There's no reason to use the forum software when they've locked the thread and are only using it to disseminate information. A Pentium one running lighttpd serving a static html page would be sufficient to handle the flood of requests.
SLA is not a substitute for business insurance.
If your business loses $1000/minute while it's offline, get a quote for insurance that pays out $1000/minute while you're offline. Alternatively if you're happy self insuring take the loss when it happens.
It's almost as if people believe that SLAs are a form of service guarantee instead of a free very bad insurance deal.
Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. (Einstein)
You may also be interested in a pretty positive write-up from SANS about ThePlanet's response and handling of the situation thus far.
They still have backup power; the fire department just told them not to use it. They will probably patch things up until the backup power is safe to use again and then use that while repairing the main system.
USE HOT GRITS WITH STATUE OF NATALIE PORTMAN (NAKED AND PETRIFIED)
We moved three racks of servers out of ThePlanet last Thursday. Timing is everything.
How are you going to handle the failover to Houston? Round-robin DNS? Or just very low TTL on the root nameservers so if something goes wrong you can update the nameservers and have them point elsewhere?
Oh and that reminds me, make sure your nameservers are spaced out as well. Learned that one the hard way.
-- taking over the world, we are.
I feel like crying. A whole weekend without it.
I have no social life. It's true.
Camping on quad since 1996.
I have about 100 sites hosted there right now that are offline, but none worth so much that a day or two of downtime will affect me so much. The worst part is getting phone calls from people who I host. and try to explain in as many ways as possible that No, I can not personally go to texas and fix it, and that there is absolutely no alternate way for them to get their email right now.
Other data centers I've been to have been smaller and primarily designed to host mainframes for large corporations. Multiple power coming in to two different rooms, multiple backup generators, adequate ups, etc.
I'm no expert in data centers, but like I previously mentioned, if someone is making claims of 100% uptime I would expect them to have some reasonable way of backing that up. They didn't. Their power room caught on fire. They didn't have a second power room that could be used and they couldn't bring backup power online.
Electrical systems fail, and sometimes catastrophically. There was a transformer that blew up on a utility pole directly across the street from me. The whole house shook. If that was in an enclosed room I can picture walls being blown down. There was also an underground fire in the wiring at one point. In both cases power was brought back online relatively fast.
I don't care what anyone says. This is poor performance compared to their marketing claims. It looks like they couldn't bring up their back-up systems because they didn't work with the local authorities when they came up with that plan.
I'm not saying ThePlanet sucks. But I wouldn't call it "world Class" and I doubt anyone's "100% uptime" claims. There is simply no amount of security / redundancy that can be done at a single location that will provide 100% uptime Someone should tell ThePlanet to stop marketing something that is so impossible then.
Open Source Java DAO Generator
Anyone know if this is why www.schlockmercenary.com is down?
From the status update thread... "Today at approximately 5:45 p.m., a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down." I read this as the fire department ordering them to kill *all* the power for safety reasons, rather than the explosion knocking the whole thing out.
What I do isn't just a web site, its also a pbx and some other stuff.
The client software that does the automation is easy. I wrote it to handle the need for a failover server so it will just try the other if the first one fails.
The PBX failover is easy, the DID provider will route to both, only one will pickup sooner. If the second does catch a call, it will have a flag on it that lets it know if the primary is still up and in that case will try to transfer over the call.
The Web Site is actually the least important part of the process for me, and I'll likely handle that with a low ttl on that one particular address.
The inbound mail is easy because I use Postini and it is good at failover.
The data is stored in a database that knows how to sync in near real time between the servers, and on disk as files. I use unison to keep the file directories in sync within just a few minutes.
Overall, I think it should work just fine. There are more elegant solutions -- and more expensive. This one will work for me.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
Well my server has been offline and even though I use it for personal use. I am considering ordering another server from a different hosting company and restoring from my remote backups and repointing my DNS and canceling my service with The Planet. I'm glad I don't host my DNS with The Planet for this and other reasons.
Also has this happened before at this data center. Wasn't this data center owned by Rackshack? Because thats who I signed up with originally before they were bought by The Planet. Here is a link from 2003 about a transformer exploding at Rackshack's Houston data center:
http://www.carrierhotels.com/wiredspace/archives/000010.html
Is this the same data center?
First, that time was an estimate -- a target. Second, even if the initial power test passes, it will take hours to bring up the a/c systems, the switches, and the routers.
:-)
The initial draw from each new bank of gear to be given power will be very high so it will need to go slow.
The battery systems (be they on each rack or in large banks serving whole blocks) will try to charge all at once. If they're not careful, that'll heat those new power lines up like the filaments in a toaster. Remember, the battery plan they have was built with the idea that they'd be used very briefly during transition to generator power -- not drained down all at once.
Only once all the switches and routing gear is back up can they start updating the network paths (do they use BGP for this -- that's not my area of expertise) so that peering data starts flowing.
Only once the network is all up and stable (no small task on a site with dozens of high end peering points) can they even start doing banks of servers.
Its also probably that each bank of servers will needs its own new power lines (and eventually replaced conduit) in the distribution center that was destroyed.
Bank by bank they'll have to bring up all these servers, each of which will draw its maximum load during boot as disks are scanned and checked.
Most of these servers probably haven't been shut down in months or years. Some drives may not spin up due to tired motors that can run fine but spinning from cold is just too much now. Other servers may have boot configuration problems undiscovered since the machines have been running without reboot for a long time -- linux ones anyway
This isn't something out of Young Frankenstein where they'll yell across the room "throw za main svitch!" and a watch the lights dim briefly while 9000 servers boot up with the deafening sound of system beeps. If they did try such a thing -- as if such a thing were possible -- it would immediately blow at least another transformer if not more.
Think about it. 9000 servers @ an average of what, 300 watts, plus the networking gear, plus the air conditioning, plus charging all those batteries....you're talking megawatts.
Without a Mr. Fusion or Harry Mudd stumbling in with some chicks wearing dilithium crystal jewelery this is going to take a while.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
The explosion may have tried hard to stop it, but the Schlock site is already back up (abet a minimal version of the site). So he still hasn't missed a daily update.
2 servers down. Burn.
my first impression would be something battery related. If a short trips something that shuts off your incoming AC, it kicks you over to batteries and generators. If something is then reset and brings you back online a little bit later, your hardware switches back to AC and all the batteries start charging. If the electrical fault wasn't really FIXED, (think sparks spraying from a nearby electrical box) but merely tripped something that you reset, then it can set off a hydrogen explosion from the H and O the batteries are dumping out while being recharged. THAT would require you to take things totally offline to fix since it's the point where your redundant power sources converge.
The support forum posts were not as heavy on detail as I would like to have seen, but better than about any I have ever seen under such circumstances. (something is always better than nothing) Looks like a transformer went out and did some structural damage. Probably not so much of an explosion. If you've ever seen a substation transformer go, that's probably about what happened here.
Their main concern besides getting power restored seems to be to repair networking equipment. Hard to say how that was damaged, it may have power spiked their routers and switches. (could have been other related causes - physical damage or got soaked with transformer coolant when it vented) At any rate, hazmat and firemen in general don't like working on live wires so they basically told them we don't care if you can turn some of it back on, you're going to leave it all off until we're done. Looks like they made good use of that time to gather replacement hardware and build an action plan. At this point it appears that they've been given the OK to get in there and start replacing hardware and fixing power.
There is a good video of a substation problem on youtube. This isn't necessarily what happened here, but you get the idea. Not really an explosion so much as a fire.
I work for the Department of Redundancy Department.
Not to mention no no number of redundant power systems helps you at all when the fire department orders the power off.
Next, after an explosion due to an electrical fault blows 3 walls down, you have a lot of checking to do before just powering the redundant system up unless you'd like more smoke and fire.
Look, when I go into a building in gear and carrying an axe and an extinguisher, breathing bottled air, wading through toxic smoke I couldn't give crap number one about your 100 sites being down.
I have a crew to protect. In this case, I'm going into an extremely hazardous environment. There has already been one explosion. I don't know what I'm going to see when I get there, but I do know that this place is wall to wall danger. Wires everywhere to get tangled in when its dark and I'm crawling through the smoke. Huge amounts of currents. Toxic batteries everywhere that may or may not be stable. Wiring that may or may not be exposed.
If its me in charge, and its my crew making entry, the power is going off. Its getting a lock-out tag on it. If you wont turn it off, I will. If I do it, you won't be turning it on so easily. If need be, I will have the police haul you away in cuffs if you try to stop me.
My job, as a firefighter -- as a fire officer -- is to ensure the safety of the general public, of my crew, and then if possible of the property.
NOW -- As a network guy and software developer -- I can say that if you're too short sighted or cheap to spring for a secondary DNS server at another facility, or if your servers are so critical to your livelihood that losing them for a couple of days will kill you but you haven't bothered to go with hot spares at another data center then you sir, are an idiot.
At any data center - anywhere - anything can happen at any time. The f'ing ground could open up and swallow your data center. Terrorists could target it because the guy in the rack next to yours is posting cartoon photos of their most sacred religious icons. Monkeys could fly out of the site admin's [nose] and shut down all the servers. Whatever. If its critical, you have off site failover. If not, you're not very good at what you do.
End of rant.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
In that case, it cannot be advertised at all. Go ahead and install your 1024N power system and nuclear powered UPS and put it in a nuclear bunker. Fill the room with argon to make fires impossible. But if you can't promise that an asteroid will never ever strike the bunker, you better not advertise 100% uptime.
Of course, if you read the fine print, they probably have something in there about not being responsible for actions of civil authorities...such as the fire dept. ordering them to shut down all power.
I'm all for stricter requirements for truth in advertising, but honestly, it's not like this is a common problem they're having. Talk is dirt cheap. It's easy to CLAIM you could put together a system that could sail right through this sort of problem, but another matter to actually DO it. You'll never be 100% sure it'll work until something like this actually happens.
ants are causing a lot of damage to electrical devices in Texas: http://urbanentomology.tamu.edu/ants/exotic_tx.cfm
new letter/phrase: hex-u means "www"
They don't let you use halon anymore these days. Back in '06 when my company was upgrading its datacenter, we had a similar fire issue. We had just gotten all of the servers moved into the new racks, and everything was running nicely for a few weeks.
Well, what do you know? The UPS blew up. Due to improper assembly of the UPS, one of the main cables was stretched and chafed along the chassis. Imagine, let's say, an array of 12 car batteries in series in a parallel arangement of 12 (144 total of course). The cable finally wore through late at night and arc gouged several inches of 12-gauge sheet metal chassis before it finally shorted and destroyed the batteries (from what I heard).
It was a very nasty mess, and half of the UPS was this blob of burnt-out circuitry. No fun. My company opted to spend several thousand dollars (over 10k IIRC) for a small FM-200 system.
My guess is that something like this happened to ThePlanet, but it was most likely a chemical like FM-200 that is less environmentally harmful than halon, not as cool nonetheless.
Anyway, the incident at my company was completely avoidable, and I would wager to guess that the one at ThePlanet was competely avoidable as well. I am just glad that ThePlanet lost my business about a year ago for price gouging a loyal customer after three years of service.
One power room doesn't seem to qualify as "fully redundant power system" or a "complete redundant power management system" that "assures 100% uptime".
With only one power room, you have to wonder how thoroughly and how often they perform maintenance on that equipment.
While a transformer explosion might be rare it is not uncommon. I don't think it's too much to expect for a data center that talks about "100% uptime", "fully redundant power systems" without "a single point of failure" to go offline because of a fire in the power room.
Open Source Java DAO Generator
"Did you catch the article about Google's datacenters the other day? Clearly they recognize that fact and design around it."
I wonder, http://www.informationweek.com/news/storage/showArticle.jhtml?articleID=202400961 how does a 'data center' in a box go down, when it's 'power room' explodes?
complicated electrical devices, especially where varying current can cause undesirable operation of the device, are the kinds of electrical devices that make a big bang when they go up in smoke, the conventional data center can put these parts far away from the server, so even with 3 walls going down no servers were harmed... but if it's all tightly integrated into a 'box' what happens to all the servers and the data?
i suppose if the thing is as big as a semi trailer, it could have a blast barrier, between servers, and power unit... otherwise, a data center in a box is a potentially less safe method of implementing a data center than the conventional approach.
https://www.gnu.org/philosophy/free-sw.html
You sir, don't know what you're talking about. Reaching for ridiculous examples of someone doing their job wrong doesn't change that.
Our S.O.G. (standard operating guidelines) are actually very specific about risk.
We will risk our lives to save a human life.
We will take reasonable risk to save the lives of pets and livestock.
We will take minimal risks to save property.
Sorry, but your building isn't worth the risk of my crew. That's reality.
Don't you DARE tell me what is and isn't bravery or cowardly until you put 50 pounds of gear on and crawl into a pitch black house that's burning over your head.
Don't you DARE tell me that you think you understand the difference between saving the blonde girl and saving your computer server.
This isn't TV World. This is the real world. Fire on TV doesn't look like real fire. You know why? Because a real house on fire doesn't look like anything but pitch black and that makes for lousy TV.
Get over yourself and go volunteer at your local fire department. 86% of the men and women in this country who will risk their lives for yours are volunteers. We could use your help if you have the guts for it. We'll teach you what you need to know -- and we'll keep you as safe as we can so you can go home to your family when its done.
Your examples are stupid and insulting to the 800,000 brave men and women who volunteer to risk death in the most painful way possible to save your sorry butt.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
The Planet's video tour of why this wouldn't happen is up and working. Click on "Take the tour", which has many data center pictures. I like the "100% uptime" part.
It turns out they didn't have all the redundancy they said they had. Their central server management system and the DNS servers for those hosts were all in that data center. So customers couldn't get in and switch the DNS to another location for hours.
They now claim to have the server management system back up.
I applaud your response, CFD. It's the responsibility of the data center management to ensure redundancy in the event of fire, flood, earthquake, tornado or curious squirrel. In this case, there was an explosion and fire. I've seen fire up close when we invited the local volunteer fire department to use an old house as training. Once they set the fire, the entire thing was engulfed in 26 seconds. They intended to enter and practice saving occupants but it was 60 year old, dry wood and just went poof, so they just waited a minute and turned the hoses on. It smoldered for about 24 hours and was a pile of ash.
In this datacenter, there are all kinds of things that could smolder and cause secondary fires if the generators were turned on and something unknown happened (i.e. short from inside?). Plus, don't firefighters need to check for hot spots? Isn't that easier if power's off? Don't get me started on structural stability, either... 3 of 4 walls collapsed. That has to count as added risk.
So yes, if ThePlanet was willing to take the risk that their building was destroyed by earthquake, they can accept 24 hours downtime at the insistence of the fire chief. Redundant data centers for critical operations; acceptable tactical losses for whatever doesn't have redundancy. Murphy's law happens, and nothing is truly redundant. If their 9,000 customers expected full redundancy, those customers will need to re-evaluate what exact kind of redundancy they're getting. Not everyone needs multi-datacenter stability which is horrifically expensive. After reading this story, I'll be getting a second server for my 19 domains, on a different provider in a different city. Just in case.
Since it was posted AC the second time, it may have been someone else pretending to be him. You never know...
Otherwise, I'm curious as to the probability of whether he meant that (and completely missed Hijacked Public's point), or is just trolling.
Why OpalCalc is the best Windows calc
As previously committed, I would like to provide an update on where we stand following yesterday's explosion in our H1 data center. First, I would like to extend my sincere thanks for your patience during the past 28 hours. We are acutely aware that uptime is critical to your business, and you have my personal commitment that The Planet team will continue to work around the clock to restore your service. As you have read, we have begun receiving some of the equipment required to start repairs. While no customer servers have been damaged or lost, we have new information that damage to our H1 data center is worse than initially expected. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed. There is some good news, however. We have found a way to get power to Phase 2 (upstairs, second floor) of the data center and to restore network connectivity. We will be powering up the air conditioning system and other necessary equipment within the next few hours. Once these systems are tested, we will begin bringing the 6,000 servers online. It will take four to five hours to get them all running. We have brought in additional support from Dallas to have more hands and eyes on site to help with any servers that may experience problems. The call center has also brought in double staff to handle the increase in tickets we're expecting. Hopefully by sunrise tomorrow Phase 2 will be well on its way to full production. Let me next address Phase 1 (first floor) of the data center and the affected 3,000 servers. The news is not as good, and we were not as lucky. The damage there was far more extensive, and we have a bigger challenge that will require a two-step process. For the first step, we have designed a temporary method that we believe will bring power back to those servers sometime tomorrow evening, but the solution will be temporary. We will use a generator to supply power through next weekend when the necessary gear will be delivered to permanently restore normal utility power and our battery backup system. During the upcoming week, we will be working with those customers to resolve issues. We know this may not be a satisfactory solution for you and your business but at this time, it is the best we can do. We understand that you will be due service credits based on our Service Level Agreement. We will proactively begin providing those following the restoration of service, which is our number priority, so please bear with us until this has been completed. I recognize that this is not all good news. I can only assure you we will continue to utilize every means possible to fully restore service. I plan to have an audio update tomorrow evening. Until then, Douglas J. Erwin Chairman & Chief Executive Officer
30% off web hosting. Coupon code "SLASHDOT".
As much as I agree with minimizing safety risk, his main point (in my opinion) was:
"So cut the feed from the power company, case closed. Shutting down the redundant power generators that are DOWNSTREAM from the problem?"
In other words, surely if the backup power *to* the servers is kept on, and just the primary power is turned off (i.e. from the power company), then that's surely 99.999% safe? After all, the room for the main power is seperate from the backup power's room.
Maybe I'm missing something, and the power can be leaked from the backup power to the servers, and then finally to the broken main power setup. I doubt that, but at least they should ask the server's technicians if that's even theoretically possible.
Why OpalCalc is the best Windows calc
Mind you that the fact that ThePlanet own 5+ datacenters lends credit to what you're saying.
Put your servers in two geographically-isolated datacenters, and you'll be considerably more protected against virtually any sort of calamity that could occur to your servers.
There are so many things that a datacenter simply cannot be 100% prepared for. Would you really blame the provider if a plane crashed into their building?
It's far cheaper to simply colocate in 2+ locations than it is to prepare for every single event that can possibly occur, no matter how remotely unlikely it may be.
-- If you try to fail and succeed, which have you done? - Uli's moose
At first I thought you must know me to use my name, then I realized it was just a cheap trick of looking at my email address or profile.
You know how I knew? Because nobody who knows me would make the mistake of calling me an obey-authority anything. You've got the wrong fool on that one.
The arrogant insult dripping through the trolls post to which I responded deserved all the ridicule and self righteousness it contained.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
While it sounds like a reasonable approach at first, it makes assumptions that I can't make as an officer on scene.
1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.
2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.
3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.
So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).
The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.
We go in to a place like this not knowing the equipment, not knowing its condition.
My final proof point --
If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.
What does this prove?
1. It proves the fire marshal was right in not allowing them to feed power in their.
2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.
Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
You are probably thinking of auto insurance. Yes, it usually goes up when used. The reason is because when you use it, it is usually because you did something that changed your risk level. If you get in an accident, that makes you a higher risk. Continue to get in accidents, you are a higher risk still. Thus the companies want more money. It's all based on risk calculation. That's also why they want more money when you are under 25. Statistically speaking, young people are a much higher risk of accidents.
Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.
Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.
Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.
someone mod this up please.
:(
I'm in DC1 at the planet, I'm down and I'm not pointing any fingers. I'm pretty sure they did a reasonable job of setting up their systems in such a way that the chances of this happening was small to begin with, and when it did happen they seem to have things under control as much as possible.
There are a lot of 'armchair' specialists and complainers around here and all I would like to say to them is we'll see how it goes when *you* operate 5 large datacenters for years. Accidents do happen (and by their nature are caused by the unforeseeable), how you deal with them is what matters.
And fire fighters lives are more precious than *any* amount of hardware.
The only person I blame for not verifying if what I thought was redundant DNS in two locations is me, and I really thought I had it set up that way
MP3 Search Engine
For example someone like Newegg.com probably has a redundant data centre. Reason being that if their site is down, their income drops to 0. Even if they had the phone techs to do the orders nobody knows their phone number and since the site is down, you can't look it up. However someone like Rotel.com probably doesn't. If their site is down it's inconvenient, and might possibly cost them some sales from people who can't research their products online, but ultimately it isn't a big deal even if it's gone for a couple of days. Thus it isn't so likely they'd spend the money on being in different data centres.
You are also right on in terms of type of failure. I've been at the whole computer support business for quite a while now, and I have a lot of friends who do the same thing. I don't know that I could count the number of servers that I've seen die. I wouldn't call it a common occurrence, but it happens often enough that it is a real concern and thus important servers tend to have backups. However I've never heard of a data centre being taken out (I mean from someone I know personally, I've seen it on the news). Even when a UPS blew up in the university's main data centre, it didn't end up having to go down.
I'm willing to bet that if you were able to get statistics on the whole of the US, you'd find my little sample is quite true. There'd be a lot of cases of servers dying, but very, very few of whole data centres going down, and then usually only because of things like hurricanes or the 9/11 attacks. Thus, a backup server makes sense, however unless it is really important a backup data centre may not.
after an explosion you simply can not assume that the original wiring diagram is still matching reality. Any discrepancy between the two translates in to a serious elevation of the risk...
In other words, what you think is a 'dead' wire could easily be a live one because one or more cables that used to be insulated are now connected.
before having inspected the situation and seen that things are good you're better off not risking powering up.
In fact, now that some of the dust has cleared up it seems that the damage was in fact much more serious than was assumed initially, and powering up the servers using emergency generators would not have mattered one little bit (whether it would have worked or not, or even made matters worse is another matter).
MP3 Search Engine
never before was 'anonymous coward' more appropriate.
if you're sick of these 'obey authority' fools telling you what you are allowed to criticize and not criticize I suggest you set up your own commercial firefighting service under your new and enlightened guidelines.
If you can get so much as 1 single person working for you under those guidelines I'll be very amazed.
Attacking the messenger is perfectly acceptable if the messenger tells you how to do your job for you and what risks you should take to save their property. No amount of property is worth the life of a firefighting crew.
Btw, we lost three firefighters in a flashover nearby recently, they were in fact 'just' trying to save some property.
MP3 Search Engine
Remember the Computer on Fire function in the BeOS kernel?
Any datacenter is subject to some form of catastrophe that will take it completely offline. They had an explosion in the power room - that's pretty wild and pretty nasty. Similarly, a good fire could take the whole place down, or a plane could crash into the building. If you're making enough money to miss the revenue, perhaps you should have a second server at another DC (preferably with a different provider).
I've had a server in that building for years now, and this is the first major outage we've had. If I were losing enough money to bitch about, I'd be kicking myself for not making sure that things didn't fail over to a completely different DC.
Except for the fact that The Planet DOES have a redundant power supply that WAS ready to be switched over to - did you RTFA and see the part where the Houston fire department informed them that they were not allowed to switch over?
This is in no way an intentional deceit.
How will it be affected by the extended fireball and what are the ramifications of a positive or negative response.
or if it still looks like this.
I wonder if that's why World of Warcraft is down?
I click the link and it DOES bring up the page. Unfortunately, since it is a cached copy of the page, it is sometimes out of date. I.e., there have been updates to the actual page that are not reflected in what the Coral Cache copy displays. :/
As of this writing (Monday morning, 06/02/08), it appears that ThePlanet Datacenter folks have created a NEW STATUS PAGE to lessen the load on their servers:
Insurance companies around here have generally looked at two things when I was applying for home insurance:
a) Have I been a previous customer of them or their affiliates (discount points)
b) Have I been a customer of other companies *without a claim* (discount points based on claimless time)
You can get further points by things such as having security bars on windows (anti-theft), a home alarm system, fire extinguishers, properly placed fire alarms, etc etc.
Your overall discount is then based on your final number of points.
So, while your "base rate" doesn't change much based on this, the final price can vary quite a bit depending on which discounts you're eligible for, with claims-free time being one of the factors in this.
Hack the Planet!
Showing my age
Did they find Earth? Tell them to turn around. We'll likely blow them outta the sky before they had a chance to talk.
Given that they shut down under orders from the fire department and that an explosion that knocked 3 walls down didn't damage the batteries, I'd guess they have more than one power room. They have a fair portion of machines powered up now. Since it would take longer than that clean up after an exploded transformer and install a new one in it's place, that room must not have been a single point of failure.
Honestly, it doesn't matter how well isolated your power systems are from each other, if there is a fire, particularly one with explosions, the fire department WILL order all power shut down. If you don't do it, they will (perhaps not so nicely). You will not be permitted to restore power at all until they are satisfied that it is perfectly safe. They will not be the least bit concerned with your uptime. Their priorities will be 1. nobody gets hurt and 2. no more fires.
Believe me, I am well past sick and tired of advertising that is synonymous with a big fat lie, but I don't think this is an example of it.
(With all due respect to the brave men and women of the fire department, couldn't resist)
Ray: "Everything would have been fine if dickless here hadn't shut off the main power grid!"
Walter Peck: "These men caused an explosion!"
Mayor: "Is this true?"
Peter Venkman: "Yes, it's true... this man has no dick."
--Coming up with something clever... please wait...
Open Source Java DAO Generator
Ever tried to do that? Unless you are a life critical operation like an intensive care unit they aren't likely to be all that interested and anything they might agree to is non-binding.
As for running on backup generators, as long as they are willing to sustain that long enough to replace the transformer, what's the problem? They have demonstrated the ability and willingness to provide power in the event that the transformer and power room are blown to bits.
Once a quarter when they claim they test their backup systems they should have a fire engine standing by and have an inspector or chief there too and walk them through what's happening. That way if there's a real disaster, the fire dept either payed attention and knows you're backup system is isolated from the main power system or you've showed them some sort of competence during your drills that they believe you if they didn't pay attention. As for running on backup generators, as long as they are willing to sustain that long enough to replace the transformer, what's the problem? They have demonstrated the ability and willingness to provide power in the event that the transformer and power room are blown to bits. The problem is they didn't have a fully redundant power system without a single point of failure like they claim. It took them more than 40 hours to start powering systems up. This is not what most people would expect from a "world class data center" that's capable of reliably handling your IT infrastructure.
Open Source Java DAO Generator
Xconomy was one of the sites hosted at "H1," as The Planet calls it. After waiting all day Sunday to see whether we'd be back up on Monday, we decided to move the site back over to Media Temple, our previous hosting provider, at least temporarily. (Ironically, one of the reasons we left Media Temple in the first place was that they couldn't handle the traffic when our flying car stories got slashdotted.) We published a post about our experience with the outage this morning.
I'm not going to fellate you because you are a firefighter, sorry. There is no such thing as a sacred elephant and there are pros and cons to everything.
As a previous posted accuratly points out, we have been forbidden from taking our own actions when we consider the risks and costs to be worth it. I would have no problem if the choice to hire a firefighting service and understand that the more the risk, the more it will cost me. What I have a problem with is there being a monopoly of only ONE legally allowed provider, and then having that provider refuse service.
I can't stand the way if you criticize troops or firemen, you are considered flamebait. It's a volunteer service. I know plenty of macho asshats who sign up specifically because they know they are basically going to be beyond reproach if they make it in.
People risk their lives every day. I already pointed out that the average electrician (who makes a journeyman's wages of about $25K) faces more danger and risk in his daily job than a firefighter may face in a week. Hell, someone manning a counter at a convenience store in a bad neighborhood has a chance at getting shot.
You aren't special. You aren't magical. You aren't beyond reproach. You don't have a monopoly on heroism. I can point to countless tales of ordinary people who risked their lives or even died trying to pull someone from a canal or even a burning building.
In fact, I have more respect for the average joe who does it than the person who collects a paycheck to do it.
And if you don't like it, then I expect you to support the privitization of social services so that if some chief decides that the 0.000001% of death is an unacceptible risk for saving property, I can pick up the phone and call someone who will.
PS for SOG that are "very specific" about risk, you have some pretty vague and unspecific terms like "reasonable" and "minimal"
-JoeShmoe
.
-- I wonder which will go down in history as the bigger failure: the War on Drugs or the War on Filesharing
If you're making/losing that much money due to your web presence, then it's your own fault for not having a redundant server set up. If you're making $1000/day in sales, you'd better consider shelling out another $80 or so for a second box for exactly this kind of situation.
There's only so much planning that can be done, because every so often a meteor's gonna come done and put a hole right through the middle of your server, and it's not up to your host to have 6" titanium reinforced roofing or anything. If your hosting is that important, BUY SOME REDUNDANCY.
As well, I've found the service and support has become significantly better since The Planet took over, but maybe it's just because I have reasonable expectations. Most of the people complaining seem to be the "OMG I'M LOSING TEN THOUSAND DOLLARS A DAY ON MY $80 HOSTING PLAN! YOU GUYS NEED TO MAKE IT WORK! NOW!" types.
ND
This statement is forty-five characters long.
OK so it sounds like an easy way to kill everyone at a major metropolitan hospital is to set off a smoke bomb in the main eletrical room because then they will unquestioningly shut down ALL ELECTRICITY including the backup generators that are keeping people on life support alive.
What, that's stupid, you say? Well so is your blanket assertion.
And you are only confirming what I said... firefighter != electrician which is why I used the term IGNORANT. Regardless of the facts and 20/20 hindsight in this particular scenario, the question remains...is it considered standard operating procedure at ANY hosting facility to shut down all power, INCLUDING backup power, regardless of the size or nature of the threat?
Or was this an accurate and analyzed response and not just a knee-jerk reactions. Because otherwise I hope the terrorists don't learn we are ten smoke bombs away from having our entire telecommunications infrastructure turned off to "avoid risking firefighters"
-JoeShmoe
.
-- I wonder which will go down in history as the bigger failure: the War on Drugs or the War on Filesharing
Because you probably wouldnt accept what I said, here is an article with some interesting stats:
http://money.cnn.com/2006/08/16/pf/2005_most_dangerous_jobs/index.htm
I don't see firefighters on that list. But I do see electrical workers.
Who's waving the flags for them?
-JoeShmoe
.
-- I wonder which will go down in history as the bigger failure: the War on Drugs or the War on Filesharing
Actually the fact that they have to run off generators shows they did not have N+1 redundancy in there power feed. Reading though the lines it looks like they had a single power feed into the building then generators/UPS after that. I've designed and built "world class" data centers and data centers used by hosting providers don't get the two confused. Hosting is cost centric business in general and insurance is cheaper than gear. Now the funny bit about world class data centers is they are still expected to fail. Services running out of them are hosted at 2 locations if at all possible and a primary and backup set of gear per location. It's all rather expensive but thats how you get to 5 nines or better. Nothing will help you if you have a single DC and the fire trucks roll in and start cutting power.
No sir I dont like it.
Its really sad and funny.
By the way, Last year I earned less than $2000 as a firefighter. We're volunteers (or in the case of most, the term is sort of a misnomer, we're paid a minimal amount of money to keep some legal requirements met by our tows).
Here's some numbers for you - I believe these are a year or two out of date, but I'm not going to look for newer:
Of US firefighters, ~300,000 are full time career firefighters, while ~800,000 are your neighbors who have regular jobs and respond when called.
Of US fire departments, 86% are all call-responders (volunteers) while 92% have at least half call-responders. Last I knew, FDNY had at least one call-responder station out on Long Island, but that may be out of date.
I did not and do not ask to be held up as a shining knight of irreproachable perfection. It wouldn't fit well anyway. I did ask that you not ridicule and insult an honorable vocation and the people who, like me, spend hundreds of hours a year training to be aware of how to deal with emergency situations ranging from a toaster oven fire to a train derailment with toxic chemicals or a data center fire with massive hazards.
So no, I'm no Bruce Willis. I'm a network guy, a business owner, and a software developer -- and a volunteer firefighter who spends as much time training for that field as in computers and technology. You may be surprised, but both are extremely technical fields.
Your statements do not accurately reflect the real danger of the situation in general, nor did they reflect a solid understanding of this incident in particular. You seem to think this was a transformer outside the building and that generator power could be applied through the generator at minimal risk to the firefighters and the workers in the building. That just isn't the case here, and usually isn't.
Finally, you are prevented from doing things which can cause you harm in cases where -I- am obligated to save you, in cases where you endanger other people (including me), or in cases where you risk damage to other people's property.
At this particular scene, the walls of the room containing this transformer were blown several feet from where they'd been. Virtually all the power conduits and lines in that room were totally mangled. It took 28 hours for the electricians to repair enough of that wiring and infrastructure for the second floor power to begin to be restored -- with generators. I'm told a few hours of that was spent waiting for the return of the fire marshal to inspect and certify the work as safe. It has taken almost 20 more hours to create a temporary new power distribution infrastructure to manage the power to the first floor. I'm hearing reports that people's 1st floor servers are starting to come online -- several hours ahead of expectations as laid out last night. The second floor will be on regular power soon if its not now, but the first floor will have to run on those backup generators for a week, while the entire power grid for that floor is re-built from scratch before it can be connected permanently.
So, it seems the fire marshal was right. Also, it seems that it is very unlikely if he had not required it be kept off line that TP would have brought the generators up without first inspecting the damage, and as soon as they had they'd have known it would be impossible.
Your frustration at having your servers off line has lead to you declaring that they are worth more to you than the lives of human beings. That, I find disgusting.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
SOG's are guidelines and not procedures or rules for reasons like what we're arguing about.
Take any specific incident and we can pick apart its details -- especially in the light of day with more facts. A firefighting crew - arrives on scene knowing only that there's something really wrong. You may have heard "explosion" or "fire" but often what you hear en-route is dead wrong. You don't know what caused the explosion. You don't know what exploded. You don't know if people are trapped or injured. You don't know if the riped and stripped wiring is carrying enough voltage to kill you when you touch a chair leaning against a wire you didn't see. You don't know if the explosion was from a gas leak and there's more gas leaking now. You don't know what toxic chemicals are in whatever blew up. These days, we're also trained to think explosions may be bombs. In that case you may have a secondary bomb -- people who set bombs like to make a small first one that draws police and fire, then a large second one that kills them.
You go into a dark building carrying every kind of tool you can carry to deal with whatever variety of broken thing you may find. One group is doing a search for people. One group is doing a search for fire - or other hazards. Fire may look out but be in the walls, or overhead in the drop ceiling. You can't tell.
In Price William County, a very well trained crew entered a house where fire was visible on the outside back wall. It was before 7am, there were cars in the garage and nobody on the front lawn to say if anyone got out yet. They made the second floor and found temperatures not over 90 degrees and a light haze in the air. We call that a tenable environment so we search. They got down the hallway when the fire dropped down on them from the attack space. One of the two up stairs go t out, the other didn't. The reports say the temperature at face height in that upstairs hallway went from 90 degrees to over 700 degrees in a few seconds.
I do my work in a small town hundreds of miles away. Still, we studied that incident like we study any other where men are killed. In most cases, they're killed because something got a lot more dangerous than it looked -- even to trained firefighters -- very very fast.
This is why I ask you not to insult firefighters by pretending to know what is and isn't dangerous without the years of training and practice that go with it.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
If its me doing the poking, I probably have a good idea what I'm looking at -- and actually that would make it harder to do my job as a firefighter because I'd be too interested in the stuff.
What they're looking for at the basics of good electrical line management. No blocked vents, no exposed wiring. No extension cords through walls (common), no extension cords used as permanent wiring (common), no extension cords coiled up and flowing power (heat builds up and they catch fire), no chains of power strips -- and so on.
Also, are the fire doors operational and not blocked open? Are the sprinkler or other fire suppression systems in order? Are exit signs lit and accurate? Are emergency battery lighting units charged and ready?
They don't care if you stuff doesn't work. They care if you get trapped because the fire code stuff isn't right.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
There was more disruption than to the servers in H1. NewsBlaze.com is in another datacenter, but it seems both NS1 and NS2 are in the same place, in H1. So even though our webserver equipment wasn't involved, we were down too. Depending on DNS cache times, traffic slowly dropped off to just a trickle and then eventually nothing. And it took them a long time to get that fixed, a lot longer than it should have. They never apeared to listen to what I told them when I called - and I tried hard to get the support staff to pass the message along. This is one of the problems with big companies - they don't listen to their customers at the times they most need to. We couldn't have been the only ones in that situation. It would be interesting to know how many servers outside H1 were affected. Thank you to the few slashdotters who visited the NewsBlaze story before the server became inaccessible. I'll be writing more about this. The Planet Houston Data Center Goes Up in a Puff of Smoke
Daily News http://newsblaze.com
What single point failed and took everything down? They had TWO failures, one was the explosion and the other was a shutdown order from the FD.
Considering that there was an electrical explosion, the latter is only to be expected. Had they somehow not done that, someone (quite probably you) would be griping about how they risked sending 9000 servers and a dozen employees up in flames just to help their stats.
Funny thing about explosions, things move around as a result (cables, conduits, walls and parts thereof). There WAS after that a non-zero chance that a new hazard developed and that it could have threatened human life. No amount of redundancy would have avoided that. This isn't a spacecraft, it's a data center. It's not like the threat to life from a shutdown was as great or greater than the threat of not shutting down.
Feed into the building, no they did not, but so what? Into the servers, yes they did, as advertised.
HEY! Thanks for that! I just recently stopped making an explicit link in posts here because they seemed to automagically turn into a link anyway. I just figured it was some /. feature.
No idea how long it'd take for me to discover it was my
linkification
Firefox
addon
unless you had pointed it out! Thanks again!
The point is redundant power feeds is a fairly cheap and common practice. They did not seem to have that in there world class data centers. They put themselves up against IBM datacenters and the three of those I've worked at all have n+2 power feeds into the building from multiple substations.
No sir I dont like it.
Life support gear is, as I understand it, built with battery powered redundancy and regularly tested. I don't work with that gear, but I believe it is the case.
I also believe that there are some circuits in hospitals which are specially labeled and are not shut off unless absolutely proved critical or already damaged. In these cases, a lot of money is spent building safety conduits for their cabling and other precautions so that they can handle major damage to the building without becoming a hazard.
Hospital emergencies are their own unique events and there are pre-plan documents and procedures in place for dealing with exactly the issues you describe.
Finally, I would point out that many firefighters are in fact electricians. You see, even career firefighters are not paid well, and most have a second job. Of those, a majority are in the construction and or contracting trades. It is a good fit for them.
People misunderstand the role of a firefighter thinking we just show up and put water on things that are hot. Surely that's the fun part of the job.
In reality, we also have to be many many other things. We have to be truck drivers (you ever drive fast in a 40,000 pound truck carrying a thousand gallons of moving liquid?). We have to be experts in building construction. We have to understand electrical work. We have to be certified in hazmat operations. More recently, we also have to be certified in NIMS (National Incident Management System) which allows us to inter operate using the same language and procedures. We have to be experts at high and low angle rope rescue, confined space rescue, below ground rescue, first responder medical support, mechanical rescue (man vs. meat grinder), traffic control, flood control, crowd control, tree removal, bees, wasps, & snakes, vehicle rescue (we don't take patients from cars, we remove cars -- in chunks -- from around patients) and anything else risky or scary you might want help with.
Even plumbing and water supply -- Just imagine showing up on a scene without a fire hydrant for miles, and being able to organize a tanker shuttle, dump tanks, pumpers, and lines to supply more than 1000 gallons of water a minute within 5 minutes of arriving on scene. That's enough water to fill your pool faster than you can fill your bathtub.
A firefighter crew is a small group of men very much like Macguyver (not as smart maybe but better equiped) with every kind of tool imaginable that they can carry around with them (especially on a heavy rescue unit). You put these guys into ANY situation and within seconds they'll organize around a safe plan for getting to the best possible resolution with the least risk to life and property.
I think the only ignorance I see here, is that which you are demonstrating in your examples.
I'll give you credit for one thing, however. When you state that you are not going to fellate me, you are 100% correct. No matter how nicely you ask.
The problem with quotes on the internet, is that nobody bothers to check their veracity. -- Abraham Lincoln
I understand that, but I don't see where it matters one whit so long as they have some means to provide power with the main transformer blown up (in this case diesel).
What that amounts to is they decided that the extra diesel would cost them less than the second grid tie. Perhaps they were right, perhaps not, but their duty to the customer is to somehow (barring orders civil authorities) be able to provide power to customers when the primary is down. They are doing that now.
Did you even bother to read the comment you just replied to? It wasn't the same person as the author comment you originally ranted to; on the contrary, he agreed with you.
Your first post was quite reasonable. This one made me think you're an asshat who thinks he's entitled to decide everything that might remotely affect him, and *not* just in your job.