Cooling Challenges an Issue In Rackspace Outage
miller60 writes "If your data center's cooling system fails, how long do you have before your servers overheat? The shrinking window for recovery from a grid power outage appears to have been an issue in Monday night's downtime for some customers of Rackspace, which has historically been among the most reliable hosting providers. The company's Dallas data center lost power when a traffic accident damaged a nearby power transformer. There were difficulties getting the chillers fully back online (it's not clear if this was equipment issues or subsequent power bumps) and temperatures rose in the data center, forcing Rackspace to take customer servers offline to protect the equipment. A recent study found that a data center running at 5 kilowatts per server cabinet may experience a thermal shutdown in as little as three minutes during a power outage. The short recovery window from cooling outages has been a hot topic in discussions of data center energy efficiency. One strategy being actively debated is raising the temperature set point in the data center, which trims power bills but may create a less forgiving environment in a cooling outage."
Other publications have noted it was number 3, too.
DT
Is this thing on? Hello?
If you want 100% uptime, it's important to have back up power for the cooling as well as the server systems themselves.
Is this really news?
Well, back to rejecting software patent applications.
Actually this brings up an interesting point of discussion for me at least. Our office is doing a remodel and I'm specifying a small server room (finally!) and the contractors are asking what AC unit(s) we need. Is there a general rule for figuring out how many BTUs of cooling you need for a given wattage of power supplies?
I'm out of my mind right now, but feel free to leave a message.....
http://www.doc.ic.ac.uk/~matti/ise2grp/energystorage_report/node5.html
I wish someone would come up with a failsafe design for liquid cooled systems that wouldn't leak if it came off, from fittings that can be yanked off without draining the system, to pipes which can be installed for the long haul and keep their flexibility over years, not decades even in UV prone environments, to some type of monitoring and automatic shutoff of the core and system in case of a detected leakage.
After that, some standard way of hooking up machines/blades on a rack so they all can be cooled via a central coolant system.
Voila. Problem solved. It would be trivial to have redundant cooling loops so if one failed, the rest of the data center would still be at an operational temperature.
Someone needs to chuck some R&D money at liquid cooling, and get it out of the stone age. As of now, all it takes is one small crack in a hose, and the whole machine would be killed. Due to this, liquid cooled PCs pretty much never are able to have a useful life past 2-3 years until the cooling system has some type of critical (and messy) failure. If its not a coolant leak, its algae getting in the coolant, or corrosion on fittings.
Man, I wish I was making that up.
Liquid nitrogen is the cooling answer, for sure. Then you're not dependent upon power of any kind at all. The nitrogen dissipates as it warms, just like how a pool stays cool on a hot day by 'sweating' through evaportation, and you just top up the tanks when you run low. It's cheap and it's simple. That's why critical cold storage applications like those in the biomedical industry don't use 'chillers' or refrigerators or anything like that. If you really want to put something on ice and keep it cold, you use liquid nitrogen.
A-Bomb
I've never understood why data centre designers haven't used a different cooling strategy to re-circulated cooled air. After all, for much of the temperate latitudes for much of the year the external ambient temperature is at or below that needed for the data centre so why not use conditioned external air to cool the equipment and then exhaust it (possibly with a heat exchanger to recover the heat for other uses such as geothermal storage and use in winter)? (Oh, and have the air-flow fans on the UPS.)
The advantage of this is that even in the worst case scenario where the chillers fail totally during mid-summer there is no run-away, closed loop, self re-enforcing heat cycle, the data centre temperature will rise but it would do so more slowly and the maximum equilibrium temperature will be far lower (and dependant upon the external ambient temperature).
In fact, as part of the design for the cluster room in our new building I've specified such a system, though due to the maximum size of the ducting space available we can only use this for half the heat load.
Agrajag: "Oh no, not again!"
After reading the articles linked from previous posts, it looks like the third outage was related to their cooling units not coming back online from the power outage linked to the Semi vs. Transformer battle. I know the units in our data center aren't hooked up to the UPS, but instead are wired directly to the generator in case of outage. I belive this is due to the massive number of additional cells that would be needed to keep up with the wattage requirements. The theory is that if the power goes out, you can live without cooling for the couple minutes while the generator pumps out the first giant plumes of black diesel and revs up to max capacity. We had a similar unplanned test when the local grid had a brownout. Luckily, our units functioned as designed. I wonder if their issues before did more damage to the units than they would have expected...
They should ban that stuff. (dhmo.org)
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Ah, the dangers of context-sensitive advertising.
Ad on the main page when this article was at the top of the list.
Does "50% off setup" mean you'll only be set up halfway before they run out of A/C?
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The first occasion was over a weekend (no-one present) in a server room full of VAX's. On the monday when it was discovered, we just opened a window and everything carried on as usual.
The next time was when an ECL model Amdahl was replaced by a CMOS IBM. No-one downgraded the cooling and it froze up - solid. This time the who shebang was down for a day while the heat-exchangers thawed out. It was quite interesting watching the temperature monitors, it took a couple of hours until the temperature rose above the "danger" threshold.
So the answer is either, until you arrive at work (2 days or more), or sometimes a bit more heat is a good thing.
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Having a lot of ice on hand would be a good way to bridge the gap between when the power goes out and when your backup system gets running. Ice is relatively cheap to store once it's created. A company called Ice Bear used to make an air conditioner based on this principle.
http://www.news.com/Ice-powered-air-conditioner-could-cut-costs/2100-1008_3-6101045.html
Just make sure your equipment doesn't get wet.
I guess they do not have *FANATICAL* cooling systems ...
A few weeks ago the A/C dropped out in one of our computer rooms. I like the resulting graph: http://leebert.org/tmp/SCADA_S100_10-3-07.JPG
For those of you who either didn't take Physics, or slept through it, Watts and BTU's/hr are both measurements of POWER. Add up all the (input) wattages, and use something like http://www.onlineconversion.com/power.htm/ to convert. This site also has a conversion to 'tons of refrigeration' on that same page.
Also note - Don't EVER user the rated wattage of a power supply because that's what it SUPPLIES, not uses. Instead use the current draw multiplied by the voltage (US - 110 for single phase, 208 for dual phase in must commercial blgs, 220 only in homes or where you know thats the case). This is the 'VA' [Volt-Amps] unit. Use this number for 'watts' in the conversion to refrigeration needs.
Just FYI - a watt is defined as 'the power developed in a circuit by a current of one ampere flowing through a potential difference of one volt." see http://www.siliconvalleypower.com/info/?doc=glossary/, i.e. 1W = 1VA. The dirty little secret about power calculations is that there is another factor thrown in, typically about 0.65, called the 'power factor' that UPS and power supply manufacturers use to lower the overall wattage. That's why you always use VA (rather than the reported wattage) because in a pinch you can always measure both voltage and amperage(under load).
Basically do this - take all the amperage draws for all the devices in your rack/room/data center, multiply them by the applied voltage for that device (110 or 208) and add all the products together. Then convert that number to tons of refrigeration. This is your minimum required cooling for a lights out room. If you have people in the room, count 1100 BTU's/hr for each person and add that to the requirements (after conversion to whatever unit you're working with). Some HVAC contractors want specifications in BTU's/hr and other want it in tons. Don't forget lighting either if its not a 'lights out' operation. A 40W florescent bulb means its going to dissipate 40W (as in heat). You can use these numbers directly as they are a measure of the actual heat thrown, not of the power used to light the bulb.
Make sense?
Dennis Dumont
What happens when the primary, secondary, and tertiary air conditioners all shut down.
http://worsethanfailure.com/Articles/Im-Sure-You-Can-Deal.aspx
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
In a previous /. article (Ancient fridge)we learned that a sterling engine can run off excess heat, so why not power the cooling system with a sterling engine?
The hot air from the cabinates could be pumped by the Stirling engine to the sterling engine, the work done will lower the air temperature which can then be pumped back to the rack.
Now I realize that a Stirling engine might not be able to extract enough energy to cool a rack in a on-going way, during normal operation it could run in a supplementary capacity with conventional air conditioning but in a power outage it could well buy the extra time needed to either get the chillers running or shut down the servers.
In the not too distant future, next Sunday A.D.
One thing to consider is if the heat measured outside a box is high, the heat on the surface of the processor is much higher. Even with little fans or heatsinks on them, it doesn't do much, remember, fans and heatsinks don't change temperature they just displace heat - and the heat is attempting to be displaced in an environment of a lot of other boxes trying to displace heat.
In our current data center, run by a respected name, I have measured external temperatures in excess of 100 degrees Fahrenheit on some machines. Machines that run 24/7/365. We have small non-production rooms which have cheap fans that fill up with condensation, and a building staff which is supposed to empty the water when it fills up, but often doesn't.
Sometimes it gets kind of insane - I worked for a Fortune 100 financial company that had tons of money, and had a data center with Sun Enterprise 4000 series servers all over the place - yet the server room was above room temperature, and even more so in certain areas. We had disk and processor/memory board failures all the time, but they never really cared about the room temperature - they spent more time making sure the insides of the fibre optic cables were clean.
I have always brought up my concerns, but management has never really taken them seriously, and then I become overloaded with other work and forget about it as well. The ideal temperature for servers is a few degress above 0 Celsius, or even below 0 depending on the equipment. Meanwhile, if you find a server room where the temperature is below 20 degrees Celsius, you're lucky. It's just one of those things where it is cheaper and easier for them to just waste my time than to fix the problem.
Most large refrigeration compressors have "short-cycling protection". The compressor motor is overloaded during startup, and needs time to cool. So there's a timer that limits the time between two compressor starts. 4 minutes is a typical delay for a large unit. If you don't have this delay, compressor motors burn out.
Some fancy short-cycling protection timers have backup power, so the the "start to start" time is measured even through power failures. But that's rare. Here's a typical short-cycling timer. For the ones that don't, like that one, a power failure restarts the timer, so you have to wait out the timer after a power glitch.
The timers with backup power, or even the old style ones with a motor and cam-operated switch, allow a quick restart after a power failure if the compressor was already running. Once. If there's a second power failure, the compressor has to wait out the time delay.
So it's important to ensure that a data center's chillers have time delay units that measure true start-to-start time, or you take a cooling outage of several minutes on any short power drop. And, after a power failure and transfer to emergency generators, don't go back to commercial power until enough time has elapsed for the short-cycling protection timers to time out. This last appears to be where Rackspace failed.
Dealing with sequential power failures is tough. That's what took down that big data center in SF a few months ago.
You could do it, it's just probably more expensive than forced-air cooling.
What you probably would want to do is have a closed system that's actually inside the computer. Fill it with some sort of nonconductive/noncorrosive coolant that won't destroy the machine if it leaks (e.g. 3M Fluorinert), then have a cooling block on the back, away from the electronics, where you plug in the chilled water lines. If you don't daisy-chain, and instead end-run the water intake/exhaust lines from every machine to a central pump, and more importantly than that, you have it driven by suction on the return side rather than positive pressure on the supply side, you could easily attach and detach machines without leaks. (Since in a datacenter a leak is probably more disastrous than a LoC to one server, suction is preferable to positive pressure.)
You'd disconnect the supply from a machine using a quick-release valve; then wait a second for the suction on the return side to pull the water out of the machine's cooling block and start sucking air. Then you'd disconnect the return side. This obviously means that you'd need a way of separating the air out of the return side before it hits the pump, but that's not exactly a unique engineering problem.
It's all doable, but the problems are the expense and the standardization. There's a major chicken-and-egg problem with equipment: you don't want to build a datacenter that can't use commodity equipment, but hardware manufacturers don't want to build gear that can't go into a standard air-cooled rack. So even though datacenters may be the biggest purchasers of racked servers (I'm not sure of that but I suspect they are, at least of some types), and datacenters might be better served by some sort of cooling besides forced-air, everybody gets the lowest common denominator.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
It seems crazy that the data centres seem to run in hot states. Surely Alaska would be better? C'mon Alaska, get the tax-breaks right.
Engineering is the art of compromise.
I fail to see where this could be news to anyone who works with data centers. If you want your datacenter to operate during a power outage, you need a Generator with enough capacity for your servers/network and your cooling. If a fancy hosting site with SLA's making up-time guarantees doesn't understand this, I think thier customers should start looking elsewhere.
Keep passing the open windows...
Cooling is usually the achilles heel of many data centers.
It takes so much power to run the air conditioners that many data centers I've been into don't even put them on their backup generators at all. There is no way the air conditioners are on battery backups either, so when the power does go out, they are off for at least the time it takes to start the generator and get it warmed up. (a minute or two at least)
All it takes is a couple minutes for the temperature of an entire data center to rise to a point where it takes hours to get it back down to normal levels. If the power cycles even a couple times you need to start thinking about which servers to turn off. Sometimes companies will put a "minimal" number of air conditioners on the generator, but they often fail to account for the increasing number of servers, so when the power does finally go out, they can't keep up anyways.
When I worked at one of the top tier hosting providers we had industrial fans stored in a closet and when the power went out (a few times per year at least) we had alarms that would go off and the entire support/NOC departments sprang into action like a well oiled machine to dig out the fans, setup extension cords and start taking the front/back doors off every cabinet to improve cooling and keep the servers from cooking themselves. They usually did anyways, it wasn't uncommon for staff to burn themselves on the cases during periods like this. I was always amazed at the temperatures that the servers did continue to run at though. I can also recall times where the entire office heated up to 90+ degrees as that was the only place the fans could blow the heat when you're in an office tower.
A better idea would be converting servers from AC to DC. The powersupply probably generates 25-50% of the heat from a device/server. Wouldn't it make more sense for OEMs to start making devices that used DC directly and then place one large transformer outside the datacenter and then run DC circuits to the racks? It might not eliminate the cooling requirements entirely, but even a reduction of 25% might go a long way.
Ok, what do I know about cooling, right? For ages, Intel processors have had a facility to protect against overheating should the CPU fan fall off or whatever. When the temperature gets too high, the CPU is made to sleep for periods of time necessary to keep it cool enough. The key point is that the system keeps running, just more slowly. Now, why can't data centers emply something like that? Servers that sleep to keep cool and then just ventilation systems that circulate air with the outside. Server response is slower, but nothing goes down.
The real solution here is not really a multihour ups for cooling and power, it would be a emergency generator. Have the generator auto kick on after 5 minutes and have 10 minutes of UPS time on the equipment. Generally I have found there is no need to ups the cooling unit.
Just a random thought -
obviously any cooling system will separate out large amounts of heat. Would it be an idea to use multistory underground stacks of data centres, and have stores on top, that would be heated by the runoff energy?
I was about to say housing, but given the outrage on electromagnetic sensitivity and whatever, it might be more palatable if people don't actually sleep above the datacentres. Or, you could just wrap them in tin foil.
While many here are discussing UPSes, chillers, set-points, etc the most serious flaw is being glossed over ... the lack of redundency outside the data center, such as multiple, diverse power lines coming in...
From the articles, it appears that Rackspace datacenter doesn't have multiple power lines coming in and/or many come in via one feed point.
How else is it that a car crash quite some distance from the datacenter can cause such disruption. Does anyone even plan for such events - I get the feeling most planners don't, since I've seen first-hand many power failures occur in places where one would expect more redundency from dumb things like a vehicle hitting a utility pole, etc.
Ron
We've summoned a small demon to let in cool air particles and shunt out hot ones. Sure the weekly sacrifice gets to be a pain after a while, but there's always a pool of willing interns right?
Ask not what you can do for your country. Ask what your country did to you
Wiring the air handlers from the backup/generator power (any serious data center has backup power) gives one a downtime of a minute or two until the generator(s) spool up-which should be well doable temperature-wise.
Beyond that-the land where Mr. Murphy plays his games like having one or your only backup generator not starting, additional precautions should be made avaialble such as multiple industrial fans in open doors configured in a way to allow them to suck fresh cool air in and expel hot air out (same principles of cooling your computer case-you remember that fun right?) for however long is required. This tactic alone has saved my (international critical) data centers from being shut down on a couple of occasions by keeping just cool enough air circulating long enough so backup power issues were solved and air handlers were back up and running again.
Your answer to cooling is a little bit of cool air and a whole lot of air flow. Pipe hot air away from servers and get it out of the room. If you replace the air with 70 degree air that is good enough. Be sure to replace the cubic quantity of air in the room Once every 2 hours for one server and subtract 10 minutes for every additional server. Cap it at 10 minutes. (120 mins by 20 servers is replacing the entire room air every 10 minutes) 11 servers is 110 minutes subtracted from 120 minutes.. all the rest of the servers will be fine. To replace the air and get that much air flow you should look into massive blowers, nearly an entire wall will be a "warm" air return (you want that as close to the servers as possible.) and pump cold air through a false floor (commonly used in server rooms and data centers) up through the racks.
I have been in data centers in Chicago where cardboard was not allowed because it gets sucked against the wall. It is very noisy and I felt like I was in a wind tunnel. But there were several hundred servers, network devices, and blinky light things that I have never seen before in cages that I wasn't allowed in. Air flow is the key to rack cooling. Maybe not to that extreme though.
Every single watt consumed by a computer is turned into heat, and generally released out the back of the case. Computers behave the same as the coil of nichrome wire as is used in a laundromat clothes dryer. (I guess a few milliwatts gets out of your cold room via ethernet cables and photons on fiber)
5 kilowatts is a heck of a lot to have on a single rack - assuming you're actually utilizing that. I recently interviewed a half dozen data centers to plan a 20-odd server deployment, and we ended up using 2 cabinets in order to ensure our heat dissipation was sufficient. Since data centers are usually supplying 20 amp, 110 or 120v power, you get 2200-2400 watts available per drop; although it's considered a bad idea to draw more than 15 amps per circuit. We have redundant power supplies in everything, so we keep ourselves at 37.5% of capacity on the drops, and each device is fed from a 20amp drop coming from a distinct data center pdu. That way even if one if the data center pdus implodes, we're still up and at 75%- capacity.
Almost no data center we spoke to would commit to cooling more than 4800 watts of power at an absolute maximum per rack, and those were facilities with hot/cool row setups to maximize airflow. But that meant they didn't want to drop more than 2x20amp power drops, plus 2x20 for backup, if you agreed to maintain 50% utilization across all 4 drops. But since you'd really want to maintain 75%- even in the case of failure, you'd only be using 3600watts. (In the facility we ended up in, we have a total of 6 20 amp drops, and we only actually utilize ~4700 watts.
Ultimately, though, the important thing is that cooling systems should be on generator/battery backup power. Otherwise, as this notes, your battery backup won't be useful.
Concerning the energy efficincy questions in the post, i wonder how the cooling systems are inplemented out there in the US of A. Here in the Netherlands, it's common to build a datacenter near highways. Vast amounts of pipings, through which the coolant flows, are laid out under the higway and can easily lose their heat via the roads. As a bonus, you don't have to counter any slipping dangers in winter because the road is kept on a nice 15 centigrade all through the winter. Of course, in summertime it's useless, but than again; we don't have summer here for more than two weeks per year ;)
The one, seemingly obvious, question I have is, why aren't the cooling needs on generator/ups backup?
I have toured data-centers where even the cooling was on battery backup. The idea is that the battery banks hold everything as it is until the generators come fully online (usually within 30 seconds). The batteries/UPS transformers were able to hold the entire system for approx 30 minutes on battery alone irrespective of generator status. This also reduced the issues from quick brown-outs...no need to fire up the generator for a quick 2 second outage.
Why aren't data-centers like this built to be completely self-sufficient or autonomous?
Sig Return: 204 No Content
- Due to facilities issues (we are a State agency in a County controlled facility), our "data center" is a 6'x11' closet, so 66' sq.
- The sum of our equipment wattage in the "data center" is min=6510W max=16833W. We estimate the average running wattage to be around 11000W.
- Assuming 1W/h=3.414 BTU/h, our "data center" generates 22225 BTU/h, 37554 BTU/h, 57468 BTU/h, min, avg, max respectively.
- Due to the same facilities issues in #1, we keep the "data center" at 71 degrees F, and cannot keep it any cooler.
- Due to the same facilities issues in #1, we have lots of cooling outages, and therefore much experience that qualifies me to accurately answer the question.
If cooling goes out in our "data center," the servers overheat in 15-20 minutes (when the closet reaches about 115-120 degrees F). To increase this time to about 45 minutes, we have installed a portable cooler that kicks in when the main HVAC system fails.Back when I worked for the Secretary of State's office it was decided that we should move the Central Voter Registration System to a new location, along with all the other mail, web, and database servers.
I was responsible for infrastructure planning. We ended up using an APC Symmetra backed up by a 125kW natural gas fired generator. Transfer to generator took approximately ten seconds by the Symmetra could keep the whole thing running for 45 minutes, allowing us graceful shutdown, etc. if the generator didn't spool up. We even extended power to the MDF so that Cox could plug their UPS into the generator line and be powered up while everything else was down.
The cooling was accomplished by redundant systems. There were duplicate two-ton air conditioning units in the room. If one failed the other could pick up the slack but in normal mode they both ran.
This represented a big improvement over what we'd had before. The servers had been housed in an minimally air conditioned closet in the sub-basement of the State House.
The fly in the ointment so to speak was that we depended on another state agency for DNS service. One day there was a massive power outage that affected a good chunk of Providence,RI. The Sec State's office systems were up and running except nobody could get to them, and they couldn't get out.
And it's still that way today.
Disclaimer: I work with SGI, so I can shed some light on their customer's perspective (NASA, gov't, research labs, etc.) and solution to this problem.
The increasing density of servers is exacerbating the problem of power and cooling in every data center. This week is the SuperComputing trade show where the the new top 500 supercomputers edition was released with "Big Turnover Among the Top 10 Systems," where you can see the first examples to address these issues.
SGI's new ICE blade system was launched a few months ago, it was designed to address the power consumption, real estate density, and cooling issues everyone will probably experience on their next server cycle. ICE has shipped and one installation is now #3 on the Top 500. It's a welcome sign that SGI is back from bankruptcy. I'm sorry if this seems like an advert, so I'm not going to link to SGI -- you can go find out more easily if you want.
My opinions are my own, but you may share them!
When i worked at (and help build) a datacenter back in 2000, they planned for just about everything. Then they played the whatif game for a few more scenarios, and came up with the fact that they needed to double the main capacity, and add a second layer. By that i mean that if the main transfer switch on the genset failed, there was a second one that'd kick in. If one or 2 (or all 3) of the A/C units failed for some reason, there was #4 and #5 which could carry the load with acceptable tolerances.
:)
We also had the genset in parallel to the UPS. So when the UPS got down to around 60%, the genset would kick in, and power them back up to 85% and shut down again. There was also 2 more gensets on site for just A/C and other essential systems. (We also learned the hard way how A/C compressors really dislike genset power, and had to get a very expensive line smoother from the power company to make the bigger ones play nice.)
Someone did the math on this system once. The datacenter could sustain itself without outside power for 51 days(!!) running at 70% capacity.
Not to mention, we ran the datacenter once a month for 6 hours on backup systems just to test. Then again, my boss at the time was the type of guy who would pull a card out of a router to see if said router (at that point in time, a Cisco 7513) was hot swap... Live router, had our SONET stuff running through it... Luckily it is a hot-swap router.
As computer get warmer there power consumption goes up.
http://www.silentcomputing.com/tech/market2.gif
Using Intel mother boards we found for each 10 Deg C temperature increase there was a 2% power increase.
I spent three years trying to get a company off the ground after I solved how to fix the heat problem.
I realized, they don't want to fix it, unless it's going to make them money.
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
Last night while spinning the channels on tv I saw a discovery channel/PBS show regarding a library in AZ using ice as a way to timeshift their power demands away from peak power load times. This system used a grid of refrigeration coils imersed in a tank of water, at night when power demands were lower, the A/C compressors would circulate refrigerant through the water tanks, freezing the water into ice, in the day they would dump the heat into the ice letting it melt. It seems a similar strategy would work for data centers for short term banking of lower powered cooling needs.
Ike
And stop running the SETI client, install the power scaling software of choice for your architecture, and stop wasting so damn much electricity to keep your precious snowflake busy 24/7.
I have a server in the DFW1 Datacenter that was knocked out during the first two outages but survived the third. I was down for about two and a half hours during the first outage and fifteen minutes during the second. My Rackers answered every question I had about it with honesty and humility. They admitted that this was their problem and kept me informed as to what was going on.
I'm wondering how many people posting on this thread are actual Rackspace customers. I can't say that I have ever once experienced anything less than top quality service from them. They are friendly, knowledgeable, hard working, and I respect them very much, and that's coming from a sysadmin/one man IT department.
Here's a company that is taking responsibility for its actions right off the bat, apologizing for their shortcomings, and honestly trying to make things right for those who have been affected. Think about this in contrast to the corporate scandals and craptacular customer service that has been plaguing the US lately.
Why not use the power of the expanding nitrogen gas to provide power as well - and yet another use is to put out fires as well.
I had an ISP that hosted my MySql-based app. The AC broke in their server room and the hard-drives got fried. They restored from the most recent backup, but it appeared they loaded from MySql dumps.
For some reason, auto-incrimenting columns in MySql re-assigned the numbers based on the order in the dump, bypassing gaps created by deleted records. Cross-references via the ID number were all screwed up. It was a pasta nightmare. I'm not sure how to prevent this in the future, other than manually program my own number assignments or program it to ignore records marked deleted instead of actually deleting them.
Table-ized A.I.
With newer advances such as Speedstep (or whatever they call it now), you can communicate to the servers that they are in a power outage mode and have them flip into a low power mode. Sure, they will run slower, but they won't go DOWN, and the resulting KW of power not being dissipated in the room will help keep things cooler, longer. Of course, in a colo, this means certain software that the server owner has to install. Better for this to be a hardware integration, with some type of network between the servers and the power management system and the cooling system. I know something like this would be fairly easy to implement, provided you had the right device in each server. I see performance adjustments which can downclock certain processors (laptops have had this for years), turn off hard drives, and blank all the monitors automatically (with an override of course).
Cool! Amazing Toys.
I worked for a mid-sized business ISP a few years ago and we lost our HVACs. All 4 of them at the same time. The maintenance guy that came to do some routine crap was new and somehow completely fubar'd the entire system. We had 2 HVACS for cooling our two data centers and 2 on backup in case anything happened. He somehow managed to shut all 4 down and they absolutely refused to come back online.
It took about an hour before we had to begin shutting the least critical firewalls/servers/routers etc off. We let some pieces of equipment get to the point of melting before we unplugged them, because they were so critical and could be replaced for cheaper at a later time.
We all thought the company was done right then and there. Everyone stayed around the clock trying to think up plans. We had a data center in Minneapolis, 5 hours away that we could run all the big client equipment to, but other than that we were SOL.
Then someone made a few phone calls and managed to procure 8 portable air conditioners. The network engineers cut holes in the wall and ran ducts from the AC units to outside. We all took 2 hour shifts changing the water pans and keeping accurate temperature records throughout the weekend while the HVACs were being replaced.
What a nightmare that was... but we came out of it alright in the end. Didn't lose any major clients and didn't have to shuttle any equipment to minnesota. I bet the guy that got fired for screwin it all up will never work in the hvac industry again though. I still feel kinda bad for him.
You're nothing; like me.
It is rediculous that a server room might overheat in 3 minutes, or even 30 minutes, without airconditioning. It is a sign that the whole setup is pushed to close to the margin for practical use.
Why would an ordinary, money-making business, not a military installation or research super computer, have that density of computers ? Escpecially in Texas. It isn't as if there aren't empty Albertsons and K-Marts and etc all over the place. ( Of course you have to have bandwidth and power available -- the building itself is probably the least of your costs, and it may be better to build your own roof and walls in the right place.) I can see computers packed together in a submarine or aircraft carrier, or in some other extreme circumstance.
However, a facility such as Rackspace's should look something like this -- a large industrial or semi-industrial warehouse type facility, such as an empty grocery story or a large office building with all the interior walls removed. The ceilings should be as high as the sructure allows, no false ceilings. The racks should be wide enough to allow one of those big dollys, a pallet with wheels basically, to easily pass down the rows. Every other rack should be an empty space, or possibly every third one, so you can step between the rows and use that space partly for cable management as well.
If you are using quad-core Xeon chips and all that other watt-burning stuff, only every 3d or 4th slot in the racks should have a 1 U computer in it. Air can blow through everything. The ceilings are probably 15 ft high without the racks only 8 ft high.
This will not allow you to escape the need for massive cooling by any means -- for Rackspace we are talking about ACRES of computers here, and even if the ceiling and walls are not insulated, that much heat just can't get out of the building, and eventually the temperature will get to where computers start shutting down. However, it should take on the order of hours, not minutes. You should have plenty of time to send people off to buy fans, send them off again to buy gasoline generators, start notifying customers, etc etc.
I can kind of understand a colo in a place such as New York or San Francisco being packed in like sardines, if the square footage is that expensive, and it sometimes is. Maybe some people who live in NY or SF would pay for that, if it meant they could visit their colo when necessary. In my opinion the weirdos were all inefficiently concentrated in SF and NYC just so that they would have to subsidize the good old rednecks via the power they have to buy from the Grand Coolee dam and HydroQuebec, but I digress. But . . . Rackspace doesn't let you visit your server, and their customers are all over the world !
Acreage in Texas is cheap. They let you throw up a two or three story steel-frame building on a concrete slab pretty much anywhere (that the acreage is cheap) with no zoning questions asked. Heck, promise some chamber of commerce a few jobs and they will put it up for you. There is just no excuse for not having enough air in that server room to absorb the heat from several hours.
The law is not an ass. No really.