Cooling Challenges an Issue In Rackspace Outage
miller60 writes "If your data center's cooling system fails, how long do you have before your servers overheat? The shrinking window for recovery from a grid power outage appears to have been an issue in Monday night's downtime for some customers of Rackspace, which has historically been among the most reliable hosting providers. The company's Dallas data center lost power when a traffic accident damaged a nearby power transformer. There were difficulties getting the chillers fully back online (it's not clear if this was equipment issues or subsequent power bumps) and temperatures rose in the data center, forcing Rackspace to take customer servers offline to protect the equipment. A recent study found that a data center running at 5 kilowatts per server cabinet may experience a thermal shutdown in as little as three minutes during a power outage. The short recovery window from cooling outages has been a hot topic in discussions of data center energy efficiency. One strategy being actively debated is raising the temperature set point in the data center, which trims power bills but may create a less forgiving environment in a cooling outage."
Other publications have noted it was number 3, too.
DT
Is this thing on? Hello?
If you want 100% uptime, it's important to have back up power for the cooling as well as the server systems themselves.
Is this really news?
Well, back to rejecting software patent applications.
Actually this brings up an interesting point of discussion for me at least. Our office is doing a remodel and I'm specifying a small server room (finally!) and the contractors are asking what AC unit(s) we need. Is there a general rule for figuring out how many BTUs of cooling you need for a given wattage of power supplies?
I'm out of my mind right now, but feel free to leave a message.....
Liquid nitrogen is the cooling answer, for sure. Then you're not dependent upon power of any kind at all. The nitrogen dissipates as it warms, just like how a pool stays cool on a hot day by 'sweating' through evaportation, and you just top up the tanks when you run low. It's cheap and it's simple. That's why critical cold storage applications like those in the biomedical industry don't use 'chillers' or refrigerators or anything like that. If you really want to put something on ice and keep it cold, you use liquid nitrogen.
A-Bomb
I've never understood why data centre designers haven't used a different cooling strategy to re-circulated cooled air. After all, for much of the temperate latitudes for much of the year the external ambient temperature is at or below that needed for the data centre so why not use conditioned external air to cool the equipment and then exhaust it (possibly with a heat exchanger to recover the heat for other uses such as geothermal storage and use in winter)? (Oh, and have the air-flow fans on the UPS.)
The advantage of this is that even in the worst case scenario where the chillers fail totally during mid-summer there is no run-away, closed loop, self re-enforcing heat cycle, the data centre temperature will rise but it would do so more slowly and the maximum equilibrium temperature will be far lower (and dependant upon the external ambient temperature).
In fact, as part of the design for the cluster room in our new building I've specified such a system, though due to the maximum size of the ducting space available we can only use this for half the heat load.
Agrajag: "Oh no, not again!"
They should ban that stuff. (dhmo.org)
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Ah, the dangers of context-sensitive advertising.
Ad on the main page when this article was at the top of the list.
Does "50% off setup" mean you'll only be set up halfway before they run out of A/C?
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
A few weeks ago the A/C dropped out in one of our computer rooms. I like the resulting graph: http://leebert.org/tmp/SCADA_S100_10-3-07.JPG
For those of you who either didn't take Physics, or slept through it, Watts and BTU's/hr are both measurements of POWER. Add up all the (input) wattages, and use something like http://www.onlineconversion.com/power.htm/ to convert. This site also has a conversion to 'tons of refrigeration' on that same page.
Also note - Don't EVER user the rated wattage of a power supply because that's what it SUPPLIES, not uses. Instead use the current draw multiplied by the voltage (US - 110 for single phase, 208 for dual phase in must commercial blgs, 220 only in homes or where you know thats the case). This is the 'VA' [Volt-Amps] unit. Use this number for 'watts' in the conversion to refrigeration needs.
Just FYI - a watt is defined as 'the power developed in a circuit by a current of one ampere flowing through a potential difference of one volt." see http://www.siliconvalleypower.com/info/?doc=glossary/, i.e. 1W = 1VA. The dirty little secret about power calculations is that there is another factor thrown in, typically about 0.65, called the 'power factor' that UPS and power supply manufacturers use to lower the overall wattage. That's why you always use VA (rather than the reported wattage) because in a pinch you can always measure both voltage and amperage(under load).
Basically do this - take all the amperage draws for all the devices in your rack/room/data center, multiply them by the applied voltage for that device (110 or 208) and add all the products together. Then convert that number to tons of refrigeration. This is your minimum required cooling for a lights out room. If you have people in the room, count 1100 BTU's/hr for each person and add that to the requirements (after conversion to whatever unit you're working with). Some HVAC contractors want specifications in BTU's/hr and other want it in tons. Don't forget lighting either if its not a 'lights out' operation. A 40W florescent bulb means its going to dissipate 40W (as in heat). You can use these numbers directly as they are a measure of the actual heat thrown, not of the power used to light the bulb.
Make sense?
Dennis Dumont
Most large refrigeration compressors have "short-cycling protection". The compressor motor is overloaded during startup, and needs time to cool. So there's a timer that limits the time between two compressor starts. 4 minutes is a typical delay for a large unit. If you don't have this delay, compressor motors burn out.
Some fancy short-cycling protection timers have backup power, so the the "start to start" time is measured even through power failures. But that's rare. Here's a typical short-cycling timer. For the ones that don't, like that one, a power failure restarts the timer, so you have to wait out the timer after a power glitch.
The timers with backup power, or even the old style ones with a motor and cam-operated switch, allow a quick restart after a power failure if the compressor was already running. Once. If there's a second power failure, the compressor has to wait out the time delay.
So it's important to ensure that a data center's chillers have time delay units that measure true start-to-start time, or you take a cooling outage of several minutes on any short power drop. And, after a power failure and transfer to emergency generators, don't go back to commercial power until enough time has elapsed for the short-cycling protection timers to time out. This last appears to be where Rackspace failed.
Dealing with sequential power failures is tough. That's what took down that big data center in SF a few months ago.
While many here are discussing UPSes, chillers, set-points, etc the most serious flaw is being glossed over ... the lack of redundency outside the data center, such as multiple, diverse power lines coming in...
From the articles, it appears that Rackspace datacenter doesn't have multiple power lines coming in and/or many come in via one feed point.
How else is it that a car crash quite some distance from the datacenter can cause such disruption. Does anyone even plan for such events - I get the feeling most planners don't, since I've seen first-hand many power failures occur in places where one would expect more redundency from dumb things like a vehicle hitting a utility pole, etc.
Ron
We've summoned a small demon to let in cool air particles and shunt out hot ones. Sure the weekly sacrifice gets to be a pain after a while, but there's always a pool of willing interns right?
Ask not what you can do for your country. Ask what your country did to you
Every single watt consumed by a computer is turned into heat, and generally released out the back of the case. Computers behave the same as the coil of nichrome wire as is used in a laundromat clothes dryer. (I guess a few milliwatts gets out of your cold room via ethernet cables and photons on fiber)
(Disregarding your blatant karma whoring by replying to the top post while changing the subject)
There's several good reasons why the servers are located where they are, and not, say, in Alaska.
The main one is light speed through fiber, and a cable from Houston to Fairbanks would induce a best case of around 28 ms latency, each way. Multiply by several billion packets.
This is why hosting near the customer is considered a Good Thing, and why companies like Akamai have made it their business of transparently re-routing clients to the closest server.
Back to cooling. A few years ago, I worked for a telephone company, and the local data centre there had a 15 degree C ambient baseline temperature. We had to wear sweaters if working for any length of time in the server hall, but had a secure normal temperature room outside the server hall, with console switches and a couple of ttys for configuration.
The main reason why the temperature was kept so low was to be on the safe side -- even if a fan should burn out in one of the cabinets, opening the cabinet doors would provide adequate (albeit not good) cooling until it could be repaired, without (and this is the important part) taking anything down.
A secondary reason was that the backup power generators were, for security reasons, inside the server hall themselves, and during a power outage these would add substantial heat to the equation.
5 kilowatts is a heck of a lot to have on a single rack - assuming you're actually utilizing that. I recently interviewed a half dozen data centers to plan a 20-odd server deployment, and we ended up using 2 cabinets in order to ensure our heat dissipation was sufficient. Since data centers are usually supplying 20 amp, 110 or 120v power, you get 2200-2400 watts available per drop; although it's considered a bad idea to draw more than 15 amps per circuit. We have redundant power supplies in everything, so we keep ourselves at 37.5% of capacity on the drops, and each device is fed from a 20amp drop coming from a distinct data center pdu. That way even if one if the data center pdus implodes, we're still up and at 75%- capacity.
Almost no data center we spoke to would commit to cooling more than 4800 watts of power at an absolute maximum per rack, and those were facilities with hot/cool row setups to maximize airflow. But that meant they didn't want to drop more than 2x20amp power drops, plus 2x20 for backup, if you agreed to maintain 50% utilization across all 4 drops. But since you'd really want to maintain 75%- even in the case of failure, you'd only be using 3600watts. (In the facility we ended up in, we have a total of 6 20 amp drops, and we only actually utilize ~4700 watts.
Ultimately, though, the important thing is that cooling systems should be on generator/battery backup power. Otherwise, as this notes, your battery backup won't be useful.
While thinking outside the box is all well and fine, it's even better when combined with Common Knowledge. Like knowing that caves and mines (a) tend to be rather warm when deep enough, and (b) have a fixed amount of air.
:-)
As for the power efficiency of pumping air from several hundred meters away compared to pumping it through the grille of an AC unit, well, there's a reason why skyscrapers these days have multiple central air facilities instead of just one: Economics.
I'd like to see you pump air for any long distance with your exercise bike
A large data center should not have one big massive UPS anyway. It should all be divided out into various load sections, each with its own UPS+battery system. Once you do that, then you can have cooling on its own UPS without any risk of the cooling system impacting the UPS feeding the computers ... if you really want cooling on UPS (it can be done, but generally is not the best way). Surely you would have the cooling on it's own three phase circuits.
Perhaps a better approach is a smart cooling system that rotates the starting of compressors on various units so you always have some number of units running and some number not running, at the ratio needed for the current thermal demands. Then where there is an outage that has to go to generators, only a limit number of units will have been recently started just before the outage and need to be thermally protected. The controller skips those and starts the idle units (unless you are already maxxed out in which case you'd have no idle units). But you will need to have the cooling on the generators.
If you are going to have a backup distribution circuit from the utility, it should be physically separate from the primary circuit so that it is not necessary to shut down both to deal with things like a traffic accident.
now we need to go OSS in diesel cars
Disclaimer: I work with SGI, so I can shed some light on their customer's perspective (NASA, gov't, research labs, etc.) and solution to this problem.
The increasing density of servers is exacerbating the problem of power and cooling in every data center. This week is the SuperComputing trade show where the the new top 500 supercomputers edition was released with "Big Turnover Among the Top 10 Systems," where you can see the first examples to address these issues.
SGI's new ICE blade system was launched a few months ago, it was designed to address the power consumption, real estate density, and cooling issues everyone will probably experience on their next server cycle. ICE has shipped and one installation is now #3 on the Top 500. It's a welcome sign that SGI is back from bankruptcy. I'm sorry if this seems like an advert, so I'm not going to link to SGI -- you can go find out more easily if you want.
My opinions are my own, but you may share them!
I agree with almost all of your post with the only exception being the cooling systems on UPS. There is absolutely no reason to put cooling systems on UPS power. Large, inductive loads are a UPS's enemy. A big inrush current of a chiller starting up would beat the crap out of your battery string(s).
Having said that, you are exactly right on having both your UPS system(s) and your cooling system(s) diversified. I tend to get into this argument with people regarding what constitutes a "data center" and one of the most significant parts of determining what actually constitutes a "data center" is redundancy. This means not just redundant utility power feeds, but redundant UPS systems/modules, redundant generators, redundant chillers/CRACs, redundant PDU's, etc etc.
For our cooling systems, we have 4 Chillers (we only need 2) and 20 CRACs (we only need 10. Any problems with any system can be mitigated by rolling to the redundant system.
the local data centre there had a 15 degree C ambient baseline
Well that's just incompetent. For one thing, commercial electronics experience increased failure as you move away from an ambient 70 degrees F regardless of which direction you move. Running them at 59 degrees F (15 C) is just as likely to induce intermittent failures as running it at 80 degrees F.
For another, you're supposed to design your cooling system to accommodate all of the planned heat load in the environment. If your generators will be adding heat then the A/C needs to have sufficient capacity to take that heat back out.
And anyway, your generators shouldn't be adding heat. They should be walled off from the data center with exterior air exchange. Otherwise an error in the exhaust ducting risks killing your operators with CO poisoning.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Additionally, you appear to be conflating the air temperature in the data centre (15C) with the temperature of the components. Since having a heat flux requires having a thermal gradient, then the components will be warmer than your heat sink.
In this town, we can tell the nationality of the boss of any office instantly on walking in - European bosses keep the HVAC (heating ventilation air-conditioning, or climate control) set to about 20C ; American bosses have it re-set to 25C (until over-ruled for wasting money). There's an Indian HVAC company (in Abu Dhabi), and a instrumentation engineer (last heard of in Houston, America) who need to be taught this lesson. Again. If you meet them, please apply the clue-bat before agreeing to take the equipment they design out to the Empty Quarter to rig it up.Your carbon dioxide flood for fire suppression would be as effectively lethal. Operators would need to be kept out of the controlled zone while enclosed generators are running; the fire suppression system should be overridden while operators are in the controlled zone, or you need to be rigged up with cascade air supplies and work-pack SCBA while working in the control zone. This isn't rocket science - there are plenty of corpses that point the way to proper management of work in potentially lethal atmospheres. (Of course, there are plenty of work places that like to cut corners and put their workers at risk. Don't work there and do report them to the relevant authorities.)
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"