Cooling Challenges an Issue In Rackspace Outage
miller60 writes "If your data center's cooling system fails, how long do you have before your servers overheat? The shrinking window for recovery from a grid power outage appears to have been an issue in Monday night's downtime for some customers of Rackspace, which has historically been among the most reliable hosting providers. The company's Dallas data center lost power when a traffic accident damaged a nearby power transformer. There were difficulties getting the chillers fully back online (it's not clear if this was equipment issues or subsequent power bumps) and temperatures rose in the data center, forcing Rackspace to take customer servers offline to protect the equipment. A recent study found that a data center running at 5 kilowatts per server cabinet may experience a thermal shutdown in as little as three minutes during a power outage. The short recovery window from cooling outages has been a hot topic in discussions of data center energy efficiency. One strategy being actively debated is raising the temperature set point in the data center, which trims power bills but may create a less forgiving environment in a cooling outage."
Other publications have noted it was number 3, too.
DT
Is this thing on? Hello?
Well, back to rejecting software patent applications.
If you want 100% uptime (which is impossible, but you can put enough 9s in your reliability to be close enough), you need to have your data distributed across multiple data centers, geographically separate, and over provisioned enough that the loss of one data center won't cause the others to be overloaded. It's important to keep your geographical separation large because you never know when the entire eastern (or western) seaboard will experience complete power failure or when a major backhaul router will go down/have a line cut. Preferably each data center should get power from multiple sources if they can, and multiple POPs on the internet from each center is almost mandatory.
I read the internet for the articles.
I actually use a vent duct to suck in cold air from outside during the winter to help cool a server in my house. Originally I was more concerned with random object/bugs/leaves so I made it a closed system(like water cooling) to help protect the actual system. It works nicely, but only for about 1/3 or less of the year when the temperature is cold enough to make a difference. I've always wondered about a larger scale of something like this such as how the parent suggested servers in a colder/arctic region.
Believe it or not, but in one of those "life coincidences", pi is a safe approximation. Take the number of watts your equipment, lighting, etc., use, multiply by pi, and that's the # of btus of cooling. Don't forget to include 100 watts per person for body heat.
It'll be 90F degrees outside, and you'll be a cool 66F.
Kevin Smith on Prince
I've never understood why data centre designers haven't used a different cooling strategy to re-circulated cooled air. After all, for much of the temperate latitudes for much of the year the external ambient temperature is at or below that needed for the data centre so why not use conditioned external air to cool the equipment and then exhaust it (possibly with a heat exchanger to recover the heat for other uses such as geothermal storage and use in winter)? (Oh, and have the air-flow fans on the UPS.)
The advantage of this is that even in the worst case scenario where the chillers fail totally during mid-summer there is no run-away, closed loop, self re-enforcing heat cycle, the data centre temperature will rise but it would do so more slowly and the maximum equilibrium temperature will be far lower (and dependant upon the external ambient temperature).
In fact, as part of the design for the cluster room in our new building I've specified such a system, though due to the maximum size of the ducting space available we can only use this for half the heat load.
Agrajag: "Oh no, not again!"
I think the problem is availability of power. When you are talking about facilities that consume so much power that, when built, their proximity to a power station is taken into account, you can't just slap one down at the poles and call it good. I would imagine that lack of bandwidth is a MAJOR issue as well..... ...one field where I think storing servers at the poles would be amazing is super computing. Supercomputers don't require the massive ammounts of bandwidth that webservers etc do. You send a cluster a chunk of data for processing, it processes it, and it gets sent back. For really REALLY large datasets (government stuff)...just fill a jet with hard-disks and have it to the server center in a few hours.
NewslilySocial News. No lolcats allowed.
For example, Chicago's primary datacenter facility is in 350 E. Cermak (right next to McCormick Place) and the primary interconnect facility in that building is Equinix (which has the 5th and now 6th floors.) A year or so ago there was a major outage there (that mucked up a good amount of the internet in the midwest) when a power substation caught on fire and the Chicago Fire Department had to shut off power to the entire neighborhood. So the backup system started like it should, with the huge battery rooms powering everything (including the chillers) for a bit while the engineers started up the generators. Only thing is, the circuitry that controls the generators shorted out, so while the generators themselves were working, the UPS was working, the chillers were working, this one circuit board blew at the WRONG moment. And this isn't the only time this circuit has been used, they test the generators every few weeks.
Long story short, once the UPSes started running out of power the chillers started going, lights flickered, and for a VERY SHORT period of time the chillers went out before all of the servers did. Within a minute or two it got well over 100 degrees in that datacenter. Thank god the power cut out as quick as it did.
So yes, Equinix in that case did everything by the book. They had everything setup as you would set it up. It was no big deal. But something went wrong at the worst time for it to go wrong and all hell broke loose.
It could be worse, your datacenter could be hit by a tornado
-nick
A few weeks ago the A/C dropped out in one of our computer rooms. I like the resulting graph: http://leebert.org/tmp/SCADA_S100_10-3-07.JPG
For those of you who either didn't take Physics, or slept through it, Watts and BTU's/hr are both measurements of POWER. Add up all the (input) wattages, and use something like http://www.onlineconversion.com/power.htm/ to convert. This site also has a conversion to 'tons of refrigeration' on that same page.
Also note - Don't EVER user the rated wattage of a power supply because that's what it SUPPLIES, not uses. Instead use the current draw multiplied by the voltage (US - 110 for single phase, 208 for dual phase in must commercial blgs, 220 only in homes or where you know thats the case). This is the 'VA' [Volt-Amps] unit. Use this number for 'watts' in the conversion to refrigeration needs.
Just FYI - a watt is defined as 'the power developed in a circuit by a current of one ampere flowing through a potential difference of one volt." see http://www.siliconvalleypower.com/info/?doc=glossary/, i.e. 1W = 1VA. The dirty little secret about power calculations is that there is another factor thrown in, typically about 0.65, called the 'power factor' that UPS and power supply manufacturers use to lower the overall wattage. That's why you always use VA (rather than the reported wattage) because in a pinch you can always measure both voltage and amperage(under load).
Basically do this - take all the amperage draws for all the devices in your rack/room/data center, multiply them by the applied voltage for that device (110 or 208) and add all the products together. Then convert that number to tons of refrigeration. This is your minimum required cooling for a lights out room. If you have people in the room, count 1100 BTU's/hr for each person and add that to the requirements (after conversion to whatever unit you're working with). Some HVAC contractors want specifications in BTU's/hr and other want it in tons. Don't forget lighting either if its not a 'lights out' operation. A 40W florescent bulb means its going to dissipate 40W (as in heat). You can use these numbers directly as they are a measure of the actual heat thrown, not of the power used to light the bulb.
Make sense?
Dennis Dumont
Most large refrigeration compressors have "short-cycling protection". The compressor motor is overloaded during startup, and needs time to cool. So there's a timer that limits the time between two compressor starts. 4 minutes is a typical delay for a large unit. If you don't have this delay, compressor motors burn out.
Some fancy short-cycling protection timers have backup power, so the the "start to start" time is measured even through power failures. But that's rare. Here's a typical short-cycling timer. For the ones that don't, like that one, a power failure restarts the timer, so you have to wait out the timer after a power glitch.
The timers with backup power, or even the old style ones with a motor and cam-operated switch, allow a quick restart after a power failure if the compressor was already running. Once. If there's a second power failure, the compressor has to wait out the time delay.
So it's important to ensure that a data center's chillers have time delay units that measure true start-to-start time, or you take a cooling outage of several minutes on any short power drop. And, after a power failure and transfer to emergency generators, don't go back to commercial power until enough time has elapsed for the short-cycling protection timers to time out. This last appears to be where Rackspace failed.
Dealing with sequential power failures is tough. That's what took down that big data center in SF a few months ago.
While many here are discussing UPSes, chillers, set-points, etc the most serious flaw is being glossed over ... the lack of redundency outside the data center, such as multiple, diverse power lines coming in...
From the articles, it appears that Rackspace datacenter doesn't have multiple power lines coming in and/or many come in via one feed point.
How else is it that a car crash quite some distance from the datacenter can cause such disruption. Does anyone even plan for such events - I get the feeling most planners don't, since I've seen first-hand many power failures occur in places where one would expect more redundency from dumb things like a vehicle hitting a utility pole, etc.
Ron
(Disregarding your blatant karma whoring by replying to the top post while changing the subject)
There's several good reasons why the servers are located where they are, and not, say, in Alaska.
The main one is light speed through fiber, and a cable from Houston to Fairbanks would induce a best case of around 28 ms latency, each way. Multiply by several billion packets.
This is why hosting near the customer is considered a Good Thing, and why companies like Akamai have made it their business of transparently re-routing clients to the closest server.
Back to cooling. A few years ago, I worked for a telephone company, and the local data centre there had a 15 degree C ambient baseline temperature. We had to wear sweaters if working for any length of time in the server hall, but had a secure normal temperature room outside the server hall, with console switches and a couple of ttys for configuration.
The main reason why the temperature was kept so low was to be on the safe side -- even if a fan should burn out in one of the cabinets, opening the cabinet doors would provide adequate (albeit not good) cooling until it could be repaired, without (and this is the important part) taking anything down.
A secondary reason was that the backup power generators were, for security reasons, inside the server hall themselves, and during a power outage these would add substantial heat to the equation.