The Risks and Rewards of Warmer Data Centers
1sockchuck writes "The risks and rewards of raising the temperature in the data center were debated last week in several new studies based on real-world testing in Silicon Valley facilities. The verdict: companies can indeed save big money on power costs by running warmer. Cisco Systems expects to save $2 million a year by raising the temperature in its San Jose research labs. But nudge the thermostat too high, and the energy savings can evaporate in a flurry of server fan activity. The new studies added some practical guidance on a trend that has become a hot topic as companies focus on rising power bills in the data center."
Locate the server farm in Antarctica!
http://www.geoffreylandis.com
1. Get a thermostat you can control with a computer
2. Give the computer inputs of temperature and energy use, and output of heating/cooling
3. Write a program to minimize energy use (genetic algorithm?)
4. Profit!!
Possible problem: do we need to factor in some increased wear & tear on the machines for higher temperatures? That would complicate things.
Sure, the fans kick in and you aren't saving as much, but are you still saving? I suspect you still are, there is a reason you are told to run ceiling fans in your house even with the AC on.
The thermal modeling for all this isn't that difficult. You can get power consumption, fan speeds, temp, etc and feed them into a pretty accurate plant model that should be able to on the fly adjust temperature for optimal efficiency. Or I guess we can hire company to form a bunch of committees to do a bunch of studies and come up with a bunch of papers that state the obvious.
No they didn't - what they did do is figure out that increased temperature is not correlated to higher failure rates - the failure rates don't magically decrease as it gets hotter.
Here's the link for your review: http://hardware.slashdot.org/story/07/02/18/0420247/Google-Releases-Paper-on-Disk-Reliability
[Citation Needed]
Here's a few links to the contrary:
http://www.tomshardware.com/news/google-hard-drives,4347.html
http://tech.blorge.com/Structure:%20/2007/02/20/googles-hard-disk-study-shows-temperature-is-not-as-important-as-once-thought/
Until what point? You can't consistently say "increase the temperature to decrease the MTBF".
You'll end up with molten slag.
Check out my sysadmin blog!
I thought the internet was free (or so people keep telling me). You mean it actually costs these companies money to maintain the connections??? Wow. I guess my $15/month bill actually serves a purpose after all.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
Until what point? You can't consistently say "increase the temperature to decrease the MTBF".
You'll end up with molten slag.
Yes, you can. MTBF = mean time before/between failure. To decrease, reduce, lower, however you want to say it, it is going to fail SOONER meaning it is getting LESS reliable. That was the point, hotter temps = less reliability. Same goes for just about any physical/chemical process (fans, batteries, hard drive motors, etc.)
80 whats? Obviously they mean 80F (running a temperature at 80K, 80C or 80R would be insane), but you should always specify units (especially if your using some backwards units like Fahrenheit!)
IranAir Flight 655 never forget!
I realise that this is not something that could be done quickly, it would require co-operation from all major vendors and then only if it would actually end up being more efficient overall. There would be lots of hurdles to overcome too... Efficient ducting (no jagged edges or corners like int domestic HVAC ductwork), no leaks, easy interconnects, space requirements, rerouting away from inactive equipment etc etc etc.You would still need some ac in the room as there is bound to be heat leakage from the duct-work, as well as heat given off from less critical components, but the level of cooling required would be much less if the bulk of the heat was ducted straight outside.
So I know the implementation of something like this would be monumental, requiring redesigning of servers, racks, cabinets and general DC layout. It would probably require standards to be laid out so that any server will work in any cab etc (like current rackmount equipment is fairly universally compatible), but after this conversion, could it be more efficient and pay off in the long run?
Just thinking out loud.
Tom...
Well, if you have a large cluster, you can load balance based on CPU temp to maintain a uniform junction temp across the cluster. Then all you need to do is maintain just enough A/C to keep the CPU cooling fans running slow (so there is excess cooling capacity to handle a load spike since the A/C can only change the temp of the room so quickly)
Or, you can just bury your data center in the antarctic ice and melt some polar ice cap directly.
We have replaced Tom's Decaf with DOUBLE ESPRESSO this morning, let's see if he's noticed the difference..
Except the google study didn't display any evidence of this happening - there was no correlation between higher temperatures and higher failure rates on mechanical drives.
http://hardware.slashdot.org/article.pl?sid=07/02/18/0420247
http://www.engadget.com/2007/02/18/massive-google-hard-drive-survey-turns-up-very-interesting-thing/
http://labs.google.com/papers/disk_failures.pdf
Even if it were, it'd be easy to rememdy - boot all your servers off SSD and keep them in a "hot" room. Keep your SANs-full-o'-spinning-rust in a "cold" room. You've just saved a fortune in air con despite being unable to convince your CTO that heat isn't as big a killer as many people claim it to be.
We had a power failure at one of our data centres, due to a combination of a stupid JCB driver and IBM's ineptitude (not keeping the diesel tanks full). Power for the servers was restored about six hours before the air-con was back up and running, and most of our equipment got cooked (ambient temp ranging from 35 to 40 degrees depending which part of the data centre you stood in) - we demanded IBM guarantee us a 3hr turnaround on any parts that died for the next 6 months due to heat failure. 18 months later and our hard drive failure graph is the same as it ever was.
Shoddy components on hardware is another matter I guess, but we've never had any hardware die due to a single faulty component apart from the occasional RAID card. Expect, as "hot" DC's become more common, that the heat thresholds on lots of enterprise equipment will increase... for a price, of course ;)
Moderation Total: -1 Troll, +3 Goat
I used to have a Pentium 4 Prescott , the truth is processors can run significantly above spec (hell the thing would go above the "max temp" just opening notepad). It's already been shown that higher temps don't break HDD, are the downsides of running the processor a few degrees hotter significant or can they be ignored?
IranAir Flight 655 never forget!
I know it was meant as a joke, but moving to colder climates may not be such a bad idea. Moving to a northern country such as Canada or Norway, you would benefit from the colder outside temperature, in the winter, to keep the servers cool and then any heat produced could be funnelled to keeping nearby buildings warm. The real challenge will be keeping any humidity out, but considering how dry the air during the winters can get there it may not be any issue.
All this said and done, trying to work out the sweet spot between not cooling a room to save energy and not having the server fans turn on is important. I would be curious to know if there are any solutions that allow the system temperature monitors to be linked into a central system, which is then linked to the room's climate control system exist?
Jumpstart the tartan drive.
If you save enery by having warmer data centers, but that it shortens the MTBF, is it really that big of a deal?
Let's say the hardware is rated for five years. Let's say that running it hotter than the recommended specifications shortens that to three years.
But in three years, new and more efficient hardware will probably replace it anyway because it will require, let's say, 150 watts instead of 200 watts, so the old hardware would get replaced anyway because the new hardware will cost less to run in those lost two years.
The studies were not long enough to constitute a very in-depth analysis. It would have to be a multi-month, or up to a year to analyze all the effects of raising temperatures.
For example, little was considered with:
1) Mechanical Part wear (increased fan wear, component wear due to heat)
2) Employee discomfort (80 degree server room?)
3) Part failure*
*If existing cooling solutions had issues, it would be a shorter time between the issue and additional problems since you have cut your window by ~15 degrees.
It's all fun and games till someone divides by 0. Then it's hilarious.
After giving that paper a closer look (the best link is this one, btw, the engaget link is dead: http://research.google.com/archive/disk_failures.pdf ) The failure rate went up with cold AND hot temperatures. How they got the disk temps that cold is beyond me, but their hot end seemed a little optimistic since I have seen desktops in comfortably air conditioned rooms running disk temps of 50C or more, and have a strong set of anecdotal evidence that these are the disks that fail most often.
Yes, but if you have the room at the tipping point what does this do to your ability to recover from a fault? I know one reason many datacenters have experienced outages even with redundant systems is that the AC equipment is almost never on UPS and so it takes some time for them to recover after switching to generators. If you are running 10F hotter doesn't that mean you have that much less time for the AC to recover before you start experiencing problems? For a large company with redundant datacenters or in Cisco's case where they are mostly development labs it probably is worth the risk, but for your average small to midsized corporate datacenter it's probably smarter to stay with the tried and true.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
The use of SSDs in data centers can dramatically impact power usage and temperature management costs:
"The power savings for the SSD-based systems is about 50 percent, and the overall cooling savings are 80 percent, according to the white paper. These savings are significant for a datacenter that spends 40 percent of its budget on power and cooling, and they're bound to make other datacenter operators sit up and take notice." http://arstechnica.com/business/news/2009/10/latest-migrations-show-ssd-is-ready-for-some-datacenters.ars
While MTBF and unit cost are still concerns, the potential savings will likely see more centers moving in this direction.
I'm less concerned with the fine-tuning of the environment for servers than I am with getting the basics right. How many bad server room implementations have you seen?
I'm sitting in one. We used to have a half-dozen built-for-the-purpose Liebert units scattered around the periphery of the room. The space was properly designed and the hardware maintained whatever temp and humidity we chose to set. They were expensive to run and maintain but they did their job and did it right.
About seven years ago, the bean-counting powers-that-be pronounced them "too expensive" and had them ripped out. The replacement central system pumps cold air under the raised floor from one central point. Theoretically, it could work. In practice, it was too humid in here the first day.
And the first week, month, and year. We complained. We did simple things to demonstate to upper management and building management that it was too humid in here, things like storing a box of envelopes in the middle of the room for a week and showing management that they had sealed themselves due to excessive humidity.
We were, in every case, rebuffed.
A few weeks ago, a contractor working on phone lines under the floor complained about the mold. *HE* got listened to. Preliminary studies show both penicillin (relatively harmless) and black (not so harmless) mold in high concentrations. Lift a floor tile near the air input and there's a nice thick coat of fluffy, fuzzy mold on everything. There's mold behind the sheetrock that sometimes bleeds through when the walls sweat. They brought in dehumidifiers that are pulling more than 30 gallons of water out of the air every day. The incoming air, depending on who's doing the measuring, is at 75% to 90% humidity. According to the first independent tester who came in, "Essentially, it's raining" under our floor at the intake.
And the areas where condensation is *supposed* to happen and drain away? Those areas are bone dry.
IOW, our whole system was designed and installed without our input and over our objections by idiots who had no idea what they were doing.
So, my fellow server room denizens, please keep this in mind - When people (especially management types) show up with studies that support the view that the way the environment is controlled in your server room can be altered to save money, be afraid. Be very afraid. It doesn't matter how good the basic research is or how artfully it could be employed to save money without causing problems, by the time the PHBs get ahold of it, it'll be perverted into an excuse to totally screw things up.
I was at a Google presentation on this last night. If I remember correctly, I believe they found the 'ideal' temperature for running server hardware without decreasing lifespan to be about 45 C.
In the beginning, there was null.
The google survey appears to show that drives are happiest arround 35-40 celcius with failure rates increasing both sides of that band.
Of course there are a couple of issues with that data
Firstly the data comes from the drives built in sensors so if a particular brand of drive has both an abnormal failure rate and an abnormal reported running temperature (either due to producing a different ammount of heat or due to a bad sensor) it would skew the results.
The second problem is they simply don't have much data outside the range 25-45 celcius.
Still I don't think it's a huge issue as long as you don't try to run your datacenter insanely hot.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
If there is a failure of AC ... that is, either Air Conditioning OR Alternating Current, you can see a rapid rise in temperature. With all the systems powered off, the latent heat inside the equipment, which is much higher than the room temperature, emerges and raises the room temperature rapidly. And if the equipment is still powered (via UPS when the power fails), the rise is much faster.
In a large data center I once worked at, with 8 mainframes and 1800 servers, power to the entire building failed after several ups and downs in the first minute. The power company was able to tell us within 20 minutes that it looked like a "several hours" outage. We didn't have the UPS capacity for that long, so we started a massive shutdown. Fortunately it was all automated and the last servers finished their current jobs and powered off in another 20 minutes. In that 40 minutes, the server room, normally kept around 17C, was up to a whopping 33C. And even with everything powered off, it peaked at 38C after another 20 minutes. If it weren't so dark in there I think some people would have been starting a sauna.
We had about 40 hard drive failures and 12 power supply failures coming back up that evening. And one of the mainframes had some issues.
now we need to go OSS in diesel cars
I'm pretty sure if they did a check between vibration and failure rates, they'd get correlation.
I'll also put out there a hypothesis that the rack which has no moving parts will have less failure rates compared to the rack which has moving parts.
Wouldn't it be fun to be a head engineer at one of the bigger companies and be able to test it out :)
UPS batteries are sealed lead-acid and they definitely benefit from being kept cooler, it's also good to keep them in a separate room, usually close to your main power switching. As far as servers are concerned, I've always been happy with ab ambient room temp of about 22 or 23, provided air-flow is good so you don't get hot-spots, and it makes for a more pleasant working environment (although with remote management I generally don't need to actually work in them for long periods of time).
describes temperatures using the Fahrenheit scale.
now we need to go OSS in diesel cars
I am a little skeptical since most hard drive failures I've had have been right after a air conditioning outage. The Google paper uses temperature obtained from SMART, which is usually 10 to 15C higher than the ambient temperature in the room, and the tail of their sample falls off rapidly over 40C. What would the SMART temperature be if the ambient temperature was 40 or so? Probably 60 or above. Their graphs don't do that high.
But we're talking raising the temperature of a data center only 2 or 3 deg. Meat lockers are not helpful. Moral of the story? Maybe spend your cooling bucks on your storage, then let the rest of your systems eat their exhaust. I have some new Juniper routers, no moving parts inside except fans - the yellow alarm doesn't kick off until 70C and the machine doesn't shut down until 85C.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
"Wouldn't it be fun to be a head engineer at one of the bigger companies and be able to test it out :)"
Oh really?
Let's see your proposal, your test criteria, your plan.
Let's see your budget... cut it in half
Now for risk analysis, what if you're right and the servers all fail sooner than expected (i.e. sooner than budgeted)?
Spend 3 weeks filling out red tape
Spend 2 weeks waiting.
OK, you can run your study. Set up two racks in a closet and take measurements every day for a year.
Now write up the review.
Alright, thanks for your study, but our lawyers have advised us that it wasn't peer reviewed and published in a respected compsci journal and therefore we can't do anything with it, or the insurance wouldn't cover us and we'd be liable for deaths resulting from servers or something.
File in circular file or far back of filing cabinet never to be seen again until you're clearing out your office because they had to let you go because server replacement costs were too high to keep you on the payroll.
-1 disagree is not a modifier for a reason. -1 troll, flaimbait, redundant, overrated are NOT acceptable substitutes.
This is slashdot, OF COURSE you should use Nagios!
/. Kung-Fu, buy an EM01;
And to increase your
http://www.nagios.org/products/environmental
Learn Nagios the FAN way;
http://fannagioscd.sourceforge.net/drupal/
or play with GroundWork, they're awesome;
http://www.groundworkopensource.com/community/community-edition.html
(Yes, I actually run this in a real data center, we eat our own dog food.)
we're damn tired of seeing that lose/loose error, in particular
Just spell it "luse", and everybody wins.
http://www.geoffreylandis.com
Yes, he got promoted to a higher level of management. At least he's now one step further removed from the actual facilities he manages and can no longer screw things up quite as directly as he did in the past.