Google Releases Paper on Disk Reliability

Great by true_hacker · 2007-02-17 16:26 · Score: 5, Funny

Excellent, i have been looking forward to thi *%)%*# DISK FAILURE

Re:Great by Compholio · 2007-02-17 17:28 · Score: 2, Funny

Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
That's what you get for logging into slashdot from Antarctica...
Re:Great by mabhatter654 · 2007-02-18 16:04 · Score: 2, Interesting

Personally, I think this is more geared as a "shot" to drive makers and big enterprise users not so much the general desktop user crowd. After all, how many companies even HAVE 100,000 hard drives to test? Google is unique in their use of LOTS of hardware... in generally better controlled environments than most have. Google has issued other papers and industry "suggestions" about performance of mother boards, power supplies, and other OTS hardware... as big of a CUSTOMER as Google is, they can push the industry to perform better in ways normal people would just "deal with".
What the report really shows is that SMART doesn't accurately indicate the life of the drive... if anything Google drives their hardware harder than normal users, so it should be a good testbed for predictive tools.... Google would be directly interested and probably pay a lot of money to somebody that implemented the changes this engineer said... chasing around 20k+ hard drives is an EXPENSIVE task... I'd bet Google pays a MILLION dollars a year in salary just to have somebody available to run out and replace unscheduled drive failures. That's a big process improvement that they would like to see hard drive manufactures answer.

Hmm by chanrobi · 2007-02-17 16:29 · Score: 2, Interesting

So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?

Re:Hmm by Anonymous Coward · 2007-02-17 16:35 · Score: 5, Funny

Didn't read the article? (Check)
Didn't read the summary? (Check)

Congratulations, you're not officially a slashdot regular!
Re:Hmm by Anonymous Coward · 2007-02-17 17:03 · Score: 4, Informative

There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
Re:Hmm by jemenake · 2007-02-18 02:44 · Score: 3, Interesting

So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
Well, I was a little disappointed by the article. They looked at a lot of different SMART categories and they looked at the different ages of the drives, but they didn't delve into the different types of failures. I get about 1 "I think my drive crashed and I was hoping you could recover it" call per month and I see a variety of failure types. Probably the most common ones I see now are ones where something has gone wrong with the control circuits/mechanism and not the media itself. For example, something can go wrong with the motor that spins the platters, or you can seize the bearings for the head traversal, etc. I've even seen some where a chip on the controller board literally popped when it got too hot. These aren't going to be detected by SMART... I don't know what would predict failures like that.

The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.

But, alas, I didn't see any breakdown for failure type....
Re:Hmm by pugugly · 2007-02-18 06:30 · Score: 3, Funny

Didn't check the typing? (Check)

Congratulations, you're now officially a slashdot regular! - Pug

--
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
Re:Hmm by norton_I · 2007-02-18 06:52 · Score: 2, Interesting

So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.

It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure over the service lifetime, but by eye it looks like the trends were leveling off by 8 months.

So, if you want to avoid replacing too many good drives, you probably have to move to a multiple error model, which probably reduces your detection liklihood well below the already low 44% reported.
Re:Hmm by Trogre · 2007-02-18 09:51 · Score: 2, Funny

Congratulations, you're not officially a slashdot regular!

Didn't hit the 'Preview' button first? (Check)

Congratulations, you are too!

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife

Did they ever name the brands? by SuperKendall · 2007-02-17 16:32 · Score: 4, Insightful

They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Re:Did they ever name the brands? by iminplaya · 2007-02-17 16:45 · Score: 5, Interesting

FTA:However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.

But, of course.

--
What?
Re:Did they ever name the brands? by Xross_Ied · 2007-02-17 16:47 · Score: 2, Insightful

They didn't include any data at all about brands.

They should have done brand analysis (without naming the brand) and also rpm analysis.

From the article..
3.2 Manufacturers, Models, and Vintages
Failure rates are known to be highly correlated with drive
models, manufacturers and vintages [18]. Our results do
not contradict this fact. For example, Figure 2 changes
significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by
drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.

--
This sig space tolet, reasonable rate.
Re:Did they ever name the brands? by drmerope · 2007-02-17 16:47 · Score: 2, Informative

No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information. The question that really needs to be studied is what distinguishes good drives from bad. This would probably involve disassembling drives of various 'vintages, models, manufacturers' and trying to pin down the relevant details. That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, heat, and acoustics...
Re:Did they ever name the brands? by repvik · 2007-02-17 16:51 · Score: 2, Insightful

"However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data." (From TFA)
Re:Did they ever name the brands? by Prof.Phreak · 2007-02-17 16:53 · Score: 3, Insightful

At the very least, they could've named brands X, Y, Z, etc., and provided the numbers for those. Would be interesting if the differences are more than marginal.

--
"If anything can go wrong, it will." - Murphy
Re:Did they ever name the brands? by ryturner · 2007-02-17 17:03 · Score: 3, Insightful

It would be useful to you and me. But it is not useful to google to release that information.
Re:Did they ever name the brands? by Anonymous Coward · 2007-02-17 17:32 · Score: 5, Funny

They would have released that data, but it was saved on a Maxtor.
Re:Did they ever name the brands? by LunarCrisis · 2007-02-17 18:17 · Score: 2, Interesting

If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product.

Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity, HD capacity seems to be increasing by roughly ten times every five years.

It's like the CD-R manufacturers stamping all the packaging with 100-year guarantees. They don't really have any good way of telling that they will actually last that long, but the replacement costs nearly nothing, and thus is payed for by the marketing benefits.

--
Mr. Period: Nine is the one that's right by ten!
Nine: One day I will kill him. Then, I will be Ten.
Re:Did they ever name the brands? by Schraegstrichpunkt · 2007-02-17 18:41 · Score: 2, Interesting

They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.
What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.

--
http://outcampaign.org/
Re:Did they ever name the brands? by HUADPE · 2007-02-17 19:15 · Score: 3, Insightful

There are several good reasons to not release the brand names. First, while the sample size is huge, the sample size for a particular model of a particular brand might not be. If they only happened to have 10 of one particular model, and one failed within a month, then 10% fail within a month, but it could just be a fluke. Second, liability. This wasn't a controlled test, it was done live within the Google servers (presumably). Whoever is on the bottom of the list could very well sue Google for libel. Without merit? Probably, but they might eke a few million in a settlement out of them. Google can't appear to be doing evil after all.

--
This sig has not been evaluated by the FDA. It is not designed to diagnose, treat, prevent, or cure any disease.
Re:Did they ever name the brands? by lukas84 · 2007-02-17 20:04 · Score: 2, Interesting

How can stating facts be libel?
Re:Did they ever name the brands? by iminplaya · 2007-02-17 20:28 · Score: 5, Funny

Well, obviously you're not a lawyer :-) Otherwise you would know the answer.

--
What?
Re:Did they ever name the brands? by Simon+Brooke · 2007-02-18 01:28 · Score: 2, Insightful

You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives, I actively seek out drives with at least a 3 (preferably 5) year warranty (some Hitachis and Seagates IIRC) and explicitly avoid those with only a 1 year warranty period (I'm looking at you WD).

You know, I don't give a monkey's. What you lose when a disk goes down (if you haven't done your backups properly) is typically far more valuable than the disk mechanism itself. Any manufacturer can put a five-year warranty on a disk mechanism as a gimmick. Most users won't remember the warranty when the disk goes down, and, even if they have to replace 10% of the units 'free', it doesn't take much on the retail price to cover that.

20 years ago we had a spate of failures on Western Digital drives on machines which were out with customers. That really hurt - giving our customers free drives would not have cheered them up. 10 years ago we had a spate of failures of Samsung drives in a server farm. That was more under control, but it was still a bloody nuisance. I don't want a drive which fails, but when it fails I get a new one free. I want a drive that doesn't fail. The warranty has absolutely nothing to do with it.

--
I'm old enough to remember when discussions on Slashdot were well informed.
Re:Did they ever name the brands? by Fred_A · 2007-02-18 01:31 · Score: 2, Insightful

No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.
The pertinence of the SMART data (pretty much always pertinent) and how often it popped up (about half the time) before a failure was a very interesting bit of information.
The question that really needs to be studied is what distinguishes good drives from bad.
A good drive is one that lasts a long time without developing too many bad blocks. A bad drive is one that fails within a couple years. In both cases you only know it after the fact or because a whole series happens to be poorly designed (like it happens to every manufacturer every now and then). Unless that model is already widely deployed and known to be bad, or already widely deployed and likely no longer sold, there's no way to tell.

And thus on the third day the FSM created backups and saw it was good.

--

May contain traces of nut.
Made from the freshest electrons.

Google had this paper ready a year ago by Anonymous Coward · 2007-02-17 16:36 · Score: 3, Funny

But the disk it was on failed.

Conclusion by llZENll · 2007-02-17 16:39 · Score: 3, Informative

This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

"In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

That would be corporate dynamite by Traf-O-Data-Hater · 2007-02-17 16:41 · Score: 5, Insightful

I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!

Re:That would be corporate dynamite by MrZaius · 2007-02-17 16:44 · Score: 3, Interesting

It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.
Re:That would be corporate dynamite by EonBlueTooL · 2007-02-17 17:04 · Score: 4, Insightful

Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)
Re:That would be corporate dynamite by Antique+Geekmeister · 2007-02-17 17:26 · Score: 4, Insightful

I'm confident that Google is fairly drive agnostic: you just can't run distributed networks that large and stay locked into a single vendor. And given that even reliable vendors have disasters like the IBM Deskstar drives some years ago, and given the remarkable growth of drive sizes over time, there's just not much point for them in buying the extremely stable but vastly more expensive hardware. They've foubtless learned that hardware flexibility provides valuable software flexibility.
Re:That would be corporate dynamite by devilspgd · 2007-02-17 17:40 · Score: 2, Insightful

Organizing and making accessible information which is already available is one thing, producing information is completely different.

--
Give a man a fish, he'll eat for a day, but teach a man to phish...
Re:That would be corporate dynamite by Jah-Wren+Ryel · 2007-02-17 19:20 · Score: 5, Insightful

Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)

Old Google Motto: Don't do anything evil.
New Google Motto: Don't get into trouble.

--
When information is power, privacy is freedom.
Re:That would be corporate dynamite by jamesh · 2007-02-18 00:01 · Score: 2, Funny

Not that far removed from the motto of several other large companies:

"Don't get caught doing anything evil."
Re:That would be corporate dynamite by gbjbaanb · 2007-02-18 05:10 · Score: 3, Informative

When a friend broke down, she asked the breakdown man who came what were the most reliable cars. He said he wasn't allowed to comment but that "he carried no honda parts". I guess the same thing applies here - Google won't say, they'd get sued.

On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.

If you do want to know more about which drives are best - check out storeagereview and enter details of your drives to their reliability database.

Similar paper by reset_button · 2007-02-17 16:42 · Score: 4, Informative

I was at the talk, and it was very interesting. CMU also had a paper (PDF) about disk failures in the same conference (in fact, they presented one after the other).

and in the meanwhile... by pedantic+bore · 2007-02-17 16:50 · Score: 3, Informative

... at the same conference, Bianca Schroeder presented a paper disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin and dozen papers by John Elerath...

C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

--
Am I part of the core demographic for Swedish Fish?

Re:and in the meanwhile... by oGMo · 2007-02-17 17:26 · Score: 3, Insightful

While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. But while Bianca Shroeder's report, for instance, uses statistics from various unnamed sources and for various unnamed uses, the Google report is interesting because we know exactly where it's coming from and what it's being used for.

Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that.

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

Translation by jd · 2007-02-17 17:05 · Score: 3, Funny

"We don't want to be sued to within an inch of our lives by certain very wealthy brands, due to US law allowing manufacturers to prohibit unfavourable reviews."

Ideally, they would have formatted the text to spell out the names of the brands if you take the first letter of every Nth word, or some specific column of text. (Or maybe they have...)

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:Translation by David+Price · 2007-02-17 17:13 · Score: 5, Insightful

More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."
Re:Translation by the_womble · 2007-02-17 17:21 · Score: 4, Insightful

Another translation: Our competitors buy millions of dollars worth of drives as well. We are not going to help them avoid the duff ones.
Re:Translation by bendodge · 2007-02-17 18:12 · Score: 3, Funny

How did that get modded insightful? When there is more demand the price goes down, not up!

--
The government can't save you.
Re:Translation by spisska · 2007-02-17 19:31 · Score: 5, Insightful

Another translation:

We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.

We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.

We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.

But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.

And we hope you have a nice day.
Re:Translation by osu-neko · 2007-02-17 20:33 · Score: 2

Demand and price in a free market are reversely proprotional.

One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.

--
"Convictions are more dangerous enemies of truth than lies."
Re:Translation by Eivind · 2007-02-17 21:06 · Score: 2, Insightful

It's not that surprising. The only mildly interesting thing I see is that high load seems to *not* increase failure-rates much, other than the first few month. They hypothesize that this may be because some drives don't handle high load -- and die early -- however those drives that survive the first ~6 months with high load are the more robust ones, and those hold up well.
Makes sense. Killing the weaker infants makes the adult population healthier.
Re:Translation by bazorg · 2007-02-17 22:37 · Score: 2, Funny

We made this interesting and useful HDD test which we made public however it lacks some details as it's still in Beta.
Re:Translation by Eivind · 2007-02-18 06:33 · Score: 2, Interesting

So, you're saying there is no better choice, and can be no better choice, than simply selecting a disc randomly ? It's possible it is like you say -- that each year the stats change enough that no consistent trends are recognizable. It is however also possible (I'd say likely even) that different manufacturers are different statistically over time.
You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anverage year or 5%.

Temperature conclusion by phasm42 · 2007-02-17 17:08 · Score: 4, Interesting

Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?

--
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner

Re:Temperature conclusion by Chalex · 2007-02-17 17:51 · Score: 2, Insightful

The chart implies that the "optimal" operating drive temperature is 35-45 Celsius. Drive temperatures below room temperature (below 22 Celsius) is probably not a scenario that drive manufacturers optimise for.
Re:Temperature conclusion by gnu-sucks · 2007-02-17 17:54 · Score: 3, Interesting

My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.
Re:Temperature conclusion by bouis · 2007-02-17 20:45 · Score: 2, Insightful

If hard drives are anything like car engines [especially those made with iron and aluminum], the designers have taken the standard operating temperature into account in the design. The parts of varying composition fit together best at the right temperature, and temperatures higher or lower result in damage or accelerated wear.

This is why, if you want your engine to last, you should let your car warm up before driving it hard.

Lower temp == higher failure rates by flyingfsck · 2007-02-17 17:15 · Score: 4, Interesting

To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!

Re:Lower temp == higher failure rates by Anonymous Coward · 2007-02-17 19:28 · Score: 2, Insightful

perhaps there is some correlation between lower temperature and higher forces, ie. a drive that starts and stops frequently may have a lower temperature, but would undergo more acceleration and stress
Re:Lower temp == higher failure rates by Mostly+a+lurker · 2007-02-17 21:01 · Score: 2, Insightful

Yes, the low temperature finding is most interesting. I have an hypothesis as to what might be going on. I suspect that absolute temperatures, within certain limits, are not important to drive reliability, but that temperature variation is. Drives that, because of their location and pattern of use, tend to fluctuate in temperature between, say, 20 and 35 degrees centigrade are being stressed more than those an a steady 40 degrees.

Proprietary makes sense here by Mammothrept · 2007-02-17 17:25 · Score: 4, Insightful

"...we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."

Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.

Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.

Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:

http://www.google.com/search?hl=en&safe=off&client =firefox-a&rls=com.ubuntu%3Aen-US%3Aofficial&hs=tq y&q=hard+drive+reliability+research+brands++manufa cturers+models&btnG=Search

This speaks volumes. by greenguy · 2007-02-17 17:25 · Score: 4, Funny

Google releases a paper on disk reliability.

--
What if I do the same thing, and I do get different results?

Re:So by mightyQuin · 2007-02-17 17:37 · Score: 2, Informative

From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)

Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.

Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.

Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.

IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.

--
Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.

Re:So by nevesis · 2007-02-17 18:31 · Score: 2, Informative

Interesting.. but I disagree with your analysis.

The DeskStars were nicknamed DeathStars due to their high failure rate.

Maxtor has a terrible reputation in the channel.

Seagate has a fantastic reputation in the channel.

And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).

Re:OS X SMART tool? by kimvette · 2007-02-17 18:48 · Score: 3, Informative

http://sourceforge.net/projects/smartmontools

Not exactly point & click but it'll do.

--
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50

Re:I'm obviously behind the times, but... by DragonTHC · 2007-02-17 18:55 · Score: 2, Informative

that sounds like a great idea, however, flash memory has a habit of failing with no warning whatsoever as well.

--
They're using their grammar skills there.

Re:Proprietary reporting by spisska · 2007-02-17 19:02 · Score: 5, Insightful

ps.. all their farm is ata/ide?

You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."

What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?

You describe their conclusions as:

Uselsess

But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.

Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.

Re:Proprietary reporting by Toba82 · 2007-02-17 19:03 · Score: 2, Informative

It is well known that google uses commodity hardware. SCSI is not commodity, although I'm sure at least some of their servers are high end.

--
I pretend to know more than I really do by mooching off google and wikipedia.

The GDRIVE by Shohat · 2007-02-17 19:43 · Score: 2, Interesting

About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.

--
My Starcraft 2 Blog

How many drives really by hankwang · 2007-02-17 21:36 · Score: 5, Insightful

The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.

Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.

So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

--
Avantslash: low-bandwidth mobile slashdot.

Re:How many drives really by CRC'99 · 2007-02-18 02:02 · Score: 2, Insightful

So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine? Cross-reference all this data, alter their ranks, follow your interests, use those to make money and target you with ads?

There is a ton of money for this information, and with enough stored data and having the facility to mine it, filter it, and sort it to location level for various advertising categories for advertisers.

Google has been very smart in the way they do business - they make money of studying your habits and selling the result (in the form of stats and/or ads).

--
Sendmail is like emacs: A nice operating system, but missing an editor and a MTA.

Re:Thanks, missed that... by mabinogi · 2007-02-17 21:38 · Score: 2, Insightful

Being able to choose freely to not say something is freedom of speech.
The right to stay silent on something is just as important a freedom as the right to have your say.

Censorship has nothing whatsoever to do with it.

--
Advanced users are users too!

Bell Labs by gustgr · 2007-02-17 21:46 · Score: 2, Insightful

Google Labs, yet in its youth, certainly resembles me of the golden yers of the Bell Labs.

They do say that "vintage" matters by Joce640k · 2007-02-17 22:04 · Score: 4, Interesting

The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

--
No sig today...

Re:OS X SMART tool? by am+2k · 2007-02-17 22:07 · Score: 2, Informative

So what tool on Mac OS X will provide all the SMART data?

I had a disk reporting a SMART failure once. The result was that the disk was red in the list in Disk Utility, but there were no other warnings. So you might want to check Disk Utility once in a while.

You can get IDE/SATA drives FAILURE RATES Here by Augur · 2007-02-18 01:00 · Score: 5, Informative

One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 - the article (in russian, but diagrams are self-explanatory).

http://pro.sunrise.ru/docs/30/image001.gif - IDE/SATA (3.5" formfactor)

http://pro.sunrise.ru/docs/30/image002.gif - HDD (2.5" notebook formfactor)

In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

Toshiba is worst in 2.5", and Seagate is best.

The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

So SMART is specific, but not sensitive. by spineboy · 2007-02-18 04:00 · Score: 3, Insightful

To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks.

Sensitivity/specificity always presents a balancing act of testing, and they are usually in a push/pull relationship. If you make a test too sensitive, then you get too many false positives, and wind up over treating something (i.e. the test says it might fail so you replace the drive even though it's not going to - a false alert)

If you make the test too specific, then usually you wind up decreasing it's sensitivity, or ability to detect something. Now you get false negatives, so when the test works, you can be sure that it's accurate, but it always doesn't detect the problem.

What you want to know is the Positive Predictive Value PPV, which is determnined by the formula PPV=TP/(TP+FP). TP= true positives, FP = false positives
Also useful is the Negative Predictive Value NPV, or this formula NPV=TN/(FN+TN) where TN = true negative, FN = false negative.

What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C0 then it accurately predicts that the drive is ok.

--
..........FULL STOP.

Re:So SMART is specific, but not sensitive. by vakuona · 2007-02-18 07:57 · Score: 2, Informative

What Gogole is saying is that you cannot rely on SMART to warn you of all or even most hard drive failures. So whilst you do reduce the possibility to lose data, they are saying you are still very likely to lose data anyway.
Re:So SMART is specific, but not sensitive. by RedWizzard · 2007-02-18 08:57 · Score: 2, Informative

To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks. ... What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C then it accurately predicts that the drive is ok. But according to the paper none of the SMART parameters was very useful in this regard. Over 50% of drive failures were not predicted by SMART errors, so the "negative test" can't give much confidence that the drive is ok. Conversely while some types of SMART error (e.g. scan errors) indicated a much higher probabily of impending failure, they still weren't all that indicative. 70% of drives that reported a scan error were still functioning normally after 8 months. So the "positive test" isn't all that convincing either. This is why the paper came to the conclusion that SMART was not useful in building a predictive model for drive failure.
Re:So SMART is specific, but not sensitive. by chriso11 · 2007-02-18 11:35 · Score: 2, Informative

No, actually it was around 36% of drive failures did not have an SMART indications. Around 49% were predicted based on 4 or so of the key parameters.

--
No, I don't trust in god. He'll have to pay up front, like everybody else.

Re:Samsung! by mollymoo · 2007-02-18 05:15 · Score: 3, Insightful

In summary: Your statistical analysis on a sample size of one showed a 100% failure rate, so Samsung are crap. You found some other people also had failed Samsung drives, so Samsung are crap.

Search the net and you will find people ranting about Seagate drives failures, Western Digital drive failures, IBM drive failures, Maxtor drives failures and failures of drives made by companies neither of us have even heard of. You won't find many, if any, reports of recent failures with 8" floppy drives though, so I suggest you use one of those. They must be more reliable, right?

--
Chernobyl 'not a wildlife haven' - BBC News

Actually this is a profoundly important conclusion by justthinkit · 2007-02-18 06:08 · Score: 2, Informative

after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors

This is easily the most important thing a sysadmin needs to know about hard drives. Much as I love Spinrite, when drives start to fail they continue to fail.

This story reminds me of the run around I got from Dell [India] when my one-and-only-Dell I'm-not-stupid-enough-to buy-their-crap-again started to have seek errors.

--
I come here for the love

Translation: Run hot, have high failure rate. by Futurepower(R) · 2007-02-18 07:34 · Score: 2, Insightful

The research results are VERY poorly communicated, as research results often are.

This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)

Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, after 3 or 4 years you will have a high failure rate."

All of our drives have their own vibration-isolated fans. Google, I recommend you do that too, based on your research results.

--
Is U.S. government violence a good in the world, or does violence just cause more violence?

Re:Proprietary reporting by T-Ranger · 2007-02-18 07:56 · Score: 3, Interesting

They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.

The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.

Google being stupid: 2 approximately equal #'s... by Futurepower(R) · 2007-02-18 08:01 · Score: 2, Insightful

Here's a quote from the Google paper: "Power-on hours -- Although we do not dispute that power-on hours might have an effect on drive lifetime, it happens that in our deployment the age of the drive is an excellent approximation for that parameter, given that our drives remain powered on for most of their life time." (Page 10, 4th paragraph)

Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.

When two numbers are close to equal, they are approximations for each other. LOL. Is there a social breakdown at Google? Are the people who don't like to think taking power at Google?

What he/she/it is looking for by Alien54 · 2007-02-18 08:45 · Score: 2, Interesting

... is not only a breakdown by age, but by other parameters, such as size, model, series, etc. I am sure that the IBM DeathStars would have greatly biased the statistics, for example, and it would be useful to have breakouts not only for such well known disasters, but also for the sample excluding the Deathstars, etc.

It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.

--
"It is a greater offense to steal men's labor, than their clothes"

Temperatures by Trogre · 2007-02-18 09:40 · Score: 2, Interesting

An interesting document, and I found the data on temperatures particularly interesting.

I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.

This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife

Re:Temperatures by whitis · 2007-02-18 14:11 · Score: 2, Informative

I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters is not affected.
This report reads like it was done by statisticians, not engineers. Handling of temperature, in particular, reveals this. As someone who has designed electronic circuits, been involved in reliability analysis, and repaired broken computers and other equipment at both the board level and chip level, I get the impression that the writers have not done any of those things.
Also, the conditions in google RAID arrays are likely very different than may be encountered in many other areas such as office and desktop PCs. In the raid arrays drives are not powered down daily and you also expect better cooling design.
The higher failure of lower average temperature drives is a definite eyebrow raiser. Not because it disproves the common wisdom (which still applies in the expected range) but because it is probably the clue that some important data was overlooked. If you actually extrapolate the right side of the graph, you see that failure does increase dramatically with temperature over the range of temperatures that would be experienced in normal cooling situations and particularly cooling failures.
Google has drives that are running at room temperature? This could point to some serious temperature fluctuation, measurement error, or to extremely aggressive cooling local cooling (chilled water or freon A/C) or a server room that is chilled like a walk in freezer. In which case, those drive failures are probably caused by moisture. At normal operating temperatures, a drive will drive off moisture. At the cooler temperatures, there may be condensation issues on the drive itself or on cooling components near the drive.
The reason that we don't see high temperature rate failures is that the sample of temperatures is abnormally low. The most common temperature related failures would be when you have a cooling failure or poor cooling. Good cooling does improve the lifetime of the drive. That does not mean, however, that cooling to extremes is a good idea. In a typical PC, the drive is going to run at somewhere around 40 degrees C. The drive on this computer, right now, which is mounted in a typical mid tower case in a slightly chilly room (it is winter here) that would be a lot more chilly without three computers heating it, is running at 39degrees. That temperature corresponds to the crest of the failure vs. temperature curve on googles graphs. What temperature do you think drive manufacturers would optimize their designs for? A typical commercial grade chip is rated 0 to 70 degrees C so the thresholds would be expected to be optimized for 35 degrees C. Drive manufacturers would expect the normal operating temperature to be around 40 degrees C. The paper says they use consumer grade drives. The datasheet for a WD 250GB hard drive says the minimum operating (ambient, not drive temperature) is 5 degrees C (41F) to 55 degrees C (131F). I noticed in doing a google search that some drives specified a minimum storage temperature of -13C.
Also, if the average temperature is low, that may be an indication that the drives in that particular population are drives that are spun down or even powered down much of the time, perhaps because the particular datasets they are serving are infrequently used or because they data is entirely cached in RAM.
Also, they talked about average temperature over the life

One of TWO best papers at FAST by Ristretto · 2007-02-18 11:07 · Score: 2, Informative

This Google paper just appeared at the 5th USENIX Conference on File and Storage Technologies (a.k.a. FAST), the premier conference on file systems and storage. It won one of the best paper awards.

You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage , by Jim Cipar, Mark Corner, and Emery Berger (Dept. of Computer Science, University of Massachusetts Amherst). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).

Enjoy!

--
Emery Berger
Dept. of Computer Science
University of Massachusetts Amherst

SpinRite Disk Error Problem Detection by northerner · 2007-02-18 11:16 · Score: 2, Interesting

Does anyone have any comments pro/con on SpinRite from Gibson Research (http://www.grc.com/sr/spinrite.htm). It claims to detect and repair disk errors before they are a problem with a low level scan. I bought it an used it on a server drive that had errors disk DOS file copies. It fixed the problem and no data was lost, but I don't have any other experience with it.

The program sounds pretty amazing from their web site.

Are many companies using it for preventative maintenance to avoid data loss on their servers?

Slashdot Mirror

Google Releases Paper on Disk Reliability

84 of 267 comments (clear)