Google Releases Paper on Disk Reliability
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.
FTA:However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.
But, of course.
What?
Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."
Excuse me, but please get off my Pennisetum Clandestinum, eh!
The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".
Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.
Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).
No sig today...
The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.
But, alas, I didn't see any breakdown for failure type....
They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.
The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.