Slashdot Mirror


Google Releases Paper on Disk Reliability

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"

8 of 267 comments (clear)

  1. Re:That would be corporate dynamite by MrZaius · · Score: 3, Interesting

    It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.

  2. Re:Did they ever name the brands? by iminplaya · · Score: 5, Interesting

    FTA:However, in this paper, we do not show a
    breakdown of drives per manufacturer, model, or vintage
    due to the proprietary nature of these data.


    But, of course.

    --
    What?
  3. Temperature conclusion by phasm42 · · Score: 4, Interesting

    Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?

    --
    "No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
    1. Re:Temperature conclusion by gnu-sucks · · Score: 3, Interesting

      My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

      While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

      There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.

  4. Lower temp == higher failure rates by flyingfsck · · Score: 4, Interesting

    To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

    --
    Excuse me, but please get off my Pennisetum Clandestinum, eh!
  5. They do say that "vintage" matters by Joce640k · · Score: 4, Interesting

    The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

    Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

    Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

    --
    No sig today...
  6. Re:Hmm by jemenake · · Score: 3, Interesting

    So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
    Well, I was a little disappointed by the article. They looked at a lot of different SMART categories and they looked at the different ages of the drives, but they didn't delve into the different types of failures. I get about 1 "I think my drive crashed and I was hoping you could recover it" call per month and I see a variety of failure types. Probably the most common ones I see now are ones where something has gone wrong with the control circuits/mechanism and not the media itself. For example, something can go wrong with the motor that spins the platters, or you can seize the bearings for the head traversal, etc. I've even seen some where a chip on the controller board literally popped when it got too hot. These aren't going to be detected by SMART... I don't know what would predict failures like that.

    The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.

    But, alas, I didn't see any breakdown for failure type....
  7. Re:Proprietary reporting by T-Ranger · · Score: 3, Interesting

    They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.

    The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.