Slashdot Mirror


Google Releases Paper on Disk Reliability

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"

20 of 267 comments (clear)

  1. Hmm by chanrobi · · Score: 2, Interesting

    So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?

    1. Re:Hmm by jemenake · · Score: 3, Interesting

      So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
      Well, I was a little disappointed by the article. They looked at a lot of different SMART categories and they looked at the different ages of the drives, but they didn't delve into the different types of failures. I get about 1 "I think my drive crashed and I was hoping you could recover it" call per month and I see a variety of failure types. Probably the most common ones I see now are ones where something has gone wrong with the control circuits/mechanism and not the media itself. For example, something can go wrong with the motor that spins the platters, or you can seize the bearings for the head traversal, etc. I've even seen some where a chip on the controller board literally popped when it got too hot. These aren't going to be detected by SMART... I don't know what would predict failures like that.

      The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.

      But, alas, I didn't see any breakdown for failure type....
    2. Re:Hmm by norton_I · · Score: 2, Interesting

      So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.


      It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure over the service lifetime, but by eye it looks like the trends were leveling off by 8 months.

      So, if you want to avoid replacing too many good drives, you probably have to move to a multiple error model, which probably reduces your detection liklihood well below the already low 44% reported.
  2. Re:That would be corporate dynamite by MrZaius · · Score: 3, Interesting

    It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.

  3. Re:Did they ever name the brands? by iminplaya · · Score: 5, Interesting

    FTA:However, in this paper, we do not show a
    breakdown of drives per manufacturer, model, or vintage
    due to the proprietary nature of these data.


    But, of course.

    --
    What?
  4. Temperature conclusion by phasm42 · · Score: 4, Interesting

    Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?

    --
    "No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
    1. Re:Temperature conclusion by gnu-sucks · · Score: 3, Interesting

      My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

      While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

      There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.

  5. Lower temp == higher failure rates by flyingfsck · · Score: 4, Interesting

    To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

    --
    Excuse me, but please get off my Pennisetum Clandestinum, eh!
  6. Re:Proprietary makes sense here by Anonymous Coward · · Score: 0, Interesting

    Why not, "Here's what works best for us, maybe this additional data will help improve reliability and help the entire computing field in general."? And maybe everyone in the world (betterment of humanity, that sort of thing?) could benefit from it? Like by avoiding a product line that is demonstrably inferior (No worries about lagging sales, I'm sure Dell would buy them for their discount line of PCs).

    I forget: It's always "fuck people", and "fuck trying to make this world a better place", and "Where's my goddamn profit I'm entitled too?!", and "Get back to work slaves..."

    Yeah it makes sense to lock everything up as proprietary. Nothing to spur progress and prevent waste like having multiple efforts duplicated and hiding the results so nobody is sure what is the best way, and taxing and profiting any way how. I can't wait until they figure out a way to charge us to breath. Can I get my verichip tracking device embedded in my skull please? Open Source is treason. Zeig Heil her Bush & Blair and Haliburtton and Google.

  7. Re:Did they ever name the brands? by LunarCrisis · · Score: 2, Interesting

    If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product.

    Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity, HD capacity seems to be increasing by roughly ten times every five years.

    It's like the CD-R manufacturers stamping all the packaging with 100-year guarantees. They don't really have any good way of telling that they will actually last that long, but the replacement costs nearly nothing, and thus is payed for by the marketing benefits.

    --
    Mr. Period: Nine is the one that's right by ten!
    Nine: One day I will kill him. Then, I will be Ten.
  8. Re:Did they ever name the brands? by Schraegstrichpunkt · · Score: 2, Interesting

    They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.

    What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.

  9. The GDRIVE by Shohat · · Score: 2, Interesting

    About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.

  10. Re:Did they ever name the brands? by lukas84 · · Score: 2, Interesting

    How can stating facts be libel?

  11. They do say that "vintage" matters by Joce640k · · Score: 4, Interesting

    The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

    Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

    Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

    --
    No sig today...
  12. Re:Translation by Eivind · · Score: 2, Interesting
    So, you're saying there is no better choice, and can be no better choice, than simply selecting a disc randomly ? It's possible it is like you say -- that each year the stats change enough that no consistent trends are recognizable. It is however also possible (I'd say likely even) that different manufacturers are different statistically over time.

    You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anverage year or 5%.

  13. Re:Proprietary reporting by T-Ranger · · Score: 3, Interesting

    They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.

    The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.

  14. What he/she/it is looking for by Alien54 · · Score: 2, Interesting

    ... is not only a breakdown by age, but by other parameters, such as size, model, series, etc. I am sure that the IBM DeathStars would have greatly biased the statistics, for example, and it would be useful to have breakouts not only for such well known disasters, but also for the sample excluding the Deathstars, etc.

    It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  15. Temperatures by Trogre · · Score: 2, Interesting

    An interesting document, and I found the data on temperatures particularly interesting.

    I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.

    This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.

    --
    "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
  16. SpinRite Disk Error Problem Detection by northerner · · Score: 2, Interesting
    Does anyone have any comments pro/con on SpinRite from Gibson Research (http://www.grc.com/sr/spinrite.htm). It claims to detect and repair disk errors before they are a problem with a low level scan. I bought it an used it on a server drive that had errors disk DOS file copies. It fixed the problem and no data was lost, but I don't have any other experience with it.

    The program sounds pretty amazing from their web site.

    Are many companies using it for preventative maintenance to avoid data loss on their servers?

  17. Re:Great by mabhatter654 · · Score: 2, Interesting
    Personally, I think this is more geared as a "shot" to drive makers and big enterprise users not so much the general desktop user crowd. After all, how many companies even HAVE 100,000 hard drives to test? Google is unique in their use of LOTS of hardware... in generally better controlled environments than most have. Google has issued other papers and industry "suggestions" about performance of mother boards, power supplies, and other OTS hardware... as big of a CUSTOMER as Google is, they can push the industry to perform better in ways normal people would just "deal with".

    What the report really shows is that SMART doesn't accurately indicate the life of the drive... if anything Google drives their hardware harder than normal users, so it should be a good testbed for predictive tools.... Google would be directly interested and probably pay a lot of money to somebody that implemented the changes this engineer said... chasing around 20k+ hard drives is an EXPENSIVE task... I'd bet Google pays a MILLION dollars a year in salary just to have somebody available to run out and replace unscheduled drive failures. That's a big process improvement that they would like to see hard drive manufactures answer.