Google Releases Paper on Disk Reliability
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.
FTA:However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.
But, of course.
What?
Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Why not, "Here's what works best for us, maybe this additional data will help improve reliability and help the entire computing field in general."? And maybe everyone in the world (betterment of humanity, that sort of thing?) could benefit from it? Like by avoiding a product line that is demonstrably inferior (No worries about lagging sales, I'm sure Dell would buy them for their discount line of PCs).
I forget: It's always "fuck people", and "fuck trying to make this world a better place", and "Where's my goddamn profit I'm entitled too?!", and "Get back to work slaves..."
Yeah it makes sense to lock everything up as proprietary. Nothing to spur progress and prevent waste like having multiple efforts duplicated and hiding the results so nobody is sure what is the best way, and taxing and profiting any way how. I can't wait until they figure out a way to charge us to breath. Can I get my verichip tracking device embedded in my skull please? Open Source is treason. Zeig Heil her Bush & Blair and Haliburtton and Google.
Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity, HD capacity seems to be increasing by roughly ten times every five years.
It's like the CD-R manufacturers stamping all the packaging with 100-year guarantees. They don't really have any good way of telling that they will actually last that long, but the replacement costs nearly nothing, and thus is payed for by the marketing benefits.
Mr. Period: Nine is the one that's right by ten!
Nine: One day I will kill him. Then, I will be Ten.
What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.
http://outcampaign.org/
About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.
My Starcraft 2 Blog
How can stating facts be libel?
The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".
Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.
Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).
No sig today...
You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anverage year or 5%.
They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.
The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.
... is not only a breakdown by age, but by other parameters, such as size, model, series, etc. I am sure that the IBM DeathStars would have greatly biased the statistics, for example, and it would be useful to have breakouts not only for such well known disasters, but also for the sample excluding the Deathstars, etc.
It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.
"It is a greater offense to steal men's labor, than their clothes"
An interesting document, and I found the data on temperatures particularly interesting.
I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.
This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
The program sounds pretty amazing from their web site.
Are many companies using it for preventative maintenance to avoid data loss on their servers?
What the report really shows is that SMART doesn't accurately indicate the life of the drive... if anything Google drives their hardware harder than normal users, so it should be a good testbed for predictive tools.... Google would be directly interested and probably pay a lot of money to somebody that implemented the changes this engineer said... chasing around 20k+ hard drives is an EXPENSIVE task... I'd bet Google pays a MILLION dollars a year in salary just to have somebody available to run out and replace unscheduled drive failures. That's a big process improvement that they would like to see hard drive manufactures answer.