Google Releases Paper on Disk Reliability
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?
"There is more worth loving than we have strength to love." - Brian Jay Stanley
But the disk it was on failed.
This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...
"In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.
One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.
Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!
I was at the talk, and it was very interesting. CMU also had a paper (PDF) about disk failures in the same conference (in fact, they presented one after the other).
C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...
Am I part of the core demographic for Swedish Fish?
you'd think they could afford statisticians. Survival analysis anyone? http://cran.r-project.org/
Why don't you just set your browser to 'download PDF files to disk' instead of 'opening PDF files in browser window'. That way, you can always abort the download, or better still, continue browsing while the PDF downloads?
I find SMART to work at detecting failures. A couple months ago I turned on my laptop and it gave me a SMART error saying my hard drive was going to die soon. But gave me the choice of continuing bootup which I did from a livecd to make one final backup. I never lost one bit of data thanks to dd'ing to another computer on my network.
What browser? Dont you have a kpdf or xpdf ready to read it???? DUH
Why should we have our screens cluttered up with "PDF ALERT! PDF ALERT!" because you can't figure out how to configure your system properly?
If you're too lazy or lacking in knowledge, buy a Mac - PDFs load instantly in OS X right out of the box.
This space intentionally left blank.
Ideally, they would have formatted the text to spell out the names of the brands if you take the first letter of every Nth word, or some specific column of text. (Or maybe they have...)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
That would have been the perfect way to divulge this data without causing direct harm to any maker - I would really have liked to see if there was a large variance between brands, which might even lead me to purchase brand Y more, even if it's not at the top of the reliability chart - just so long as it was cheaper.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?
"No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
You take the Google paper and the twenty others on disk failure, take the third page of each, sort them by their papers' Google rankings and take the middle letter of every 42nd word, whilst standing in the middle of a pentagram under the second full moon of the month.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Just kill acrord32.exe. Firefox recovers and gives you a blank page with control back to you.
To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."
Excuse me, but please get off my Pennisetum Clandestinum, eh!
"...we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."
t =firefox-a&rls=com.ubuntu%3Aen-US%3Aofficial&hs=tq y&q=hard+drive+reliability+research+brands++manufa cturers+models&btnG=Search
Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.
Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.
Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:
http://www.google.com/search?hl=en&safe=off&clien
Google releases a paper on disk reliability.
What if I do the same thing, and I do get different results?
I've personally had much better luck with manufacturers offering 5 year warranties on their media. This does not include either of the manufacturers you mentioned...
The download crapped out. And I couldn't close the tab. It's just an unpleasant surprise when everything locks up for a while. It's just two little words. A simple courtesy, no? For those of us who don't always remember to check the status bar. I just right clicked and saved the link after reloading.
What?
From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)
Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.
Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.
Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.
IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.
Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
In my Firefox browser I can see a nice little PDF icon warning me of a PDF file.
Also, no need to buy a mac, PDFs work instantaneously outta box on my Ubuntu Linux...
has quite a few grammatical errors. Is this a result of disk failure?
It appears that sentence was right after the part I read about how some makers had better results than others. So of course I scan the whole document looking for said data immediately after reading the first part, but did not return to that exact point thinking I had read it already...
"There is more worth loving than we have strength to love." - Brian Jay Stanley
if you use Firefox, get the TargetAlert extension. it adds a small image after links that are pdfs, Word docs, etc. so you'll have some forewarning.
And I agree with your implication.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Heh. The OLD Desktars were great. I had a 1GB and 20GB over the years, and they were fantastic. Some newer 20GB on up, there was a downright scandal about extremely high failure rates on certain lines. It sounds like 1 plant producing them was turning out duds with a near 100% failure rate. IBM sold off the storage division to Hitachi, who now sells Hitachi Deskstars. I can only assume they closed the bad plant, or made sure the clean room was actually clean 8-).
This is completely anecdotal, unscientific... Since building out two servers a couple years ago, each with approximately 800G of drive space, I've had to replace drives on average of one every 8 weeks. In my lab there are about twenty drives across 8 machines, so that number is not too bad. Or so I thought. After replacing all my power supplies my drive failures have gone way down. The only drive I've lost recently is one in an older machine with an ancient 300W power supply.
Would it have killed them to vary the colors in the charts -- Figures 8 and 11-13 are pretty much unreadable even when zoomed above 100%.
Interesting.. but I disagree with your analysis.
The DeskStars were nicknamed DeathStars due to their high failure rate.
Maxtor has a terrible reputation in the channel.
Seagate has a fantastic reputation in the channel.
And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).
So what tool on Mac OS X will provide all the SMART data?
Well, the article's conclusion looks pretty clear to me. Watch for scan errors in smartd reports. When they start happening, migrate your data off that disk and replace it.
--
Mad science! Robots! Underwear! Cute girls! Full comic online! http://www.girlgeniusonline.com/
What is SMART monitoring really good for? Not one drive I've had it enabled has given me a "warning, this drive is about to fail" alert. Instead, it would be a random clunking sound, or the system would freeze up entirely (hard to tell what's at fault at first, if you use Windows.;).
/. article about this a while ago). If the HD BIOS detects impending doom, it can just dump the most critical (eg; user files, OS, irreplaceable stuff) to flash, then copy it over to a fresh drive.
Now I'm no engineer, but what strikes me as a better alternative is to toss on a 1Gb flash storage chip, and keep a redundant index/record of live/recoverable sectors. As the HD first starts to fail, the BIOS can pop up a warning window on reboot, advising the user to put in a replacement. After which, the HD clones itself automatically, based on the index files. Files that are damaged could be recovered as well, or discarded in the ol' bitbucket. 45 minutes for an OS repair install, and you're done. No scrambling to download everything all over again.
Another alternative is a hybrid solid state HD (I think there was a
But anyhoo, that's my 2 cents.
Just because you can mod me down, doesn't mean you're right. Shoes for industry!
that it changes more from year to year and model to model than from one manufacturer to another?
Now I don't feel bad about turning of SMART reporting all those years ago. I never did trust that crap... On a side note, it would be interesting to see who Google signs their next contract for disk drives with...
You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."
What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?
You describe their conclusions as:
But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.
Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.
so, what do they do with the failed drives? where does that data go? what is their procedure for tossing these drives?
for a minute there, i lost myself...
It is well known that google uses commodity hardware. SCSI is not commodity, although I'm sure at least some of their servers are high end.
I pretend to know more than I really do by mooching off google and wikipedia.
I had a bad run with Western Digital drives a while back and switched to Maxtors, which I found to be very reliable when they were first putting out 250GB drives. Had a bad experience with a Seagate dropping dead within the first week after purchase, fortunately I got most of my data off of it.
Seagate also does NOT offer advance drive replacement in Canada, which means I'll never buy another of their products until this policy changes.
Had good luck with more recent Western Digital drives. Put 5 x 500GB in a RAID-5 server, and they're running great!
N.
"Nothing strengthens authority so much as silence." - Charles de Gaulle
It's a very slight difference in the positioning of the WD power connector within the physical position on the drive. It's still a 40 pin standard power connector, but you cannot slide it into the housing of an AccuSYS IDE RAID drive bay. You have to order a different AccuSYS model that is specifically for WD parallel IDE drives.
Out of curiosity, what model of Seagate has the fantastic rep?
Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.
My Starcraft 2 Blog
As a reliability gauge, replacement policy is important. But I've found that in reality, if a drive fails I don't want another one of the same to replace it.
Cheers, fellow Canadian.
Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
And that will give me a predictive model of disk failure? Your ideas intrigue me, sir, and I would like to subscribe to your newsletter.
Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.
Oh yeah, I got a 12GB Quantum Fireball CX from Egghead Software the other day, and I highly recommend it!!!!
Or did that happen about 10 years ago? It's so hard to remember sometimes...
Their conclusion (and a glance at their results) indicates that drives fail because of product defects. However, home-use parameters such as brown power (low voltage on the line) are probably not taken into account in their server environment.
It's interesting, and I tend to trust their results, but these conclusions may not be relevant to single-drive situations. That is, if two customers purchase 1 drive each, and both drives are not defected, then this study doesn't explain why one drive would fail before the other. It also doesn't take into account the 1-year warranty foisted on the majority of PC-system purchasers these days.
Trust me, it can't be any worse than any of the other predictions in the IT industry.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
There are apparently several SMART parameters that are correlated to eventual disk failure. If a disk starts throwing SMART errors in these categories then your best bet is to replace the disk ASAP. While it may be true that most disks fail without warning that doesn't mean it isn't a good idea to look for early warning signs of failure.
Had good luck with more recent Western Digital drives. Put 5 x 500GB in a RAID-5 server, and they're running great!
A 2.5TB RAID5 ? Brave man...
What's the rebuild time on that baby ?
No, if anything the price would decrease. Do you see why?
If you read more than the first sentence of the first paragraph in section 3.2, you would see where they said they didn't include this data due to its "proprietary" nature.
The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.
Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.
So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?
Avantslash: low-bandwidth mobile slashdot.
5x500 in RAID 5 is not 2.5 TB. It's 1.5 TB assuming he's running with a hot spare. (Everyone has a hot spare, right?)
Advanced users are users too!
Google Labs, yet in its youth, certainly resembles me of the golden yers of the Bell Labs.
Did you run that through Google Translate?
The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".
Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.
Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).
No sig today...
This is how some guys I used to know in the storage division told the story - hearsay, but probably a reasonable approximation to what happened:
At the time, IBM had two disk fabrication plants. Certain lines of deskstars were being migrated to a new kind of platter technology (glass composition? something like that), which necessitated completely rebuilding the production lines.
One of those rebuilds was screwed. All the disks it produced were DOA, but not quite DOA enough to get the problem caught by their standard QA procedures. In the end they had to tear the whole thing down and rebuild it again.
In the end, about half the drives shipped in the affected product lines were defective. Because of how stock allocation from the two plants works, if the store you got your drive from gave you a defective one, most likely every single other drive in their storeroom was from the same plant and therefore also defective, so taking it back there for a warranty replacement was a joke. The deskstars got a bad reputation more from this than from anything else. IBM knew what was going on, but could do little to stop it, because until they got that plant rebuilt they just didn't *have* any replacement drives to hand out. A classic example of how a failure in the QA process can leave a company completely screwed.
By the time Hitachi bought the storage division, the bad production line was long gone and the QA procedures fixed.
I don't know why they didn't just throw in the towel and issue a product recall. Must have been a management decision. There's a lawsuit pending that might find out.
I think this is some sort of pre-existing speech in which he merely replaced the significant target with Google.
That's the only way this pile of junks existence can be accepted.
Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07), February 2007
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso
Google Inc.
1600 Amphitheatre Pkwy
Mountain View, CA 94043
{edpin,wolf,luiz}@google.com
Abstract
It is estimated that over 90% of all new information produced
in the world is being stored on magnetic media, most of it on
hard disk drives. Despite their importance, there is relatively
little published work on the failure patterns of disk drives, and
the key factors that affect their lifetime. Most available data
are either based on extrapolation from accelerated aging experiments
or from relatively modest sized field studies. Moreover,
larger population studies rarely have the infrastructure in place
to collect health signals from components in operation, which
is critical information for detailed failure analysis.
We present data collected from detailed observations of a
large disk drive population in a production Internet services deployment.
The population observed is many times larger than
that of previous studies. In addition to presenting failure statistics,
we analyze the correlation between failures and several
parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive's
self monitoring facility (SMART) that correlate highly with
failures. Despite this high correlation, we conclude that models
based on SMART parameters alone are unlikely to be useful
for predicting individual drive failures. Surprisingly, we found
that temperature and activity levels were much less correlated
with drive failures than previously reported.
1 Introduction
The tremendous advances in low-cost, high-capacity
magnetic disk drives have been among the key factors
helping establish a modern society that is deeply reliant
on information technology. High-volume, consumergrade
disk drives have become such a successful product
that their deployments range from home computers
and appliances to large-scale server farms. In 2002, for
example, it was estimated that over 90% of all new information
produced was stored on magnetic media, most
of it being hard disk drives [12]. It is therefore critical
to improve our understanding of how robust these components
are and what main factors are associated with
failures. Such understanding can be particularly useful
for guiding the design of storage systems as well as devising
deployment and maintenance strategies.
Despite the importance of the subject, there are very
few published studies on failure characteristics of disk
drives. Most of the available information comes from
the disk manufacturers themselves [2]. Their data are
typically based on extrapolation from accelerated life
test data of small populations or from returned unit
databases. Accelerated life tests, although useful in providing
insight into how some environmental factors can
affect disk drive lifetime, have been known to be poor
predictors of actual failure rates as seen by customers
in the field [7]. Statistics from returned units are typically
based on much larger populations, but since there
is little or no visibility into the deployment characteristics,
the analysis lacks valuable insight into what actually
happened to the drive during operation. In addition,
since units are typically returned during the warranty period
(often three years or less), manufacturers' databases
may not be as helpful for the study of long-term effects.
A few recent studies have shed some light on field
failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].
However, these studies have either reported on relatively
modest populations or did not monitor the disks closely
enough during deployment to provide insights into the
factors that might be associated with failures.
Disk dr
One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.
8 3 - the article (in russian, but diagrams are self-explanatory).
http://pro.sunrise.ru/articletext.asp?reg=30&id=2
http://pro.sunrise.ru/docs/30/image001.gif - IDE/SATA (3.5" formfactor)
http://pro.sunrise.ru/docs/30/image002.gif - HDD (2.5" notebook formfactor)
In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.
Toshiba is worst in 2.5", and Seagate is best.
The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).
5x500 in RAID 5 is not 2.5 TB.
It is, it's just not 2.5TB _usable_. The *array size* is 2.5TB (well, it might be 2TB with a hot spare, but I sincerely doubt it).
It's 1.5 TB assuming he's running with a hot spare. (Everyone has a hot spare, right?)
Not a good assumption. The wording of the post, plus the "odd" number of drives suggests a DIY job. Most DIYers are after maximum space/$, not reliability or performance. Heck, even most "professionals" don't use a HS.
In today's world of big SATA disk arrays and RAID6/RAID-DP/RAIDZ2/$DUAL_PARITY_RAID_SCHEME, anyone still using RAID5 (with or without a HS) is borderline negligent, IMHO.
..gets slashdotted. Here is a link to the Google cache. http://64.233.167.104/search?q=cache:7IGly_-xAMIJ: labs.google.com/papers/disk_failures.html+Failure+ Trends+in+a+Large+Disk+Drive+Population&hl=en&ct=c lnk&cd=1&gl=us&client=opera
That raises a good question actually - what are people's experiences with advance drive replacements?
I've had several Maxtor drives fail, and I'm now wary of storing anything important on them*, but their advance drive replacement service has been fabulously good in my experience. My understanding is that the particular model I have has been prone to failures (The 80Gb Maxtor DiamondMax Plus 9 drives). I wouldn't be adverse to purchasing other model drives from them again in the future, based on their drive replacement service.
My experience with WD and Seagate drives are that in the case of the particular drives I've owned, reliability has been good. I also found Seagate's 3-year warranty to be a good reason for buying these drives. However, when one of my Seagate drives failed (Seagate Barracuda 7200.8, 250Gb drive, model ST3250823A), they refused me an advance drive replacement in Sweden, which I'm very unhappy about. They were also both rude and unhelpful when I called to ask if it was available. In addition to that, it took them three weeks to ship me a new drive. I'm very, very unlikely to ever buy another Seagate product, despite finding their hardware has performed well, and decently reliably.
As for WD drives, I've never had one fail during their warranty period. Last I was looking for replacement drives they were still just offering 1 year support, so I didn't buy any.
I now live in London, and hope it isn't too far out of the way for decent customer service. What manufacturers have offered good drive replacement service where you live, and which have not?
*I do have backups of course, but it's still a pain to have the drive fail.
[Science] is one of the very few things that raises human life a little above farce and gives it the grace of tragedy.
They could have shown the analysis without naming actual brands.
That way we could see for ourselves how much difference there is between best/worst manufacturers.
No sig today...
PDFs will also load quickly in Windows if you use Foxit reader instead of the bloated crap from Adobe!
So much for ever buying a Samsung drive again, as I noticed after a short search on the net that I'm not the only one experiencing such problems.
You faired much better than I. I bought 18 IBM Deskstar drives in about 2001-2003 and stopped when my failure rate skyrocked. Hitachi bought them from IBM. It went like this:
All lasted over a year, all that died in in 2-4 years of service. I only use the remaining two on non-production systems. They are non-critical.
I started buying them as I had great results using the 13GB lot. In fact, a couple were abused and kept working, so I bought the 40GB, 60GB and later 80GB. The only ones left are 40GB.
Random paper generator, perhaps?
To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks.
Sensitivity/specificity always presents a balancing act of testing, and they are usually in a push/pull relationship. If you make a test too sensitive, then you get too many false positives, and wind up over treating something (i.e. the test says it might fail so you replace the drive even though it's not going to - a false alert)
If you make the test too specific, then usually you wind up decreasing it's sensitivity, or ability to detect something. Now you get false negatives, so when the test works, you can be sure that it's accurate, but it always doesn't detect the problem.
What you want to know is the Positive Predictive Value PPV, which is determnined by the formula PPV=TP/(TP+FP). TP= true positives, FP = false positives
Also useful is the Negative Predictive Value NPV, or this formula NPV=TN/(FN+TN) where TN = true negative, FN = false negative.
What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C0 then it accurately predicts that the drive is ok.
..........FULL STOP.
I'm glad to see this kind of information come out of Google. But I don't expect it to last.
These charitable acts aren't a big deal while Google is the darling of Wall Street. Once the honeymoon's over their shareholders are going to punish them for peeing away company trade secrets.
after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors
This is easily the most important thing a sysadmin needs to know about hard drives. Much as I love Spinrite, when drives start to fail they continue to fail.
This story reminds me of the run around I got from Dell [India] when my one-and-only-Dell I'm-not-stupid-enough-to buy-their-crap-again started to have seek errors.
I come here for the love
I'm sure there are a few smart cookies out there that could easily derive which vendor corresponded to which letter. it wouldn't take that much if you work for one of the drive companies (and therefore can easily derive one of them, then just solve for the rest)
It's also very proprietary on their part, with the volumes they're buying, showing that drive X had a drastically better failure rate would probably mean that the price on X would go up, therefore costing google money.
The research results are VERY poorly communicated, as research results often are.
This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)
Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, after 3 or 4 years you will have a high failure rate."
All of our drives have their own vibration-isolated fans. Google, I recommend you do that too, based on your research results.
--
Is U.S. government violence a good in the world, or does violence just cause more violence?
They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.
The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.
Here's a quote from the Google paper: "Power-on hours -- Although we do not dispute that power-on hours might have an effect on drive lifetime, it happens that in our deployment the age of the drive is an excellent approximation for that parameter, given that our drives remain powered on for most of their life time." (Page 10, 4th paragraph)
Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.
When two numbers are close to equal, they are approximations for each other. LOL. Is there a social breakdown at Google? Are the people who don't like to think taking power at Google?
So they've not had the balls to fess up and name names!
Come on, who is the worst drive manufacturer, my money's on Maxtor - based on at least six drives dying within 6 months.
For that matter, does anyone have any opinion on whether Maxtor has improved since being bought by Seagate?
Has IBM improved since merging with Hitachi, or have they just renamed the Deathstar?
Does anyone even use Samsung drives? Whatever happened to Fujitsu and Conner - they really were bad, i.e. sometimes didn't even work when new!
Are Western Digital the best for SATA and Seagate the best for IDE, as is my opinion (got about a dozen of these and only one failure)
All Google told us is that temperature doesn't make a difference, and power-cycling may but they can't really tell as they don't do it often!
I have a friend who has a theory that BitTorrent is really bad for drives, as its constant read/write of little bits.
#include <sig.h>
... is not only a breakdown by age, but by other parameters, such as size, model, series, etc. I am sure that the IBM DeathStars would have greatly biased the statistics, for example, and it would be useful to have breakouts not only for such well known disasters, but also for the sample excluding the Deathstars, etc.
It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.
"It is a greater offense to steal men's labor, than their clothes"
1) I read the article.
2) I can tell what's there and what's missing.
3) Because there is no table or graph presenting correlation between manufacturer, model, and failure rate, the article is of little use to me. It does have lots of words and pretty graphs, though.
An interesting document, and I found the data on temperatures particularly interesting.
I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.
This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
The relationship between low temperature and failure suggests that the higher viscosity of colder lube is causing increased wear.
You see the same thing happen with autos that are started frequently in cold weather. Thick, cold oil just doesn't work well.
They said "most of their life time", not "all of their life time".
Female Prison Rape in NY
You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage , by Jim Cipar, Mark Corner, and Emery Berger (Dept. of Computer Science, University of Massachusetts Amherst). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).
Enjoy!
--
Emery Berger
Dept. of Computer Science
University of Massachusetts Amherst
(Offtopic)Same with Linux. The only OS I've ever used that's useless out of the box is Windows. You can't even get a decent office suite on that OS without buying or downloading. They'd be keeping me as a customer if they'd finally figured out that usability means usability instead of annoying me with security "features."
(Ontopic) A number of people have talked about IBM's deskstars. I'd like to comment that I had a 40GB deskstar as my main system drive for 4 years without any failure. Those glass platters are really nice if you're trying to build a quiet system. I wish I could get a quiet high-end video card and a quiet PSU since I no longer have the deskstar, but sadly the wallet isn't what it used to be.
SRSLY.
The program sounds pretty amazing from their web site.
Are many companies using it for preventative maintenance to avoid data loss on their servers?
Sometimes even the task manager doesn't come up very quickly to kill to the process. Oh, how I'm pining for a Mac, where Option-Command-Esc in such crises provides such instant gratification. It's a real basic complaint of mine with computers in general. No matter what kind of loop it gets stuck in, there should always some resources devoted to user imput. And don't even get me started on file management. Such basic things that should have been addressed since the 70s.
What?
No, they're the people that don't equate "most of their life time" with "always powered". I think you should apply for a job at the MPAA/RIAA with that logic "most of the time, people on P2P nets are infringing on copyright" == "all P2P users are pirates". For what it's worth though, it also says something about where the study applies (disks normally running 24/7) and where it doesn't (desktop uses etc.) Also the number of power cycles is interesting...
Live today, because you never know what tomorrow brings
I read the first sentace, and immediatley went into scan mode through the whole paper looking for said brand data... when I didn't find it I returned to about, but not exactly, where I left off - because I thought I had read it before. Pretty sad I know not to notice text that was right after, but I thought the admission by such a big entity as Google that brands mattered was interesting enough to warrant a departure from linear narrative.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I can understand leaving mfg. names out of the paper, but I would be interested to see how big a factor drive speed was in failure rates. Did any of the other papers presented address this?
Does anybody know the author? Just take a look at what hard drives he's buying for his private computer and we'll know the answer. If I knew a google HD technician and saw that his home server had 100% of his drives from company Z, then I'd favour those too
Who needs a hot spare in a home machine anyway? It's not like I need 5 9's reliability on my porn. Cold spares are perfectly fine.
Also, RAID6 has even worse performance overhead than RAID5, and for home users it's blatant overkill, especially since you're backing up the data nightly anyway.
I read the internet for the articles.
I seem to remember reading that Google prefers RAM drives to Hard disks? Personally, I can't wait to see the end of the hard disk - it is the weakest link in the hardware chain ...
Chris
Just contributing my own experience but, for me Maxtor has been completely crap (I used them for the first and last time about 6 years ago) while Seagate have been really good. I bought a used 120GB Seagate cheapo from Ebay about 2 years ago and it is still spinning loveley. I have also just replaced my notebook notebook from-factory samsung drive (because of a SMART warning) with another seagate.
Ubuntu is an African word meaning 'I can't configure Debian'
In the PDF they mentioned stress testing the harddrives before deploying them, but don't specify their methods.
:( It has been running continuously for ~6 months now so maybe I finally dare to put some actual data on it.
How would you stress test a harddrive and how long?
Last time I bought a harddrive I divided it to 4 partitions and then run 4 continuous Bonnie++ benchmarking on the partitions. I did this for couple of weeks and I'm glad to say it didn't break down, although it did become a bit louder.
Begging for modpoints since '03
for people who want the bottom line and not a 13 page paper. Check out Google's Disk Failure Experience.