Slashdot Mirror


Google Releases Paper on Disk Reliability

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"

267 comments

  1. Great by true_hacker · · Score: 5, Funny

    Excellent, i have been looking forward to thi *%)%*# DISK FAILURE

    1. Re:Great by Compholio · · Score: 2, Funny

      Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
      That's what you get for logging into slashdot from Antarctica...
    2. Re:Great by clayne · · Score: 1

      Yet more useless data from the "we want to break ground on everything courtesy of Google" department.

      1. No controlled set. They keep changing their drive configurations around.
      2. Only 8 months of data (they state this study was conducted between Dec 2005 - Aug 2006).
      3. It's like an average of 5% overall w/ the only real piece of info from this being that smart doesnt really correlate
      4. There is no correlation to number of drive platters vs "afr" here.
      5. Utilization rate as a measure of bandwidth, yet no account of fragmentation which would drive a correlation to drive head excessive moment and sweeping vs controlled movement.
      6. No mention of number of drives in the set. Which is needed to calculate the granularity of statistics.

      7.This must be why their stock is at 469. By providing ground breaking studies of entropy and it's affects on random data.

    3. Re:Great by mabhatter654 · · Score: 2, Interesting
      Personally, I think this is more geared as a "shot" to drive makers and big enterprise users not so much the general desktop user crowd. After all, how many companies even HAVE 100,000 hard drives to test? Google is unique in their use of LOTS of hardware... in generally better controlled environments than most have. Google has issued other papers and industry "suggestions" about performance of mother boards, power supplies, and other OTS hardware... as big of a CUSTOMER as Google is, they can push the industry to perform better in ways normal people would just "deal with".

      What the report really shows is that SMART doesn't accurately indicate the life of the drive... if anything Google drives their hardware harder than normal users, so it should be a good testbed for predictive tools.... Google would be directly interested and probably pay a lot of money to somebody that implemented the changes this engineer said... chasing around 20k+ hard drives is an EXPENSIVE task... I'd bet Google pays a MILLION dollars a year in salary just to have somebody available to run out and replace unscheduled drive failures. That's a big process improvement that they would like to see hard drive manufactures answer.

  2. Hmm by chanrobi · · Score: 2, Interesting

    So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?

    1. Re:Hmm by Anonymous Coward · · Score: 5, Funny

      Didn't read the article? (Check)
      Didn't read the summary? (Check)

      Congratulations, you're not officially a slashdot regular!

    2. Re:Hmm by Anonymous Coward · · Score: 4, Informative

      There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

      So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.

    3. Re:Hmm by TattleTale1975 · · Score: 0

      No,
      Actually,
      This report indicates that Drive manufacturers should issue RMAs for a drive with a single confirmed bad sector, or any of a number of indicators reported by SMART, as they have as much as a 60% chance of failing
      within the next several months.

      If you see any SMART Errors,(note Errors) take that Drive out of active service and just run it till it dies.

    4. Re:Hmm by Anonymous Coward · · Score: 0

      Congratulations, you're now officially a slashdot regular!

      Which proves that although dvorak beats qwerty by several boat lengths, it's not perfect.

    5. Re:Hmm by Anonymous Coward · · Score: 0

      How could a beating by a scalar, abstract quantity be less than perfect? You know nothing of Platonic ideals, sir.

    6. Re:Hmm by jemenake · · Score: 3, Interesting

      So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
      Well, I was a little disappointed by the article. They looked at a lot of different SMART categories and they looked at the different ages of the drives, but they didn't delve into the different types of failures. I get about 1 "I think my drive crashed and I was hoping you could recover it" call per month and I see a variety of failure types. Probably the most common ones I see now are ones where something has gone wrong with the control circuits/mechanism and not the media itself. For example, something can go wrong with the motor that spins the platters, or you can seize the bearings for the head traversal, etc. I've even seen some where a chip on the controller board literally popped when it got too hot. These aren't going to be detected by SMART... I don't know what would predict failures like that.

      The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.

      But, alas, I didn't see any breakdown for failure type....
    7. Re:Hmm by pugugly · · Score: 3, Funny

      Didn't check the typing? (Check)

      Congratulations, you're now officially a slashdot regular! - Pug

      --
      An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
    8. Re:Hmm by norton_I · · Score: 2, Interesting

      So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.


      It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure over the service lifetime, but by eye it looks like the trends were leveling off by 8 months.

      So, if you want to avoid replacing too many good drives, you probably have to move to a multiple error model, which probably reduces your detection liklihood well below the already low 44% reported.
    9. Re: Hmm by Anonymous Coward · · Score: 0

      I'm confused. Why is this comment rated Informative? He's parrotting what has already been said then saying essentially 'if you flip a coin, you have a 50% chance it'll land heads'.

      No, duh, Sherlock. And you moderators. If I wasn't so lazy I'd punch you in the stomach.

    10. Re:Hmm by Trogre · · Score: 2, Funny

      Congratulations, you're not officially a slashdot regular!

      Didn't hit the 'Preview' button first? (Check)

      Congratulations, you are too!

      --
      "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    11. Re:Hmm by Part`A · · Score: 1

      I believe they don't give you the failure type because it's unlikely that they even know. They're known for not removing dead servers from a rack until there have been enough dead ones on that rack to justify removing them..

      How many place do you know would knowingly keep a faulty HDD in production for 8 months as it didn't matter when it actually died? At best they will collect the faulty HDDs for a replacement under warranty.

    12. Re:Hmm by complete+loony · · Score: 1

      My guess is, they don't know. This would require manual collation of failure data. What they have collected and reported on is only what can be discovered automatically by the OS the disk is attached to.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
  3. Did they ever name the brands? by SuperKendall · · Score: 4, Insightful

    They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:Did they ever name the brands? by iminplaya · · Score: 5, Interesting

      FTA:However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.


      But, of course.

      --
      What?
    2. Re:Did they ever name the brands? by Anonymous Coward · · Score: 1, Insightful

      Google's studies are like their searchengine: you get a bunch of results, but you have to sift through them yourself to get anything specific, and you'll probably end up reading the section closest related to boobies.

    3. Re:Did they ever name the brands? by Xross_Ied · · Score: 2, Insightful
      They didn't include any data at all about brands.

      They should have done brand analysis (without naming the brand) and also rpm analysis.

      From the article..

      3.2 Manufacturers, Models, and Vintages
      Failure rates are known to be highly correlated with drive
      models, manufacturers and vintages [18]. Our results do
      not contradict this fact. For example, Figure 2 changes
      significantly when we normalize failure rates per each
      drive model. Most age-related results are impacted by
      drive vintages. However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.

      --
      This sig space tolet, reasonable rate.
    4. Re:Did they ever name the brands? by drmerope · · Score: 2, Informative

      No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information. The question that really needs to be studied is what distinguishes good drives from bad. This would probably involve disassembling drives of various 'vintages, models, manufacturers' and trying to pin down the relevant details. That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, heat, and acoustics...

    5. Re:Did they ever name the brands? by AmigaBen · · Score: 1
      Yes, Google seems to have the /. disease that prevents one from naming responsible parties when it would be useful to do so.

      Tsk tsk

      --
      +5 Insightful, really!
    6. Re:Did they ever name the brands? by repvik · · Score: 2, Insightful

      "However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data." (From TFA)

    7. Re:Did they ever name the brands? by Prof.Phreak · · Score: 3, Insightful

      At the very least, they could've named brands X, Y, Z, etc., and provided the numbers for those. Would be interesting if the differences are more than marginal.

      --

      "If anything can go wrong, it will." - Murphy

    8. Re:Did they ever name the brands? by ryturner · · Score: 3, Insightful

      It would be useful to you and me. But it is not useful to google to release that information.

    9. Re:Did they ever name the brands? by AmigaBen · · Score: 1, Insightful
      How was it useful to Google to publish the report at all?

      I don't see the point in pretending to provide information while obfuscating the most meaningful bits of it, unless it's a sales attempt to garner attention for a paid-for version of the report. Obviously, Google has concerns in the process different than what our concerns are, but again, I don't really see the point in the report without the brands.

      --
      +5 Insightful, really!
    10. Re:Did they ever name the brands? by mattmacf · · Score: 1

      That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, heat, and acoustics...
      You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives, I actively seek out drives with at least a 3 (preferably 5) year warranty (some Hitachis and Seagates IIRC) and explicitly avoid those with only a 1 year warranty period (I'm looking at you WD).
      --
      I only mod funny =D
    11. Re:Did they ever name the brands? by Anonymous Coward · · Score: 5, Funny

      They would have released that data, but it was saved on a Maxtor.

    12. Re:Did they ever name the brands? by Bill+Dog · · Score: 1

      It's for prestige. They are big enough to have gone through enough hard drives to do a study on them, and have smart enough people to do the study.

      --
      Attention zealots and haters: 00100 00100
    13. Re:Did they ever name the brands? by MadMorf · · Score: 1

      They specifically stated they would not be revealing the brands or models.

      I think that's understandable given the litigious nature of business today...

      Makes it a little less useful from a practical standpoint though...

    14. Re:Did they ever name the brands? by LunarCrisis · · Score: 2, Interesting

      If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product.

      Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity, HD capacity seems to be increasing by roughly ten times every five years.

      It's like the CD-R manufacturers stamping all the packaging with 100-year guarantees. They don't really have any good way of telling that they will actually last that long, but the replacement costs nearly nothing, and thus is payed for by the marketing benefits.

      --
      Mr. Period: Nine is the one that's right by ten!
      Nine: One day I will kill him. Then, I will be Ten.
    15. Re:Did they ever name the brands? by Schraegstrichpunkt · · Score: 2, Interesting

      They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.

      What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.

    16. Re:Did they ever name the brands? by Nogami_Saeko · · Score: 1

      I was disappointed that they didn't offer this information in the report - but not really surprised.

      --
      "Nothing strengthens authority so much as silence." - Charles de Gaulle
    17. Re:Did they ever name the brands? by HUADPE · · Score: 3, Insightful

      There are several good reasons to not release the brand names. First, while the sample size is huge, the sample size for a particular model of a particular brand might not be. If they only happened to have 10 of one particular model, and one failed within a month, then 10% fail within a month, but it could just be a fluke. Second, liability. This wasn't a controlled test, it was done live within the Google servers (presumably). Whoever is on the bottom of the list could very well sue Google for libel. Without merit? Probably, but they might eke a few million in a settlement out of them. Google can't appear to be doing evil after all.

      --
      This sig has not been evaluated by the FDA. It is not designed to diagnose, treat, prevent, or cure any disease.
    18. Re:Did they ever name the brands? by nolife · · Score: 1

      You did not miss anything. The report states:

      However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.


      and then add to it with:

      Interestingly, this does not change our conclusions. In
      contrast to age-related results, we note that all results
      shown in the rest of the paper are not affected signifi-
      cantly by the population mix.


      Proprietary? Wrong use of the word there. What they really mean is we do not want to make specific companies look bad or maybe they do not want people to make incorrect conclusions based on the scope of their specific testing. In reality, I think the specific models and companies would be interesting though.
      For hard drives in general, this is very interesting information. For what specific drives to avoid, this report is no useful.

      --
      Bad boys rape our young girls but Violet gives willingly.
    19. Re:Did they ever name the brands? by Ruvim · · Score: 1

      All we have to do now is watch whom Google gets it's drives from next...

    20. Re:Did they ever name the brands? by lukas84 · · Score: 2, Interesting

      How can stating facts be libel?

    21. Re:Did they ever name the brands? by iminplaya · · Score: 5, Funny

      Well, obviously you're not a lawyer :-) Otherwise you would know the answer.

      --
      What?
    22. Re:Did they ever name the brands? by Eivind · · Score: 1
      We also do not mention in any way just how large our population is anyway.

      Ok, so they didn't even mention the fact that they failed to mention that.

      Still, in any statistical analysis that's just about the most important parameter. I realize Google doesn't want to say just how many disk-drives they have, but it'd still be useful to have an order-of-magnitude number for the size of the study.

      I know the answer is "many". But not if it's 1000 drives, 10.000 drives or 100.000 drives. (nowhere do they say what fraction of Googles drives are included in the study either -- nor when the study was done or what vintage the discs where -- also useful data)

    23. Re:Did they ever name the brands? by TheThiefMaster · · Score: 1

      I have a five year old 20GB Maxtor that's still going strong. My newer 250s and 300s are also running well.

      I can't say much for the 80s and 160s I've had though. Especially as there's two different sizes of Maxtor 160s, a 152GiB and a 149GiB, just to confuse matters (160,000,000 KiB and 160,000,000,000 Bytes).

      As only the 20GB's an IDE, I suspect that it's just Maxtor's early sata drives that weren't reliable.

    24. Re:Did they ever name the brands? by Simon+Brooke · · Score: 1

      nowhere do they say ... when the study was done or what vintage the discs where -- also useful data

      What part of:

      All units in this study were put into production in or after 2001. The data used for this study were collected between December 2005 and August 2006.

      did you not understand?

      --
      I'm old enough to remember when discussions on Slashdot were well informed.
    25. Re:Did they ever name the brands? by Simon+Brooke · · Score: 2, Insightful

      You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives, I actively seek out drives with at least a 3 (preferably 5) year warranty (some Hitachis and Seagates IIRC) and explicitly avoid those with only a 1 year warranty period (I'm looking at you WD).

      You know, I don't give a monkey's. What you lose when a disk goes down (if you haven't done your backups properly) is typically far more valuable than the disk mechanism itself. Any manufacturer can put a five-year warranty on a disk mechanism as a gimmick. Most users won't remember the warranty when the disk goes down, and, even if they have to replace 10% of the units 'free', it doesn't take much on the retail price to cover that.

      20 years ago we had a spate of failures on Western Digital drives on machines which were out with customers. That really hurt - giving our customers free drives would not have cheered them up. 10 years ago we had a spate of failures of Samsung drives in a server farm. That was more under control, but it was still a bloody nuisance. I don't want a drive which fails, but when it fails I get a new one free. I want a drive that doesn't fail. The warranty has absolutely nothing to do with it.

      --
      I'm old enough to remember when discussions on Slashdot were well informed.
    26. Re:Did they ever name the brands? by Fred_A · · Score: 2, Insightful

      No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.
      The pertinence of the SMART data (pretty much always pertinent) and how often it popped up (about half the time) before a failure was a very interesting bit of information.

      The question that really needs to be studied is what distinguishes good drives from bad.
      A good drive is one that lasts a long time without developing too many bad blocks. A bad drive is one that fails within a couple years. In both cases you only know it after the fact or because a whole series happens to be poorly designed (like it happens to every manufacturer every now and then). Unless that model is already widely deployed and known to be bad, or already widely deployed and likely no longer sold, there's no way to tell.

      And thus on the third day the FSM created backups and saw it was good.
      --

      May contain traces of nut.
      Made from the freshest electrons.
    27. Re:Did they ever name the brands? by Jasin+Natael · · Score: 1

      They could have satisfied me by doing some kind of less-intensive unbranded analysis, like how long the technology for the platter and the read head had been deployed, what the price point was compared to the release price of similar drives, etc.

      It would be very useful to know that buying a drive as soon as its price has fallen to 70% of similar drives' cost when the technology was first introduced, is most likely to result in a good / bad drive. I, personally, would think this should be highly correlated. Catch them when the manufacturing process is mature and the workers are skilled with it, but before they start slashing margins to reach "the consumer".

      I could also handle some research on the internals, but I think that the usefulness of such data would quickly expire. That is, unless the reviewer is going to get a torque wrench, a laser leveler, and some vibration sensors to try and figure out exacly how tight the screw fittings are, how flatly the motor is mounted, and what kind of movement the read head induces in the casing when it jumps around.

      --
      True science means that when you re-evaluate the evidence, you re-evaluate your faith.
    28. Re:Did they ever name the brands? by mackyrae · · Score: 1

      Moore's law: double every 18 months

      --
      look! it's a bird, it's a plane, it's....a girl? yes, a girl browsing Slashdot on Linux
    29. Re:Did they ever name the brands? by OAB_X · · Score: 1

      I want a drive that never breaks down too, and I bet so do the manufacturers. Imagine the marketing "gaurnteed to never fail, ever!"

      However, HDDs have moving parts. All devices with moving parts break. Hard drives break.

      Of course, flash memory has no moving parts, and is therefore less likely to break, however, writes are slow, its hideously expensive (for hdd size capacity), and there is an eventual write limit on how many times it can be written too. Yet, when flash memory becomes (more) affordable for real-life capacity, failure rates SHOULD go down, because then the motor can't wear out, only the actual chips.

    30. Re:Did they ever name the brands? by Glonoinha · · Score: 1

      Moore's law says that transistor density (which loosely translates to performance / speed) doubles every 18 months, is highly related to CPUs and memory and other chip based hardware, but has nothing to do with hard drive capacity.

      Personally my observation matches the GP's post, that drive density goes up by a factor of ten every 5 years, with drives today being approximately 1000x larger than drives 15 years ago.

      Regarding the Google study, there is one statistic that was left out (primarily because it doesn't apply in rack-mount server environments) : rate and frequency of yaw. The number one thing that influences hard drive lifespans, from what I have seen, is whether you move the machine around while the drive is spinning. People with laptops that use them as desktop replacements (ie, set it down on a hard surface and leave it there, turn it on, use an external keyboard / mouse, turn it off before moving it) have (in my observations) had their drives last a LOT longer than people that use their laptops as mobile computers (ie sit them on their laps, pick them up and carry them around under their arms while running, etc.) I'm talking an order of magnitude difference in failure rates.

      --
      Glonoinha the MebiByte Slayer
    31. Re:Did they ever name the brands? by Glonoinha · · Score: 1

      Especially as there's two different sizes of Maxtor 160s, a 152GiB and a 149GiB, just to confuse matters (160,000,000 KiB ...

      Do I need to participate in this conversation?

      --
      Glonoinha the MebiByte Slayer
    32. Re:Did they ever name the brands? by Anonymous Coward · · Score: 0

      If you'd bother to read the freaking article, you would have read that it was over 100,000 drives in the analysis. But go ahead and criticize it without reading it. Idiot.

    33. Re:Did they ever name the brands? by TheThiefMaster · · Score: 1

      I was just trying to be clear. Personally I hate the GiB MiB and so on abbreviations, but until everyone uses the same definition of "MB" and "GB" it pays to be clear. Especially as the discussion was about three different definitions of GB, making the two 160 Maxtor GB come out as 152GB and 149GB using the real definition.

      It's still stupid that they have two differently sized "160GB" hard-disks. It caused me more than a little trouble when I tried to enlarge my raid-array, as it was the newer drives which were smaller...

    34. Re:Did they ever name the brands? by pugugly · · Score: 1

      There was a section related to boobies?!?!

      Oh, yeah, they *did* mention that there were high utilization drives . . .

      --
      An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
    35. Re:Did they ever name the brands? by Glonoinha · · Score: 1

      You aren't the only one I have heard having 'issues' with Maxtor drives, although this business with having different sized 160G drives is a new twist.
      Personally, lately I have been sticking with drives that use the FDB (fluid dynamic bearing) technology, specifically I have been using the Hitachi Deathstar (although I did pick up a few Seagate Barracuda drives on sale recently; they also have the fluid bearings and have 5 year warranty.)

      --
      Glonoinha the MebiByte Slayer
    36. Re:Did they ever name the brands? by TheThiefMaster · · Score: 1

      Just to be completely clear, I've only had trouble with the 80GB and 160GB maxtors, the earlier 20GB and later 250GB and 300GB are all fine.

    37. Re:Did they ever name the brands? by Dr.+Spork · · Score: 1
      Ugh, do we really have to read the article for you? Well, here you go:

      Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.
    38. Re:Did they ever name the brands? by Anonymous Coward · · Score: 0

      I had a pair of 250GB Maxtors go splat on me within months of each other. They were purchased within months of each other, too. Lasted about 2 years apiece, just past the warranty and then started failing over large swaths of the disks. Thankfully I didn't lose much of anything, and the local fileserver now has a 500GB Seagate and a recent (post-Deathstar generation) 500GB Deskstar for 1 TB of space, and a much longer warranty on each than the Maxtors.
        Both the Maxtors, I should probably note, ran considerably hotter than the replacement drives. One of them was almost hot enough to burn me when I pulled it out, though I hadn't noticed it being that hot a few months before when I opened the case for some odd reason. If you've got Maxtors, keep an eye on their temperature.

    39. Re:Did they ever name the brands? by bitbucketeer · · Score: 1

      No shit! I bought four 250G OEM Maxtor SATA II drives and made a RAID0+1 array and when one died, I bought another 250G OEM drive from the same vendor and got the "ever so much smaller" version. I finally found out that the vendor had unboxed retail drives and sold them as OEM (could tell from the firmware version). The smaller drive I bought later was a batch the vendor had bought from Dell (again, could tell from the firmware version). The Dell drives were also 10x slower!! The "retail" drive took one hour to low level format, while the "Dell" drive took 10 hours to low level format (both on the same computer booted with the latest MaxBlast CD). Oh, and the vendor was Newegg.com. Let 'em come sue me for relating what was told to me by a Maxtor tier 2 support engineer who clued me to what the firmware version numbers actually mean.

    40. Re:Did they ever name the brands? by BillX · · Score: 1

      Nope, intentionally unmentioned. Besides ticking off whoever's at the bottom and inviting a lawyerly pissing competition, I think this makes sense since the study (though fairly controlled) is representative of only a very narrow range of usage patterns (essentially constant-on, temperature controlled, no users...)... What has the best reliability in a server farm may well have disadvantages in a home environment (susceptibilities to frequent power cycles, moderate recurring shock such as users swinging their legs into the PC under a desk, or cats jumping up on it, etc.).

      --
      Caveat Emptor is not a business model.
    41. Re:Did they ever name the brands? by Peter+Mork · · Score: 1

      Your observation (x10 density in 5 years) actually matches Moore's law almost perfectly. In 5 years there are 3-1/3 18 month spans. According to my calculator, 2^(3-1/3) is roughly 10.08.

    42. Re:Did they ever name the brands? by Glonoinha · · Score: 1

      Ahhh yes, the old 'let's break a guy's theory by injecting a little reality' trick ... (good catch.)

      In that case, as I was saying ... Moore's law, which says that transistor density (which loosely translates to performance / speed) doubles every 18 months, is highly related to CPUs and memory and other chip based hardware, but has nothing to do with hard drive capacity. Moore make that observation in 1965 (long before the Winchester 30-30 hard drive made hard drives an even remotely viable consumer technology,) and to hear him recall it he asserts that the time frame is 'every 24 months.' In an amazing coincidence, however, hard drive capacity has conformed precisely to Moore's law for at least the last 15, possibly 25+ years. According to wikipedia, pixel density (for a given cost) also follows a similar growth rate.

      --
      Glonoinha the MebiByte Slayer
  4. Google had this paper ready a year ago by Anonymous Coward · · Score: 3, Funny

    But the disk it was on failed.

  5. Conclusion by llZENll · · Score: 3, Informative

    This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

    "In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

    One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

    Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

  6. That would be corporate dynamite by Traf-O-Data-Hater · · Score: 5, Insightful

    I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!

    1. Re:That would be corporate dynamite by MrZaius · · Score: 3, Interesting

      It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.

    2. Re:That would be corporate dynamite by EonBlueTooL · · Score: 4, Insightful

      Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)

    3. Re:That would be corporate dynamite by Antique+Geekmeister · · Score: 4, Insightful

      I'm confident that Google is fairly drive agnostic: you just can't run distributed networks that large and stay locked into a single vendor. And given that even reliable vendors have disasters like the IBM Deskstar drives some years ago, and given the remarkable growth of drive sizes over time, there's just not much point for them in buying the extremely stable but vastly more expensive hardware. They've foubtless learned that hardware flexibility provides valuable software flexibility.

    4. Re:That would be corporate dynamite by devilspgd · · Score: 2, Insightful

      Organizing and making accessible information which is already available is one thing, producing information is completely different.

      --
      Give a man a fish, he'll eat for a day, but teach a man to phish...
    5. Re:That would be corporate dynamite by Anonymous Coward · · Score: 0

      No one's talking about a single vendor (GP said "manufacturers"), much less being "locked in". The hard drive market != the desktop OS market. But when you're unnaturally obsessed with one, as most are here, you tend to erroneously see and think about everything in those same terms, and it reveals itself in patterns of speech.

    6. Re:That would be corporate dynamite by Anonymous Coward · · Score: 0

      So Google believes it's impossible to believe that drives exist or not?

      No? Perhaps they are neutral rather than agnostic, then.

    7. Re:That would be corporate dynamite by fred911 · · Score: 1

      Didn't Seagate have a disaster with stiction on their RLL drives? ... Seems I remember taking apart some 10 mb RLL drives and cleaning them with windex. Worked every time.

      ps... they cost about $300 then.

        And you call yourself an antique:-)

      --
      09 F9 11 02 9D 74 E3 5B - D8 41 56 C5 63 56 88 C0 45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    8. Re:That would be corporate dynamite by Jah-Wren+Ryel · · Score: 5, Insightful

      Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)

      Old Google Motto: Don't do anything evil.
      New Google Motto: Don't get into trouble.

      --
      When information is power, privacy is freedom.
    9. Re:That would be corporate dynamite by vtcodger · · Score: 1
      If I were writing this paper, I wouldn't specify brands and neither would most people. Technologies change so fast that the manufacturers that were not too great a year or three ago may be superior this year and the makers that were great in the past may well be dogs this year. Why ask for trouble?

      Exception: If one specific manufacturer or model were clearly an utter disaster, I might exclude the data for that manufacturer-model in order to keep from contaminating the results. In that case, I probably would name names. But apparently nothing the authors dealt with was that bad.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    10. Re:That would be corporate dynamite by jamesh · · Score: 2, Funny

      Not that far removed from the motto of several other large companies:

      "Don't get caught doing anything evil."

    11. Re:That would be corporate dynamite by that+this+is+not+und · · Score: 1

      Did anybody ever make an RLL encoded 10MB drive? I seem to remember that the first 5 and 10MB drives were all MFM encoded. My first drive was a 5MB MFM Drive. It was a Shugart and Associates drive, from before Shugart left the company he founded and went to Seagate. It may have been a model ST-506.

      Shugart 5MB hard drives were a LOT more expensive than $300. Think in the four figures, and not the lower ones. I purchased mine used at a surplus shop, of course.

    12. Re:That would be corporate dynamite by gbjbaanb · · Score: 3, Informative

      When a friend broke down, she asked the breakdown man who came what were the most reliable cars. He said he wasn't allowed to comment but that "he carried no honda parts". I guess the same thing applies here - Google won't say, they'd get sued.

      On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.

      If you do want to know more about which drives are best - check out storeagereview and enter details of your drives to their reliability database.

    13. Re:That would be corporate dynamite by toddestan · · Score: 1

      They could have still broke it down as "Brand A", "Brand B", "Brand C", etc. and that could have atleast told us whether the brand really matters (which is what I have seen from my admittedly small sample size), or if all the brands are pretty much the same.

    14. Re:That would be corporate dynamite by FuturePastNow · · Score: 1

      It says right in the paper that Google buys whatever drives it can get the best bulk deals on, and that this changes from one manufacturer to another over time.

      They were careful to repeat that no single manufacturer had a statistically significant problem, but it still sucks that they didn't break down their results by make and model.

      --
      Give a man fire, and you warm him for the night. Set a man on fire, and you warm him for the rest of his life.
    15. Re:That would be corporate dynamite by hicksw · · Score: 1

      Fixed it for you: "Don't get caught doing anything evil that loses money"

    16. Re:That would be corporate dynamite by Anonymous Coward · · Score: 0

      If I were Google, I would stick with manufacturer that has the longest warranty. Period.

      If you have 200,000 drives, you want some return on investment and data integrity of one particular drive doesn't matter as long as data overall is not corrupted. Google is definitely running mirroring schemes, so HD failure doesn't mean data failure.

  7. Similar paper by reset_button · · Score: 4, Informative

    I was at the talk, and it was very interesting. CMU also had a paper (PDF) about disk failures in the same conference (in fact, they presented one after the other).

    1. Re:Similar paper by Driador · · Score: 1

      I was also there this year; both papers presented were very interesting. I have the feeling I will be chewing over the printed Proceedings book here for a while.

  8. and in the meanwhile... by pedantic+bore · · Score: 3, Informative
    ... at the same conference, Bianca Schroeder presented a paper disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin and dozen papers by John Elerath...

    C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

    --
    Am I part of the core demographic for Swedish Fish?
    1. Re:and in the meanwhile... by oGMo · · Score: 3, Insightful

      While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. But while Bianca Shroeder's report, for instance, uses statistics from various unnamed sources and for various unnamed uses, the Google report is interesting because we know exactly where it's coming from and what it's being used for.

      Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that.

      --

      Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

    2. Re:and in the meanwhile... by RedWizzard · · Score: 1

      While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. I can't speak for the editor in question, but I suspect it made the front page because it's likely to be interesting to a lot of people. We all have hard drives and most of them are ATA drives. None of us want to lose our data.

      Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that. The Google paper does that itself.
  9. With all of Google's cash... by Anonymous Coward · · Score: 0, Offtopic

    you'd think they could afford statisticians. Survival analysis anyone? http://cran.r-project.org/

  10. Re:You guuuyyys... by Anonymous Coward · · Score: 0

    Why don't you just set your browser to 'download PDF files to disk' instead of 'opening PDF files in browser window'. That way, you can always abort the download, or better still, continue browsing while the PDF downloads?

  11. SMART works for me. by shadowofdarkness · · Score: 0

    I find SMART to work at detecting failures. A couple months ago I turned on my laptop and it gave me a SMART error saying my hard drive was going to die soon. But gave me the choice of continuing bootup which I did from a livecd to make one final backup. I never lost one bit of data thanks to dd'ing to another computer on my network.

  12. Re:You guuuyyys... by westyvw · · Score: 0, Offtopic

    What browser? Dont you have a kpdf or xpdf ready to read it???? DUH

  13. Re:You guuuyyys... by avalys · · Score: 0, Flamebait

    Why should we have our screens cluttered up with "PDF ALERT! PDF ALERT!" because you can't figure out how to configure your system properly?

    If you're too lazy or lacking in knowledge, buy a Mac - PDFs load instantly in OS X right out of the box.

    --
    This space intentionally left blank.
  14. Translation by jd · · Score: 3, Funny
    "We don't want to be sued to within an inch of our lives by certain very wealthy brands, due to US law allowing manufacturers to prohibit unfavourable reviews."

    Ideally, they would have formatted the text to spell out the names of the brands if you take the first letter of every Nth word, or some specific column of text. (Or maybe they have...)

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    1. Re:Translation by David+Price · · Score: 5, Insightful

      More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."

    2. Re:Translation by the_womble · · Score: 4, Insightful

      Another translation: Our competitors buy millions of dollars worth of drives as well. We are not going to help them avoid the duff ones.

    3. Re:Translation by bendodge · · Score: 3, Funny

      How did that get modded insightful? When there is more demand the price goes down, not up!

      --
      The government can't save you.
    4. Re:Translation by Schraegstrichpunkt · · Score: 1

      How did that get modded informative? That's not informative. This is informative.

    5. Re:Translation by Alien+Being · · Score: 1

      "When there is more demand the price goes down, not up!"

      That would make for some very strange auctions.

    6. Re:Translation by jlarocco · · Score: 1

      How did that get modded insightful? When there is more demand the price goes down, not up!

      Sigh. That's the most misinformed post I've ever seen on Slashdot. Demand, by itself, says absolutely nothing about the price of something.

    7. Re:Translation by Jahz · · Score: 1

      More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay." You tard. Demand and price in a free market are reversely proprotional. Go back to high school economics! Not only would that, but the great drive company mentioned would probably get more press and money leading to more R&D and even better drives.

      I wish Google released the data they found because it would force the crappy drive companies to improve their products.
      --
      There are 10 types of people in the world. Those who understand binary and those who do not.
    8. Re:Translation by spisska · · Score: 5, Insightful

      Another translation:

      We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.

      We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.

      We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.

      But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

      We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.

      And we hope you have a nice day.

    9. Re:Translation by the_womble · · Score: 1
      RTFA. It says: However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.

      It is clearly not proprietary to the drive manufacturers, because it came from Google's study. This means they regard it as proprietary to themselves.

      How do you know that their competitors have done equally good studies? Given the large population (100,000) and the fact that people are surprised even by some of the published conclusions, it is very likely that there are things in there that their competitors do not know.

      But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

      Unless they have signed some sort of NDA, agreeing not to release test results, what exactly can they be sued for? Unless you can either find evidence that buyers of hard drives sign NDAs, or specify some other grounds on which they could be sued, this sounds plain wrong to me.

    10. Re:Translation by osu-neko · · Score: 2

      Demand and price in a free market are reversely proprotional.

      One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.

      --
      "Convictions are more dangerous enemies of truth than lies."
    11. Re:Translation by Anonymous Coward · · Score: 0

      Another Translation:

      @%)%*# Disk Failure.

    12. Re:Translation by Eivind · · Score: 2, Insightful
      It's not that surprising. The only mildly interesting thing I see is that high load seems to *not* increase failure-rates much, other than the first few month. They hypothesize that this may be because some drives don't handle high load -- and die early -- however those drives that survive the first ~6 months with high load are the more robust ones, and those hold up well.

      Makes sense. Killing the weaker infants makes the adult population healthier.

    13. Re:Translation by encoderer · · Score: 1

      1. You can get sued for just about anything. Doesn't mean you'd necessarily lose, but it's still a pain in the ass.
      2. Of course they said it's proprietary. This is akin to an executive leaving the company for "personal reasons" after he was just fired. Do you think they'd actually say "We've left this out because our lawyers said so"
      3. Do you think that googles major competitors and other internet operations (like Amazon) are run by bumpkins? The thought didn't occur to them to find a way to buy the best hard drives possible for their hundreds of thousands of servers?

    14. Re:Translation by Anonymous Coward · · Score: 0

      I used to work at Amazon - they really are run by bumpkins!

    15. Re:Translation by bazorg · · Score: 2, Funny

      We made this interesting and useful HDD test which we made public however it lacks some details as it's still in Beta.

    16. Re:Translation by Anonymous Coward · · Score: 1, Insightful



      Demand and price in a free market are reversely proprotional.

      One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.


      It could probably have been better stated, "demand and price in a free market are reversely proportional, in the long term assuming that there are no barriers to entry" and bearing in mind it'd cost a few bn dollars to setup an enterprise harddisc company then it doesn't really apply here.

      And realistically even if google did say "Hitachi disks are 10x times more reliable than everyone elses", who apart from a few thousand geeks would even know to be able to make buying decisions based upon it ?

      Alex

    17. Re:Translation by sabinm · · Score: 1

      The most likely translation:

      Our researchers worked diligently and professionally to produce the best analysis they could. They gave the relevant data, including the brand names associated with the data, to our PR AND MARKETING department. After careful review, management at the company found that revealing the names of the companies would bring unintended consequences to Google and it's investors, hence the report was returned to PR AND MARKETING where the brand names were redacted. Marketing informed the resarchers of the changes and then prepared the paper to be released to the public.

      --
      http://cincyboys.blogspot.com/ Everything Cincinnati. Including the word 'Finnih'
    18. Re:Translation by evilbessie · · Score: 1

      Or possibly brand x was good for 2001, brand y for 2002, brand x again in 2003, but they were bottom of the heap in 2004 it's all pointless data as you don't want to go and buy a 5 year old disk. Drives have moved on and releasing the data could very well be anti-productive as the worst performing drives could now be the best although they probably won't stay that way for long. In much the same way Intel and AMD, nVidia and ATI keep trying to outdo each other, and anyway buying disks you pay your money and you makes your choise if it fails it fails, if you don't have redundancy and backups for sensitive data it is your own damn fault and I will laugh at you.

    19. Re:Translation by maxume · · Score: 1

      One of their conclusions was that vintage mattered more than brand; reliability is more or less related to specific models; there isn't a 'free market' in specific models, there is generally a going to be a certain number of a specific model available, and a percentage of a manufacturers production devoted to that model, with relatively slow(in comparison to the market) ability to increase production of that drive. So given a fairly limited simply(it's relatively inelastic...), increased demand will indeed drive the price up, in some fashion or another.

      --
      Nerd rage is the funniest rage.
    20. Re:Translation by Eivind · · Score: 2, Interesting
      So, you're saying there is no better choice, and can be no better choice, than simply selecting a disc randomly ? It's possible it is like you say -- that each year the stats change enough that no consistent trends are recognizable. It is however also possible (I'd say likely even) that different manufacturers are different statistically over time.

      You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anverage year or 5%.

    21. Re:Translation by toddestan · · Score: 1

      I would have to say that the temperature thing is interesting too. Conventional wisdom is to keep the drives as cool as possible. However, they seem to have found that overzealous cooling of the drives is almost as bad as letting them get excessively hot.

  15. Very true! by SuperKendall · · Score: 1

    That would have been the perfect way to divulge this data without causing direct harm to any maker - I would really have liked to see if there was a large variance between brands, which might even lead me to purchase brand Y more, even if it's not at the top of the reliability chart - just so long as it was cheaper.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  16. Temperature conclusion by phasm42 · · Score: 4, Interesting

    Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?

    --
    "No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
    1. Re:Temperature conclusion by Chalex · · Score: 2, Insightful

      The chart implies that the "optimal" operating drive temperature is 35-45 Celsius. Drive temperatures below room temperature (below 22 Celsius) is probably not a scenario that drive manufacturers optimise for.

    2. Re:Temperature conclusion by gnu-sucks · · Score: 3, Interesting

      My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

      While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

      There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.

    3. Re:Temperature conclusion by bouis · · Score: 2, Insightful

      If hard drives are anything like car engines [especially those made with iron and aluminum], the designers have taken the standard operating temperature into account in the design. The parts of varying composition fit together best at the right temperature, and temperatures higher or lower result in damage or accelerated wear.

      This is why, if you want your engine to last, you should let your car warm up before driving it hard.

    4. Re:Temperature conclusion by KZigurs · · Score: 1

      I am actually unsure if this could be the case - just as well it might be that hdds are manifactured taking average desktop in mind and are tuned to tolerate a bit higher temparature range overall. This would also explain increased failures when getting too hot, thus, leaving optimal range again.

    5. Re:Temperature conclusion by Skater · · Score: 1

      Last week, my house was without power for two days. When the power finally came back on, it was 44 degrees F (7 C) inside my house. When I turned on the computer at that temperature, the hard drive was screaming - sort of a high-pitched wail. The computer didn't even boot the first time - I don't have a monitor hooked to it (I use it as a server), but I think it never made it past the BIOS screen, and my guess is that the hard drive was the culprit. After 20 minutes or so of trying to figure out if it was a network problem or a system problem, I turned it off and restarted it. That time the computer was warmer and booted completely normally. I definitely will think twice before starting a computer that's at that temperature again.

    6. Re:Temperature conclusion by Anonymous Coward · · Score: 0

      The correlation between temperature and data loss is pretty well proven and it's called the superparamagnetic effect.

      It happens when a grain of ferromagnetic material on the platter looses it's magnetization due to temperature under the Curie temperature of said material. The probability of it happening is directly proportional to temperature (how it's close to the Curie point) and inversely proportional to size of the grain and (magnetic) hardness of the material.

      Hitachi's "vertical bit" technology allowed to use magnetically harders materials for smaller grains and achieve greater data density.

      I guess this paper shows this isn't the main cause of drive failures because it is well proven and understood.

    7. Re:Temperature conclusion by Bazer · · Score: 1

      The correlation between temperature and data loss is pretty well proven and it's called the superparamagnetic effect.

      It happens when a grain of ferromagnetic material on the platter looses it's magnetization due to temperature under the Curie temperature of said material. The probability of it happening is directly proportional to temperature (how it's close to the Curie point) and inversely proportional to size of the grain and (magnetic) hardness of the material.

      Hitachi's "vertical bit" technology allowed to use magnetically harders materials for smaller grains and achieve greater data density.

      I guess this paper shows this isn't the main cause of drive failures because it is well proven and understood.

    8. Re:Temperature conclusion by MobyDisk · · Score: 1

      I can confirm this. I worked on computers in a warehouse with little to no heat at night, and the hard drives needed to warm-up in the morning before the computers would start.

      Maybe someone can make a motherboard that dumps waste heat from the CPU into the hard drive if it is below a certain temperature. :-)

    9. Re: Temperature conclusion by Anonymous Coward · · Score: 0

      It happens when a grain of ferromagnetic material on the platter looses it's magnetization...

      Shouldn't that be "releases" or "loosens"?

      Of course if you are (apparently) contradicting a study, you really should have some references to back up your assertion.

    10. Re:Temperature conclusion by Doppler00 · · Score: 1

      They wouldn't be collecting SMART data if the drives were inactive. I always get a chuckle when I see hard drive cooler products. It's amazing people think that just because cooling a CPU is a good thing that you have to cool your hard drive. It's why studies like this from Google are important. There is too much lore and unscientific wishy-washy in the computer mod/maintenance world. Google's HUGE data set (vs. your limited life experience with maybe 10 hard drives) really proves the temperature conclusion. Keep that hard drive warm and running (although not 24/365 if you want it to last longer).

      Also, the data shows higher temperatures in older drives having high failure rates. Probably due to the hard drive itself contributing to the higher temperature (more friction, more current draw in motor, etc...)

    11. Re:Temperature conclusion by Reziac · · Score: 1

      Once upon a time my test rig lived in an unheated space. It had a whopping 20MB (not GB) HD, a very early IDE model. When it was very cold (under about 50 degrees), the HD would spin up but wouldn't read. After it ran for a few minutes and started to warm up, it would read files, but still wouldn't boot. And after another few minutes, it reached operating temperature and would finally boot up normally.

      BTW I still have the drive (a W.D. dated 1991), and it's still 100% perfect.

      --
      ~REZ~ #43301. Who'd fake being me anyway?
    12. Re: Temperature conclusion by Bazer · · Score: 1

      Doh. I ended up being to terse and misspelled the word "lose".
      The grain loses it's original magnetic moment. It's direction becomes reversed causing data corruption.

      Here's the paper.
      http://ieeexplore.ieee.org/search/wrapper.jsp?arnu mber=809134

  17. Re:So by jd · · Score: 1

    You take the Google paper and the twenty others on disk failure, take the third page of each, sort them by their papers' Google rankings and take the middle letter of every 42nd word, whilst standing in the middle of a pentagram under the second full moon of the month.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  18. Re:You guuuyyys... by Anonymous Coward · · Score: 0

    Just kill acrord32.exe. Firefox recovers and gives you a blank page with control back to you.

  19. Lower temp == higher failure rates by flyingfsck · · Score: 4, Interesting

    To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

    --
    Excuse me, but please get off my Pennisetum Clandestinum, eh!
    1. Re:Lower temp == higher failure rates by beavis88 · · Score: 1

      But did the lower temperature actually cause the failures? Such a counterintuitive conclusion seems like it'd be worth some further examination...I can turn off some fans in my cases and get the drives back up into the 40-45C range pretty quickly if need be!

    2. Re:Lower temp == higher failure rates by Anonymous Coward · · Score: 2, Insightful

      perhaps there is some correlation between lower temperature and higher forces, ie. a drive that starts and stops frequently may have a lower temperature, but would undergo more acceleration and stress

    3. Re:Lower temp == higher failure rates by Mostly+a+lurker · · Score: 2, Insightful

      Yes, the low temperature finding is most interesting. I have an hypothesis as to what might be going on. I suspect that absolute temperatures, within certain limits, are not important to drive reliability, but that temperature variation is. Drives that, because of their location and pattern of use, tend to fluctuate in temperature between, say, 20 and 35 degrees centigrade are being stressed more than those an a steady 40 degrees.

    4. Re:Lower temp == higher failure rates by Simon+Brooke · · Score: 1

      But did the lower temperature actually cause the failures? Such a counterintuitive conclusion seems like it'd be worth some further examination...

      Looks like it from the data, and TBH I don't find it at all counter-intuitive. Many materials contract when cold, expand when warm - but different materials do so at different rates. Mechanical tolerances in a hard disk drive are very close, so running at the wrong temperature will inevitably cause more failures. Drives must be designed to run fairly warm, because 99% of all drives live in computer enclosures with other drives, processors etc pumping out heat. The ambient temperature in a machine in a server rack is way above normal room temperature.

      --
      I'm old enough to remember when discussions on Slashdot were well informed.
    5. Re:Lower temp == higher failure rates by WuphonsReach · · Score: 1

      I suspect that absolute temperatures, within certain limits, are not important to drive reliability, but that temperature variation is.

      That's my suspicion as well. Minimizing the amount of thermal cycling is probably important. This means adequate cooling to keep drives at a steady temperature state, no matter if they are spinning idle or operating under a heavy seek load. It may also mean that spinning drives down to save on power may cause premature failure due to the thermal variation.

      I guess it would depend on how many times per hour you spin drives down or cycle them from 25C to 45C.

      --
      Wolde you bothe eate your cake, and have your cake?
  20. Proprietary makes sense here by Mammothrept · · Score: 4, Insightful

    "...we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."

    Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.

    Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.

    Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:

    http://www.google.com/search?hl=en&safe=off&client =firefox-a&rls=com.ubuntu%3Aen-US%3Aofficial&hs=tq y&q=hard+drive+reliability+research+brands++manufa cturers+models&btnG=Search

    1. Re:Proprietary makes sense here by Anonymous Coward · · Score: 0, Interesting

      Why not, "Here's what works best for us, maybe this additional data will help improve reliability and help the entire computing field in general."? And maybe everyone in the world (betterment of humanity, that sort of thing?) could benefit from it? Like by avoiding a product line that is demonstrably inferior (No worries about lagging sales, I'm sure Dell would buy them for their discount line of PCs).

      I forget: It's always "fuck people", and "fuck trying to make this world a better place", and "Where's my goddamn profit I'm entitled too?!", and "Get back to work slaves..."

      Yeah it makes sense to lock everything up as proprietary. Nothing to spur progress and prevent waste like having multiple efforts duplicated and hiding the results so nobody is sure what is the best way, and taxing and profiting any way how. I can't wait until they figure out a way to charge us to breath. Can I get my verichip tracking device embedded in my skull please? Open Source is treason. Zeig Heil her Bush & Blair and Haliburtton and Google.

    2. Re:Proprietary makes sense here by Anonymous Coward · · Score: 0

      Apparently Google's secret other motto is "Do no good."

    3. Re:Proprietary makes sense here by StarfishOne · · Score: 1

      Another site that might be useful for the average consumer is http://www.storagereview.com/ :)

    4. Re: Proprietary makes sense here by Anonymous Coward · · Score: 0

      ...shite....

      Why do so many Slashdot posters have fucking trouble spelling? They can't all have eaten lead paint chips, can they?

      Now that would be a study, 100000 Slashdot spelling disabled posters, looking for a common factor, or at least a predictor of failure. Not that SMART would be the acronym....

    5. Re:Proprietary makes sense here by OfNoAccount · · Score: 1

      Reliability data for desktop drives can be found here: StorageReview

      Unfortunately you need to register to see it, but that's what bugmenot was invented for ;)

      I'm afraid I disagree with you though - I think Google and everyone else should make their data public. It would save everyone a lot of pain, and make the manufacturers of unreliable drives actually improve their game.

      Places like tomshardware only review one drive, and most likely for a couple of days - so I don't think that's really their responsibility.

  21. This speaks volumes. by greenguy · · Score: 4, Funny

    Google releases a paper on disk reliability.

    --
    What if I do the same thing, and I do get different results?
  22. Re:So by triffid_98 · · Score: 1

    I've personally had much better luck with manufacturers offering 5 year warranties on their media. This does not include either of the manufacturers you mentioned...

  23. Re:You guuuyyys... by iminplaya · · Score: 1

    The download crapped out. And I couldn't close the tab. It's just an unpleasant surprise when everything locks up for a while. It's just two little words. A simple courtesy, no? For those of us who don't always remember to check the status bar. I just right clicked and saved the link after reloading.

    --
    What?
  24. Re:So by mightyQuin · · Score: 2, Informative

    From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)

    Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.

    Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.

    Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.

    IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.

    --
    Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
  25. Re:You guuuyyys... n0 sk1llz by Anonymous Coward · · Score: 0

    In my Firefox browser I can see a nice little PDF icon warning me of a PDF file.

    Also, no need to buy a mac, PDFs work instantaneously outta box on my Ubuntu Linux...

  26. So this article.. by shiningdays · · Score: 0, Offtopic

    has quite a few grammatical errors. Is this a result of disk failure?

  27. Thanks, missed that... by SuperKendall · · Score: 1

    It appears that sentence was right after the part I read about how some makers had better results than others. So of course I scan the whole document looking for said data immediately after reading the first part, but did not return to that exact point thinking I had read it already...

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
    1. Re:Thanks, missed that... by iminplaya · · Score: 1

      Though I don't like them for not giving the breakdown, they did mention the fact, kind of like when a user in China tries to access censored sites, and they say it's not allowed. Censorship is everywhere. China uses the government, we use "proprietary". Both achieve the same desired result. I suppose it's a good way to leave the way open for alternative search engines, etc. Google has been eaten by sharks.

      --
      What?
    2. Re:Thanks, missed that... by mabinogi · · Score: 2, Insightful

      Being able to choose freely to not say something is freedom of speech.
      The right to stay silent on something is just as important a freedom as the right to have your say.

      Censorship has nothing whatsoever to do with it.

      --
      Advanced users are users too!
    3. Re:Thanks, missed that... by iminplaya · · Score: 1

      No doubt. I guess we'll never know if they are keeping silent voluntarily.

      --
      What?
  28. TargetAlert for FireFox by goldragon · · Score: 1

    if you use Firefox, get the TargetAlert extension. it adds a small image after links that are pdfs, Word docs, etc. so you'll have some forewarning.

  29. Funniest response of the whole story by SuperKendall · · Score: 0, Redundant

    And I agree with your implication.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  30. Re:So by Anonymous Coward · · Score: 0

    Heh. The OLD Desktars were great. I had a 1GB and 20GB over the years, and they were fantastic. Some newer 20GB on up, there was a downright scandal about extremely high failure rates on certain lines. It sounds like 1 plant producing them was turning out duds with a near 100% failure rate. IBM sold off the storage division to Hitachi, who now sells Hitachi Deskstars. I can only assume they closed the bad plant, or made sure the clean room was actually clean 8-).

  31. power supplies by digitalhermit · · Score: 1

    This is completely anecdotal, unscientific... Since building out two servers a couple years ago, each with approximately 800G of drive space, I've had to replace drives on average of one every 8 weeks. In my lab there are about twenty drives across 8 machines, so that number is not too bad. Or so I thought. After replacing all my power supplies my drive failures have gone way down. The only drive I've lost recently is one in an older machine with an ancient 300W power supply.

    1. Re:power supplies by Anonymous Coward · · Score: 0

      Why is it anecdotal and unscientific? Nothing they've done strikes me as such. They conducted and elaborate experiment and wrote it up, and made that public.
      Just because they did not consider the power supplies, does not make it unscientific.
      Just because *you* had some mild correlation of hard drive failures and power supplies doesn't make *your* 'study' scientific.

    2. Re:power supplies by okar · · Score: 0, Redundant

      Of course he referred to his own story being anecdotal and unscientific.

      --
      Move. Sig.
    3. Re:power supplies by maestroX · · Score: 1
      Thank you.

      Most drives will fail when the PSU delivers unstable output (ref: http://www.dansdata.com/ though some drives are less sensitive to power fluctuation.

      It's pretty difficult to determine which drives are ok, since the manufacturers update these things every month.

      I would like to hear to *CLUNK* sounds of failing drives at google though ;-)

  32. How about a little color? by Anonymous Coward · · Score: 0

    Would it have killed them to vary the colors in the charts -- Figures 8 and 11-13 are pretty much unreadable even when zoomed above 100%.

  33. Re:So by nevesis · · Score: 2, Informative

    Interesting.. but I disagree with your analysis.

    The DeskStars were nicknamed DeathStars due to their high failure rate.

    Maxtor has a terrible reputation in the channel.

    Seagate has a fantastic reputation in the channel.

    And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).

  34. OS X SMART tool? by Anonymous Coward · · Score: 0

    So what tool on Mac OS X will provide all the SMART data?

    1. Re:OS X SMART tool? by kimvette · · Score: 3, Informative

      http://sourceforge.net/projects/smartmontools

      Not exactly point & click but it'll do.

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
    2. Re:OS X SMART tool? by am+2k · · Score: 2, Informative

      So what tool on Mac OS X will provide all the SMART data?

      I had a disk reporting a SMART failure once. The result was that the disk was red in the list in Disk Utility, but there were no other warnings. So you might want to check Disk Utility once in a while.

  35. Run smartd and look for scan errors by SysKoll · · Score: 1

    Well, the article's conclusion looks pretty clear to me. Watch for scan errors in smartd reports. When they start happening, migrate your data off that disk and replace it.

    --

    --
    Mad science! Robots! Underwear! Cute girls! Full comic online! http://www.girlgeniusonline.com/

  36. I'm obviously behind the times, but... by NeuroManson · · Score: 0

    What is SMART monitoring really good for? Not one drive I've had it enabled has given me a "warning, this drive is about to fail" alert. Instead, it would be a random clunking sound, or the system would freeze up entirely (hard to tell what's at fault at first, if you use Windows.;).

    Now I'm no engineer, but what strikes me as a better alternative is to toss on a 1Gb flash storage chip, and keep a redundant index/record of live/recoverable sectors. As the HD first starts to fail, the BIOS can pop up a warning window on reboot, advising the user to put in a replacement. After which, the HD clones itself automatically, based on the index files. Files that are damaged could be recovered as well, or discarded in the ol' bitbucket. 45 minutes for an OS repair install, and you're done. No scrambling to download everything all over again.

    Another alternative is a hybrid solid state HD (I think there was a /. article about this a while ago). If the HD BIOS detects impending doom, it can just dump the most critical (eg; user files, OS, irreplaceable stuff) to flash, then copy it over to a fresh drive.

    But anyhoo, that's my 2 cents.

    --
    Just because you can mod me down, doesn't mean you're right. Shoes for industry!
    1. Re:I'm obviously behind the times, but... by DragonTHC · · Score: 2, Informative

      that sounds like a great idea, however, flash memory has a habit of failing with no warning whatsoever as well.

      --
      They're using their grammar skills there.
    2. Re:I'm obviously behind the times, but... by sporkmonger · · Score: 1

      Or, you could always try using ZFS instead. But then you'd have to either run Solaris or wait for one of the ongoing ports of ZFS to finally be finished. But yeah... the solution, IMHO, isn't more/better hardware, but rather better software.

    3. Re:I'm obviously behind the times, but... by kasperd · · Score: 1

      Sounds to me like what you are asking for is a RAID controller. But I can't really see the point in building that into the disk. Having them seperate give you a lot more flexibility to choose what kind of RAID you want.

      --

      Do you care about the security of your wireless mouse?
    4. Re:I'm obviously behind the times, but... by Joce640k · · Score: 1

      >"I'm no engineer, but what strikes me as a better alternative is to toss
      > on a 1Gb flash storage chip, and keep a redundant index/record of live/recoverable sectors"

      That's mostly what drives do, except they keep the table on disk instead of flash.

      As sectors fail they get "relocated" somewhere else on the disk.

      This is one of the reasons hard drives are so cheap - making a 100% perfect disk surface is very difficult. Being able to make 99% perfect disks then work around the bad bits keeps the price down.

      > "What is SMART monitoring really good for?"

      It's supposed to tell you about the above process...

      Problem is, the meaning of the numbers is "proprietry" - that means it's not much use to you and me.

      --
      No sig today...
    5. Re:I'm obviously behind the times, but... by EvilIdler · · Score: 1

      >What is SMART monitoring really good for? Not one drive I've had it enabled has given me a "warning, this drive is about to fail" alert.

      SMART itself keeps failures transparent to you, by locking out definitely broken sectors on the drive.
      This is at a level outside the OS, so in theory a write to a bad block should just be a little slower,
      but not freeze the system. I guess the monitoring tools are just useful to tell you when a drive
      is starting to fail. Once you have a few bad blocks, disaster is never far away, in my experience.

    6. Re:I'm obviously behind the times, but... by Anonymous Coward · · Score: 0

      What is SMART monitoring really good for? Not one drive I've had it enabled has given me a "warning, this drive is about to fail" alert.

      You aren't very specific about what software you are using to monitor the errors. If you just flip the "don't boot if the disk is dying" flag on in the BIOS, it's very crude compared to running smartmontools or similar. Even then they are no guarantees.

  37. What do you want to bet by Beryllium+Sphere(tm) · · Score: 1

    that it changes more from year to year and model to model than from one manufacturer to another?

  38. Woohoo! by memnoch37 · · Score: 1

    Now I don't feel bad about turning of SMART reporting all those years ago. I never did trust that crap... On a side note, it would be interesting to see who Google signs their next contract for disk drives with...

    1. Re:Woohoo! by RegularFry · · Score: 1

      Unfortunately, that's exactly the wrong reading of TFA. What they're saying is that absence of SMART failures is no indication of absence of failure, but certain types of SMART failure are very good indicators of impending doom.

      --
      Reality is the ultimate Rorschach.
  39. Re:Proprietary reporting by spisska · · Score: 5, Insightful

    ps.. all their farm is ata/ide?

    You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."

    What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?

    You describe their conclusions as:

    Uselsess

    But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.

    Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.

  40. drive failure != lost data by sxtxixtxcxh · · Score: 0

    so, what do they do with the failed drives? where does that data go? what is their procedure for tossing these drives?

    --
    for a minute there, i lost myself...
  41. Re:Proprietary reporting by Toba82 · · Score: 2, Informative

    It is well known that google uses commodity hardware. SCSI is not commodity, although I'm sure at least some of their servers are high end.

    --
    I pretend to know more than I really do by mooching off google and wikipedia.
  42. Re:So by Nogami_Saeko · · Score: 1

    I had a bad run with Western Digital drives a while back and switched to Maxtors, which I found to be very reliable when they were first putting out 250GB drives. Had a bad experience with a Seagate dropping dead within the first week after purchase, fortunately I got most of my data off of it.

    Seagate also does NOT offer advance drive replacement in Canada, which means I'll never buy another of their products until this policy changes.

    Had good luck with more recent Western Digital drives. Put 5 x 500GB in a RAID-5 server, and they're running great!

    N.

    --
    "Nothing strengthens authority so much as silence." - Charles de Gaulle
  43. Re:So by mightyQuin · · Score: 1

    It's a very slight difference in the positioning of the WD power connector within the physical position on the drive. It's still a 40 pin standard power connector, but you cannot slide it into the housing of an AccuSYS IDE RAID drive bay. You have to order a different AccuSYS model that is specifically for WD parallel IDE drives.

    Out of curiosity, what model of Seagate has the fantastic rep?

    --
    Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
  44. The GDRIVE by Shohat · · Score: 2, Interesting

    About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.

  45. Re:So by mightyQuin · · Score: 1

    As a reliability gauge, replacement policy is important. But I've found that in reality, if a drive fails I don't want another one of the same to replace it.

    Cheers, fellow Canadian.

    --
    Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
  46. Re:So by Anonymous Coward · · Score: 0

    You take the Google paper and the twenty others on disk failure, take the third page of each, sort them by their papers' Google rankings and take the middle letter of every 42nd word, whilst standing in the middle of a pentagram under the second full moon of the month.

    And that will give me a predictive model of disk failure? Your ideas intrigue me, sir, and I would like to subscribe to your newsletter.

  47. Re:So by Anonymous Coward · · Score: 0

    Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.

    Oh yeah, I got a 12GB Quantum Fireball CX from Egghead Software the other day, and I highly recommend it!!!!

    Or did that happen about 10 years ago? It's so hard to remember sometimes...

  48. I read the abstract and the conclusion by mshurpik · · Score: 1

    Their conclusion (and a glance at their results) indicates that drives fail because of product defects. However, home-use parameters such as brown power (low voltage on the line) are probably not taken into account in their server environment.

    It's interesting, and I tend to trust their results, but these conclusions may not be relevant to single-drive situations. That is, if two customers purchase 1 drive each, and both drives are not defected, then this study doesn't explain why one drive would fail before the other. It also doesn't take into account the 1-year warranty foisted on the majority of PC-system purchasers these days.

  49. Re:So by jd · · Score: 1

    Trust me, it can't be any worse than any of the other predictions in the IT industry.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  50. Yes it does by GunFodder · · Score: 1

    There are apparently several SMART parameters that are correlated to eventual disk failure. If a disk starts throwing SMART errors in these categories then your best bet is to replace the disk ASAP. While it may be true that most disks fail without warning that doesn't mean it isn't a good idea to look for early warning signs of failure.

  51. Re:So by drsmithy · · Score: 1

    Had good luck with more recent Western Digital drives. Put 5 x 500GB in a RAID-5 server, and they're running great!

    A 2.5TB RAID5 ? Brave man...

    What's the rebuild time on that baby ?

  52. Economies of scale by Anonymous Coward · · Score: 0

    No, if anything the price would decrease. Do you see why?

    1. Re:Economies of scale by chis101 · · Score: 1

      Unless it becomes a monopoly?

  53. It's in the same paragraph by jgoemat · · Score: 1

    If you read more than the first sentence of the first paragraph in section 3.2, you would see where they said they didn't include this data due to its "proprietary" nature.

  54. How many drives really by hankwang · · Score: 5, Insightful

    The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.

    Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.

    So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

    1. Re:How many drives really by Hurricane78 · · Score: 1

      Well, there was ana article about that somewhere, some time ago. The're adding 100,000 servers A MONTH.
      At that time they had somewhere aroud 4 PETAbyte of RAM. With 8 GB per system.
      Roughly calculated, this means that they had

          ~4,194,304 / 8 = ~524,288

      servers at that time.

      Beowulf cluster anyone?
      Running doom 3 anyone? (At the usual one black pixel per system (*g*) this would be a hell of a (black) graphics quality. :)

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    2. Re:How many drives really by CRC'99 · · Score: 2, Insightful

      So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?


      Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine? Cross-reference all this data, alter their ranks, follow your interests, use those to make money and target you with ads?

      There is a ton of money for this information, and with enough stored data and having the facility to mine it, filter it, and sort it to location level for various advertising categories for advertisers.

      Google has been very smart in the way they do business - they make money of studying your habits and selling the result (in the form of stats and/or ads).
      --
      Sendmail is like emacs: A nice operating system, but missing an editor and a MTA.
    3. Re:How many drives really by kingtut · · Score: 1

      "Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count."

      whatever buddy, that means fish statistics and you know it.

    4. Re:How many drives really by oski4410 · · Score: 1

      Well, actually, Bianca's paper at the same conference showed that even though everyone assumes a Poisson distribution that's not what disk drives actually do...see Figure 5. I have no idea whether Google's population is similar (the data in Bianca's paper is partly from SCSI drives) but what both papers say is that the vintage effect is very pronounced, which suggests that the distribution doesn't look fishy, if you pardon the pun.

    5. Re:How many drives really by hankwang · · Score: 1

      ...even though everyone assumes a Poisson distribution that's not what disk drives actually do... what both papers say is that the vintage effect is very pronounced,

      That does not matter, since I did not look at the vintage data but rather at the correlation with temperature. The vintage effect would only be an issue in my statistical consideration if the temperature range 20 to 21 C were mostly of one specific vintage, while the ranges 19-20 C and 21-22 C where of a completely different vintage. That is not likely. And even then, it would only make the estimation of the standard deviation harder from the data alone, but the Google researchers did provide the error bars themselves and error bars are commonly 68% (1 stdev) or 95% (2 stdev) confidence intervals. Within one temperature interval, failures are still a Poisson process, unless you think the hard drive manufacturers deliberately put a fail-early fault in exactly every 20th drive they sell.

    6. Re:How many drives really by oski4410 · · Score: 1

      Did you read Bianca's paper? I'm not sure how you can ignore the evidence of non-Poisson distributions to maintain your claim that "failures are still a Poisson process"... Also, it's *very* likely that vintage is correlated with temperature. Servers built in 2001 are very likely to have different thermal characteristics (e.g., different airflow, layout, and thus disk drive temperatures). On top of that, the power consumption of disk drives has increased meaningfully over that time period, so that a 2001 5400rpm drive very likely runs cooler than a new 7200rpm drive.

    7. Re:How many drives really by hankwang · · Score: 1

      I did have a quick look at the paper. As I understand it, certain batches of drives purchased at the same time tend to fail after roughly the same time. That leads to an overall non-Poisson distribution of failure events. However, if you have a population of 1 million drives, each of which has a 5% chance of failing at some point, and you take a few random sample of 100 pieces each, you will still get on the average 5 faults per sample, with a standard deviation of sqrt(5), according to Poisson statistics.

      The question is whether taking samples according to operating temperature at 1 C intervals is sufficiently random. If each batch of drives (with a non-Poisson failure pattern) falls in exactly one temperature bin, then you are right. However, I find it highly unlikely that their operating temperatures are so close to each other. Their operating temperature is likely to depend on whether they are sitting at the bottom or at the top of a server rack. I would expect that to lead to at least a 5 C temperature spread, which means that each temperature bin represents a mixture of many different vintages, which will add a lot of randomization to the sampling.

    8. Re:How many drives really by Alomex · · Score: 1

      Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine?

      You are correct. Google openly acknowledges that they store every search made together with its cookie userid.

  55. Re:So by mabinogi · · Score: 1

    5x500 in RAID 5 is not 2.5 TB. It's 1.5 TB assuming he's running with a hot spare. (Everyone has a hot spare, right?)

    --
    Advanced users are users too!
  56. Bell Labs by gustgr · · Score: 2, Insightful

    Google Labs, yet in its youth, certainly resembles me of the golden yers of the Bell Labs.

  57. Re:I interned at google - don't trust them by conchur · · Score: 1

    Did you run that through Google Translate?

  58. They do say that "vintage" matters by Joce640k · · Score: 4, Interesting

    The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

    Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

    Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

    --
    No sig today...
    1. Re:They do say that "vintage" matters by Anonymous Coward · · Score: 0

      I don't understand your post.

      First you state that past performance is not a reliable indicator of future performance, and so statements about individual companies would be useless.

      Then you say that it could tell you about "a company to avoid like the plague".

      These statements are contradictory. Did I miss something?

    2. Re:They do say that "vintage" matters by Anonymous Coward · · Score: 0

      Well, Maxtor has already been damned.

  59. Re:So by asuffield · · Score: 1

    Some newer 20GB on up, there was a downright scandal about extremely high failure rates on certain lines. It sounds like 1 plant producing them was turning out duds with a near 100% failure rate. IBM sold off the storage division to Hitachi, who now sells Hitachi Deskstars. I can only assume they closed the bad plant, or made sure the clean room was actually clean 8-).


    This is how some guys I used to know in the storage division told the story - hearsay, but probably a reasonable approximation to what happened:

    At the time, IBM had two disk fabrication plants. Certain lines of deskstars were being migrated to a new kind of platter technology (glass composition? something like that), which necessitated completely rebuilding the production lines.

    One of those rebuilds was screwed. All the disks it produced were DOA, but not quite DOA enough to get the problem caught by their standard QA procedures. In the end they had to tear the whole thing down and rebuild it again.

    In the end, about half the drives shipped in the affected product lines were defective. Because of how stock allocation from the two plants works, if the store you got your drive from gave you a defective one, most likely every single other drive in their storeroom was from the same plant and therefore also defective, so taking it back there for a warranty replacement was a joke. The deskstars got a bad reputation more from this than from anything else. IBM knew what was going on, but could do little to stop it, because until they got that plant rebuilt they just didn't *have* any replacement drives to hand out. A classic example of how a failure in the QA process can leave a company completely screwed.

    By the time Hitachi bought the storage division, the bad production line was long gone and the QA procedures fixed.

    I don't know why they didn't just throw in the towel and issue a product recall. Must have been a management decision. There's a lawsuit pending that might find out.
  60. Re:I interned at google - don't trust them by Anonymous Coward · · Score: 0

    I think this is some sort of pre-existing speech in which he merely replaced the significant target with Google.

    That's the only way this pile of junks existence can be accepted.

  61. Article Text by Anonymous Coward · · Score: 0

    Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST'07), February 2007
    Failure Trends in a Large Disk Drive Population
    Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso
    Google Inc.
    1600 Amphitheatre Pkwy
    Mountain View, CA 94043
    {edpin,wolf,luiz}@google.com
    Abstract
    It is estimated that over 90% of all new information produced
    in the world is being stored on magnetic media, most of it on
    hard disk drives. Despite their importance, there is relatively
    little published work on the failure patterns of disk drives, and
    the key factors that affect their lifetime. Most available data
    are either based on extrapolation from accelerated aging experiments
    or from relatively modest sized field studies. Moreover,
    larger population studies rarely have the infrastructure in place
    to collect health signals from components in operation, which
    is critical information for detailed failure analysis.
    We present data collected from detailed observations of a
    large disk drive population in a production Internet services deployment.
    The population observed is many times larger than
    that of previous studies. In addition to presenting failure statistics,
    we analyze the correlation between failures and several
    parameters generally believed to impact longevity.
    Our analysis identifies several parameters from the drive's
    self monitoring facility (SMART) that correlate highly with
    failures. Despite this high correlation, we conclude that models
    based on SMART parameters alone are unlikely to be useful
    for predicting individual drive failures. Surprisingly, we found
    that temperature and activity levels were much less correlated
    with drive failures than previously reported.
    1 Introduction
    The tremendous advances in low-cost, high-capacity
    magnetic disk drives have been among the key factors
    helping establish a modern society that is deeply reliant
    on information technology. High-volume, consumergrade
    disk drives have become such a successful product
    that their deployments range from home computers
    and appliances to large-scale server farms. In 2002, for
    example, it was estimated that over 90% of all new information
    produced was stored on magnetic media, most
    of it being hard disk drives [12]. It is therefore critical
    to improve our understanding of how robust these components
    are and what main factors are associated with
    failures. Such understanding can be particularly useful
    for guiding the design of storage systems as well as devising
    deployment and maintenance strategies.
    Despite the importance of the subject, there are very
    few published studies on failure characteristics of disk
    drives. Most of the available information comes from
    the disk manufacturers themselves [2]. Their data are
    typically based on extrapolation from accelerated life
    test data of small populations or from returned unit
    databases. Accelerated life tests, although useful in providing
    insight into how some environmental factors can
    affect disk drive lifetime, have been known to be poor
    predictors of actual failure rates as seen by customers
    in the field [7]. Statistics from returned units are typically
    based on much larger populations, but since there
    is little or no visibility into the deployment characteristics,
    the analysis lacks valuable insight into what actually
    happened to the drive during operation. In addition,
    since units are typically returned during the warranty period
    (often three years or less), manufacturers' databases
    may not be as helpful for the study of long-term effects.
    A few recent studies have shed some light on field
    failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].
    However, these studies have either reported on relatively
    modest populations or did not monitor the disks closely
    enough during deployment to provide insights into the
    factors that might be associated with failures.
    Disk dr

  62. You can get IDE/SATA drives FAILURE RATES Here by Augur · · Score: 5, Informative

    One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

    http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 - the article (in russian, but diagrams are self-explanatory).

    http://pro.sunrise.ru/docs/30/image001.gif - IDE/SATA (3.5" formfactor)

    http://pro.sunrise.ru/docs/30/image002.gif - HDD (2.5" notebook formfactor)

    In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

    Toshiba is worst in 2.5", and Seagate is best.

    The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

    1. Re:You can get IDE/SATA drives FAILURE RATES Here by mackyrae · · Score: 1

      I remember my boss saying something the other day to a Toshiba laptop owner with a dead drive....that we replace those often. He's right. I think most of the dead drives I've seen come in have been Toshibas.

      --
      look! it's a bird, it's a plane, it's....a girl? yes, a girl browsing Slashdot on Linux
    2. Re:You can get IDE/SATA drives FAILURE RATES Here by that+this+is+not+und · · Score: 1

      Does the company you work for use anything else but Toshiba laptops?

    3. Re:You can get IDE/SATA drives FAILURE RATES Here by mackyrae · · Score: 1

      I work in a computer repair store. These are computers customers bring in to be serviced, and we're not an official Toshiba (or anything else for that matter) warranty service center (before you ask that one).

      --
      look! it's a bird, it's a plane, it's....a girl? yes, a girl browsing Slashdot on Linux
    4. Re:You can get IDE/SATA drives FAILURE RATES Here by UpnAtom · · Score: 1

      Surprises me. Every single IBM/Hitachi drive I've had has failed within 3 years. I've got through about 5 and only one was the Deathstar.

      100% failure rate for me versus 1.4% for everyone else.

  63. Re:So by drsmithy · · Score: 1

    5x500 in RAID 5 is not 2.5 TB.

    It is, it's just not 2.5TB _usable_. The *array size* is 2.5TB (well, it might be 2TB with a hot spare, but I sincerely doubt it).

    It's 1.5 TB assuming he's running with a hot spare. (Everyone has a hot spare, right?)

    Not a good assumption. The wording of the post, plus the "odd" number of drives suggests a DIY job. Most DIYers are after maximum space/$, not reliability or performance. Heck, even most "professionals" don't use a HS.

    In today's world of big SATA disk arrays and RAID6/RAID-DP/RAIDZ2/$DUAL_PARITY_RAID_SCHEME, anyone still using RAID5 (with or without a HS) is borderline negligent, IMHO.

  64. Just in case the original.. by NevarMore · · Score: 1
    1. Re:Just in case the original.. by ezh · · Score: 1

      slashdotting google... the funniest joke i've heard all week!

    2. Re:Just in case the original.. by Anonymous Coward · · Score: 0

      Hm. I would suspect that google are probably one of the few sites in the world who don't give a monkey's **** about the /. effect - we know they've got the bandwidth to take it and the servers to process the demand...

  65. Re:So by rafa · · Score: 1

    That raises a good question actually - what are people's experiences with advance drive replacements?

    I've had several Maxtor drives fail, and I'm now wary of storing anything important on them*, but their advance drive replacement service has been fabulously good in my experience. My understanding is that the particular model I have has been prone to failures (The 80Gb Maxtor DiamondMax Plus 9 drives). I wouldn't be adverse to purchasing other model drives from them again in the future, based on their drive replacement service.

    My experience with WD and Seagate drives are that in the case of the particular drives I've owned, reliability has been good. I also found Seagate's 3-year warranty to be a good reason for buying these drives. However, when one of my Seagate drives failed (Seagate Barracuda 7200.8, 250Gb drive, model ST3250823A), they refused me an advance drive replacement in Sweden, which I'm very unhappy about. They were also both rude and unhelpful when I called to ask if it was available. In addition to that, it took them three weeks to ship me a new drive. I'm very, very unlikely to ever buy another Seagate product, despite finding their hardware has performed well, and decently reliably.

    As for WD drives, I've never had one fail during their warranty period. Last I was looking for replacement drives they were still just offering 1 year support, so I didn't buy any.

    I now live in London, and hope it isn't too far out of the way for decent customer service. What manufacturers have offered good drive replacement service where you live, and which have not?

    *I do have backups of course, but it's still a pain to have the drive fail.

    --
    [Science] is one of the very few things that raises human life a little above farce and gives it the grace of tragedy.
  66. Yep, they're a bunch of cowards... by Joce640k · · Score: 1

    They could have shown the analysis without naming actual brands.

    That way we could see for ourselves how much difference there is between best/worst manufacturers.

    --
    No sig today...
  67. Re:You guuuyyys... n0 sk1llz by Anonymous Coward · · Score: 0

    PDFs will also load quickly in Windows if you use Foxit reader instead of the bloated crap from Adobe!

  68. Samsung! by Angstroem · · Score: 0, Offtopic
    I had such a drive. Started making ticking noises while getting dog slow. SMART reported zilch... No reallocation count going up, no seek times going up, nothing... And just before I could copy it to another drive, it died.

    So much for ever buying a Samsung drive again, as I noticed after a short search on the net that I'm not the only one experiencing such problems.

    1. Re:Samsung! by mollymoo · · Score: 3, Insightful

      In summary: Your statistical analysis on a sample size of one showed a 100% failure rate, so Samsung are crap. You found some other people also had failed Samsung drives, so Samsung are crap.

      Search the net and you will find people ranting about Seagate drives failures, Western Digital drive failures, IBM drive failures, Maxtor drives failures and failures of drives made by companies neither of us have even heard of. You won't find many, if any, reports of recent failures with 8" floppy drives though, so I suggest you use one of those. They must be more reliable, right?

      --
      Chernobyl 'not a wildlife haven' - BBC News
    2. Re:Samsung! by Angstroem · · Score: 1

      In summary: Your statistical analysis on a sample size of one showed a 100% failure rate, so Samsung are crap. You found some other people also had failed Samsung drives, so Samsung are crap.
      Well, I had a failure. Thinking about handing it in for the extended warranty which Samsung is nice enough to give (and which made me buying the thing cause a company wouldn't give a 5-year warranty when they aren't convinced about their stuff) I made a quick search and discovered that indeed this very model *was* flawed. Just like the ZIP drive was flawed in a way that quite a number of people experienced the "click of death".

      I'm not sure how big a statistical sample must be for you that it counts as valid and not just being some ranting of a single person...
    3. Re:Samsung! by rtb61 · · Score: 1

      You obviously need your thinking adjusted, in modern corporate speak, a product is only flawed when you can't sell it at a profit. One of these days when some piece of crap fails and it has a return to manufacturer on the other side of the planet requirement, I am personally going to take it back and return it 'through' a window, preferably the board room when it is occupied ;).

      --
      Chaos - everything, everywhere, everywhen
  69. Re:So by canuck57 · · Score: 1

    IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.

    You faired much better than I. I bought 18 IBM Deskstar drives in about 2001-2003 and stopped when my failure rate skyrocked. Hitachi bought them from IBM. It went like this:

    • 18 acquired
    • 8 retired normal attrition
    • 7 died in service causing interruptions
    • 2 still in service
    • 1 on shelf as spare/mirror

    All lasted over a year, all that died in in 2-4 years of service. I only use the remaining two on non-production systems. They are non-critical.

    I started buying them as I had great results using the 13GB lot. In fact, a couple were abused and kept working, so I bought the 40GB, 60GB and later 80GB. The only ones left are 40GB.

  70. Re:I interned at google - don't trust them by Anonymous Coward · · Score: 0

    Random paper generator, perhaps?

  71. So SMART is specific, but not sensitive. by spineboy · · Score: 3, Insightful

    To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks.

    Sensitivity/specificity always presents a balancing act of testing, and they are usually in a push/pull relationship. If you make a test too sensitive, then you get too many false positives, and wind up over treating something (i.e. the test says it might fail so you replace the drive even though it's not going to - a false alert)

    If you make the test too specific, then usually you wind up decreasing it's sensitivity, or ability to detect something. Now you get false negatives, so when the test works, you can be sure that it's accurate, but it always doesn't detect the problem.

    What you want to know is the Positive Predictive Value PPV, which is determnined by the formula PPV=TP/(TP+FP). TP= true positives, FP = false positives
    Also useful is the Negative Predictive Value NPV, or this formula NPV=TN/(FN+TN) where TN = true negative, FN = false negative.

    What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C0 then it accurately predicts that the drive is ok.

    --
    ..........FULL STOP.
    1. Re:So SMART is specific, but not sensitive. by vakuona · · Score: 2, Informative

      What Gogole is saying is that you cannot rely on SMART to warn you of all or even most hard drive failures. So whilst you do reduce the possibility to lose data, they are saying you are still very likely to lose data anyway.

    2. Re:So SMART is specific, but not sensitive. by RedWizzard · · Score: 2, Informative

      To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks. ... What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C then it accurately predicts that the drive is ok. But according to the paper none of the SMART parameters was very useful in this regard. Over 50% of drive failures were not predicted by SMART errors, so the "negative test" can't give much confidence that the drive is ok. Conversely while some types of SMART error (e.g. scan errors) indicated a much higher probabily of impending failure, they still weren't all that indicative. 70% of drives that reported a scan error were still functioning normally after 8 months. So the "positive test" isn't all that convincing either. This is why the paper came to the conclusion that SMART was not useful in building a predictive model for drive failure.
    3. Re:So SMART is specific, but not sensitive. by chriso11 · · Score: 2, Informative

      No, actually it was around 36% of drive failures did not have an SMART indications. Around 49% were predicted based on 4 or so of the key parameters.

      --
      No, I don't trust in god. He'll have to pay up front, like everybody else.
    4. Re:So SMART is specific, but not sensitive. by Archtech · · Score: 1

      "To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks."

      Trouble is, you MAY lose a hard drive at any time - there is no way to be sure it won't happen. To be safe, you should:

      1. Back up ALL your vital, indispensable data EVERY DAY. (Or less often, according to how much vital, indispensable data you are willing to lose).
      2. Back up everything that would be inconvenient to lose, as often as you can reasonably manage. I suggest weekly at least.
      3. Ideally, take complete image backups of your hard drives, so that in case of total drive failure you can just roll everything right back and carry on where you left off. People always grossly underestimate how long it will take (and how much trouble and perhaps cost they will incur) to rebuild a system from scratch. Even if you can find your %#&@ Microsoft CD/DVD case with the serial number without which you can't reinstall the software you paid for.

      Once you reconcile yourself to the thought that backup is not a luxury but a basic essential, the cost and trouble are not too great. I prefer to backup my data daily to a large-capacity USB flash stick, and periodically do full backups to an external hard drive. But, if you have two or more drives and use less than half of each (quite common nowadays), why not maintain a copy of your first hard drive contents on the second drive, and vice versa? That way if any one spindle dies, you still have everything. If you have a network, you can do incremental backups to remote computers or a dedicated server.

      --
      I am sure that there are many other solipsists out there.
    5. Re:So SMART is specific, but not sensitive. by RedWizzard · · Score: 1

      Right, misspoke there.

  72. Re:Proprietary reporting by defile · · Score: 1

    Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.

    I'm glad to see this kind of information come out of Google. But I don't expect it to last.

    These charitable acts aren't a big deal while Google is the darling of Wall Street. Once the honeymoon's over their shareholders are going to punish them for peeing away company trade secrets.

  73. Actually this is a profoundly important conclusion by justthinkit · · Score: 2, Informative

    after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors

    This is easily the most important thing a sysadmin needs to know about hard drives. Much as I love Spinrite, when drives start to fail they continue to fail.

    This story reminds me of the run around I got from Dell [India] when my one-and-only-Dell I'm-not-stupid-enough-to buy-their-crap-again started to have seek errors.

    --
    I come here for the love
  74. probably a bad idea on their part by seibed · · Score: 1

    I'm sure there are a few smart cookies out there that could easily derive which vendor corresponded to which letter. it wouldn't take that much if you work for one of the drive companies (and therefore can easily derive one of them, then just solve for the rest)

    It's also very proprietary on their part, with the volumes they're buying, showing that drive X had a drastically better failure rate would probably mean that the price on X would go up, therefore costing google money.

  75. Translation: Run hot, have high failure rate. by Futurepower(R) · · Score: 2, Insightful

    The research results are VERY poorly communicated, as research results often are.

    This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)

    Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, after 3 or 4 years you will have a high failure rate."

    All of our drives have their own vibration-isolated fans. Google, I recommend you do that too, based on your research results.

    --
    Is U.S. government violence a good in the world, or does violence just cause more violence?

    1. Re:Translation: Run hot, have high failure rate. by jo42 · · Score: 1

      Well, what did you expect from a bunch of PhD's?

  76. Re:Proprietary reporting by T-Ranger · · Score: 3, Interesting

    They are hardly trade secrets. Google isn't in the hardware business. There are only so many patterns of disk usage on can have, and knowing what pattern Google has would hardly be useful to figure out how they did anything that they do. At least, to any level of detail useful enough to copy.

    The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.

  77. Google being stupid: 2 approximately equal #'s... by Futurepower(R) · · Score: 2, Insightful

    Here's a quote from the Google paper: "Power-on hours -- Although we do not dispute that power-on hours might have an effect on drive lifetime, it happens that in our deployment the age of the drive is an excellent approximation for that parameter, given that our drives remain powered on for most of their life time." (Page 10, 4th paragraph)

    Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.

    When two numbers are close to equal, they are approximations for each other. LOL. Is there a social breakdown at Google? Are the people who don't like to think taking power at Google?

  78. Doesn't really tell us anything useful by BestNicksRTaken · · Score: 1

    So they've not had the balls to fess up and name names!

    Come on, who is the worst drive manufacturer, my money's on Maxtor - based on at least six drives dying within 6 months.

    For that matter, does anyone have any opinion on whether Maxtor has improved since being bought by Seagate?

    Has IBM improved since merging with Hitachi, or have they just renamed the Deathstar?

    Does anyone even use Samsung drives? Whatever happened to Fujitsu and Conner - they really were bad, i.e. sometimes didn't even work when new!

    Are Western Digital the best for SATA and Seagate the best for IDE, as is my opinion (got about a dozen of these and only one failure)

    All Google told us is that temperature doesn't make a difference, and power-cycling may but they can't really tell as they don't do it often!

    I have a friend who has a theory that BitTorrent is really bad for drives, as its constant read/write of little bits.

    --
    #include <sig.h>
    1. Re:Doesn't really tell us anything useful by toddestan · · Score: 1

      Come on, who is the worst drive manufacturer, my money's on Maxtor - based on at least six drives dying within 6 months.

      Could be, though I have had good luck with Maxtor. Could very well be Western Digital too.

      Has IBM improved since merging with Hitachi, or have they just renamed the Deathstar?

      The Hitachi drives seem to have average reliability. Really, there was just a couple of bad years there for IBM with the Deathstars. The ones before and after those drives don't seem any worse than average.

      Does anyone even use Samsung drives? Whatever happened to Fujitsu and Conner - they really were bad, i.e. sometimes didn't even work when new!

      My experience with Samsung drives is that they are quiet, low heat, and reliable. I recommend them. Conner got bought out by Quantum, which got bought by Seagate (IIRC). Fujitsu is still around, they make 2.5" drives mostly. Reliability seems average as 2.5" drives go.

      Are Western Digital the best for SATA and Seagate the best for IDE, as is my opinion (got about a dozen of these and only one failure)

      I have not had any SATA failures (yet). Seagate IDE drives do seem reliable, Western Digitals are terrible.

      All Google told us is that temperature doesn't make a difference, and power-cycling may but they can't really tell as they don't do it often!

      Actually, Google tells us that very low and very high temperatures are bad. The temperatures that most drives seem to operate at in most computers is the best, going outside of that is trouble.

      I have a friend who has a theory that BitTorrent is really bad for drives, as its constant read/write of little bits.

      I'm going to guess that a lot of Google's usage patterns are a lot like bittorrent (as in lots of small, random accesses and writes as opposed to large continous reads and writes). Google's data seems to show us that a lot usage like this is able to weed out the early failures quickly, but after that it doesn't matter until the drive gets old.

    2. Re:Doesn't really tell us anything useful by JasonBee · · Score: 1

      I find that interesting...I've got three Maxtors and a Seagate in my PowerMac G4 (mirrored drive bay 1ghz) since 2002.

      None have given me issues and run much of the time (60%). The machine goes to sleep at night now, and stays above 15C (it's -20-30 these days here).

      I've not yet had one drive failure. No SMART errors etc.

      My co-workers in platform services (the server guys) have said across the board that windows machines are far harder on their hard drives than the UNIX machines. That woul dbe nice to suss out of the Google data - but we already know what their preference is ;)

      JB

  79. What he/she/it is looking for by Alien54 · · Score: 2, Interesting

    ... is not only a breakdown by age, but by other parameters, such as size, model, series, etc. I am sure that the IBM DeathStars would have greatly biased the statistics, for example, and it would be useful to have breakouts not only for such well known disasters, but also for the sample excluding the Deathstars, etc.

    It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.

    --
    "It is a greater offense to steal men's labor, than their clothes"
    1. Re:What he/she/it is looking for by StikyPad · · Score: 1

      It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.

      Would you prefer a more random distribution of failure? Personally, I would be rather nonplussed if there was a significant amount of failures inside of the warranty period, and I have no expectation that a device with a stated lifetime would function reliably beyond that period (although it would be a bonus if it did).

      A warrantee is a guarantee that the product will work for the intended purpose for a specific period -- in this case, 3 years of reliable data storage and retrieval. When viewed from that perspective, it seems like they're doing a good job of living up to their guarantee. At any rate, A) most drives are "too small" after that point and B) if you expect that you'll still be using them (that they won't be "too small" for your purposes at that point), then you can buy drives with longer warranties. WD's server series (YS) drives, for example, carry a 5 year warranty. I just purchased three of those because I expect that 1.5TB of storage will still be useful to me in 3-5 years. [Insert Vista/Cairo joke here.]

  80. Re:Proprietary reporting by Anonymous Coward · · Score: 0

    1) I read the article.
    2) I can tell what's there and what's missing.
    3) Because there is no table or graph presenting correlation between manufacturer, model, and failure rate, the article is of little use to me. It does have lots of words and pretty graphs, though.

  81. Temperatures by Trogre · · Score: 2, Interesting

    An interesting document, and I found the data on temperatures particularly interesting.

    I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.

    This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.

    --
    "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
    1. Re:Temperatures by whitis · · Score: 2, Informative

      I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters is not affected.

      This report reads like it was done by statisticians, not engineers. Handling of temperature, in particular, reveals this. As someone who has designed electronic circuits, been involved in reliability analysis, and repaired broken computers and other equipment at both the board level and chip level, I get the impression that the writers have not done any of those things.

      Also, the conditions in google RAID arrays are likely very different than may be encountered in many other areas such as office and desktop PCs. In the raid arrays drives are not powered down daily and you also expect better cooling design.

      The higher failure of lower average temperature drives is a definite eyebrow raiser. Not because it disproves the common wisdom (which still applies in the expected range) but because it is probably the clue that some important data was overlooked. If you actually extrapolate the right side of the graph, you see that failure does increase dramatically with temperature over the range of temperatures that would be experienced in normal cooling situations and particularly cooling failures.

      Google has drives that are running at room temperature? This could point to some serious temperature fluctuation, measurement error, or to extremely aggressive cooling local cooling (chilled water or freon A/C) or a server room that is chilled like a walk in freezer. In which case, those drive failures are probably caused by moisture. At normal operating temperatures, a drive will drive off moisture. At the cooler temperatures, there may be condensation issues on the drive itself or on cooling components near the drive.

      The reason that we don't see high temperature rate failures is that the sample of temperatures is abnormally low. The most common temperature related failures would be when you have a cooling failure or poor cooling. Good cooling does improve the lifetime of the drive. That does not mean, however, that cooling to extremes is a good idea. In a typical PC, the drive is going to run at somewhere around 40 degrees C. The drive on this computer, right now, which is mounted in a typical mid tower case in a slightly chilly room (it is winter here) that would be a lot more chilly without three computers heating it, is running at 39degrees. That temperature corresponds to the crest of the failure vs. temperature curve on googles graphs. What temperature do you think drive manufacturers would optimize their designs for? A typical commercial grade chip is rated 0 to 70 degrees C so the thresholds would be expected to be optimized for 35 degrees C. Drive manufacturers would expect the normal operating temperature to be around 40 degrees C. The paper says they use consumer grade drives. The datasheet for a WD 250GB hard drive says the minimum operating (ambient, not drive temperature) is 5 degrees C (41F) to 55 degrees C (131F). I noticed in doing a google search that some drives specified a minimum storage temperature of -13C.

      Also, if the average temperature is low, that may be an indication that the drives in that particular population are drives that are spun down or even powered down much of the time, perhaps because the particular datasets they are serving are infrequently used or because they data is entirely cached in RAM.

      Also, they talked about average temperature over the life

  82. Better lubes needed? by mc6809e · · Score: 1


    The relationship between low temperature and failure suggests that the higher viscosity of colder lube is causing increased wear.

    You see the same thing happen with autos that are started frequently in cold weather. Thick, cold oil just doesn't work well.

  83. Re:Google being stupid: 2 approximately equal #'s. by greenrd · · Score: 1

    They said "most of their life time", not "all of their life time".

  84. One of TWO best papers at FAST by Ristretto · · Score: 2, Informative
    This Google paper just appeared at the 5th USENIX Conference on File and Storage Technologies (a.k.a. FAST), the premier conference on file systems and storage. It won one of the best paper awards.

    You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage , by Jim Cipar, Mark Corner, and Emery Berger (Dept. of Computer Science, University of Massachusetts Amherst). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).

    Enjoy!

    --
    Emery Berger
    Dept. of Computer Science
    University of Massachusetts Amherst

  85. Re:You guuuyyys... by DeadChobi · · Score: 1

    (Offtopic)Same with Linux. The only OS I've ever used that's useless out of the box is Windows. You can't even get a decent office suite on that OS without buying or downloading. They'd be keeping me as a customer if they'd finally figured out that usability means usability instead of annoying me with security "features."

    (Ontopic) A number of people have talked about IBM's deskstars. I'd like to comment that I had a 40GB deskstar as my main system drive for 4 years without any failure. Those glass platters are really nice if you're trying to build a quiet system. I wish I could get a quiet high-end video card and a quiet PSU since I no longer have the deskstar, but sadly the wallet isn't what it used to be.

    --
    SRSLY.
  86. SpinRite Disk Error Problem Detection by northerner · · Score: 2, Interesting
    Does anyone have any comments pro/con on SpinRite from Gibson Research (http://www.grc.com/sr/spinrite.htm). It claims to detect and repair disk errors before they are a problem with a low level scan. I bought it an used it on a server drive that had errors disk DOS file copies. It fixed the problem and no data was lost, but I don't have any other experience with it.

    The program sounds pretty amazing from their web site.

    Are many companies using it for preventative maintenance to avoid data loss on their servers?

    1. Re:SpinRite Disk Error Problem Detection by SuiteSisterMary · · Score: 1

      Dissenting opinion. The whole site is an interesting read.

      --
      Vintage computer games and RPG books available. Email me if you're interested.
  87. Re:You guuuyyys... by iminplaya · · Score: 1

    Sometimes even the task manager doesn't come up very quickly to kill to the process. Oh, how I'm pining for a Mac, where Option-Command-Esc in such crises provides such instant gratification. It's a real basic complaint of mine with computers in general. No matter what kind of loop it gets stuck in, there should always some resources devoted to user imput. And don't even get me started on file management. Such basic things that should have been addressed since the 70s.

    --
    What?
  88. Re:Google being stupid: 2 approximately equal #'s. by Kjella · · Score: 1

    No, they're the people that don't equate "most of their life time" with "always powered". I think you should apply for a job at the MPAA/RIAA with that logic "most of the time, people on P2P nets are infringing on copyright" == "all P2P users are pirates". For what it's worth though, it also says something about where the study applies (disks normally running 24/7) and where it doesn't (desktop uses etc.) Also the number of power cycles is interesting...

    --
    Live today, because you never know what tomorrow brings
  89. Read my other response by SuperKendall · · Score: 1

    I read the first sentace, and immediatley went into scan mode through the whole paper looking for said brand data... when I didn't find it I returned to about, but not exactly, where I left off - because I thought I had read it before. Pretty sad I know not to notice text that was right after, but I thought the admission by such a big entity as Google that brands mattered was interesting enough to warrant a departure from linear narrative.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  90. Drive speed by stapedium · · Score: 1

    I can understand leaving mfg. names out of the paper, but I would be interested to see how big a factor drive speed was in failure rates. Did any of the other papers presented address this?

  91. the answer by deadlocked · · Score: 1

    Does anybody know the author? Just take a look at what hard drives he's buying for his private computer and we'll know the answer. If I knew a google HD technician and saw that his home server had 100% of his drives from company Z, then I'd favour those too

  92. Re:So by jandrese · · Score: 1

    Who needs a hot spare in a home machine anyway? It's not like I need 5 9's reliability on my porn. Cold spares are perfectly fine.

    Also, RAID6 has even worse performance overhead than RAID5, and for home users it's blatant overkill, especially since you're backing up the data nightly anyway.

    --

    I read the internet for the articles.
  93. RAM drives? by cd-w · · Score: 1

    I seem to remember reading that Google prefers RAM drives to Hard disks? Personally, I can't wait to see the end of the hard disk - it is the weakest link in the hardware chain ...

    Chris

  94. Re:So by xtracto · · Score: 1

    Just contributing my own experience but, for me Maxtor has been completely crap (I used them for the first and last time about 6 years ago) while Seagate have been really good. I bought a used 120GB Seagate cheapo from Ebay about 2 years ago and it is still spinning loveley. I have also just replaced my notebook notebook from-factory samsung drive (because of a SMART warning) with another seagate.

    --
    Ubuntu is an African word meaning 'I can't configure Debian'
  95. Stress testing harddrives by Miksa · · Score: 0

    In the PDF they mentioned stress testing the harddrives before deploying them, but don't specify their methods.

    How would you stress test a harddrive and how long?

    Last time I bought a harddrive I divided it to 4 partitions and then run 4 continuous Bonnie++ benchmarking on the partitions. I did this for couple of weeks and I'm glad to say it didn't break down, although it did become a bit louder. :( It has been running continuously for ~6 months now so maybe I finally dare to put some actual data on it.

    --

    Begging for modpoints since '03
  96. StorageMojo summarized the paper by modapi · · Score: 1

    for people who want the bottom line and not a 13 page paper. Check out Google's Disk Failure Experience.