Slashdot Mirror


Everything You Know About Disks Is Wrong

modapi writes "Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."

23 of 330 comments (clear)

  1. MTBF by seanadams.com · · Score: 5, Interesting

    MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

    Suppose a tire manufacturer drove their tires around the block, and then observed that not one of the four tires had gone bald. Could they then claim an enormous MTBF? Of course not, but that is no less absurd than the testing being reported by hard drive manufacturers.

    1. Re:MTBF by Wilson_6500 · · Score: 5, Informative

      Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that failure rates increase with time, and thus the bathtub curve doesn't actually apply?

    2. Re:MTBF by vtcodger · · Score: 5, Insightful
      ***Um, but doesn't the summary of the paper say that there is no infant mortality effect,***

      It does. But it also says -- repeatedly -- that the data is disk replacement data, NOT disk failure data. i.e. it's data on the number of problems that the user tech thought might be fixed by replacing the disk, not by the number of disks that actually failed. One might wonder if, for example, the response to a system failing while it was being set up or in early lifetime might not be to put the whole damn thing into a box and ship it back to the vendor rather than dink around trying to figure out what is wrong. That won't be recorded as a disk failure.

      The study is fine -- really it is. But, table 3 ought to give pause. It's quite clear that different data sets show quite different diagnostic patterns. We've got one set of data that says that power supplies, for example, are hardly ever replaced and a second set that says that they are the most frequently replaced item. There MAY be good reasons for this. But it could also be an indication that the technicians are incompetent, that the record keeping is erratic, or (and I'd seriously consider this one) that only certain kinds of failures are being recorded.

      Finally, I think someone really ought to mention that there is no way that a disk manufacturer is actually going to measure MTBFs of 100000 hours prior to printing up the data sheets. The problem is that there are only around 750 hours in a month. And you need a reasonable number of failures (many quality guys would say at least 4) in order to get a reasonably valid MTBF. In order to actually measure a six digit MTBF, the manufacturer would have to run maybe 500 units for a month. My guess is that isn't going to happen. If they have the production line producing 500 units, they are going to ship them. Manufacturer MTBF data are surely based on data from a handful of engineering and preproduction units plus a bunch of wild guesses.

      My guess, and it is just a guess, is that manufacturer MTBFs for disks are probably pretty much the MTBF goal in the drive specifications established before the design actually started.

      Incidentally, based on some experience with other sorts of high tech gadetry, if the engineering/preproduction units do fail during test, a failure analysis will be done, and steps will be taken to fix the problem. Problem's fixed. OK, we shouldn't count those failures since they won't happen any more. That's called "censoring failure data". Begin to get an idea why disk MTBFs might be pretty much pure fiction?

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
  2. moving parts by DogDude · · Score: 5, Funny

    Every single mechanism with moving parts will fail. It's just a matter of when. In a few years, when everybody is using solid state drives, people will look back and shake their heads, wondering why we were using spinning magnetic platters to hold all of our critical data for such a long time.

    --
    I don't respond to AC's.
    1. Re:moving parts by theReal-Hp_Sauce · · Score: 5, Funny

      Forget Solid State Drives, soon we'll have Isolinear Chips. It wont matter if they fail or not because as long as the story line supports it Geordie can re-route the power through some other subsystem, Data can move the chips around really quickly, Picard can "make it so", and after it's all over with Wesley can wear a horrible sweater and deliver a really cheese line.

      -C

    2. Re:moving parts by NMerriam · · Score: 4, Informative

      I thought flash memory had a lower read/write cycle expectancy before crapping out?


      They do have a limited read/write lifetime for each sector, BUT the controllers automatically distribute data over the least-used sectors (since there's no performance penalty to non-linear storage), and you wind up getting the maximum possible lifetime from well-built solid-state drives (assuming no other failures).

      So in practice, the lifetime of modern solid state will be better than spinning disks as long as you aren't reading and writing every sector of the disk on a daily basis.
      --
      Recursive: Adj. See Recursive.
    3. Re:moving parts by wik · · Score: 4, Informative

      Not true. Transistors at really small dimensions (e.g., 32nm and 22nm processes) will experience soft breakdown during (what used to be) normal operational lifetimes. This will be a big problem in microprocessors because of gate oxide breakdown, NBTI, electromigration, and other processes. Even "solid-state" parts have to tolerate current, electric fields, and high thermal conditions and gradually break down, just like mechanical parts. Don't go believing that your storage will be much safer, either.

      --
      / \
      \ / ASCII ribbon campaign for peace
      x
      / \
  3. Re:Dr. Schroeder is pretty hot, too! by Anonymous Coward · · Score: 5, Funny

    Except she requires a MTBF of more than 3 seconds. Sorry dude.

  4. infant mortality by Anonymous Coward · · Score: 5, Insightful

    I suspect that the 'infant mortality' syndrome really has to do with the drives being abused before they are installed in the machines (getting dropped during shipping for example)

    the large shops like these studies are looking at get the drives in bulk directly from the manufacturer, the rest of us who have to go through several middle-men before we get our drives have more of a chance that something happened to them before we received them.

    David Lang

  5. Re:MTBF? RTFA. by Vellmont · · Score: 4, Informative

    You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.


    Well, the article actually says that drives don't have a spike of failures at the beginning. It also says failure rates increase with time. So you're right that MTBF shouldn't be taken for a single drive, since the failure rate at 5 years is going to be much higher than at one.

    The other thing that the article claims is that the stated MTBF is simply just wrong. It mentioned a stated MTBF of 1,000,000 hours, and an observed MTBF of 300,000 hours. That's pretty bad. It's also quite interesting that the "enterprise" level drives aren't any better than the consumer level drives.

    --
    AccountKiller
  6. This paper and the Google paper are complementary by Thagg · · Score: 4, Informative

    What's interesting about both of these papers is that previously-believed myths are shown to be, in fact, myths.

    The Google paper shows that relatively high temperatures and high usage rates don't affect disk life.
    The current paper shows that interface (SCSI, FC vs ATA) had no effect either. The Google paper shows
    a significant infant mortality that the CMU paper didn't, and the Google paper shows some years of flat
    reliability where the current paper shows decreasing reliability from year one.

    The both show that the failure rate is far higher than the manufacturers specify, which shouldn't come
    as a surprise to anybody with a few hundred disks.

    I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."
    Manufacturers have been pushing that principle for years, saying that "oh, we bin-out the SCSI disks
    after testing" or some other horseshit, but it's not true and it's never been true. The disks are
    sometimes faster, but they're not "better".

    Thad

    --
    I love Mondays. On a Monday, anything is possible.
  7. Human MTBF by EmbeddedJanitor · · Score: 4, Funny
    MTBF of a human until gross catastophic failure (ie. death) is approx 50 years which is approx 440,000 hours.

    Of course if we count relatively minor failures (like forgetting to take out the trash or pick up dirty underwear), then MTBF is approx 27 minutes!

    --
    Engineering is the art of compromise.
  8. Re:Infant Mortality and stuff by Wilson_6500 · · Score: 5, Insightful

    That may be the new 'theory' but we all know about theory vs reality.

    Uh, but wasn't this data accumulated via testing actual drives? That's... kinda how science works--by replacing anecdotal evidence with scientifically-gathered data. That's basically condemning science in favor of anecdotes--and the medical fields can tell you how well _that_ works.

  9. Re:Desktop vs Server usage. by Lumpy · · Score: 4, Interesting

    Or she forgot to put in the part that Enterprise drives are replaced on a schedule BEFORE they fail. At Comcast I used to have 30 some servers with 25-50 drives each scattered about the state. every hard drive was replaced every 3 years to avoid failures. These servers (Tv ad insertion servers) made us between $4500-13,000 a minute they were in operation in spurts of 15 minutes down 3-5 minutes inserting ad's. Downtime was not acceptable so we replaced them on a regular basis.

    Most enterprise level operations that relies on their data replace drives before they fail. In fac tthe replacement rate was increased to every 2 years not for failure prevention but for capacity increases.

    --
    Do not look at laser with remaining good eye.
  10. How much does handling matter? by RebornData · · Score: 5, Interesting

    What's interesting to me is that neither of these papers mentions the issue of pre-installation handling. The good folks over at Storage Review seem to be of the opinion that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

    The google paper talks a bit about certain drive "vintages" being problemmatic, but I wonder if they buy drives in large lots, and perhaps some lots might have been handled roughly during shipping. If they could trace back each hard drive to the original order, perhaps they could look to see if there's a correlation between failure and shipping lot.

    -R

  11. Re:"Everything You Know About Disks Is Wrong" by egr · · Score: 4, Funny

    I've read the article, then the tittle, damn!

  12. Re:Infant Mortality and stuff by TheLink · · Score: 4, Insightful

    quote: "Sorta. Again, real world vs theory. Try banging the hell out of an off the shelf consumer drive 24/7/365 and see how long it holds up"

    Uh the paper is based on _real_world_ stats (which part of "empirical evidence" + "she looked at 100,000 drives" don't you understand?).

    Your assumptions = theory. Paper = real world.

    And that's why the paper was voted "Best Paper", because it seems lots of people had similar assumptions and this paper is very useful to at least get some people to revisit those assumptions.

    It might still be proven wrong by a bigger/better study, or it could turn out that it was flawed in some way. But I'll give them the benefit of doubt - more than I'll trust the MTTF/MTBF figures from drive manufacturers.

    --
  13. Re:Desktop vs Server usage. by MadMorf · · Score: 5, Informative

    Most enterprise level operations that relies on their data replace drives before they fail.

    You worked at an unusual place!

    I'm a Tech Support Engineer for a large storage system manufacturer and I can tell you that NONE of our customers replace disks before they fail unless our OS detects a "predictive failure" for the disk. Our customers are some of the biggest names in business from all over the planet.

  14. Re:Infant Mortality and stuff by Anonymous Coward · · Score: 5, Insightful

    Use two drives that are not in a raid setup. Use one as the data holder and rsync or tar.gz the data to the other one at your comfort level (hourly/daily/weekly/monthly or whatever time frame you would like). Much cheaper then raid, easier to get going, no gotchas involved with different HD controllers or different drives and most importantly, the second drive is not "live" and not in normal operation which constitutes a backup (remember, raid is not and never was a backup solution, it is only for uptime and maybe speed).

    Raid controllers comes in two flavors. Ones that are very well supported and you will always find a similar or compatible one if that controller fails, the down side of this type is it is very expensive. The other type is the cheap ones, you know, the ones for under $100 which may not exist in 2 years when your fails leaving your raid array useless and the on board SATA raid chip sets that change at least yearly as well. Good luck with those. They do work but I'd bet you will have more problems with the raid setup itself then with actual drives the data is on.

    I know, KISS is not in typical /. speak but it definitely applies here. 300GB HDs are about $80 without rebates, using one to hold a copy of the other using rsync or robocopy is about the cheapest backup you can get and since it is not a live file system, all the other things that happens to data that is not the fault of the actual HD (virus, mouse slip, kids messing around, accidents, overwriting) will be recoverable.

  15. and Google contradicts. by bill_mcgonigle · · Score: 4, Interesting

    Well, the article actually says that drives don't have a spike of failures at the beginning.

    Hmm, the Google paper says they do, from 3-6 months (Figure 2).

    Which leaves us with confirmation that 50% of all studies are wrong.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  16. OSS Software RAID, too. by Kadin2048 · · Score: 4, Insightful

    On the other hand, you could get a cheap drive controller, and do software RAID, using OSS tools; the setup might be more complex than hardware RAID, but there shouldn't be any issues with recovering your data later due to the format it's written in.

    I agree though, that for most people, some sort of "userland RAID" where the disks are just mounted as regular volumes to the filesystem, and then you just write the data twice, is probably the best bet. There's no format problems, and you'll always be able to pull a drive out, stick it in another machine, and get at your data.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  17. Re:Infant Mortality and stuff by duffbeer703 · · Score: 4, Informative

    That may be the new 'theory' but we all know about theory vs reality. Here in reality if you put a couple of dozen new drives into service you have one or two spare hard drives to replace the ones that WILL fail in the first week. Especially with consumer grade drives typical in workstation deployment. If you only have one dud out of twenty it was a good rollout.

    This study looks pretty realistic to me, in fact its better data than the Google paper's because they are looking at different usage scenarios. The study also jives with vendor's warranty periods -- right around the 3 year mark (end of warranty) failures start going up.

    I take issue with your "real world vs. theory" argument version workstation disks and server disks as well, only because I have my own numbers. Based on numbers that my company gathers for its 50,000 workstations, the disk failure rate is around 1.9% annually. (Still alot of disks) There are exceptions -- those numbers are driven upward by one deployment of workstations from a vendor that had a 22% failure rate. (the PCs were replaced by the vendor) Server disks are in the same ballpark - slightly less that 2%.

    Vendors provide more evidence of that fact. Many servers are being shipped with SATA disks, often the same as what you'll find in workstations. If SATA was less reliable, that would increase the vendor's support costs and they wouldn't ship them.

    You're totally right about RAID-5... it can be a dangerous thing for an inept admin. Bad disks often come in batches, and bad controllers can ruin your day. A redundant array of bad data isn't very helpful ;)

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
  18. Actually, mostly it DOESN'T contradict by Moraelin · · Score: 5, Insightful

    The two don't really contradict each other that much. Google's spike is relatively small and it's really a spike in the first 1-3 months. By the 6th month it's basically settled. In this paper half the time they graph in whole year increments, so that kind of a spike would be averaged into the first year. So, no, they don't contradict each other as such. And in at least one of the graphs by month in this paper (HPC1), there is something that looks like a spike in the first month.

    More importantly, they don't contradict each other in respect to the rest of the curve. With or without that spike, the curve just doesn't look like the bathtub fairy tale that drive makers try to bullshit us with. You're led into a false sense of security that, basically, if a drive didn't fail within the first couple of months, then it'll be at a (nearly) constant and very small probability to fail for the whole next 5 years, and only then it starts rising again. Basically that if you upgrade your drives every 4 years, whatever didn't fail within 2-3 months, heck, it's very unlikely to fail. And the curve just doesn't look that way. The probability to fail rises continuously, and (again whether that spike actually exists or not) after as little as 1 year you're above the starting height of the "bathtub" already.

    In retrospect, I don't even know when and why the "bathtub" myth even started. The bathtub distribution was originally for stuff like electronic components, without moving parts. For something with mechanical wear and tear like a hard drive, who the heck came up with the idea that the same curve must apply? Shouldn't it have been common sense all along that it linearly gets more wear and tear?

    Both papers also tell us that the manufacturers' MTBF numbers are, basically, pure bullshit. They're some impressive number put there for the benefit of the marketting department, not because someone at Seagate/Maxtor/whatever actually believes that number.

    In retrospect, again, we should have had an alarm signal when the manufacturers lowered there warranty from 3 to 1 year. If indeed there was (1) the MTBF they claim, and more importantly (2) the bathtub curve they claim, the reduction wouldn't have even made too much of a difference. I mean, most drives would have failed withing a couple of months, followed by barely a trickle of deffective drives for the next 5 years straight. Why bother doing the bad-for-marketting thing of lowering the warranty in that scenario? Or did they already know that they lie?

    And finally, a very important point is that (again, bullshit marketting claims be damned) there is no difference in reliability between cheap SATA and expensive SCSI and FC. There is this assumption permeating the whole society that if something is expensive, it _must_ automatically be better and more durable than the cheap stuff. That if you buy a big plasma TV, it's automatically better and last longer than an el-cheapo CRT. (Yeah, right. Plasma is actually known for its decay over time.) A whole edifice of consumerism, conspicuous consumption, and SFV (Stupid Fashion Victim) syndrome is based on that bullshit excuse to spend more than you need to spend. "Yeah, but it'll be better and last longer!" Yeah, right.

    I've actually met people who wouldn't even _consider_ putting a ATA drive in any kind of server. "What, you're going to put your enterprise data on ATA drives???" (Said with a perplexed look, as if I had proposed flushing it to /dev/nul or something.) Well, now we know they're not actually any worse. If you don't actually need the extra bandwidth or lower latency or a 15,000 RPM drive, then you can just as well drop a SATA drive in that machine. Even for 10,000 RPM, 4.5ms, there are the WD Raptor drives with SATA interface, and they're cheaper than a SCSI or FC drive. For a lot of stuff you don't even need those, a 7200 RPM will do perfectly fine.

    --
    A polar bear is a cartesian bear after a coordinate transform.