Slashdot Mirror


Analyzing Long-Term SSD Failure Rates

wintertargeter writes "It looks like Tom's Hardware has posted the first long-term study of SSD failure rates. The chart on the last page is interesting — based on numbers, it seems SSDs aren't more reliable than hard drives. "

25 of 149 comments (clear)

  1. Uh, yes they are by TheRaven64 · · Score: 3, Informative

    Did the poster even look at the chart he linked to? Those big lines that shoot up to the top after 1-3 years? They're the failure rates for hard disks. The ones near the bottom? They're the failure rates for SSDs. Now, some of the SSD figures are projected and look quite optimistic, but the number of hard disks failing after 3 years looks high than the number of SSDs failing after three years by all of the studies. For most workloads, the SSDs fail less often, and the SSD failures only exceed HD failures very early on in their lifetimes.

    --
    I am TheRaven on Soylent News
    1. Re:Uh, yes they are by msauve · · Score: 2

      Well, it depends on the application. Assuming the chart is accurate, disks are more reliable for the first year. So, if you have a short term application/need, or replace your hardware every year, then disks are more reliable.

      --
      "National Security is the chief cause of national insecurity." - Celine's First Law
    2. Re:Uh, yes they are by Geoffrey.landis · · Score: 5, Insightful

      Did the poster even look at the chart he linked to?

      Did you? Apparently not.

      Ignore the dashed lines-- those curves are not data, they are "projection." The chart has no data on SSD failures late in the lifetime. So, when you say "...SSD failures only exceed HD failures very early on in their lifetimes," that is equivalent to saying "SSD failures only exceed HD failures in the region of the graph for which there is data."

      --
      http://www.geoffreylandis.com
    3. Re:Uh, yes they are by Baloroth · · Score: 3, Insightful

      Look closer. At any points where they have actual data, the failure rate for SSDs is higher than that of HDD, except for the Google study, which I bet puts the drives under massive load or something else funky (given its massive difference from all the other HDD charts.) Only in the projections for the SSDs do the HDDs begin to curve upwards, throwing off the graph. And from what I know of flash memory, especially MLC (which most SSDs are), I'd bet that SSDs will curve upwards too. Sure, wear leveling will help, but if a cell fails with data in it, which can still happen, then that data is lost. So yeah, for any section where they have actual data, SSDs do have a higher failure rate that hard drives. Incidentally, that's a really terrible and deceptive chart.

      --
      "None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
    4. Re:Uh, yes they are by Anonymous Coward · · Score: 2, Insightful

      You do know that HDDs also require wear leveling right? (Well, not really, but defective blocks were pretty much part of life when HDDs were in the 10-100MB size.)
      So yes, both SSDs and HDDs are likeyly to wear out after time. What wear leveling does is that it makes sure that the entire disk is pretty much worn out when you start encountering bad blocks.
      With SSDs there is however one slight improvement. Since flash memory have been used for so long without wear leveling and in applications were it's damn important to get a good estimate of the the product life (Advanced fire alarms, mars probes and such.) it is actually possible to get good information on when a SSD is likely to fail.
      I assume that HDD manufacturers have at least some clue of how many writes their disks will take before it is worn out (Otherwise they will have to alocate unused blocks for the wear leveling on a hunch.) but good luck getting that information from them.

      So yes, both SSDs and HDDs are likely to fail sometime. The big difference is that if you are designing a system where it actually matters you can actually select an SSD with the correct specification. If it really matters you are probably going to get SLC anyway. If you don't want to pay for it it is likely that you don't need reliability.

    5. Re:Uh, yes they are by Rockoon · · Score: 2

      What I dont get is that companies like Dell have been shipping SSD's for much more than 5 years now. Surely Dell has some good statistics about failure rates, since their customers want refunds and shit when things die quickly. Is it that Dell wont release the data? Has anyone even asked?

      I understand that the latest crop of SSD's from companies like OCZ have been a real nightmare. I suspect the OCZ issue has to do with powering down the device, with the capacitor responsible for ensuring this happens correctly isnt supply enough power for long enough to let all the buffers write out correctly.. most of the failure posts you see on newegg begin "I put the machine to sleep...." .. in other words, several gigs were written out right before the device lost its primary supply of power. So it could easily be that final book-keeping is failing to complete correctly, leaving the flash in a "corrupted" (the controller cant make sense of its own "block system") state

      --
      "His name was James Damore."
  2. Huh? by adamjcoon · · Score: 5, Insightful

    I didn't read TFA but the chart doesn't tell me that "SSDs aren't more reliable than hard drives".. the SSDs were generally 6% or under (assuming the linear progression) whereas regular HDD approached 14%+ after five years. And "Long-term" in the title? The SSD data in the chart only goes for 1 year. Not exactly long term when the chart goes from 1-5 years of use. The actual data for the SSDs is only 20% of the time span.

  3. Re:Not more reliable, by alphatel · · Score: 2

    The author reviews several data sets that show SSDs are probably less likely to fail, and then describes several reasons why that information cannot be taken at face value. Not all of the data presented by the author is classified as reliable or even useful. The final chart is either not well-documented or would take a seminar to explain because it does not seem directly related to the rest of TFA.

    Either way, the SSD drive market, is oddly enough, as good as spindrives but like anything else, the data released by vendors should be taken with a grain of salt.

    --
    When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
  4. Re:Who said they were? by Enleth · · Score: 2

    They're more durable - you can bang one against the desk, throw it around the room all day, then plug it in and it should still work (or, at worst, require fixing a broken solder joint or two, SMD capacitors sometimes fall off the PCB after a strong enough jolt), while no HDD in the world is going to survive that. Maybe people got that confused, the word "reliable" means many different things in layman's speech.

    --
    This is Slashdot. Common sense is futile. You will be modded down.
  5. Re:Whaddayamean "long term"? by hairyfeet · · Score: 4, Interesting

    Well considering that the failure rates are bad enough Atwood at Coding Horror says SSDs should be judged on a hot/crazy scale I'd say that is a pretty bad sign. Note that he still buys them even though they keep failing, but this is a guy that spends $400 on a pair of headphones.

    My problem with SSDs and why I won't recommend them to anyone but a few edge use cases (those that doing a lot of traveling with their laptop, servers where IOPS is the #1 goal) is because when they DO fail in my experience there is no warning at all and that is simply unacceptable. I have a couple of "Must rule teh benchmarkz!" gamer customers and both went SSD. These guys ain't cheap and bought the baddest SSDs they could find, price be damned. With both guys both drives failed with NO warning, not even SMART. They just turned on their machines one day and poof! Bye bye SSD. One I was able to get a small amount of the data back, the other couldn't even be detected in BIOS. Sure they both had warranties but so what? it isn't like the warranties covered downtime or the HDDs they had to buy to replace it while they waited on the RMA. both ended up selling their SSDs and going with a pair of Raptors in RAID 0.

    So until they fix this major flaw I will simply tell my customers to avoid them. With HDDs I don't think I can remember a time I've had a HDD fail without ample warning. Windows delayed write failures, SMART, noise and temp of the drive, in all cases you were given ample time to get your data off the failing drive. Not so with SSD, when it goes it just goes poof! Having that risk hanging like the sword of Damocles over your head just isn't worth the speed IMHO.

    --
    ACs don't waste your time replying, your posts are never seen by me.
  6. Re:Who said they were? by Attila+Dimedici · · Score: 3, Insightful

    I remember there being lots of claims that SSDs would be more reliable because they had no moving parts.

    --
    The truth is that all men having power ought to be mistrusted. James Madison
  7. Worst. Ever. by DarthVain · · Score: 4, Insightful

    Let me summarize:

    A) Chart is worthless. I have never see a more ambiguous meaningless chart in my life. They might as well not bother to label things.
    B) Lets do a reliability study on SSD's that they don't have any long term data on past 2 years, yet compare it to HDD that typically at least have a 3 year warranty. By that I only mean, I'll go out on a limb and guess that the average failure rate of HDD is > 3 years, if only for economic self preservation.
    C) Results in either case depend highly on specific device model and configuration.

    1. Re:Worst. Ever. by GodfatherofSoul · · Score: 2

      But, you can make projections from limited data. Disclaimer, all I looked at was the chart and I think you can assume linear failure rates for SSDs and exponential for HDDs (probably because of more components and different failure points). The chart is pretty clear if I'm interpreting it correctly.

      Just like sampling a population in statistics, you're working with limited data but you can hypothesize based on a small sample. What you can't tell is if there's some failure bomb (unlikely) outside the data range.

      --
      I swear to God...I swear to God! That is NOT how you treat your human!
  8. Baed on numbers... by tanderson92 · · Score: 2

    Based on numbers, the study shows SSDs to be more reliable than HDDs. The best data I have seen in that article is the following:

    SSDs: 1.28--2.19% over 2 years

    HDDs: >=5% over 2 years

    The HDD data comes from: http://media.bestofmicro.com/2/N/289103/original/google_afrtemputilization_475.png The SSD data comes from the table on Page #6.

    I don't think any of this data is particularly surprising, HDDs are mechanical so the curves for failure would not be linear. The most interesting part of the article for consideration with SSDs is that SMART is going to be near useless for them. Since most failures are random occurrences in electronics which SMART isn't good at detecting, we may need better technology for detecting SSD failures.

    1. Re:Baed on numbers... by rayd75 · · Score: 2, Interesting

      The most interesting part of the article for consideration with SSDs is that SMART is going to be near useless for them. Since most failures are random occurrences in electronics which SMART isn't good at detecting, we may need better technology for detecting SSD failures.

      Have you ever seen SMART perform in a useful way on a mechanical disk? At work and at home, I've gone through a crap-ton of hard disks in the last decade or so that SMART's been prevalent and never have I seen SMART flag a drive as problematic before I already knew I had a serious problem. More often than not, I've had systems slow to a crawl due to massive numbers of read errors and sector reallocations while the drive firmware actively lied to me about the drive's condition. Only looking at the raw SMART stats and watching the counters increase wildly reveals the truth.

  9. Re:Whaddayamean "long term"? by h4rr4r · · Score: 2, Informative

    The fix for this was released a long time ago, it is called proper backups. Instead of avoiding a superior product, trying using them and proper backups.

  10. Re:Who said they were? by gbjbaanb · · Score: 2

    sigh, someone else who didn't RTFA. If you look on page 8 you'll see this image where Intel's 'reliability study at IDF 2011' says HDDs are pants, SSDs are great.

    of course, this is part of Intel's marketing for SSDs, so you'd expect them to say this kind of thing. Of course, that means someone has said this - specifically as some sort of selling point.

  11. and those that do fail by mikey177 · · Score: 2

    it is also a lot easier to retrieve data from disc then SSD that most of the time go without warning

  12. Re:Whaddayamean "long term"? by gbjbaanb · · Score: 2

    always remember: RAID is not backup.

    One day, with a traditional HDD based setup, you'll come into the office to find the place a mess, everyone standing around and when you ask "what's happened", you'll get the reply "we were burgled, your PC is right now being sold on ebay".

    So who cares whether SSDs fail immediately or with a huge flashy light show whilst beeping out La Marseillaise, it won't help you none.

    You'll find other stories of HDD RAID that failed simultaneously (which is more common than you think, drives go bad in batches, or I think, die at the same time just out of stubborness) either due to power surges, or raid failure that led to data corruption.

    So the only solution is to have adequate backup. With the number of continuous backup solutions out there, there's no excuse not to run it.

    PS. you replaced your SSDs with a pair or HDDs in RAID 0 format. Beggers belief.

  13. Re:Whaddayamean "long term"? by TheLink · · Score: 3, Informative

    The other failure mode is the "time warp" failure.

    http://www.dslreports.com/forum/r25491097-Dell-Laptop-and-SSD-Time-warp-issue

    Also updated windows fully, customized everything to my liking... in short, a good 2-3h of work.

    This morning, I open up the laptop and surprise... EVERYTHING's back to the pre-format. I have no idea how this is even remotely possible.

    The big problem with this failure mode would be if the user doesn't notice anything wrong till too late.

    A 100% dead drive sucks, but if you do regular backups you lose 1 day of data.

    A "time warp" failure that you don't notice could result in you sending out of date info in an important email. Or overwriting something important with invalid data and not noticing. The resulting damage could be far far worse than a dead drive.

    In my experience "spinning rust" rarely fails 100% without warning (or abuse - e.g. you drop the drive ;) ). You can often salvage some stuff out (just hope it's the stuff you want ;) ). I've managed to use knoppix to salvage data from people's failed spinning disk drives.

    In contrast these SSDs just go totally dead. Or really weird shit happens.

    In both cases the manufacturer might get an RMA. But they're not the same. If OCZ drives are getting RMA'ed at higher rates than spinning drives, and their failure modes are 100% dead or "time warp" they are far worse than the stats show: http://news.softpedia.com/news/French-Website-Publishes-HDD-SSD-and-Motherboard-RMA-Statistics-196538.shtml

    --
  14. Re:Apparently by oGMo · · Score: 2

    So you're quoting that SSDs are not 10x more reliable than HDDs. That doesn't exactly prove a point that HDDs are more reliable.

    The original poster said "it seems SSDs aren't more reliable than hard drives." Do not create a straw man. The article indicates that while marketing and simpletons may point out select statistics as "more reliable," there's a lot more to the story, and it's difficult to impossible to get meaningful data at this point. That is, based on their analysis, SSDs are not provably more reliable at this time.

    --

    Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

  15. Re:Whaddayamean "long term"? by TheLink · · Score: 5, Interesting

    If you're unlucky backups won't save you from this:
    http://www.dslreports.com/forum/r25491097-Dell-Laptop-and-SSD-Time-warp-issue

    yesterday I spent over an hour fomatting, re-installing windows and everything else I needed.

    Also updated windows fully, customized everything to my liking... in short, a good 2-3h of work.

    This morning, I open up the laptop and surprise... EVERYTHING's back to the pre-format. I have no idea how this is even remotely possible.

    OCZ is calling this the time warp issue, and is related to the sandforce controller...

    http://forum.notebookreview.com/alienware-m17x/552728-fresh-os-install-ocz-ssd-r3.html

    any firmware before 1.29 can result in you experiencing what OCZ refers to as "Time Warp" (you lose all info stored on drive since last boot - happens at random). 1.29 decreases likelihood of this happening, but does not eliminate the possibility.

    The big problem with this failure mode is the drive still appears to work. So if you are unlucky to not notice that the pricelist/tender document you are about to send or commit to is no longer showing the corrected figures/information, things could get way more painful than if your drive just didn't work (in which case work would just be delayed while you restore from backups, or if you have no backups you would just have to deal with the data loss).

    --
  16. Re:Read the paper, not the graph by pympdaddyc · · Score: 2

    Somewhere, Ed Tufte just puked and has no idea why. Poor guy.

  17. Re:Whaddayamean "long term"? by Nethemas+the+Great · · Score: 2

    For a personal computer SSDs are probably best used for OS/Application storage, not data (documents, images, music, etc.). The cost per GB is too bloody much to justify otherwise and the less noticeable failure symptoms bolster that notion. Besides that, application load time is where these toys have their niche.

    --
    Two of my imaginary friends reproduced once ... with negative results.
  18. Re:Whaddayamean "long term"? by TheLink · · Score: 2

    So how long would it take for you to notice you had the "time warp" problem to actually start restoring from backups?

    Given you don't appear to have read what I posted, you might not be one of those who would notice in time.

    --