Slashdot Mirror


Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures

Lucas123 writes Backblaze, which has taken to publishing data on hard drive failure rates in its data center, has just released data from a new study of nearly 40,000 spindles revealing what it said are the top 5 SMART (Self-Monitoring, Analysis and Reporting Technology) values that correlate most closely with impending drive failures. The study also revealed that many SMART values that one would innately consider related to drive failures, actually don't relate it it at all. Gleb Budman, CEO of Backblaze, said the problem is that the industry has created vendor specific values, so that a stat related to one drive and manufacturer may not relate to another. "SMART 1 might seem correlated to drive failure rates, but actually it's more of an indication that different drive vendors are using it themselves for different things," Budman said. "Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else — neither will tell you what it is."

6 of 142 comments (clear)

  1. Skip the blogspam, here's the real link by Anonymous Coward · · Score: 5, Informative

    https://www.backblaze.com/blog/hard-drive-smart-stats/

    Goes into a lot more detail too.

  2. The measurements in question: by Immerman · · Score: 4, Informative

    for those who are only passingly curious and don't want to read the article.
            SMART 5 - Reallocated_Sector_Count.
            SMART 187 - Reported_Uncorrectable_Errors.
            SMART 188 - Command_Timeout.
            SMART 197 - Current_Pending_Sector_Count.
            SMART 198 - Offline_Uncorrectable

    --
    --- Most topics have many sides worth arguing, allow me to take one opposite you.
    1. Re:The measurements in question: by omnichad · · Score: 4, Informative

      And I can confirm. Reallocated Sector Count rarely goes above zero when the drive is fine. It's possible to have a few sectors go bad and get reallocated, but it's usually part of a bigger problem when it happens (this number is reset to zero at the factory, after all initially bad sectors have been remapped). If the Current Pending Sector Count is non-zero, it's likely over.

      I always clone a drive immediately with ddrescue when it gets to this point, while the drive is still working.

    2. Re:The measurements in question: by swillden · · Score: 4, Insightful

      Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.

      Still, data is better than emergent collective perceptions from distributed anecdotes.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  3. Re:Uncorrected reads by ls671 · · Score: 4, Interesting

    I have had drives fail. I took them off line and wrote 0 and 1 to them with dd until Reallocated_Sector_Ct stops raising and Current_Pending_Sector goes to zero then ran e2fsck -c -c on them 2 or 3 times then, I put them back on line!!!

    Most people would say this is crazy but in my opinion, the surface of the drives often have bad spots while the rest is perfectly OK. Some on those drives are still on line without reporting any new errors after more than 5 years, some almost 10 years. Those are server drives with very low Start_Stop_Count, Power_Cycle_Count and Power-Off_Retract_Count. All lower than 250 after 10 years. Those drives are spinning all the time.

    Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.
     

    --
    Everything I write is lies, read between the lines.
  4. Re: Seagate OEM? by brianwski · · Score: 4, Insightful

    > TL;DR: Buy whatever is cheapest, the odds are always the same.

    Disclaimer: I work at Backblaze. I'm going to completely agree with you wholeheartedly, and say in addition you must have a backup. You don't have to use us, I'm just saying if a drive has a 1 percent chance or a 30 percent chance of failing, the actionable item is the same - keep a backup and buy the cheaper drive and restore from backup when it happens.

    > over the past 10 years, I've never had a hard drive die in any of my computers while in use.

    Professionally we lose something like 10 (?) drives every single day at Backblaze, but *PERSONALLY* I had a LOT of luck for a number of years, but about 3 years ago I finally lost one drive. I'm more backed up than most people, so it was a completely relaxed event. Not a bit of stress. Replace the drive, re-install the OS, and restore the data. Yet something like 95 percent of people never backup their data. IT professionals backup up their family computers, but once you are out there in "normal computer user" land, it's a horror show.