Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures

← Back to Stories (view on slashdot.org)

Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures

Posted by samzenpus on Wednesday November 12, 2014 @09:34AM from the about-to-go dept.

Lucas123 writes Backblaze, which has taken to publishing data on hard drive failure rates in its data center, has just released data from a new study of nearly 40,000 spindles revealing what it said are the top 5 SMART (Self-Monitoring, Analysis and Reporting Technology) values that correlate most closely with impending drive failures. The study also revealed that many SMART values that one would innately consider related to drive failures, actually don't relate it it at all. Gleb Budman, CEO of Backblaze, said the problem is that the industry has created vendor specific values, so that a stat related to one drive and manufacturer may not relate to another. "SMART 1 might seem correlated to drive failure rates, but actually it's more of an indication that different drive vendors are using it themselves for different things," Budman said. "Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else — neither will tell you what it is."

25 of 142 comments (clear)

Min score:

Reason:

Sort:

Skip the blogspam, here's the real link by Anonymous Coward · 2014-11-12 09:39 · Score: 5, Informative

https://www.backblaze.com/blog/hard-drive-smart-stats/
Goes into a lot more detail too.
Uncorrected reads by russotto · 2014-11-12 09:49 · Score: 2

Uncorrected reads do not indicate a drive will fail. They indicate the drive has _already_ failed.
The number one predictor is probably power-on time, they go into that in an earlier post.
1. Re:Uncorrected reads by ls671 · 2014-11-12 11:54 · Score: 4, Interesting
  
  I have had drives fail. I took them off line and wrote 0 and 1 to them with dd until Reallocated_Sector_Ct stops raising and Current_Pending_Sector goes to zero then ran e2fsck -c -c on them 2 or 3 times then, I put them back on line!!!
  Most people would say this is crazy but in my opinion, the surface of the drives often have bad spots while the rest is perfectly OK. Some on those drives are still on line without reporting any new errors after more than 5 years, some almost 10 years. Those are server drives with very low Start_Stop_Count, Power_Cycle_Count and Power-Off_Retract_Count. All lower than 250 after 10 years. Those drives are spinning all the time.
  Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.
  
  --
  Everything I write is lies, read between the lines.
Re:Seagate OEM? by AaronLS · 2014-11-12 09:50 · Score: 2

I've had drives fail in the ~3 years range from a few different manufacturers. I think with a sample size of 3 drives you can't really draw any conclusions.
The measurements in question: by Immerman · 2014-11-12 09:53 · Score: 4, Informative

for those who are only passingly curious and don't want to read the article.
SMART 5 - Reallocated_Sector_Count.
SMART 187 - Reported_Uncorrectable_Errors.
SMART 188 - Command_Timeout.
SMART 197 - Current_Pending_Sector_Count.
SMART 198 - Offline_Uncorrectable

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.
1. Re:The measurements in question: by SpaceManFlip · 2014-11-12 10:05 · Score: 3, Insightful
  
  I read the article to find those "5 Top SMART Stats" they refer to, but I'm replying here because it's the relevant place.
  Those 5 SMART stats match up exactly with what I habitually look at on the job monitoring lots of RAID arrays' drives. Those are the stats that tell you if the drive is going bad most often in my experience.
2. Re:The measurements in question: by AmiMoJo · 2014-11-12 10:18 · Score: 2
  
  I tend to think a drive has failed once it has any uncorrectable errors... I lost some data, it couldn't be read back. Drive gets returned to the manufacturer under warranty. Don't wait around for it to fail further.
  I agree with the reallocated sector count though. The moment that starts to rise I usually make sure the data is fully backed up and then do a full surface scan. The full scan almost always causes the drive to find more failed sectors and die, so it gets send back under warranty too.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
3. Re:The measurements in question: by rduke15 · 2014-11-12 10:23 · Score: 2
  
  And to list these for your own drive:
  $ sudo smartctl -A /dev/sda | egrep '^\s*(ID|5|1[89][78])' ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 etc.
  (Incomplete last line to "use fewer 'junk' characters." as requested by that silly filter)
4. Re:The measurements in question: by omnichad · 2014-11-12 10:23 · Score: 4, Informative
  
  And I can confirm. Reallocated Sector Count rarely goes above zero when the drive is fine. It's possible to have a few sectors go bad and get reallocated, but it's usually part of a bigger problem when it happens (this number is reset to zero at the factory, after all initially bad sectors have been remapped). If the Current Pending Sector Count is non-zero, it's likely over.
  I always clone a drive immediately with ddrescue when it gets to this point, while the drive is still working.
5. Re:The measurements in question: by jedidiah · 2014-11-12 10:25 · Score: 2
  
  Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.
  
  --
  A Pirate and a Puritan look the same on a balance sheet.
6. Re:The measurements in question: by koinu · 2014-11-12 10:30 · Score: 3, Informative
  
  Reallocated_Sector_Count sectors that the drive successfully replaced Reported_Uncorrectable_Errors errors that could not be recovered by ECC Command_Timeout controller hanging and had to be resetted Current_Pending_Sector_Count sectors to be replace by the next write access Offline_Uncorrectable sectors that the drive tried to repair, but failed (try offline test, maybe it is not dead yet)
7. Re:The measurements in question: by afidel · 2014-11-12 10:59 · Score: 2
  
  I never worry about going home, my array has plenty of spare capacity to handle rebuilds, we schedule the technician when it's convenient to us, not when it's convenient for them or the array. When you have guard space for at least 4 disk failures (out of a few hundred) you deal with replacements in a less urgent manner than a traditional small RAID5 array in a standalone server. Within ~30 minutes of a failure or a predictive failure my arrays are back to 100% resiliency with slightly less guard space. It's one of many reasons why I only buy wide striped arrays.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
8. Re:The measurements in question: by omnichad · 2014-11-12 11:26 · Score: 3, Informative
  
  Also, generally you don't need to panic over this attribute. You should panic when it increases steadily.
  True, I've had a few drives hold steady at 1 sector reallocated. But if Current Pending Sector count remains non-zero for very long, it's a headache at the very least and probably a failure. Generally, it seems like as soon as you crest zero, it's over. I've had the next symptom be a totally unresponsive drive. But doing the backup when you hit 1 (admittedly overly cautious) will force the drive to read off all the sectors and you'll at least get your backup while you verify the rest of the drive still reads OK.
9. Re:The measurements in question: by swillden · 2014-11-12 13:53 · Score: 4, Insightful
  
  Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.
  Still, data is better than emergent collective perceptions from distributed anecdotes.
  
  --
  Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:Correlation != causation by Immerman · 2014-11-12 09:55 · Score: 3, Insightful

Nope. When looking for warning signs you don't care about causation, it's enough to know that the presence of A indicates an increased probability of imminent B.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Re:Cool data but... by Rashkae · 2014-11-12 10:16 · Score: 2

If the PC has less than optimal cooling, it's possible, even l iikely, the drive temperature will exceed operating specs at some point. Even if there is no ill effect or any long term problem, the BIOS will forever more report "Imminent Drive Failure" on every boot if BIOS SMART is enabled.
Put the SMART stats to the test by DidgetMaster · 2014-11-12 10:41 · Score: 2

Take all the drives that have signs of failure, put them in a testing environment where you can read and write them all day but don't care about any of the data on them and see how long it takes for them to really fail. That will give you an indication of how reliable the SMART stats are at predicting real disk failure.
1. Re:Put the SMART stats to the test by brianwski · 2014-11-12 14:56 · Score: 3, Informative
  
  Disclaimer: I work at Backblaze. Essentially this is what we did. We don't care at all if one drive dies, so we left it in an environment where we can read and write them all day (the storage pods with live customer data) and when they failed we calmly replaced them with zero customer data loss and produced this blog post. :-)
Re: Seagate OEM? by corychristison · 2014-11-12 11:30 · Score: 3, Informative

I buy whatever is cheapest.
I know it's a toss up no matter what or when you buy hard drives, so the only thing I have left to guage is price, capacity, and speed (RPM) depending on the intended use.
About a year ago I took a gamble on an SSD for my primary workstation. I bought an ADATA SX900 64GB drive. I had never heard of the brand before. It was ~$120 at the time, and the cheapest for that capacity. I've been looking at getting a 128GB (or so) SSD for my laptop. Prices right now look like I will be getting another ADATA... but I am holding out for Black Friday/Cyber Monday deals to decide.
Oddly enough, over the past 10 years, I've never had a hard drive die in any of my computers while in use. I have a stack of 4 or 5 drives, ranging in capacity from 100GB to 500GB, 3 different different brands, that I'm not using right now. A while back, I plugged one in just to see if it still worked and it didn't. I recently found out it was the hotswap bay that quit working, so as far as I know it still works.
Conversely, I have some servers in a datacenter. Had a drive fail on reboot after a kernel upgrade the other night. Sent a ticket to the DC and they plugged a new one in. Good to go again. In case you're wondering, it has 4x600GB SAS drives in RAID-10.
TL;DR: Buy whatever is cheapest, the odds are always the same.
Re:Thanks, Backblaze! by organgtool · 2014-11-12 11:42 · Score: 2

Perhaps they are obvious to a System Administrator but to someone who is not an admin, everything in SMART probably looks like an error. In addition to that, the article describes common errors that sound indicative of a drive failure but are actually relatively benign. So there is definitely value in this information.
Re: Seagate OEM? by brianwski · 2014-11-12 14:39 · Score: 4, Insightful

> TL;DR: Buy whatever is cheapest, the odds are always the same.

Disclaimer: I work at Backblaze. I'm going to completely agree with you wholeheartedly, and say in addition you must have a backup. You don't have to use us, I'm just saying if a drive has a 1 percent chance or a 30 percent chance of failing, the actionable item is the same - keep a backup and buy the cheaper drive and restore from backup when it happens.

> over the past 10 years, I've never had a hard drive die in any of my computers while in use.

Professionally we lose something like 10 (?) drives every single day at Backblaze, but *PERSONALLY* I had a LOT of luck for a number of years, but about 3 years ago I finally lost one drive. I'm more backed up than most people, so it was a completely relaxed event. Not a bit of stress. Replace the drive, re-install the OS, and restore the data. Yet something like 95 percent of people never backup their data. IT professionals backup up their family computers, but once you are out there in "normal computer user" land, it's a horror show.
Re:Thanks, Backblaze! by brianwski · 2014-11-12 14:50 · Score: 2

Disclaimer: I work at Backblaze.

> SMART values they expected to be an indication on drive wear showed no correlation with failure

Exactly. Also, some people care more than "approximately correlates" vs seeing the actual data of exactly how correlated it is.
Re:My useless(?) WD anecdotes by brianwski · 2014-11-12 14:54 · Score: 2

> power-cycling the drive can have an effect on its lifetime and/or reliability

Yes, exactly, why are you calling this stupid? It is interesting because it might affect your behavior - if you power cycle the drives every day, maybe you should consider leaving them powered up, if electricity is cheaper than replacing the drive. It's just an observation, leaving it out seems.... irresponsible? Disclaimer: I work at Backblaze.
Re:RUBBISH by FatdogHaiku · 2014-11-12 16:38 · Score: 2

Also grabbing a copy of smartmontools might be a good idea...
http://smartmontools.sourceforge.net

--
You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
Re:RUBBISH by profplump · 2014-11-12 20:05 · Score: 2

He hasn't given up, he's just acknowledged the reality that the variance among drives of any particular model is large enough that he can't statistically pick a winner even given reliable statistics about the past performance of similar drives (which is definitely not available) and assuming the drives never change over their manufacturing life (with is definitely not true).
If you're buying 1000 hard drives their average reliability is meaningful to you (though even then it's only *a* factor, not *the* factor). But if you're only buying a handful of drives and prioritizing reliability you're much better off with diversity than any single model because the average reliability means almost nothing in your small application and diversity at least lets you avoid duplicating systematic faults.
Whatever strategy you think you've devised to beat the statistics is just you hoping to pick the right stock/horse/number and lying to yourself about the odds -- even if you have good data and choose the statistically best option there's still a very good chance it won't turn out to be the best one available and a moderate chance it will be one of the worst.