Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures
Lucas123 writes Backblaze, which has taken to publishing data on hard drive failure rates in its data center, has just released data from a new study of nearly 40,000 spindles revealing what it said are the top 5 SMART (Self-Monitoring, Analysis and Reporting Technology) values that correlate most closely with impending drive failures. The study also revealed that many SMART values that one would innately consider related to drive failures, actually don't relate it it at all. Gleb Budman, CEO of Backblaze, said the problem is that the industry has created vendor specific values, so that a stat related to one drive and manufacturer may not relate to another. "SMART 1 might seem correlated to drive failure rates, but actually it's more of an indication that different drive vendors are using it themselves for different things," Budman said. "Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else — neither will tell you what it is."
https://www.backblaze.com/blog/hard-drive-smart-stats/
Goes into a lot more detail too.
Do the words on the drive "Seagate OEM" when you are buying it from as a stand-alone unit from a store count? I built the computer I am using now about 5 years ago. 3 drives (apparently new). All were replaced within 2.5 years. Bad Sectors, Bad clusters, "Cannot read from drive", etc. I only got warranty on two of them (after running Seatools, confirming the drive manufacture date, serial number, giving the Seatools error codes, and then sending it in the foam padded box).
Uncorrected reads do not indicate a drive will fail. They indicate the drive has _already_ failed.
The number one predictor is probably power-on time, they go into that in an earlier post.
Am I doing it right? Can I be in the cool kids' club now?
for those who are only passingly curious and don't want to read the article.
SMART 5 - Reallocated_Sector_Count.
SMART 187 - Reported_Uncorrectable_Errors.
SMART 188 - Command_Timeout.
SMART 197 - Current_Pending_Sector_Count.
SMART 198 - Offline_Uncorrectable
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Correlation justifies effort to find the common cause of two phenomena.
What about a low Hardware ECC Recovered number? My Seagate has 33 or 34. I think the raw rolls over now and then. Lowest it has been is 25. No idea if it's logarithmic or not.
Where's Natalie Portman?
Ever find it odd that most PC manufacturers (at least the variety I've seen over the years) disable S.M.A.R.T. in BIOS by default? Never understood the reasoning behind that...
{} ------ When I think of a good sig, I'll put it here
As someone who is suspicious of a couple of hard drives, this data will help me to determine just how concerned I should be. I don't know what Backblaze gets out of making this information public (except publicity), but it is refreshing to a company release information such as this rather than guard it as a trade secret or sell it.
I've used Crystal Disk Info and while it reports SMART info, I can't make much out of the info.
Many values for Samsung spinning rust just have values of Current and Worst of 100 and either a raw value of 0 or some insanely huge number.
I never take a look at SMART values or do disk benchmarks. They just make me more stressful and paranoid. If it should occur, I'll let the drive die a mighty death and restore the latest backup to a new disk.
I buy WD drives. The attributes-to-watch are 198 and 5, and maybe 197, and 12 (for greens). I know, that's a top four instead of a top five, and really I just mean it as a top two! But that's how my life has gone.
198 offline uncorrectable is where I always see trouble first. 5 can be an indirect indicator of the same thing.
12 Power Cycle Count is relevant on the EZRXs (greens); that keeps increasing unless you do certain things to prevent it, and I think (this is murky) I saw a weak correlation between this going into way up, and the drives failing sooner. The EFRXs (reds) work out-of-the-box, without it only automatically increasing all the time.
The biggest sign that correlates to drive failure is: it's a brick and all your data is gone.
Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.
"Oh no... he found the
Take all the drives that have signs of failure, put them in a testing environment where you can read and write them all day but don't care about any of the data on them and see how long it takes for them to really fail. That will give you an indication of how reliable the SMART stats are at predicting real disk failure.
By analyzing ten thousand of harddrive failes they figured out that the smart stats thats shows errors actually shows errors. What a surprise.
I hope I'm not the only person who instantly thought of a study on the exact same topic by Google. This would be rather more interesting news to see a similar topic done with SSDs, or perhaps news about MRAM or PCRAM being made in sufficient quantity to takeover from Flash for SSDs.
Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.
Yeah, I did mention smartd in an earlier post, and I said it "can be handy" but I suppose I must agree with you based on my own life as its been lived until now. We never put a server into service without at least software raid, usually with just two disks with some exceptions. A lot of our equipment are tiny supermicro 1u's that can only hold two. But after many years we have yet to have two go at once (knock on wood) so the warning of a raid out of sync has saved us.
You've given up, let go and let it all hang out.
Better advice. Look at the reviews - percentage of bad reviews, nature of the problem. Do not buy a brand new model if you can avoid it. Chances are it will be cheaper on clearance and if there is an issue there will be more data out there about it.
ALWAYS BACK UP to multiple drives at multiple locations if you can't replace that data. Do NOT rely on RAID. Do not store all your backups in one physical location where a single fire, rat chewing them or other event might compromise them. At least one copy should be completely offline once the backup is taken so it can't be hit by a virus or bug. Relatives that you trust - a parent, child or sibling if they live far enough to avoid one disaster hitting you both but close enough to do semi-regular backups make excellent choices for storage. Cloud storage is not reliable.
If you go by Google's definition of failing (the raw value of any of Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable goes non-zero) rather than the SMART definition of failing (any scaled value goes below the "failure threshold" value defined in the drive's firmware), about 40% of drive failures can be predicted with an acceptably low false-positive rate. You're correct, though, that the "SMART health assessment" is useless as a predictor of failure.
They did a study on this a few years back. It comes to about the same conclusions that Backblaze's study does, but with more numbers (and a larger data set).
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created.
So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.
Sure, it's not scientifically representative, but I'll not take that chance with clients data...
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
Am I the only one seeing that judging longevity of commercial drives under enterprise workload tells very little on how this drive will perform under workstation or home NAS use?
I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created. So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.
Sure, it's not scientifically representative, but I'll not take that chance with clients data...
Yeah, I won't dispute your experience because it happened. On the other hand, the only SMART warnings I've seen in our fleet of... four-digits worth of spindles... have ended up false-positives. As in, I contact DELL / IBM / HP / Lenovo and report the issue, they instruct me to flash some controller firmwares, reboot, and go away. If those drives ever fail, it's years later, well beyond any correlation with the SMART events.
"Oh no... he found the
As MSP, false-positives are not always a negative. There, I said it... and most MSPs will agree begrudgingly when off the record.
That said, our support prices alter when the device is no longer under warranty, so the device usually gets moved to a location covered under a different support structure like only 8x5 or have a longer response time to compensate.
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
From the data collected so far, the leading cause of drive failure appears to be Congressional subpoena.
IBM says about 50% of time, and that matches my experience. Of course, if you are only watching for SMART failure and not suspicious attribute changes and trends, then you are doing it wrong.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.