Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures
Lucas123 writes Backblaze, which has taken to publishing data on hard drive failure rates in its data center, has just released data from a new study of nearly 40,000 spindles revealing what it said are the top 5 SMART (Self-Monitoring, Analysis and Reporting Technology) values that correlate most closely with impending drive failures. The study also revealed that many SMART values that one would innately consider related to drive failures, actually don't relate it it at all. Gleb Budman, CEO of Backblaze, said the problem is that the industry has created vendor specific values, so that a stat related to one drive and manufacturer may not relate to another. "SMART 1 might seem correlated to drive failure rates, but actually it's more of an indication that different drive vendors are using it themselves for different things," Budman said. "Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else — neither will tell you what it is."
https://www.backblaze.com/blog/hard-drive-smart-stats/
Goes into a lot more detail too.
Uncorrected reads do not indicate a drive will fail. They indicate the drive has _already_ failed.
The number one predictor is probably power-on time, they go into that in an earlier post.
I've had drives fail in the ~3 years range from a few different manufacturers. I think with a sample size of 3 drives you can't really draw any conclusions.
for those who are only passingly curious and don't want to read the article.
SMART 5 - Reallocated_Sector_Count.
SMART 187 - Reported_Uncorrectable_Errors.
SMART 188 - Command_Timeout.
SMART 197 - Current_Pending_Sector_Count.
SMART 198 - Offline_Uncorrectable
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Correlation justifies effort to find the common cause of two phenomena.
Nope. When looking for warning signs you don't care about causation, it's enough to know that the presence of A indicates an increased probability of imminent B.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Ever find it odd that most PC manufacturers (at least the variety I've seen over the years) disable S.M.A.R.T. in BIOS by default? Never understood the reasoning behind that...
{} ------ When I think of a good sig, I'll put it here
As someone who is suspicious of a couple of hard drives, this data will help me to determine just how concerned I should be. I don't know what Backblaze gets out of making this information public (except publicity), but it is refreshing to a company release information such as this rather than guard it as a trade secret or sell it.
I've used Crystal Disk Info and while it reports SMART info, I can't make much out of the info.
Many values for Samsung spinning rust just have values of Current and Worst of 100 and either a raw value of 0 or some insanely huge number.
I never take a look at SMART values or do disk benchmarks. They just make me more stressful and paranoid. If it should occur, I'll let the drive die a mighty death and restore the latest backup to a new disk.
The biggest sign that correlates to drive failure is: it's a brick and all your data is gone.
Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.
"Oh no... he found the
Take all the drives that have signs of failure, put them in a testing environment where you can read and write them all day but don't care about any of the data on them and see how long it takes for them to really fail. That will give you an indication of how reliable the SMART stats are at predicting real disk failure.
By analyzing ten thousand of harddrive failes they figured out that the smart stats thats shows errors actually shows errors. What a surprise.
I've not done anything special with the two that I have in a media server at home. This stat is at 5 on the older drive and 4 on the newer drive. By comparison, a Seagate Barracuda LP in the same box is at 128 (it's quite a bit older than the WD drives), and the boot drive, a Seagate Barracuda 7200.11 I grabbed out of the unused-drive box when whatever drive it replaced failed, has 365 spinups logged.
(Looking at the stats for all of my drives, the outlook for that 7200.11 isn't so good. :-P )
20 January 2017: the End of an Error.
Hard disks recover sectors with ECC all the time. There is nothing special about it.
I buy whatever is cheapest.
I know it's a toss up no matter what or when you buy hard drives, so the only thing I have left to guage is price, capacity, and speed (RPM) depending on the intended use.
About a year ago I took a gamble on an SSD for my primary workstation. I bought an ADATA SX900 64GB drive. I had never heard of the brand before. It was ~$120 at the time, and the cheapest for that capacity. I've been looking at getting a 128GB (or so) SSD for my laptop. Prices right now look like I will be getting another ADATA... but I am holding out for Black Friday/Cyber Monday deals to decide.
Oddly enough, over the past 10 years, I've never had a hard drive die in any of my computers while in use. I have a stack of 4 or 5 drives, ranging in capacity from 100GB to 500GB, 3 different different brands, that I'm not using right now. A while back, I plugged one in just to see if it still worked and it didn't. I recently found out it was the hotswap bay that quit working, so as far as I know it still works.
Conversely, I have some servers in a datacenter. Had a drive fail on reboot after a kernel upgrade the other night. Sent a ticket to the DC and they plugged a new one in. Good to go again. In case you're wondering, it has 4x600GB SAS drives in RAID-10.
TL;DR: Buy whatever is cheapest, the odds are always the same.
Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.
Yeah, I did mention smartd in an earlier post, and I said it "can be handy" but I suppose I must agree with you based on my own life as its been lived until now. We never put a server into service without at least software raid, usually with just two disks with some exceptions. A lot of our equipment are tiny supermicro 1u's that can only hold two. But after many years we have yet to have two go at once (knock on wood) so the warning of a raid out of sync has saved us.
You've given up, let go and let it all hang out.
Better advice. Look at the reviews - percentage of bad reviews, nature of the problem. Do not buy a brand new model if you can avoid it. Chances are it will be cheaper on clearance and if there is an issue there will be more data out there about it.
ALWAYS BACK UP to multiple drives at multiple locations if you can't replace that data. Do NOT rely on RAID. Do not store all your backups in one physical location where a single fire, rat chewing them or other event might compromise them. At least one copy should be completely offline once the backup is taken so it can't be hit by a virus or bug. Relatives that you trust - a parent, child or sibling if they live far enough to avoid one disaster hitting you both but close enough to do semi-regular backups make excellent choices for storage. Cloud storage is not reliable.
You got lucky. I had 8 out of 10 ADATA 64gb msata drives fail at my workplace over the last year. Adata is crap.
SSDs are a whole different ballgame. Comparing their quirks to rotating hard drives is akin to comparing a car to a train. They do not work the same, nor fail the same.
SSD are by far not all created equal and you must do research before buying them. I like samsung, intel and crucial personally, based on experience. Be sure to keep up with firmware updates as well!
-
> TL;DR: Buy whatever is cheapest, the odds are always the same.
Disclaimer: I work at Backblaze. I'm going to completely agree with you wholeheartedly, and say in addition you must have a backup. You don't have to use us, I'm just saying if a drive has a 1 percent chance or a 30 percent chance of failing, the actionable item is the same - keep a backup and buy the cheaper drive and restore from backup when it happens.
> over the past 10 years, I've never had a hard drive die in any of my computers while in use.
Professionally we lose something like 10 (?) drives every single day at Backblaze, but *PERSONALLY* I had a LOT of luck for a number of years, but about 3 years ago I finally lost one drive. I'm more backed up than most people, so it was a completely relaxed event. Not a bit of stress. Replace the drive, re-install the OS, and restore the data. Yet something like 95 percent of people never backup their data. IT professionals backup up their family computers, but once you are out there in "normal computer user" land, it's a horror show.
> power-cycling the drive can have an effect on its lifetime and/or reliability
Yes, exactly, why are you calling this stupid? It is interesting because it might affect your behavior - if you power cycle the drives every day, maybe you should consider leaving them powered up, if electricity is cheaper than replacing the drive. It's just an observation, leaving it out seems.... irresponsible? Disclaimer: I work at Backblaze.
If you go by Google's definition of failing (the raw value of any of Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable goes non-zero) rather than the SMART definition of failing (any scaled value goes below the "failure threshold" value defined in the drive's firmware), about 40% of drive failures can be predicted with an acceptably low false-positive rate. You're correct, though, that the "SMART health assessment" is useless as a predictor of failure.
They did a study on this a few years back. It comes to about the same conclusions that Backblaze's study does, but with more numbers (and a larger data set).
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
Perhaps it has been too long - you've clearly forgotten the even longer history of the deadpan response to spam.
That said, I can't actually think of many cases of such spam in response to articles/discussions that never mentioned causation at all - but maybe that's just because causally irrelevant mentions of correlation are relatively rare. Or because the spamming was so bland that it just disappeared into the background.
Also: why oh why would you want to try to resurrect such an old and worthless meme? It's like the script-kiddie version of trolling: no creativity, no vitriol, no emotional provocation of any kind. Just a slight waste of time for everyone involved, not even enough to be annoying.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
IIRC, the greens are the "energy efficient" drives, and I think they power themselves down when idle, and up when they come back into use, so the numbers can grow even if the machine hasn't been rebooted since the drive was first installed.
What are the odd's I would have one of the employee's from the article comment on my little ol' post?
I keep local backups. I've been browsing online, looking for an online backup service that I like, so far not a whole lot of luck. I exclusively run Funtoo Linux on all of my personal and office computers (workstation at home, workstation at office, and laptop). From what I understand, you don't support Linux (yet).
My basic requirements are:
- support Linux (one of ssh/scp, rsync, webdav)
- preferably data located in Canada
As it stands, I'm better off firing up a VM on one of my servers and backing up to it... but that comes with all the other associated headaches like securing, configuring, maintaining the server.
I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created.
So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.
Sure, it's not scientifically representative, but I'll not take that chance with clients data...
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
Please mod up. Seagate drives fail much sooner than all other brands.
I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created. So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.
Sure, it's not scientifically representative, but I'll not take that chance with clients data...
Yeah, I won't dispute your experience because it happened. On the other hand, the only SMART warnings I've seen in our fleet of... four-digits worth of spindles... have ended up false-positives. As in, I contact DELL / IBM / HP / Lenovo and report the issue, they instruct me to flash some controller firmwares, reboot, and go away. If those drives ever fail, it's years later, well beyond any correlation with the SMART events.
"Oh no... he found the
As MSP, false-positives are not always a negative. There, I said it... and most MSPs will agree begrudgingly when off the record.
That said, our support prices alter when the device is no longer under warranty, so the device usually gets moved to a location covered under a different support structure like only 8x5 or have a longer response time to compensate.
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
Enclosure heat. They're "passively cooled" through layers of plastic. If you're not allowing good ventilation around the enclosure, and you're leaving it spinning all the time, it'll bake.
Watch for Penguins, they eat Apples and throw rocks at Windows.
I'm calling it stupid because if you don't know anything about the time between the power cycles, you can at best assume that the power cycle count is a low-quality proxy for powered hours.
For any claim that the number of power cycles itself is a predictor of failure, you'd need to, you know, power cycle a bunch of drives at various rates until they die, and see if merely power cycling it more often makes it fail faster. Only in such conditions would the power cycle mean anything. Otherwise it's stupid and let's just stop with the stupidity, okay?
A successful API design takes a mixture of software design and pedagogy.
Load Cycle Count and Power Cycle Count aren't the same thing.
A successful API design takes a mixture of software design and pedagogy.
Pray tell, what has a firmware bug got to do with the meaning of a power cycle counter, otherwise that in this particular case you can't rely on a faulty counter? Let's not deflect attention to strawmen.
A successful API design takes a mixture of software design and pedagogy.
I pay for a business account with an online retailer. Said business account provides me with a 2 year exchange on all hard drives (and a bunch of other benefits).
So if the drive fails within 2 years, I send it back to them and they replace it with a similar model, and they pay for the shipping.
If it happens out of the two year scope, I'm better off just buying a new drive than dealing with the hassle of sending it to the manufacturer.
I don't own a shop, nor do I provide IT services. I used to buy A LOT of stuff from them, and decided to start paying the yearly fee.
We learned in statistics class that unless you have special circumstances a sample of size 100 is what you need at the very least.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
As this is about indicators, it is correlation all right and it is meaningful. Of course, SMART attributes do not _cause_ drive fails.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Don't worry about it. For reads the disks try to start reading very early after positioning, so the heads may not be perfectly aligned yet. This leads to some retries and some ECC recovered.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
IBM says about 50% of time, and that matches my experience. Of course, if you are only watching for SMART failure and not suspicious attribute changes and trends, then you are doing it wrong.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
The Google thing was pretty badly done by people that barely understand the subject matter or scientific process.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.