Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures

Skip the blogspam, here's the real link by Anonymous Coward · 2014-11-12 09:39 · Score: 5, Informative

https://www.backblaze.com/blog/hard-drive-smart-stats/

Goes into a lot more detail too.

Uncorrected reads by russotto · 2014-11-12 09:49 · Score: 2

Uncorrected reads do not indicate a drive will fail. They indicate the drive has _already_ failed.

The number one predictor is probably power-on time, they go into that in an earlier post.

Re:Uncorrected reads by ls671 · 2014-11-12 11:54 · Score: 4, Interesting

I have had drives fail. I took them off line and wrote 0 and 1 to them with dd until Reallocated_Sector_Ct stops raising and Current_Pending_Sector goes to zero then ran e2fsck -c -c on them 2 or 3 times then, I put them back on line!!!
Most people would say this is crazy but in my opinion, the surface of the drives often have bad spots while the rest is perfectly OK. Some on those drives are still on line without reporting any new errors after more than 5 years, some almost 10 years. Those are server drives with very low Start_Stop_Count, Power_Cycle_Count and Power-Off_Retract_Count. All lower than 250 after 10 years. Those drives are spinning all the time.
Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.

--
Everything I write is lies, read between the lines.
Re:Uncorrected reads by RabidReindeer · 2014-11-12 14:45 · Score: 1

Newer drives will relocate bad sectors to free reserved space they keep for that purpose.
IBM Mainframe drives did that back in the 1960s.
From what I've seen of hard drives, they're a lot like silicon wafers. Rarely perfect, but as long as they're "good enough", the controller maps around the bad spots that they came with as well as a certain number of ones that form over the operating life.
Re:Uncorrected reads by DigiShaman · 2014-11-12 16:20 · Score: 1

Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.
HDDs don't rely on user addressable free space to remap LBAs; they now have their own non-user accessible spare space that gets allocated for the remapping purpose automatically. Effectively, it happens on-the-fly at the hardware layer. It's why you rarely, if ever, will have bad clusters at the file system level; it's oblivious to what's really going on.

--
Life is not for the lazy.
Re:Uncorrected reads by AmiMoJo · 2014-11-12 20:45 · Score: 1

The problem is you have no idea how many free reallocated sectors are available. It isn't even consistent between drives, as some will have been used at the factory before the count was reset to zero.
Your strategy is reasonable if the drives are part of a redundant array or just used for backed up data, but for most people once the reallocated sector count starts to it's best to just return the drive as a SMART failure and get it replaced.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Uncorrected reads by ls671 · 2014-11-13 00:41 · Score: 1

I know that. I run e2fsck -c -c (write+read test) to generate random pattern writes on the drives then read the data to make sure it is the same. If I put the drive back on line, e2fsck -c -c will always report 0 bad blocks and no timeouts will have occurred. I also check for timeouts in the logs.
Failed reads on a drive part of a RAID array will usually cause the drive to be kicked out of the RAID array after a timeout slowing down the machine. The strategy I suggested allow the drive hardware to indeed relocate the bad blocks and to bring Current_Pending_Sector to zero.
So, more or less:
1) Drive gets kicked out of the array.
2) Look for read timeouts in the logs
3) Use what I described until read timeouts vanish.
4) Keep an eye on smart data and further read timeouts to insure the drive has stabilized. I actually use cron scripts for that.

--
Everything I write is lies, read between the lines.
Re:Uncorrected reads by DigiShaman · 2014-11-13 04:45 · Score: 1

Error recovery control. Also known as TLER, ERC, or CCTL.
You shouldn't have to script any of this if your using drives that support error recovery. Western Digital desktop drives do not have TLER. As such, the slightest hesitation can kick a drive out of an RAID array. Sucks balls, but don't use generic desktop drives (or any drive for that matter) that doesn't support this in hardware.

--
Life is not for the lazy.
Re:Uncorrected reads by goarilla · 2014-11-13 05:34 · Score: 1

So you overwrite your drive with 0 (/dev/zero) and 1's (/dev/one???) but still you were able to e2fsck it afterwards ?
Re:Uncorrected reads by goarilla · 2014-11-13 05:36 · Score: 1

Just keep your Raid arrays small in number of drives if you use Desktop drives and/or
spring the extra money to buy WD Red's which do have TLER IIRC.
Re:Uncorrected reads by ls671 · 2014-11-13 05:42 · Score: 1

I suggest you do a little more research. If a sector was successfully written to and then 2 months later the drive hardware can't read from it, there is no way for the drive hardware to automagically correct the error and recover the data. The drive hardware then just increment the Current_Pending_Sector count. You could start by reading your own link but then again, you seem to have problems reading my own posts so your mileage may vary ;-)

--
Everything I write is lies, read between the lines.
Re: Uncorrected reads by DigiShaman · 2014-11-13 06:24 · Score: 1

Damn your full of yourself!
If you're running RAID 1, 5, 6, 10, etc, it's a moot point as data will be rebuilt from remaining parity information. Secondly, if a drive drops out of an array from an extended error recovery timeout, chances are you can't trust the reliability of the drive anyways. That's regardless if it trips SMART or not.
My point to you is this: why do you go through convoluted motions to micromanage your hardware when this is a solved problem. Solutions exist! Run the cost/risk aassessment and apply accordingly. Unless your time (and sanity) is worth so little compared to the hardware you administer?!

--
Life is not for the lazy.
Re:Uncorrected reads by ls671 · 2014-11-13 06:32 · Score: 1

Good one! mke2fs -c -c
Thanks for pointing this out!

--
Everything I write is lies, read between the lines.
Re: Uncorrected reads by ls671 · 2014-11-13 07:02 · Score: 1

Run the cost/risk aassessment and apply accordingly.

Exactly, use ZFS that does just that if you want to afford the extra memory. Use a fancy hardware raid controller that does that if you wish. I just use cheap drives and Linux MD. Do your research before commenting on setup you don't seem to know about. You don't have to brag about your hardware here and try to convince others to do as you do.
Didn't I mention in my first post: "Most people would say this is crazy but in my opinion,..."?
I do not see what was your point in replying to my posts anyway other than brag about using more expensive solutions and treat others that don't do just like you like idiots.
Oh, and while at it, RAID 1 doesn't have parity information!

If you're running RAID 1, 5, 6, 10, etc, it's a moot point as data will be rebuilt from remaining parity information.

I did not learn a single thing from your replies.
Take care nevertheless!

--
Everything I write is lies, read between the lines.
Re: Uncorrected reads by DigiShaman · 2014-11-13 07:23 · Score: 1

I don't work for myself, I work for others. That is to say, when I'm having to administer over 100+TB of data on 50+ servers, I won't be rolling my own software-based solution. I'm not saying it can't be done, but there's just too many variable and permutations to deal with; more so when an update rolls around and potentially throws a wrinkle in the mix. And to be perfectly honest, going with Dell or HP provides next-day warranty replacement of drives. That, and the level of R&D put into a hardware based solution is backed up by a solid reputation of the aforementioned companies. And just so you know, if one of these fail, I've got my ass covered. I refused to go before the CEO or owner and say "Yeah, in theory my cobbled solution should work...in theory. And in theory you shouldn't HAVE LOST YOUR DATA!!" Yeah, no problem.
Life's too short. Enjoy being the brilliant hero that you are. I know nothing you're on top of things. You're just too fucking smart man!
Wow, just wow. Some people....

--
Life is not for the lazy.
Re: Uncorrected reads by ls671 · 2014-11-13 07:40 · Score: 1

Who says I don't ALSO work for others and I don't know about more expensive solutions? I just don't brag about it mister Shaman ;-)
I know enough to know about people covering their arses, it is pretty common you know...
Yet, I never lost any data on the cheaper setup I run on the side.
Take care man!

--
Everything I write is lies, read between the lines.
Re: Uncorrected reads by ls671 · 2014-11-13 08:08 · Score: 1

more so when an update rolls around and potentially throws a wrinkle in the mix.

You are right about this. Once, a linux kernel update, or was it mdtools? was screwed. You would add a new partition to an linux MD raid array and it wouldn't sync the partition before putting it online ;-) This is where a good backup strategy comes into place.
Anyways, toying around with linux MD and cheap solutions makes you more creative in the long run IMHO.
Just keep your mind open please. There are plenty of approaches and trade-offs available and just as you said:

Run the cost/risk aassessment and apply accordingly.

Furthermore, it depends on SLAs and such and having the best cost effective solution. As long as you know what you are doing and document it, you don't have to worry about covering your arse so much...

--
Everything I write is lies, read between the lines.
Re:Uncorrected reads by gweihir · 2014-11-14 20:08 · Score: 1

Wrong. Uncorrected roads indicate surface defects. The rest of the surface may be entirely fine. All disks have surface defects and not all are obvious on manufacturer testing.
They also indicate faulty drive care. Usually, data goes bad over a longer tome. If you run your long SMART selftests every 7-14 days, you are very unlikely to be hit by this and will get reallocated sectors with no data-loss instead. Not doing these tests is like never pumping your bicycle tires and complaining when they eventually get flat that they are "broken".

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Uncorrected reads by gweihir · 2014-11-14 20:09 · Score: 1

Still a valid approach today for surface defects. And if you had run regular full surface scans, you would probably not have had to do anything yourself.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re: Uncorrected reads by sribe · 2014-11-23 13:29 · Score: 1

Oh, and while at it, RAID 1 doesn't have parity information!
No, but every sector on the drive does. When a read fails from a sector on 1 drive, RAID-1 can read it from the other drive...

Re:Seagate OEM? by AaronLS · 2014-11-12 09:50 · Score: 2

I've had drives fail in the ~3 years range from a few different manufacturers. I think with a sample size of 3 drives you can't really draw any conclusions.

The measurements in question: by Immerman · 2014-11-12 09:53 · Score: 4, Informative

for those who are only passingly curious and don't want to read the article.
SMART 5 - Reallocated_Sector_Count.
SMART 187 - Reported_Uncorrectable_Errors.
SMART 188 - Command_Timeout.
SMART 197 - Current_Pending_Sector_Count.
SMART 198 - Offline_Uncorrectable

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.

Re:The measurements in question: by SpaceManFlip · 2014-11-12 10:05 · Score: 3, Insightful

I read the article to find those "5 Top SMART Stats" they refer to, but I'm replying here because it's the relevant place.
Those 5 SMART stats match up exactly with what I habitually look at on the job monitoring lots of RAID arrays' drives. Those are the stats that tell you if the drive is going bad most often in my experience.
Re:The measurements in question: by AmiMoJo · 2014-11-12 10:18 · Score: 2

I tend to think a drive has failed once it has any uncorrectable errors... I lost some data, it couldn't be read back. Drive gets returned to the manufacturer under warranty. Don't wait around for it to fail further.
I agree with the reallocated sector count though. The moment that starts to rise I usually make sure the data is fully backed up and then do a full surface scan. The full scan almost always causes the drive to find more failed sectors and die, so it gets send back under warranty too.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:The measurements in question: by rduke15 · 2014-11-12 10:23 · Score: 2

And to list these for your own drive:
$ sudo smartctl -A /dev/sda | egrep '^\s*(ID|5|1[89][78])' ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 etc.
(Incomplete last line to "use fewer 'junk' characters." as requested by that silly filter)
Re:The measurements in question: by omnichad · 2014-11-12 10:23 · Score: 4, Informative

And I can confirm. Reallocated Sector Count rarely goes above zero when the drive is fine. It's possible to have a few sectors go bad and get reallocated, but it's usually part of a bigger problem when it happens (this number is reset to zero at the factory, after all initially bad sectors have been remapped). If the Current Pending Sector Count is non-zero, it's likely over.
I always clone a drive immediately with ddrescue when it gets to this point, while the drive is still working.
Re:The measurements in question: by jedidiah · 2014-11-12 10:25 · Score: 2

Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:The measurements in question: by afidel · 2014-11-12 10:26 · Score: 1

Those 5 SMART stats match up exactly with what I habitually look at on the job monitoring lots of RAID arrays' drives
Really? At my job I get notified that the array is ejecting a drive based on whatever parameters the OEM uses, it's already started the rebuild to spare space on the remaining drives, and a ticket has been dispatched to have a technician bring a replacement drive. If it's a predictive fail it generally doesn't notify until the rebuild has completed as it can generally use the "failing" drive as the source of the rebuild. Are you doing operations for a web scale company or something?

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:The measurements in question: by koinu · 2014-11-12 10:30 · Score: 3, Informative

Reallocated_Sector_Count sectors that the drive successfully replaced Reported_Uncorrectable_Errors errors that could not be recovered by ECC Command_Timeout controller hanging and had to be resetted Current_Pending_Sector_Count sectors to be replace by the next write access Offline_Uncorrectable sectors that the drive tried to repair, but failed (try offline test, maybe it is not dead yet)
Re:The measurements in question: by koinu · 2014-11-12 10:41 · Score: 1

I once had 2 drives having 2047 reallocated sector count (buggy firmware, but drive ok).
Also, generally you don't need to panic over this attribute. You should panic when it increases steadily.
Best indicator for failures is not SMART but a reasonable filesystem like ZFS, optionally protected by raidz (if you want to recover from failures, usually you want). zpool status shows very reliably errors. SMART sometimes can lie to you or can have bugs.
Re:The measurements in question: by afidel · 2014-11-12 10:59 · Score: 2

I never worry about going home, my array has plenty of spare capacity to handle rebuilds, we schedule the technician when it's convenient to us, not when it's convenient for them or the array. When you have guard space for at least 4 disk failures (out of a few hundred) you deal with replacements in a less urgent manner than a traditional small RAID5 array in a standalone server. Within ~30 minutes of a failure or a predictive failure my arrays are back to 100% resiliency with slightly less guard space. It's one of many reasons why I only buy wide striped arrays.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:The measurements in question: by 0123456 · 2014-11-12 11:14 · Score: 1

We just look at the flashing lights every once in a while. Though we've got drives the RAID controller has been telling us are failing for the best part of a year now, and haven't got around to replacing them.
Re:The measurements in question: by __aardcx5948 · 2014-11-12 11:20 · Score: 1

In other words: nothing new and people have been tracking these values for decades anyway.
Re:The measurements in question: by omnichad · 2014-11-12 11:26 · Score: 3, Informative

Also, generally you don't need to panic over this attribute. You should panic when it increases steadily.
True, I've had a few drives hold steady at 1 sector reallocated. But if Current Pending Sector count remains non-zero for very long, it's a headache at the very least and probably a failure. Generally, it seems like as soon as you crest zero, it's over. I've had the next symptom be a totally unresponsive drive. But doing the backup when you hit 1 (admittedly overly cautious) will force the drive to read off all the sectors and you'll at least get your backup while you verify the rest of the drive still reads OK.
Re:The measurements in question: by ericloewe · 2014-11-12 11:57 · Score: 1

They needed a study to arrive at that conclusion?
Reallocated, uncorrectable and pending sectors are all obvious indicators of closing drive failure.
Command Timeouts, depending on definition, could be timeouts after failing a read, so nothing unusual there.
Re:The measurements in question: by ericloewe · 2014-11-12 11:59 · Score: 1

If it comes to you having to clone the drive, it's too late. That's going to bite you in the ass sooner or later.
Re:The measurements in question: by ls671 · 2014-11-12 12:03 · Score: 1

just take the drive off-line and try this:
http://slashdot.org/comments.p...
Current_Pending_Sector will go back to zero if the drive is still usable.

--
Everything I write is lies, read between the lines.
Re:The measurements in question: by omnichad · 2014-11-12 12:26 · Score: 1

I realize that. But I always make a clone first, because it's a lot of wear and runtime if the drive is actually failing.
Re:The measurements in question: by omnichad · 2014-11-12 12:26 · Score: 1

At the first sign of trouble? How much earlier should I do it? I'm not saying in place of a backup. Just as a quicker way to get a new drive up and running.
Re:The measurements in question: by mcrbids · 2014-11-12 12:39 · Score: 1

Your later comments about ignoring RAID controller warnings for a *year* strike me as callous. But we all have our standards, and standards vary greatly from place to place as the needs the drive the standards also vary greatly. (financial institutions care much more about transactional correctness than reddit)
After months of testing, our organization has wholeheartedly adopted ZFS and have been finding that not only is it technically far superior to other storage technologies, it's significantly faster in many contexts, it's actually more stable than even EXT4 under continuous heavy read/write loads, and brings capabilities to the table that even expensive, hardware RAID controllers have a tough time matching. Best of all, since it actually runs off JBOD, the cost is somewhere between insignificant and irrelevant.
I was wondering if you had investigated ZFS at all, and if so, why you aren't using it?

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:The measurements in question: by swillden · 2014-11-12 13:53 · Score: 4, Insightful

Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.
Still, data is better than emergent collective perceptions from distributed anecdotes.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:The measurements in question: by RabidReindeer · 2014-11-12 14:48 · Score: 1

Reallocated_Sector_Count sectors that the drive successfully replaced Reported_Uncorrectable_Errors errors that could not be recovered by ECC Command_Timeout controller hanging and had to be resetted Current_Pending_Sector_Count sectors to be replace by the next write access Offline_Uncorrectable sectors that the drive tried to repair, but failed (try offline test, maybe it is not dead yet)
Did some idiot mod you DOWN?
This is information that bears frequent repetition.
Re:The measurements in question: by Lehk228 · 2014-11-12 16:40 · Score: 1

the technician time to fuck around with a failing drive is just not worth it

bad sectors = pull it, clone it, bin it

--
Snowden and Manning are heroes.
Re:The measurements in question: by dargaud · 2014-11-12 18:28 · Score: 1

Is there a tool that will parse a smartctl output and tell you 'good' or 'no good' ?

--
Non-Linux Penguins ?
Re:The measurements in question: by profplump · 2014-11-12 20:27 · Score: 1

You can't compare real filesystems to EXT. EXT4 is a backport of some of what is possible in modern filesystems to a brand name that makes people comfortable. Like most filesystems it's sufficient for many uses, but it's not particularly good at anything and it's really bad a whole slew of fairly common uses. It's not even a good compromise for backwards compatibility, like EXT3 was, as volumes formatted as EXT4 can't be mounted as EXT2/3.
I'm not saying EXT4 is bad, just that it isn't a terribly useful baseline for comparison. By the time you get to systematically evaluating a filesystem on functionality and performance EXT shouldn't even be on the list, except maybe to help people who don't understand the problem see why you want to make a change in the first place.
Re:The measurements in question: by SLi · 2014-11-13 02:23 · Score: 1

Current_Pending_Sector > 0 means you most likely already have unrecoverable errors on your disk, because otherwise the sectors would already have been remapped (and thus not pending). So if your CPS = 3, expect there to be at least three sectors which will return an uncorrectable error when read. Writing to these sectors will allow them to be remapped, which will decrease your CPS.
Re:The measurements in question: by omnichad · 2014-11-13 02:38 · Score: 1

That's true. Writing to the sector will remap it. But if you get a bad sector, it's very rare for it to remain an isolated incident. And it may not be the sector, but rather the head that's actually failing. I usually consider the drive a likely loss by this point. After doing a full backup, I'll run the drive manufacturer's utility to scan the disk and remap sectors and then write zeroes to the drive for good measure. If all is OK after that, I can always clone my backup back onto the drive.
Re:The measurements in question: by goarilla · 2014-11-13 05:39 · Score: 1

Huh what about 196 Reallocated_Event_Count. Nothing we didn't know already but is there any data out
for SSD's, that's what I would like to know.
Re:The measurements in question: by Immerman · 2014-11-13 14:35 · Score: 1

I don't know. I believe though that, unlike hard drives, SSDs are designed on the presumption that cells will gradually fail as part of normal operations, and hence any such statistics would mean something very different than they would for a hard drive.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Re:The measurements in question: by goarilla · 2014-11-13 20:07 · Score: 1

Exactly, but what I would like to know is what are the critical SMART values to watch for in SSD's ?
Any results on that yet, should we expect them or will they vary even more by manufacturer/model making a "top 5"
list impossible.
Re:The measurements in question: by gweihir · 2014-11-14 20:12 · Score: 1

Well, these are exactly the ones every knowledgeable person was watching anyways. 188 can also be controller or cable problems though.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:The measurements in question: by sribe · 2014-11-23 13:12 · Score: 1

Yep, I learned a long time ago that when I start seeing sectors reallocated or uncorrectable errors, it's time to replace...
Re:The measurements in question: by sribe · 2014-11-23 13:20 · Score: 1

After months of testing, our organization has wholeheartedly adopted ZFS and have been finding that not only is it technically far superior to other storage technologies, it's significantly faster in many contexts, it's actually more stable than even EXT4 under continuous heavy read/write loads, and brings capabilities to the table that even expensive, hardware RAID controllers have a tough time matching. Best of all, since it actually runs off JBOD, the cost is somewhere between insignificant and irrelevant.
A few months ago I consolidated all my pile of home devices onto raidz2 in a fairly inexpensive eSATA box.
A few weeks ago I moved my major client's production data onto raidz2 in a horrible cheap USB box. (Better hardware was ordered. Better hardware was defective when received. When better hardware is straightened, the drives will be moved, and the little USB crap-box stuffed in a cabinet as a spare.)
I'm smaller scale than most ZFS users, but larger scale than most smallish departmental RAID-5/10 setups. Stuck in between, and finding that these days one can get ZFS up and running for less cost than any of the hardware RAID solutions at our scale--almost none of which are trustworthy enough to use.
Re:The measurements in question: by sribe · 2014-11-23 13:23 · Score: 1

I tend to think a drive has failed once it has any uncorrectable errors... I lost some data, it couldn't be read back. Drive gets returned to the manufacturer under warranty. Don't wait around for it to fail further.
Yep, I do 3-passes of full write/verify on new drives before they go into service, and any error count > 0 gets the drive returned right then.

Common cause by tepples · 2014-11-12 09:55 · Score: 1

Correlation justifies effort to find the common cause of two phenomena.

Re:Correlation != causation by Immerman · 2014-11-12 09:55 · Score: 3, Insightful

Nope. When looking for warning signs you don't care about causation, it's enough to know that the presence of A indicates an increased probability of imminent B.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.

Cool data but... by trazom28 · 2014-11-12 10:04 · Score: 1

Ever find it odd that most PC manufacturers (at least the variety I've seen over the years) disable S.M.A.R.T. in BIOS by default? Never understood the reasoning behind that...

--
{} ------ When I think of a good sig, I'll put it here

Re:Cool data but... by fnj · 2014-11-12 10:13 · Score: 1

I could never imagine why it is even POSSIBLE to disable it. If you don't want to read it, just freakin don't read it.
Re:Cool data but... by Rashkae · 2014-11-12 10:16 · Score: 2

If the PC has less than optimal cooling, it's possible, even l iikely, the drive temperature will exceed operating specs at some point. Even if there is no ill effect or any long term problem, the BIOS will forever more report "Imminent Drive Failure" on every boot if BIOS SMART is enabled.
Re:Cool data but... by bobjr94 · 2014-11-12 10:18 · Score: 1

Ive seen that as well. But I have had drives that report a SMART error at boot for years and still never failed (nothing important on that drive, thats why I didnt care) Maybe they would just rather the end user surprisingly looses all their data one day, rather then be troubled by a message at boot up when a problem us suspected.

I would like to see SMART tools built into Windows and other OS's (maybe there are some I don't know about). Especially since some of my computers are up for 6 months or more at a time, a drive could be fine 4 or 5 months ago when it was last booted, but I wont get a smart message until next reboot, maybe a month or two from now, after it's to late.
Re:Cool data but... by omnichad · 2014-11-12 10:21 · Score: 1

Less warranty replacement.
Re:Cool data but... by SgtAaron · 2014-11-12 10:59 · Score: 1

I would like to see SMART tools built into Windows and other OS's (maybe there are some I don't know about). Especially since some of my computers are up for 6 months or more at a time, a drive could be fine 4 or 5 months ago when it was last booted, but I wont get a smart message until next reboot, maybe a month or two from now, after it's to late.
Linux smartmontools package has smartd, the "SMART Disk Monitoring Daemon", which will monitor SMART-capable drives and will log problems and send email alerts. Can be handy. Don't know about Windows.
Re:Cool data but... by jargonburn · 2014-11-12 11:03 · Score: 1

The meaning of that BIOS option may vary by system.
I have used utilities to view the SMART info on drives where this BIOS option is disabled, can't recall any systems where it flat-out didn't work. I won't say that this information couldn't be blocked in some cases, but I believe that this option is for whether the BIOS checks SMART status during POST. It has made the difference between a system merrily proceeding to boot with a SMART failure versus reporting that the drive's SMART indicates failure and waiting for keyboard input to continue.
I don't know if it affects whether the OS (Windows) can/does "see" and report a SMART failure.
The bigger issue, in my opinion, is that "warnings" (such as on the important metrics as decided by BackBlaze) are rarely if ever reported.
I've used the same metrics that BackBlaze reports using as an indicator to recommend drive replacement to my clients for a long time. With the exception of "Command Timeout", which I truthfully don't remember looking at.
Anyone else have some information, experience, anecdotes about SMART in BIOS?
Re:Cool data but... by RabidReindeer · 2014-11-12 14:49 · Score: 1

I could never imagine why it is even POSSIBLE to disable it. If you don't want to read it, just freakin don't read it.
I think there's some routine testing going on that adds overhead unless you disable it.
Re:Cool data but... by DigiShaman · 2014-11-12 16:28 · Score: 1

Correct. The SMART status in BIOS is for whether or not the HDD SMART status get reported at POST. For example on Dell systems, it will warn the user with an option to press the space bar to continue booting into the OS (assuming the drive is still functional). With it turned off in BIOS, you can still poll SMART status with any number of HDD utilities available to whatever OS you're running.

--
Life is not for the lazy.
Re:Cool data but... by Stolpskott · 2014-11-13 00:02 · Score: 1

I generally use HD Tune (www.hdtune.com) which is free unless you want to buy the Pro version with a bunch of features that are irrelevant if all you want is SMART reporting.If I was going to spend actual money on a checker though, I would tend toward the LSoft Hard Disk Monitor (www.lsoft.net).
Re:Cool data but... by gweihir · 2014-11-14 20:14 · Score: 1

Gives them a few more weeks or months before they have a service case. With luck, the machine is out of warranty by then. This is utterly unethical though.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Thanks, Backblaze! by organgtool · 2014-11-12 10:10 · Score: 1

As someone who is suspicious of a couple of hard drives, this data will help me to determine just how concerned I should be. I don't know what Backblaze gets out of making this information public (except publicity), but it is refreshing to a company release information such as this rather than guard it as a trade secret or sell it.

Re:Thanks, Backblaze! by fnj · 2014-11-12 10:16 · Score: 1

The list of parameters that are closely correlated with failure is pretty bloody obvious.
Re:Thanks, Backblaze! by organgtool · 2014-11-12 11:42 · Score: 2

Perhaps they are obvious to a System Administrator but to someone who is not an admin, everything in SMART probably looks like an error. In addition to that, the article describes common errors that sound indicative of a drive failure but are actually relatively benign. So there is definitely value in this information.
Re:Thanks, Backblaze! by thegarbz · 2014-11-12 12:25 · Score: 1

And yet they aren't even by Backblaze's admission. SMART values they expected to be an indication on drive wear showed no correlation with failure.
Re:Thanks, Backblaze! by brianwski · 2014-11-12 14:50 · Score: 2

Disclaimer: I work at Backblaze.

> SMART values they expected to be an indication on drive wear showed no correlation with failure

Exactly. Also, some people care more than "approximately correlates" vs seeing the actual data of exactly how correlated it is.

Windows app that displays these meaningfully? by swb · 2014-11-12 10:11 · Score: 1

I've used Crystal Disk Info and while it reports SMART info, I can't make much out of the info.

Many values for Samsung spinning rust just have values of Current and Worst of 100 and either a raw value of 0 or some insanely huge number.

Re:Windows app that displays these meaningfully? by omnichad · 2014-11-12 10:25 · Score: 1

A few of them aren't accounted for very well (and some of Samsung's stats are not accumulative stats). Crystal Disk Info makes it idiot-proof. If the square is blue, the drive is fine, yellow and the drive is probably failing soon, and red is a definite failure.
Raw value of zero is good. If Current Pending Sector Count or Reallocated Sector Count go above zero, you're likely dealing with a failing drive.
Most of the numbers are not important.
Re:Windows app that displays these meaningfully? by ElderKorean · 2014-11-12 17:43 · Score: 1

I work at a school and see plenty of failing laptop drives - mostly from kids not sleeping their laptops while walking around.
We use (currently) PartedMagic Linux distribution on a boot USB. The "Disk Health" tool happily reports on failing drives and gives reasons.
Added bonus is that Linux is better than windows at allowing data to be copied from a failing drive (and doesn't care about the NTFS file permissions)
Re:Windows app that displays these meaningfully? by omnichad · 2014-11-13 01:29 · Score: 1

On Linux, I just use smartmontools. Gives the same grid of data (mostly) as Crystal Disk Info. But when copying a failing drive, always use ddrescue. It will allow you to unplug the drive (to do some mysterious temporary fix like putting it back in the freezer) and plug it back in and restart from where you left off. Unless you only need a small amount of data (I prefer to just clone the entire system to a new drive to boot from).

Just my personal take by jones_supa · 2014-11-12 10:18 · Score: 1

I never take a look at SMART values or do disk benchmarks. They just make me more stressful and paranoid. If it should occur, I'll let the drive die a mighty death and restore the latest backup to a new disk.

Top #1 Indicator That Correlates To Drive Failure by PsychoSlashDot · 2014-11-12 10:30 · Score: 1

The biggest sign that correlates to drive failure is: it's a brick and all your data is gone.

Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.

--
"Oh no... he found the .sig setting."

Put the SMART stats to the test by DidgetMaster · 2014-11-12 10:41 · Score: 2

Take all the drives that have signs of failure, put them in a testing environment where you can read and write them all day but don't care about any of the data on them and see how long it takes for them to really fail. That will give you an indication of how reliable the SMART stats are at predicting real disk failure.

Re:Put the SMART stats to the test by brianwski · 2014-11-12 14:56 · Score: 3, Informative

Disclaimer: I work at Backblaze. Essentially this is what we did. We don't care at all if one drive dies, so we left it in an environment where we can read and write them all day (the storage pods with live customer data) and when they failed we calmly replaced them with zero customer data loss and produced this blog post. :-)
Re:Put the SMART stats to the test by Carnildo · 2014-11-12 15:29 · Score: 1

Google did this about seven years ago. Of the stats, a drive with a non-zero scan error count has a 70% chance of surviving eight months, one with a non-zero reallocated sector count has a 85% chance of survival, and one with a non-zero pending sector count has a 75% chance of survival. For comparison, a drive with no error indications has a better than 99% chance of surviving eight months.
Overall, 44% of failures can be predicted with a low false-positive rate, while 64% can be predicted with an unacceptably high false-positive rate. 36% of drive failures occur with no SMART failure indications at all.

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
Re:Put the SMART stats to the test by gweihir · 2014-11-14 20:21 · Score: 1

And that is how you do it if you know what how HDDs work. Most people have not the least clue about the mechanics, physics and electronics involved and hence are posting a lot of techno-mythical nonsense here.
Personally, I had suspicious disks in RAID6 and checked then once a day. (Data was also backed up elsewhere.) Except for one freak disk that suddenly had 150 reallocations, but then continued to work for 3 years, they all died pretty soon, but I never needed those backups.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Put the SMART stats to the test by gweihir · 2014-11-14 20:23 · Score: 1

Unfortunately, the Google paper is mostly unusable as it has severe methodical errors and basically only shows its authors do not have a clue. What of it is usable confirms things however. (Not the only really badly researched and written paper to come out of Google either...)

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Errors actually means errors by vargad · 2014-11-12 10:42 · Score: 1

By analyzing ten thousand of harddrive failes they figured out that the smart stats thats shows errors actually shows errors. What a surprise.

Re:Errors actually means errors by gweihir · 2014-11-14 20:23 · Score: 1

Fail. That is not what they say.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:My useless(?) WD anecdotes by ncc74656 · 2014-11-12 10:59 · Score: 1

12 Power Cycle Count is relevant on the EZRXs (greens); that keeps increasing unless you do certain things to prevent it, and I think (this is murky) I saw a weak correlation between this going into way up, and the drives failing sooner.

I've not done anything special with the two that I have in a media server at home. This stat is at 5 on the older drive and 4 on the newer drive. By comparison, a Seagate Barracuda LP in the same box is at 128 (it's quite a bit older than the WD drives), and the boot drive, a Seagate Barracuda 7200.11 I grabbed out of the unused-drive box when whatever drive it replaced failed, has 365 spinups logged.

(Looking at the stats for all of my drives, the outlook for that 7200.11 isn't so good. :-P )

--
20 January 2017: the End of an Error.

Re:Hardware ECC Recovered by koinu · 2014-11-12 10:59 · Score: 1

Hard disks recover sectors with ECC all the time. There is nothing special about it.

Re: Seagate OEM? by corychristison · 2014-11-12 11:30 · Score: 3, Informative

I buy whatever is cheapest.

I know it's a toss up no matter what or when you buy hard drives, so the only thing I have left to guage is price, capacity, and speed (RPM) depending on the intended use.

About a year ago I took a gamble on an SSD for my primary workstation. I bought an ADATA SX900 64GB drive. I had never heard of the brand before. It was ~$120 at the time, and the cheapest for that capacity. I've been looking at getting a 128GB (or so) SSD for my laptop. Prices right now look like I will be getting another ADATA... but I am holding out for Black Friday/Cyber Monday deals to decide.

Oddly enough, over the past 10 years, I've never had a hard drive die in any of my computers while in use. I have a stack of 4 or 5 drives, ranging in capacity from 100GB to 500GB, 3 different different brands, that I'm not using right now. A while back, I plugged one in just to see if it still worked and it didn't. I recently found out it was the hotswap bay that quit working, so as far as I know it still works.

Conversely, I have some servers in a datacenter. Had a drive fail on reboot after a kernel upgrade the other night. Sent a ticket to the DC and they plugged a new one in. Good to go again. In case you're wondering, it has 4x600GB SAS drives in RAID-10.

TL;DR: Buy whatever is cheapest, the odds are always the same.

Re:Top #1 Indicator That Correlates To Drive Failu by SgtAaron · 2014-11-12 12:40 · Score: 1

Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.

Yeah, I did mention smartd in an earlier post, and I said it "can be handy" but I suppose I must agree with you based on my own life as its been lived until now. We never put a server into service without at least software raid, usually with just two disks with some exceptions. A lot of our equipment are tiny supermicro 1u's that can only hold two. But after many years we have yet to have two go at once (knock on wood) so the warning of a raid out of sync has saved us.

RUBBISH by Anonymous Coward · 2014-11-12 13:09 · Score: 1

You've given up, let go and let it all hang out.

Better advice. Look at the reviews - percentage of bad reviews, nature of the problem. Do not buy a brand new model if you can avoid it. Chances are it will be cheaper on clearance and if there is an issue there will be more data out there about it.

ALWAYS BACK UP to multiple drives at multiple locations if you can't replace that data. Do NOT rely on RAID. Do not store all your backups in one physical location where a single fire, rat chewing them or other event might compromise them. At least one copy should be completely offline once the backup is taken so it can't be hit by a virus or bug. Relatives that you trust - a parent, child or sibling if they live far enough to avoid one disaster hitting you both but close enough to do semi-regular backups make excellent choices for storage. Cloud storage is not reliable.

Re:RUBBISH by FatdogHaiku · 2014-11-12 16:38 · Score: 2

Also grabbing a copy of smartmontools might be a good idea...
http://smartmontools.sourceforge.net

--
You have the right to remain sentient. If you give up the right to remain sentient, you will be elected to public office
Re:RUBBISH by profplump · 2014-11-12 20:05 · Score: 2

He hasn't given up, he's just acknowledged the reality that the variance among drives of any particular model is large enough that he can't statistically pick a winner even given reliable statistics about the past performance of similar drives (which is definitely not available) and assuming the drives never change over their manufacturing life (with is definitely not true).
If you're buying 1000 hard drives their average reliability is meaningful to you (though even then it's only *a* factor, not *the* factor). But if you're only buying a handful of drives and prioritizing reliability you're much better off with diversity than any single model because the average reliability means almost nothing in your small application and diversity at least lets you avoid duplicating systematic faults.
Whatever strategy you think you've devised to beat the statistics is just you hoping to pick the right stock/horse/number and lying to yourself about the odds -- even if you have good data and choose the statistically best option there's still a very good chance it won't turn out to be the best one available and a moderate chance it will be one of the worst.
Re:RUBBISH by AaronLS · 2014-11-13 05:11 · Score: 1

Yes, even with lots of data you'd probably have a hard time showing that some manufacturers are significantly more reliable, due to lots of factors that will create a large deviation within each manufacturer.
I heard a story of someone seeing a shipping container full of hard drives get dropped accidentally, and they just hooked it back up, put it on the ship, and sent them on their way. That probably generated a lot of the "I bought 3 SuchBrand drives, and they all failed in the first month".

Re: Seagate OEM? by n3r0.m4dski11z · 2014-11-12 14:36 · Score: 1

"Prices right now look like I will be getting another ADATA..." ... "TL;DR: Buy whatever is cheapest, the odds are always the same."

You got lucky. I had 8 out of 10 ADATA 64gb msata drives fail at my workplace over the last year. Adata is crap.

SSDs are a whole different ballgame. Comparing their quirks to rotating hard drives is akin to comparing a car to a train. They do not work the same, nor fail the same.

SSD are by far not all created equal and you must do research before buying them. I like samsung, intel and crucial personally, based on experience. Be sure to keep up with firmware updates as well!

--
-

Re: Seagate OEM? by brianwski · 2014-11-12 14:39 · Score: 4, Insightful

> TL;DR: Buy whatever is cheapest, the odds are always the same.

Disclaimer: I work at Backblaze. I'm going to completely agree with you wholeheartedly, and say in addition you must have a backup. You don't have to use us, I'm just saying if a drive has a 1 percent chance or a 30 percent chance of failing, the actionable item is the same - keep a backup and buy the cheaper drive and restore from backup when it happens.

> over the past 10 years, I've never had a hard drive die in any of my computers while in use.

Professionally we lose something like 10 (?) drives every single day at Backblaze, but *PERSONALLY* I had a LOT of luck for a number of years, but about 3 years ago I finally lost one drive. I'm more backed up than most people, so it was a completely relaxed event. Not a bit of stress. Replace the drive, re-install the OS, and restore the data. Yet something like 95 percent of people never backup their data. IT professionals backup up their family computers, but once you are out there in "normal computer user" land, it's a horror show.

Re:My useless(?) WD anecdotes by brianwski · 2014-11-12 14:54 · Score: 2

> power-cycling the drive can have an effect on its lifetime and/or reliability

Yes, exactly, why are you calling this stupid? It is interesting because it might affect your behavior - if you power cycle the drives every day, maybe you should consider leaving them powered up, if electricity is cheaper than replacing the drive. It's just an observation, leaving it out seems.... irresponsible? Disclaimer: I work at Backblaze.

Re:Top #1 Indicator That Correlates To Drive Failu by Carnildo · 2014-11-12 15:18 · Score: 1

If you go by Google's definition of failing (the raw value of any of Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable goes non-zero) rather than the SMART definition of failing (any scaled value goes below the "failure threshold" value defined in the drive's firmware), about 40% of drive failures can be predicted with an acceptably low false-positive rate. You're correct, though, that the "SMART health assessment" is useless as a predictor of failure.

They did a study on this a few years back. It comes to about the same conclusions that Backblaze's study does, but with more numbers (and a larger data set).

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.

Re:Correlation != causation by Immerman · 2014-11-12 16:28 · Score: 1

Perhaps it has been too long - you've clearly forgotten the even longer history of the deadpan response to spam.

That said, I can't actually think of many cases of such spam in response to articles/discussions that never mentioned causation at all - but maybe that's just because causally irrelevant mentions of correlation are relatively rare. Or because the spamming was so bland that it just disappeared into the background.

Also: why oh why would you want to try to resurrect such an old and worthless meme? It's like the script-kiddie version of trolling: no creativity, no vitriol, no emotional provocation of any kind. Just a slight waste of time for everyone involved, not even enough to be annoying.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.

Re:My useless(?) WD anecdotes by geminidomino · 2014-11-12 17:01 · Score: 1

IIRC, the greens are the "energy efficient" drives, and I think they power themselves down when idle, and up when they come back into use, so the numbers can grow even if the machine hasn't been rebooted since the drive was first installed.

Re: Seagate OEM? by corychristison · 2014-11-12 17:13 · Score: 1

What are the odd's I would have one of the employee's from the article comment on my little ol' post?

I keep local backups. I've been browsing online, looking for an online backup service that I like, so far not a whole lot of luck. I exclusively run Funtoo Linux on all of my personal and office computers (workstation at home, workstation at office, and laptop). From what I understand, you don't support Linux (yet).

My basic requirements are:
- support Linux (one of ssh/scp, rsync, webdav)
- preferably data located in Canada

As it stands, I'm better off firing up a VM on one of my servers and backing up to it... but that comes with all the other associated headaches like securing, configuring, maintaining the server.

Re:Top #1 Indicator That Correlates To Drive Failu by fostware · 2014-11-12 18:50 · Score: 1

I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created.
So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.

Sure, it's not scientifically representative, but I'll not take that chance with clients data...

--
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan

Re:Seagate OEM? by aaaaaaargh! · 2014-11-12 21:33 · Score: 1

Please mod up. Seagate drives fail much sooner than all other brands.

Re:Top #1 Indicator That Correlates To Drive Failu by PsychoSlashDot · 2014-11-13 00:10 · Score: 1

I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created. So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.

Sure, it's not scientifically representative, but I'll not take that chance with clients data...

Yeah, I won't dispute your experience because it happened. On the other hand, the only SMART warnings I've seen in our fleet of... four-digits worth of spindles... have ended up false-positives. As in, I contact DELL / IBM / HP / Lenovo and report the issue, they instruct me to flash some controller firmwares, reboot, and go away. If those drives ever fail, it's years later, well beyond any correlation with the SMART events.

--
"Oh no... he found the .sig setting."

Re:Top #1 Indicator That Correlates To Drive Failu by fostware · 2014-11-13 00:39 · Score: 1

As MSP, false-positives are not always a negative. There, I said it... and most MSPs will agree begrudgingly when off the record.

That said, our support prices alter when the device is no longer under warranty, so the device usually gets moved to a location covered under a different support structure like only 8x5 or have a longer response time to compensate.

--
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan

Re: Seagate OEM? by halltk1983 · 2014-11-13 02:13 · Score: 1

Enclosure heat. They're "passively cooled" through layers of plastic. If you're not allowing good ventilation around the enclosure, and you're leaving it spinning all the time, it'll bake.

--
Watch for Penguins, they eat Apples and throw rocks at Windows.

Re:My useless(?) WD anecdotes by tibit · 2014-11-13 02:44 · Score: 1

I'm calling it stupid because if you don't know anything about the time between the power cycles, you can at best assume that the power cycle count is a low-quality proxy for powered hours.

For any claim that the number of power cycles itself is a predictor of failure, you'd need to, you know, power cycle a bunch of drives at various rates until they die, and see if merely power cycling it more often makes it fail faster. Only in such conditions would the power cycle mean anything. Otherwise it's stupid and let's just stop with the stupidity, okay?

--
A successful API design takes a mixture of software design and pedagogy.

Re:My useless(?) WD anecdotes by tibit · 2014-11-13 02:44 · Score: 1

Load Cycle Count and Power Cycle Count aren't the same thing.

--
A successful API design takes a mixture of software design and pedagogy.

Re:My useless(?) WD anecdotes by tibit · 2014-11-13 02:45 · Score: 1

Pray tell, what has a firmware bug got to do with the meaning of a power cycle counter, otherwise that in this particular case you can't rely on a faulty counter? Let's not deflect attention to strawmen.

--
A successful API design takes a mixture of software design and pedagogy.

Re: Seagate OEM? by corychristison · 2014-11-13 07:50 · Score: 1

I pay for a business account with an online retailer. Said business account provides me with a 2 year exchange on all hard drives (and a bunch of other benefits).

So if the drive fails within 2 years, I send it back to them and they replace it with a similar model, and they pay for the shipping.

If it happens out of the two year scope, I'm better off just buying a new drive than dealing with the hassle of sending it to the manufacturer.

I don't own a shop, nor do I provide IT services. I used to buy A LOT of stuff from them, and decided to start paying the yearly fee.

Re:Seagate OEM? by gweihir · 2014-11-14 20:05 · Score: 1

We learned in statistics class that unless you have special circumstances a sample of size 100 is what you need at the very least.