Ask Slashdot: Do You Test Your New Hard Drives?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Do You Test Your New Hard Drives?

Posted by timothy on Sunday December 23, 2012 @05:22AM from the just-bite-the-corner-a-little dept.

An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"

19 of 348 comments (clear)

Min score:

Reason:

Sort:

Heh by Deekin_Scalesinger · 2012-12-23 05:23 · Score: 4, Insightful

Like, never. Out of the box and away she goes...good luck to thee!

--
"As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
1. Re:Heh by JMJimmy · 2012-12-23 06:00 · Score: 4, Insightful
  
  Add to the above:
  HDD tools are useless. I recently tried a bunch of them - they all reported my HDD in perfect condition... while it was doing the click of death. HDD failed within a week.
2. Re:Heh by hairyfeet · 2012-12-23 07:13 · Score: 4, Interesting
  
  The problem is the best damned tool ever made for testing drives hasn't been updating in years and now won't work on drives bigger than 500Gb, I am of course talking about Spinrite. With Spinrite on lvl 2 you just bypass the firmware and write patterns of zeroes and ones and then read back what it reports, if its spitting errors right off the bat then you know to send it back. Problem is Gibson hasn't updated the thing since 06 so it can't handle drives bigger than 500Gb which makes it all but useless today.
  So if anybody has found something that works similar to spinrite but works on large drives I too would like to know, I get drives coming in from all over the place at the shop with ZERO history here at the shop so I don't know if they've been barely used or thoroughly abused and having a tool I can run on them would be a big help.
  
  --
  ACs don't waste your time replying, your posts are never seen by me.
3. Re:Heh by greg1104 · 2012-12-23 07:54 · Score: 5, Interesting
  
  Spinrite hasn't been useful for years. There's a good analysis why at Does SpinRite do what it claims to do?. Everything the program does can be done more efficiently with a simpler program run from a Linux boot CD. And the fact that it takes so long is a problem--you want to get data off a dying drive as quickly as possible. Here's what I wrote on that question years ago, and the rise of SSDs make this even more true now:
  SpinRite was a great program in the era it was written, a long time ago. Back then, it would do black magic to recover drives that were seemingly toast, by being more persistent than the drive firmware itself was.
  But here in 2009, it's worthless. Modern drives do complicated sector mapping and testing on their own, and SpinRite is way too old to know how to trigger those correctly on all the drives out there. What you should do instead is learn how to use smartmontools, probably via a Linux boot CD (since the main time you need them is when the drive is already toast).
  My usual routine when a drive starts to go back is to back its data up using dd, run smartmontools to see what errors its reporting, trigger a self-test and check the errors again, and then launch into the manufacturer's recovery software to see if the problem can be corrected by it. The idea that SpinRite knows more about the drive than the interface provided by SMART and the manufacturer tools is at least ten years obsolete. Also, getting the information into the SMART logs helps if you need to RMA the drive as defective, something SpinRite doesn't help you with.
  Note that the occasional reports you see that SpinRite "fixes" problems are coincidence. If you access a sector on a modern drive that is bad, the drive will often remap it for you from the spares kept around for that purpose. All SpinRite did was access the bad sector, it didn't actually repair anything. This is why you still get these anecdotal "it worked for me" reports related to it--the same thing would have been much better accomplished with a SMART scan.
4. Re:Heh by SuperTechnoNerd · 2012-12-23 07:56 · Score: 4, Interesting
  
  You have to interpret the data correctly. Looking at seek error rate and raw read errors tells if the heads are positioning accurately. Run the drive hard (read/write patterns )and watch the temperature. And of course if you start seeing a non 0 pending, and realloc sector count you know the end is near. And watch as a drive gets older the spin up time will increase. (I rarely shut the raid server down so this is less important). I have smartd email and text me any time things start to get out of a happy place.. I do nightly quick test and weekly extended tests. Smart is useful - if your smart about it...
5. Re:Heh by Burpmaster · 2012-12-23 07:57 · Score: 4, Informative
  
  What you want is just 'badblocks -w '.
6. Re:Heh by greg1104 · 2012-12-23 08:10 · Score: 4, Interesting
  
  SMART is a part of the modern drive's firmware. You can't bypass it. Anyone who tells you otherwise--such as the makers of Spinrite--is lying to you in order to sell a product.
  The quality of SMART implementation varies significantly based on the manufacturer. Anecdotally, I have 3 failed Western Digital drives here that flat out lie about the drive's errors. Running the tool needed to generate an RMA does a full SMART scan of the drive, remaps some bad sectors, and then says everything is good. But it's not--each drive is still broken, in a way the firmware seems downright evasive about. Try to use it again, it doesn't take long until another failure. It does seem like the sole purpose of SMART and its associated utilities on WD drives is to keep people from returning a bad drive, by providing a gatekeeper in that process that never says there's a problem.
  Most of my serious installations avoid WD drives like the plague for this reason. I think that Seagate's drives are probably less reliable overall than WD nowadays. Regardless I prefer them, simply because the firmware is more honest about the errors that do happen. Drives fail and I plan for that. What I can't deal with is drives that fail but don't admit it.
  The reason there are "RAID edition" firmware available is to provide a drive that isn't supposed to be as evasive about errors. It may be that some WD RAID edition models might not have the problem I'm describing. I soured on them as a brand before those became mainstream.
7. Re:Heh by Culture20 · 2012-12-23 08:26 · Score: 4, Informative
  
  My usual routine when a drive starts to go back is to back its data up using dd
  ddrescue is the tool for backing up a failing drive unless you really want to manually check every failed sector read then restart a new dd (skipping to the next sector).
8. Re:Heh by thegarbz · 2012-12-23 09:53 · Score: 4, Informative
  
  No, not SMART. I did a full range of tests with all suits on top of SMART (surface tests, etc)
  The only HDD tool I trust is the ancient one from GRC.
  That is absolutely laughable. Spinrite is about as good at interfacing with a modern drive than an old 16bit dos program trying to sqeeze every ounce of performance out of a 64bit processor. It had it's purpose in its day. These days running it will more likely do more harm than good.
  Not to mention that if your drive is at the end of life running a program that is widely known to give it a most horrendous thrashing is probably not a good idea.
Re:SSDs by roc97007 · 2012-12-23 05:29 · Score: 5, Insightful

> Who cares about HDDs anymore these days?
Anyone with a need for a massive amount of storage space.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
smartmontools by WD · 2012-12-23 05:35 · Score: 5, Informative

Set up the smartd.conf file to do the example short-test daily and long-test weekly, and email you when something is fishy. It's a trivial amount of effort, resulting in a significant amount of peace of mind. (In many cases, you'll have some amount of warning before your drive kicks the bucket and it's too late)
1. Re:smartmontools by Deekin_Scalesinger · 2012-12-23 05:58 · Score: 5, Funny
  
  This should be modded up for your username alone lol
  
  --
  "As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
Yes! Especially before adding them to an array. by Anonymous Coward · 2012-12-23 05:42 · Score: 5, Interesting

I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.
I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.
This catches not only obviously bad drives, but also the slow or otherwise odd ones.
Re:dban followed by smartctl by bill_mcgonigle · 2012-12-23 05:45 · Score: 5, Interesting

Yes, this. I do it online:

dd if=/dev/zero of=/dev/sdX bs=8M

and then check smartctl. If I'm making a really big zpool, I fill them up and let ZFS fail out the turkeys:

dd if=/dev/zero of=/tank/zeros.dd bs=8M zpool scrub tank

If I'm building a 30-drive storage server for a client I'll often see 1-2 fail out. Better to catch them now then when they're deployed (especially with the crap warranties on spinning rust these days). I need to order in staggered lots anyway, so having 10% overhead helps keep things moving along.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
SMART + badblocks by SuperBanana · 2012-12-23 06:23 · Score: 5, Interesting

I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.
If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.

--
Please help metamoderate.
Re:SSDs by cpghost · 2012-12-23 07:14 · Score: 4, Interesting

Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.

--
cpghost at Cordula's Web.
Re:SSDs by hairyfeet · 2012-12-23 08:18 · Score: 4, Interesting

Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.
So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.

--
ACs don't waste your time replying, your posts are never seen by me.
Wrong Approach by nuckfuts · 2012-12-23 08:29 · Score: 4, Insightful

I've been dealing with hardware failures for 20+ years. What I've learned is that disasters WILL happen, regardless of what preventive measures are in place. So I shifted my focus toward recoverablity. To me, the important question is "When something catastrophic happens, how quickly and easily can I put things back in working order"?
Since I use RAID where appropriate, and more importantly, I am positively fanatic about frequent, full, and tested backups, the only concern I have when a hard drive dies is whether I'm still entitled to a warranty replacement.
Re:SSDs by drsmithy · 2012-12-23 09:23 · Score: 4, Funny

Holy crap. Twenty 3T spindles in a single array ? What do you do to de-stress ? Run between cars on a highway ?