Ask Slashdot: Do You Test Your New Hard Drives?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Do You Test Your New Hard Drives?

Posted by timothy on Sunday December 23, 2012 @05:22AM from the just-bite-the-corner-a-little dept.

An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"

30 of 348 comments (clear)

Min score:

Reason:

Sort:

Heh by Deekin_Scalesinger · 2012-12-23 05:23 · Score: 4, Insightful

Like, never. Out of the box and away she goes...good luck to thee!

--
"As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
1. Re:Heh by JMJimmy · 2012-12-23 06:00 · Score: 4, Insightful
  
  Add to the above:
  HDD tools are useless. I recently tried a bunch of them - they all reported my HDD in perfect condition... while it was doing the click of death. HDD failed within a week.
2. Re:Heh by PlusFiveTroll · 2012-12-23 06:40 · Score: 3, Informative
  
  Sounds more like your hard drive s.m.a.r.t. was useless. The tools can only report what the drive tells it, if smart isn't telling about relocated sectors, resets, or whatever other terrible malfunction then they are left in the dark.
3. Re:Heh by hairyfeet · 2012-12-23 07:13 · Score: 4, Interesting
  
  The problem is the best damned tool ever made for testing drives hasn't been updating in years and now won't work on drives bigger than 500Gb, I am of course talking about Spinrite. With Spinrite on lvl 2 you just bypass the firmware and write patterns of zeroes and ones and then read back what it reports, if its spitting errors right off the bat then you know to send it back. Problem is Gibson hasn't updated the thing since 06 so it can't handle drives bigger than 500Gb which makes it all but useless today.
  So if anybody has found something that works similar to spinrite but works on large drives I too would like to know, I get drives coming in from all over the place at the shop with ZERO history here at the shop so I don't know if they've been barely used or thoroughly abused and having a tool I can run on them would be a big help.
  
  --
  ACs don't waste your time replying, your posts are never seen by me.
4. Re:Heh by hairyfeet · 2012-12-23 07:25 · Score: 3, Informative
  
  That's nice, an OS used by less than 2% of the entire planet has some tool that reports what SMART is telling it, no different that a billion freeware programs for Windows. Just FYI but I can think of about a dozen freeware programs that will do the same damned thing in Windows, INCLUDING the email, so its not exactly like you got anything to brag about Ms AC.
  Now I'm gonna spell out what the REAL problem is, which any guy who has spent time in the trenches will tell you and that is SMART SUCKS ASS and for several years has more about covering bad batches for the HDD OEMs than it has been for actually telling you something is going bad. I have had drives in the shop that sounded like an angle grinder bouncing on pavement where SMART said "Nope, nothing wrong here la la la"" while the thing just ground and sputtered, its the most fucking pointless diagnostic tool there is.
  What we NEED is a replacement for Spinrite, something that bypasses the lying SMART and just runs a pass of zeroes and ones on the drive and reports a simple pass/fail on the read/writes. Spinrite was fucking brilliant for this, it would give you a layout of the entire drive with red for sectors that failed to report the correct data back and blue for clean so it took just a second to glance at the readout to spot a drive that was buggy out of the box, but nobody has updated the tool in years so its useless now since it can't do SATA 6 or drives above 500Gb.
  So how about it FOSS devs, here is the requirements: Bypass SMART, does a single R/W cycle, reports results. That's ALL it has to do anjd so far nobody has stepped up to the plate. damned near every shop I knew including mine had bought a copy of Spinrite so there is good money to be made there if you are willing to put in the work, its a niche but its a niche with money, builders, repair shops and gamers would all love to hand you money for this tool, so get on it and report back when its done, okay?
  
  --
  ACs don't waste your time replying, your posts are never seen by me.
5. Re:Heh by JMJimmy · 2012-12-23 07:30 · Score: 3, Informative
  
  No, not SMART. I did a full range of tests with all suits on top of SMART (surface tests, etc)
  The only HDD tool I trust is the ancient one from GRC.
6. Re:Heh by greg1104 · 2012-12-23 07:54 · Score: 5, Interesting
  
  Spinrite hasn't been useful for years. There's a good analysis why at Does SpinRite do what it claims to do?. Everything the program does can be done more efficiently with a simpler program run from a Linux boot CD. And the fact that it takes so long is a problem--you want to get data off a dying drive as quickly as possible. Here's what I wrote on that question years ago, and the rise of SSDs make this even more true now:
  SpinRite was a great program in the era it was written, a long time ago. Back then, it would do black magic to recover drives that were seemingly toast, by being more persistent than the drive firmware itself was.
  But here in 2009, it's worthless. Modern drives do complicated sector mapping and testing on their own, and SpinRite is way too old to know how to trigger those correctly on all the drives out there. What you should do instead is learn how to use smartmontools, probably via a Linux boot CD (since the main time you need them is when the drive is already toast).
  My usual routine when a drive starts to go back is to back its data up using dd, run smartmontools to see what errors its reporting, trigger a self-test and check the errors again, and then launch into the manufacturer's recovery software to see if the problem can be corrected by it. The idea that SpinRite knows more about the drive than the interface provided by SMART and the manufacturer tools is at least ten years obsolete. Also, getting the information into the SMART logs helps if you need to RMA the drive as defective, something SpinRite doesn't help you with.
  Note that the occasional reports you see that SpinRite "fixes" problems are coincidence. If you access a sector on a modern drive that is bad, the drive will often remap it for you from the spares kept around for that purpose. All SpinRite did was access the bad sector, it didn't actually repair anything. This is why you still get these anecdotal "it worked for me" reports related to it--the same thing would have been much better accomplished with a SMART scan.
7. Re:Heh by SuperTechnoNerd · 2012-12-23 07:56 · Score: 4, Interesting
  
  You have to interpret the data correctly. Looking at seek error rate and raw read errors tells if the heads are positioning accurately. Run the drive hard (read/write patterns )and watch the temperature. And of course if you start seeing a non 0 pending, and realloc sector count you know the end is near. And watch as a drive gets older the spin up time will increase. (I rarely shut the raid server down so this is less important). I have smartd email and text me any time things start to get out of a happy place.. I do nightly quick test and weekly extended tests. Smart is useful - if your smart about it...
8. Re:Heh by Burpmaster · 2012-12-23 07:57 · Score: 4, Informative
  
  What you want is just 'badblocks -w '.
9. Re:Heh by greg1104 · 2012-12-23 08:10 · Score: 4, Interesting
  
  SMART is a part of the modern drive's firmware. You can't bypass it. Anyone who tells you otherwise--such as the makers of Spinrite--is lying to you in order to sell a product.
  The quality of SMART implementation varies significantly based on the manufacturer. Anecdotally, I have 3 failed Western Digital drives here that flat out lie about the drive's errors. Running the tool needed to generate an RMA does a full SMART scan of the drive, remaps some bad sectors, and then says everything is good. But it's not--each drive is still broken, in a way the firmware seems downright evasive about. Try to use it again, it doesn't take long until another failure. It does seem like the sole purpose of SMART and its associated utilities on WD drives is to keep people from returning a bad drive, by providing a gatekeeper in that process that never says there's a problem.
  Most of my serious installations avoid WD drives like the plague for this reason. I think that Seagate's drives are probably less reliable overall than WD nowadays. Regardless I prefer them, simply because the firmware is more honest about the errors that do happen. Drives fail and I plan for that. What I can't deal with is drives that fail but don't admit it.
  The reason there are "RAID edition" firmware available is to provide a drive that isn't supposed to be as evasive about errors. It may be that some WD RAID edition models might not have the problem I'm describing. I soured on them as a brand before those became mainstream.
10. Re:Heh by Culture20 · 2012-12-23 08:26 · Score: 4, Informative
  
  My usual routine when a drive starts to go back is to back its data up using dd
  ddrescue is the tool for backing up a failing drive unless you really want to manually check every failed sector read then restart a new dd (skipping to the next sector).
11. Re:Heh by BLKMGK · 2012-12-23 08:49 · Score: 3, Informative
  
  Not exactly useless... There's a preclear script that many unRAID users use to beat up their drives while monitoring SMART. It doesn't just look at SMART for a thumbs up or down but monitors the various parameters that SMART throws out. Users run this multiple times in a row and find bad drives fairly regularly. I will admit that I've not been running it but judging from the numbers of folks who have been finding it useful and from the fact that warranties seem to be getting ever shorter I may begin doing so. I use a decent number of the 3TB drives that are always going on sale and I'm starting to think I'm tempting fate by not testing them. I've gotten spoiled in that my unRAID box covers my ass in the even of a failure but I see too damn many reports of new drives going toes up to not be concerned. I have 3 drives sitting on the shelf waiting to be loaded and I may beat them up beforehand just to be sure they won't screw me when I least expect it...
  
  --
  Build it, Drive it, Improve it! Hybridz.org
12. Re:Heh by thegarbz · 2012-12-23 09:53 · Score: 4, Informative
  
  No, not SMART. I did a full range of tests with all suits on top of SMART (surface tests, etc)
  The only HDD tool I trust is the ancient one from GRC.
  That is absolutely laughable. Spinrite is about as good at interfacing with a modern drive than an old 16bit dos program trying to sqeeze every ounce of performance out of a 64bit processor. It had it's purpose in its day. These days running it will more likely do more harm than good.
  Not to mention that if your drive is at the end of life running a program that is widely known to give it a most horrendous thrashing is probably not a good idea.
13. Re:Heh by Pentium100 · 2012-12-23 15:35 · Score: 3, Interesting
  
  MHDD works best for me for testing the drive. Spinrite (and ddrescue) is good for data recovery, but not that good for testing. I had one drive that have a lot of sectors that were good, except that the drive took 10-30 seconds to read them making the PC extremely slow (Windows would drop to PIO mode and be slow even when reading the good sectors).Chkdsk didn't detect anything, Spinrite didn't detect anything, only mhdd showed lots of slow sectors (I later made a list and manually marked them as bad, getting a 2.5" IDE drive is not that easy or fast, so it will have to do until then).
Re:SSDs by roc97007 · 2012-12-23 05:29 · Score: 5, Insightful

> Who cares about HDDs anymore these days?
Anyone with a need for a massive amount of storage space.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
dban followed by smartctl by X0563511 · 2012-12-23 05:30 · Score: 3, Interesting

If dban can write out every sector and not have smartctl show any pending sectors after the fact (and the average speed of the dban wipe was normal) then you've got good chances the drive will be fine.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
1. Re:dban followed by smartctl by bill_mcgonigle · 2012-12-23 05:45 · Score: 5, Interesting
  
  Yes, this. I do it online:
  
  dd if=/dev/zero of=/dev/sdX bs=8M
  
  and then check smartctl. If I'm making a really big zpool, I fill them up and let ZFS fail out the turkeys:
  
  dd if=/dev/zero of=/tank/zeros.dd bs=8M zpool scrub tank
  
  If I'm building a 30-drive storage server for a client I'll often see 1-2 fail out. Better to catch them now then when they're deployed (especially with the crap warranties on spinning rust these days). I need to order in staggered lots anyway, so having 10% overhead helps keep things moving along.
  
  --
  My God, it's Full of Source!
  OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
smartmontools by WD · 2012-12-23 05:35 · Score: 5, Informative

Set up the smartd.conf file to do the example short-test daily and long-test weekly, and email you when something is fishy. It's a trivial amount of effort, resulting in a significant amount of peace of mind. (In many cases, you'll have some amount of warning before your drive kicks the bucket and it's too late)
1. Re:smartmontools by Deekin_Scalesinger · 2012-12-23 05:58 · Score: 5, Funny
  
  This should be modded up for your username alone lol
  
  --
  "As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
Yes! Especially before adding them to an array. by Anonymous Coward · 2012-12-23 05:42 · Score: 5, Interesting

I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.
I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.
This catches not only obviously bad drives, but also the slow or otherwise odd ones.
Re:SSDs by White+Flame · 2012-12-23 05:53 · Score: 3, Insightful

Not really. People usually don't modify gigantic footprints of data per day, so standard incremental backup strategies are still very applicable. Most of the large data tends to be read-only over time, typically media, archives, large installation files, etc.
Re:SSDs by cpghost · 2012-12-23 05:58 · Score: 3, Informative

Who cares about HDDs anymore these days?
We do here at work. We need some modest 120+ TB of storage right now, and 30% of that content is highly dynamic (PostgreSQL databases). Anything but data center quality HDD would be silly, not to mention unreliable as hell and heavily expensive. SSDs are just for laptops or so, not for real data storage requirements.

--
cpghost at Cordula's Web.
SMART + badblocks by SuperBanana · 2012-12-23 06:23 · Score: 5, Interesting

I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.
If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.

--
Please help metamoderate.
Re:SSDs by aaarrrgggh · 2012-12-23 06:29 · Score: 3, Insightful

Rebuild time. It takes our hardware raids about 24 hours to rebuild, and software raids about 72 hours. If the disk failure isn't detected immediately, even with RAID-6 you are pushing your luck.
RAID is not backup.
Re:SSDs by PlusFiveTroll · 2012-12-23 06:50 · Score: 3, Insightful

Depending on your definition of reliable and long term, people still use tapes.
Re:Used to never test by PlusFiveTroll · 2012-12-23 07:05 · Score: 3, Interesting

Two DOA of the same part isn't out of the question, a good amount of the time the same part number is from the same batch, which may suffer from the same manufacturing defects. I see things like that pretty often in batches of disks that fall out of RAIDs.
Re:SSDs by cpghost · 2012-12-23 07:14 · Score: 4, Interesting

Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.

--
cpghost at Cordula's Web.
Re:SSDs by hairyfeet · 2012-12-23 08:18 · Score: 4, Interesting

Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.
So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.

--
ACs don't waste your time replying, your posts are never seen by me.
Wrong Approach by nuckfuts · 2012-12-23 08:29 · Score: 4, Insightful

I've been dealing with hardware failures for 20+ years. What I've learned is that disasters WILL happen, regardless of what preventive measures are in place. So I shifted my focus toward recoverablity. To me, the important question is "When something catastrophic happens, how quickly and easily can I put things back in working order"?
Since I use RAID where appropriate, and more importantly, I am positively fanatic about frequent, full, and tested backups, the only concern I have when a hard drive dies is whether I'm still entitled to a warranty replacement.
Re:SSDs by drsmithy · 2012-12-23 09:23 · Score: 4, Funny

Holy crap. Twenty 3T spindles in a single array ? What do you do to de-stress ? Run between cars on a highway ?