Ask Slashdot: Do You Test Your New Hard Drives?
An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"
Like, never. Out of the box and away she goes...good luck to thee!
"As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
> Who cares about HDDs anymore these days?
Anyone with a need for a massive amount of storage space.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Set up the smartd.conf file to do the example short-test daily and long-test weekly, and email you when something is fishy. It's a trivial amount of effort, resulting in a significant amount of peace of mind. (In many cases, you'll have some amount of warning before your drive kicks the bucket and it's too late)
I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.
I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.
This catches not only obviously bad drives, but also the slow or otherwise odd ones.
Yes, this. I do it online:
and then check smartctl. If I'm making a really big zpool, I fill them up and let ZFS fail out the turkeys:
If I'm building a 30-drive storage server for a client I'll often see 1-2 fail out. Better to catch them now then when they're deployed (especially with the crap warranties on spinning rust these days). I need to order in staggered lots anyway, so having 10% overhead helps keep things moving along.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.
If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.
Please help metamoderate.
Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.
cpghost at Cordula's Web.
Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.
So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.
ACs don't waste your time replying, your posts are never seen by me.
I've been dealing with hardware failures for 20+ years. What I've learned is that disasters WILL happen, regardless of what preventive measures are in place. So I shifted my focus toward recoverablity. To me, the important question is "When something catastrophic happens, how quickly and easily can I put things back in working order"?
Since I use RAID where appropriate, and more importantly, I am positively fanatic about frequent, full, and tested backups, the only concern I have when a hard drive dies is whether I'm still entitled to a warranty replacement.
Holy crap. Twenty 3T spindles in a single array ? What do you do to de-stress ? Run between cars on a highway ?