Ask Slashdot: Do You Test Your New Hard Drives?
An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"
Like, never. Out of the box and away she goes...good luck to thee!
"As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
> Who cares about HDDs anymore these days?
Anyone with a need for a massive amount of storage space.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
If dban can write out every sector and not have smartctl show any pending sectors after the fact (and the average speed of the dban wipe was normal) then you've got good chances the drive will be fine.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
My first help desk job included every computer in the company. We had a server drive fail, so I had Compaq send a replacement. The new arrival didn't work. So then I spent more time looking at RAID configuration and such, but we got a second replacement. That one didn't work either. But I tested it on arrival. The third replacement worked fine, just when I was worried it was something stupid I was missing. Two DOA RMAs for the same part. And yes, that's happened to me again since that first time.
I test every "used" part as if it's suspect. The question was about new, but they are still new to me.
Learn to love Alaska
Set up the smartd.conf file to do the example short-test daily and long-test weekly, and email you when something is fishy. It's a trivial amount of effort, resulting in a significant amount of peace of mind. (In many cases, you'll have some amount of warning before your drive kicks the bucket and it's too late)
Old bathtubs lasted longer than old hard drives. Now it's the other way around.
Sorry, but gray text on gray background is making my eyes bleed.
I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.
I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.
This catches not only obviously bad drives, but also the slow or otherwise odd ones.
Sounds like a really old troll.
Only the State obtains its revenue by coercion. - Murray Rothbard
Not really. People usually don't modify gigantic footprints of data per day, so standard incremental backup strategies are still very applicable. Most of the large data tends to be read-only over time, typically media, archives, large installation files, etc.
Let me guess,,, if it sank to the bottom it was a good drive, but if it floated it was a bad drive and needed to be burnt at the stake.
Trying to coax an error will never reveal one. Only when you start using it "for real" will the problem manifest.
We do here at work. We need some modest 120+ TB of storage right now, and 30% of that content is highly dynamic (PostgreSQL databases). Anything but data center quality HDD would be silly, not to mention unreliable as hell and heavily expensive. SSDs are just for laptops or so, not for real data storage requirements.
cpghost at Cordula's Web.
I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.
If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.
Please help metamoderate.
Rebuild time. It takes our hardware raids about 24 hours to rebuild, and software raids about 72 hours. If the disk failure isn't detected immediately, even with RAID-6 you are pushing your luck.
RAID is not backup.
Depending on your definition of reliable and long term, people still use tapes.
I thoroughly test any new hdd I get for my desktop PC:
The first thing I do is format it and install windows. If that works, then we know the drive isn't DOA
From there I torture test it by copying several hundred gigabytes of software and movies, as well as installing some more programs.
After that, I let it run for a few months, using it normally. If it crashes during that time, then I know it was bad.
Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.
cpghost at Cordula's Web.
Testing is simple - plug it in, and run it till it fails. Might as well use it in the mean-time.
At two companies I managed IP libraries (massive amounts of photographs and drawings used in catalogs and advertisements). The data changes only slowly, and (depending on usage) seasonally, so incremental backups are very much practical. But that's not really the issue.
This is important. Raid protects you from certain kinds of failures, usually limited to the mechanical or electrical failure of a single hard drive. (More protection can be had by nesting raid levels, but for most installations this is the case.) Raid does not protect you from a wide variety of failures including data corruption from a bad controller or application bug, systemic failure of the raid appliance (example: a catastrophic power supply failure taking out multiple drives) operator-induced data loss, either accidental or malicious, or environmental catastrophe. If your data is important, there is still no substitute for backing up your data and sending it to a remote site. Even geosynch won't necessarily help if you're synching bad data to the only remote copy. And, I'm not yet convinced that syncing to "the cloud" is a good idea.
Mind you, backups don't have to be to tape. I'm a photographer when I'm not a geek, and I typically keep tens of thousands of photographs online on my workstation. As backup to tape, DVD or even blu-ray isn't really practical, I back up to a series of hard drives using one of those plug-in hard drive toasters, then carefully store them elsewhere, disconnected from the computer. Disaster recovery is a set of drives in a safe at a friend's house.
There are examples where backups aren't necessary. I worked with one array that was essentially a huge cache for 1-800 calls, and a complete wipe would only mean that customers would see a delay on the next call as their particular part of the cache was rebuilt. But for the most part, depending on raid instead of a properly implemented backup solution is a really bad idea.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.
So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.
ACs don't waste your time replying, your posts are never seen by me.
I've been dealing with hardware failures for 20+ years. What I've learned is that disasters WILL happen, regardless of what preventive measures are in place. So I shifted my focus toward recoverablity. To me, the important question is "When something catastrophic happens, how quickly and easily can I put things back in working order"?
Since I use RAID where appropriate, and more importantly, I am positively fanatic about frequent, full, and tested backups, the only concern I have when a hard drive dies is whether I'm still entitled to a warranty replacement.
Then you sir are either the luckiest bastard on the planet or haven't bought any Seagate drives above 500Gb, because I've seen so many dead OOTB or very soon after leaving the box Segate 1TB and above drives i won't even touch them anymore.
There is a reason this guy is asking this question, its because we are now down to just 2 makers of drives and the Seagates are Russian roulette with your data. Most likely he has seen that the new Seagates are selling for as low as $50 a TB online and wants more space but can see all the horror stories in the feedback and wants some way to help mitigate the risk.
But I'm sorry friend, the only way I've found to mitigate the risk is to avoid Seagate like an STD, even with WD drives often double the price of the Seagate, because while the WDs seem to have about a 1 in 15 failure rate the Seagates depending on the size (1TB-2TB the worst, 3TB better but not great) you are looking at as low as a 1 in 3 chance of failure. With failures THAT high, which frankly I hadn't seen since the big Maxtor mess of 2002, i just would avoid Seagate for anything i gave a shit about as its just not worth the risk.
ACs don't waste your time replying, your posts are never seen by me.
On black Friday I bought a 1 TB drive at Office Depot, and of course they waved the box over their anti-theft degauser. I asked for a different drive and told them that they shouldn't do that with drives. The girl gave me the look we all have seen, but the boy behind her actually agreed with me and they gave me a drive out of the cage and let me leave the store with the alarm blaring. I've just about filled it up already and It's been working fine.
Jack of all trades, master of some.
Holy crap. Twenty 3T spindles in a single array ? What do you do to de-stress ? Run between cars on a highway ?
Stress testing hard disks is a particular bugbear of mine, after having some really bad luck with early hard disks. Over the 15 years that I've been doing it I've had to send back loads of hard disks and flash cards because they failed my tests, either breaking completely or returning single bit errors in your data. Mostly the manufacturers will take disks back if you can get their stupid Windows program to return an error code. Sometimes it takes a bit of arguing but ultimately the manufacturers want to keep you happy. Flash disks with single bit errors are the hardest to send back in my experience.
Here is the latest generation of my stress testing code (re-written in Go recently): https://github.com/ncw/stressdisk
(Interestingly the stressdisk program sometimes finds bad ram in your computer too!)
I generally thrash every new hard disk or memory card for 24 hours to see if I can break it before trusting any data to it!
I also run a long smart test too.
Somewhat paranoid, yes, but I really, really hate losing data!
Every man for himself, all in favour say "I"
Please. Quoting Jeff Atwood as an authoritative source on SSDs?
Some anecdotal evidence and a subsequent admission of buying from the brand known for the highest failure rate in SSDs isn't going to convince anyone.
I'd like to see some proper statistics before I believe anything you say.
The most reliable statistics I've seen show SSDs performing as good or better than HDDs when it comes to failing. I haven't seen any statistics on what percentage of failing drives did so spontaneously, completely, without warning and without any possibility for repair.
Mind you, I'm not claiming they don't. Just that I haven't seen any evidence beyond some anecdotes. And well, anybody that trusts a single drive with important data is an idiot or ignorant anyway.
Exactly this.
I know a (very large) Data Center belonging to a (very large) company which started replacing their HDDs with SSDs. The price difference isn't even that large; price-per-GB for a server-grade 15K RPM SAS was negligibly close to SSD price. And the advantages are really there: (much) lower heat produced, less noise, less space taken, less energy consumed. Even with a similar failure rate, the advantages are there.
...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
--If I were you, I would look into the following:
o Test all drives before putting them into production - either with SMART long test, or linux 'badblocks'
o Cooling - is it adequate enough?
o Powerful enough Power supply ++ UPS (essential these days)
o Mount all drives with "noatime" option in Linux, or in XP and later:
' fsutil behavior set disablelastaccess 1 ' and reboot
o Spin down all HDs when not in use.
--I do all of the above, and my drives last for years and years. Just sayin'
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??