Ask Slashdot: How Do You Test Storage Media?
First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"
http://www.grc.com/sr/spinrite.htm
K Man
I've hit a bit of a brick wall when it comes to testing hard disks
Have you tried throwing them against the brick wall?
Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.
In previous jobs, I've used the system of:
Full Format, Verify, Erase, then a Drive fitness test.
If there are errors in media, the Format, verify or erase will pick it up, then the fitness test to check the hardware.
Hitachi has a Drive Fitness test program
I have also used hddllf (hddguru.com)
don't use consumer drives if you're concerned.
see also http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf
The Goog wrote a nice paper on hard drives.
Need Mercedes parts ?
Jet Stress does a good job of runnig the storage media through a lot of work.
All I usually do is:
1. smartctl -AH
Get an initial baseline report.
2. mke2fs -c -c
Perform a read/write test on the drive.
3. smartctl -AH
Get a final report to compare to the initial report.
If the drive remains healthy, and error counters aren't incrementing between the smartctl reports, it's good to go.
Can You Say Linux? I Knew That You Could.
root ~/bin # cat scandisk
#!/bin/bash
# RW scan of HD
argg='/dev/'$1
# if IDE (old kernels)
hdparm -c1 -d1 -u1 $argg
# Speedup I/O - also good for USB disks
blockdev --setra 16384 $argg
blockdev --getra $argg
#time badblocks -f -c 20480 -n -s -v $argg
#time badblocks -f -c 16384 -n -s -v $argg
time badblocks -f -c 10240 -n -s -v $argg
exit;
---------
Note that this reads existing content on the drive, writes a randomized pattern, reads it back, and writes the original content back. With modern high-capacity over-500GB drives, you should plan on leaving this running overnight. You can do this from pretty much any linux livecd, AFAIK. If running your own distro, you can monitor the disk I/O with ' iostat -k 5 '.
From ' man badblocks '
-n Use non-destructive read-write mode. By default only a non-destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
Hard Disk Sentinel: http://www.hdsentinel.com/ is a great tool They even have a free Linux client. What it does over SMART is that it takes the SMART data and weights them according to indications of failure, then gives you a score of 0-100 (100 being great, 0 being dead) as to how healthy the drive is. We use this extensively and have created NAGIOS scripts that monitor the output. Generally, if a drive has a score of 65 or higher, I will generally continue using it (pretty much all my setups are RAID 10 or RAID 6). If the score starts dropping rapidly (a few points every day, even if it started high) or gets below 65 or so, I go ahead and replace it. It has helped out a bunch.
Even with that, using the SMART data, in a SMART way, still only predicts about 30% of failures. The other 70% will come out of no where. That is why it is best to assume all drives will die at anytime and are suspect and never allow a single drive to be the sole copy of anything.
S.M.A.R.T is a joke, but not in implementation. It's a joke because most HDD failures occur on the logic board. It's a known fix in data recovery services to simply swap out the PCB for another of the same vintage make/model/firmware rev. Though I have ran tools such as HD Tune to view out-of-spec metrics and benchmarks. For example, I once had a user that reported that her workstation was running extremely slow. I suspected the drive was at fault and the graphs proved it, but technically it wasn't a failure. S.M.A.R.T would have flagged it if it was mechanical, but it wouldn't have if it was a controller issue. Now that may have changed with newer drives, but that's been my overall experience.
Life is not for the lazy.
I manage a team that oversees PB of disk, both within an enterprise array and internal to the server. For testing the arrays, since there's GB of cache in front of the disks, I can only rely on the vendor to do the appropriate post installation testing to make sure there are no DOA disks. For internal disks, as others have mentioned you could run IOMeter for days without a problem and then the very next day it's dead. Unlike memory, disks have moving parts that can fail much easier than chips. However, with proper precautions like RAID, single disk failures can be avoided.
The bigger problem is having a double disk failure. This is due to the amount of time required to rebuild the failed disk. Back when disks were 100GB this was a "relatively" quick process. However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs. With what an enterprise disk costs from the array vendor (not Frys), it can start to add up.
Not completely related to how to test, but...
In 2007 Google reported that for a sample of 100k drives, only 60% of their drives with failures had ever encountered any SMART errors. Also, NetApp has reported a significant amount of drives with temporary failures, such that they can be placed back into a pool after being taken offline for a period of time and wiped. Google also had a lot of other interesting things to say (such as heat has no noticeable effect on hard drive life under 45C, that load is unrelated to failure rates, and that if a drive doesn't fail after 3 months, it's very unlikely to fail until the 2-3 year timeframe.
You can find the google paper here: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
A few other notes that you can find from storage vendor tech notes if you own their arrays:
* Enterprise-level SAS drives aren't any more reliable than consumer SATA drives
- But they do have considerably different firmwares that assume they will be placed in an array, and thus have a completely different self-healing scheme than consumer-level drives (generally resulting in higher performance in failure scenarios)
* RAID 5 is a really bad idea - correlated failures are much more likely than the math would indicate, especially with the rebuild times involved with today's huge drives
* You have a lot more filesystem options that might not even make sense to use with a RAID system, like ZFS, as well as other mechanisms for distributing your data at a layer higher than the filesystem
Ultimately the reality is that regardless of the testing you put them under, hard drives will fail, and you need to design your production system around this fact. You *should* burn them in with constant read/write cycles for a couple days in order to identify those drives which are essentially DOA, but you shouldn't assume any drive that passes that process won't die tomorrow.
http://lime-technology.com/forum/index.php?topic=2817.0 ... the main feature of the script is
1. gets a SMART report
2. pre-reads the entire disk
3. writes zeros to the entire disk
4. sets the special signature recognized by unRAID
5. verifies the signature
6. post-reads the entire disk
7. optionally repeats the process for additional cycles (if you specified the "-c NN" option, where NN = a number from 1 to 20, default is to run 1 cycle)
8. gets a final SMART report
9. compares the SMART reports alerting you of differences.
Check it out. Its "original" purpose was to set the drive to all "0's" for easy insertion into a parity array (read: parity drive does not need to be updated if the new drive is all zeros) but it has also shown great utility as a stress test / burn-in tool to detect infant mortality and "force the issue" as far as satisfying the criteria needed for an RMA (read: sufficient reallocated block count)
If your skill level is enough to adapt the script to your own environ then great, otherwise UnRaid Basic is free and allows 3 drives in the array which should allow you to simultaneously pre-clear three drives. You might even be able to pre-clear more than that (up to available hardware slots) since you aren't technically dealing with the array at that point, but with enumerated hardware that the script has access to which should be eveything on the disc. Hardware requirements are minimal and it runs from flash.
If you can't be good, be good at it!
Most everything above is good, but don't overlook the obvious. Spin the drive up in a quiet room and listen to it. If it sounds different from all the other drives like it, there's a good chance something is wrong.
I replaced the drive in my TiVo. The 1st replacement was so much louder, I swapped the original back, then put the new drive in a test rig. It started getting bad sectors in a few days. RMA'd it to Seagate, and the new one was much quieter.
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
badblocks -c 10240 -s -w -t random -v /dev/sda1
that's my standard test for all HDDs
#
#\ @ ? Colonize Mars
#
Speaking as somebody that has done hardware qualifications and burn-in development at very large scale for companies you ahve heard of let me tell you the tools I use:
fio: The _BEST_ tool for raw drive performance and burnin testing. A couple of hours of random access will ensure the drive head can dance, then a full block by block walk through with checksum verification will ensure that all blocks are readable and writable.. I usually do 2 or 3 passes here. You can tell fio to reject drives that do not perform to a minimum standard. Very useful for finding functional yet not quite up to speed drives. The statistics produced here are awesome as well.. Something like 70 stats per device per test.
stressapptest: This is google's burn in tool and virtually the only one I have ever found that supports NUMA on modern dual socket machines. This is IMPORTANT as its easy to ignore issues that come up with the link between the CPUs. The various testing modes give you the ability to tear the machine to pieces which is awesome. Stressapptest also is the most power hungry test I have ever seen, including the intel Power testing suite that you have to jump through hoops to get.
Pair this with a pass of memtest and you get a really, really nice burn in system that can burtalize the hardware and give you scriptable systems for detecting failure.
It may not be sophisticated, but MHDD is what I use at work (among a couple of other tools). Other tools are more reliable in different circumstances, but my first stop is always MHDD, because it will give me a comprehensive R/W delay test on a disk. Extremely practical for a workshop, perhaps not practical for a data centre.
Admit it. You post strawman arguments as AC so you get modded Insightful for refuting them, rather than Troll
> I've always used "consumer" grade HD's in my servers...
A lot depends on what your servers are doing. Try putting a consumer drive in a news server some time if you want a good laugh. Or put one in any server that is really busy 24/7 and a cheap drive will crap out fairly fast since they were never designed for that sort of constant abuse. Meanwhile good drives can and will hold up for years of that kind of torture. Yes they will all happilly spin 24/7 and thrive if the drive is well cooled since power cycles are when most drives die, but constant seeking is another kettle of fish.
Democrat delenda est