Ask Slashdot: How Do You Test Storage Media?

← Back to Stories (view on slashdot.org)

Ask Slashdot: How Do You Test Storage Media?

Posted by timothy on Tuesday April 3, 2012 @05:29AM from the give-her-some-storage-tarot-cards dept.

First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"

18 of 297 comments (clear)

Min score:

Reason:

Sort:

SpinRite by alphax45 · 2012-04-03 05:30 · Score: 2, Informative

http://www.grc.com/sr/spinrite.htm

--
K Man
1. Re:SpinRite by Anonymous Coward · 2012-04-03 05:38 · Score: 2, Informative
  
  Does their product actually do anything these days? Seems like the last time I used it was when you had the choice of an ARLL or RLL disk controller, haha...
  Anyway, I always stress test my drives with IOMeter. Leave them burning on lots of random IOP's to stress test the head positioners and don't forget to power cycle them a good number of times. Spin-up is when most drive motors will fail and when the largest in-rush current occurs.
2. Re:SpinRite by alphax45 · 2012-04-03 05:40 · Score: 2, Informative
  
  Still works 100% as HDD tech is still the same - just don't use on SSD's
  
  --
  K Man
3. Re:SpinRite by SuperTechnoNerd · 2012-04-03 06:22 · Score: 4, Informative
  
  Spinrite is not that meaningful thees days since drives don't give you that low level access to the media like the days of old. Since you can't low level format drives which was one of spinrites strong points, save money and use badblocks. Use badblocks in read/write mode with random test pattern or worst case test pattern a few times. Then do a SMART long self test. Keep an eye on the pending sector count and the reallocated sector count. A drive can remap a difficult sector without you ever knowing unless you look there. Also keep an eye on drive temperature, even a new drive can act flaky if it gets too hot.
4. Re:SpinRite by linebackn · 2012-04-03 06:49 · Score: 4, Informative
  
  >Still works 100% as HDD tech is still the same
  Not entirely true. Back in the days of MFM/RLL drives, SpinRite could perform a "low level" format on each track. This ensured every last magnetic 1 and 0 was re-written to the disk. Back in the day I witnessed many times when SpinRite would completely recover bad sectors, presumably damaged by electrical/controller issues rather than physical surface issues, and a full pattern test would prove the space was safe to use.
  Modern IDE drives don't allow low-level formatting, and as far as I know, even re-writing the user content of the drive does not re-write sector header data. Modern IDE drives also have hidden reserved space for "spare" tracks and space where they store their firmware, which likewise never gets tested or re-written.
  Additionally, on MFM/RLL drives SpinRite could use low-level formatting to optimize the sector interleave for the specific system. (You would be surprised, moving some disks and their ISA controllers to a faster system would actually require a higher interleave, slowing them down incredibly until SpinRite was run)
  Still, SpinRite is the only program that I know of that can do a controlled read/write pattern test and modify the underlying file system when needed.
5. Re:SpinRite by washu_k · 2012-04-03 06:52 · Score: 5, Informative
  
  Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.
  
  An easy test to prove that Spinrite is BS is run it against a USB key. Not a SATA SSD, but a USB flash drive. Make the USB key bootable with DOS, put Spinrite on and boot a PC with no other drives. Run its "tests" against the USB key. All the "low level" tests Spinrite claims to do will appear to work, but are impossible on a USB device.
  
  Infact, they are impossible on a modern mechanical HD as well. As yacc143 pointed out, modern drives are not the same as MFM/RLL drives of the past. The low level tests that Spinrite claims to do are simply impossible.
  
  It's also a terrible data recovery program, since it can only write recovered data back to the same disk. That's a data recovery 101 no-no, and Spinrite fails.
6. Re:SpinRite by NotBorg · 2012-04-03 08:21 · Score: 4, Informative
  
  A drive can remap a difficult sector without you ever knowing unless you look there.
  Except when it happens a lot... then your drive is F***.............I.....N...............G slow for now apparent reason.
  We "test" our drives by filling them with whatever data we have laying around. We do this 5 to 10 times (depending on how soon we need the drives). What eventually happens with a bad drive is that the SMART counter ticks over to some magical number and starts reporting health issues (A requirement for some RMA processes). We also time each fill cycle. We expect the first two or three runs to take longer (EVERY drive these days will have relocations going on for the first few runs). For later runs we expect to see a more consistent fill time and the relocated sector count stop climbing so alarmingly fast.
  There are bad sectors on your brand new drive. You can count on it. You have to make the drive find them and map around them because it won't happen in the factory. Write to every byte several times. Don't wait for it to happen naturally... you'll just hit performance problems and put yourself closer to warranty cutoff time. They're counting on you not finding a problem soon enough. You must burn them in or suffer later.
  
  --
  I want this account deleted.
7. Re:SpinRite by alanmeyer · 2012-04-03 08:58 · Score: 3, Informative
  
  Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.
  
  This is a very uniformed opinion about Spinrite. Spinrite has a large population of testimonials that prove that "it works". It's main purpose is data recovery and data maintenance on magnetic-based rotational media.
  Your example of a USB drive is just another way of saying "flash", for which Spinrite is not targeted to fix.
  Indeed, there are no more "low level" commands like in the day of old HDD technology. However, Spinrite uses the standard ATA command set to do everything possible to get your data off your drive. It does this very well and you'll be hard pressed to find other programs that do it better that don't cost a lot, lot more money (think data recovery repair center).
  
  It's also a terrible data recovery program, since it can only write recovered data back to the same disk
  
  Spinrite doesn't target this case. Backing up is what you do *after* you use Spinrite to first correct the few sectors that are preventing your system from recognizing the disk in the first place.
  You really need to review the product, what it's targeted to do, and the testimonials before you continue to bad mouth a product that has been shipping for as long as Spinrite has.
Hard Drive Testing by Anonymous Coward · 2012-04-03 05:36 · Score: 2, Informative

In previous jobs, I've used the system of:
Full Format, Verify, Erase, then a Drive fitness test.
If there are errors in media, the Format, verify or erase will pick it up, then the fitness test to check the hardware.
Hitachi has a Drive Fitness test program
I have also used hddllf (hddguru.com)
The usual by macemoneta · 2012-04-03 05:39 · Score: 5, Informative

All I usually do is:
1. smartctl -AH
Get an initial baseline report.
2. mke2fs -c -c
Perform a read/write test on the drive.
3. smartctl -AH
Get a final report to compare to the initial report.
If the drive remains healthy, and error counters aren't incrementing between the smartctl reports, it's good to go.

--
Can You Say Linux? I Knew That You Could.
Re:S.M.A.R.T. by DigiShaman · 2012-04-03 05:49 · Score: 4, Informative

S.M.A.R.T is a joke, but not in implementation. It's a joke because most HDD failures occur on the logic board. It's a known fix in data recovery services to simply swap out the PCB for another of the same vintage make/model/firmware rev. Though I have ran tools such as HD Tune to view out-of-spec metrics and benchmarks. For example, I once had a user that reported that her workstation was running extremely slow. I suspected the drive was at fault and the graphs proved it, but technically it wasn't a failure. S.M.A.R.T would have flagged it if it was mechanical, but it wouldn't have if it was a controller issue. Now that may have changed with newer drives, but that's been my overall experience.

--
Life is not for the lazy.
Reliability and fault-tolerance by Mondragon · 2012-04-03 05:53 · Score: 5, Informative

Not completely related to how to test, but...
In 2007 Google reported that for a sample of 100k drives, only 60% of their drives with failures had ever encountered any SMART errors. Also, NetApp has reported a significant amount of drives with temporary failures, such that they can be placed back into a pool after being taken offline for a period of time and wiped. Google also had a lot of other interesting things to say (such as heat has no noticeable effect on hard drive life under 45C, that load is unrelated to failure rates, and that if a drive doesn't fail after 3 months, it's very unlikely to fail until the 2-3 year timeframe.
You can find the google paper here: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
A few other notes that you can find from storage vendor tech notes if you own their arrays:
* Enterprise-level SAS drives aren't any more reliable than consumer SATA drives
- But they do have considerably different firmwares that assume they will be placed in an array, and thus have a completely different self-healing scheme than consumer-level drives (generally resulting in higher performance in failure scenarios)
* RAID 5 is a really bad idea - correlated failures are much more likely than the math would indicate, especially with the rebuild times involved with today's huge drives
* You have a lot more filesystem options that might not even make sense to use with a RAID system, like ZFS, as well as other mechanisms for distributing your data at a layer higher than the filesystem
Ultimately the reality is that regardless of the testing you put them under, hard drives will fail, and you need to design your production system around this fact. You *should* burn them in with constant read/write cycles for a couple days in order to identify those drives which are essentially DOA, but you shouldn't assume any drive that passes that process won't die tomorrow.
Re:Why? by jeffmeden · 2012-04-03 05:55 · Score: 2, Informative

Hard drives, amazingly, are tested pretty effectively before leaving the factory. During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan. The result: if you don't screw up when you install it you have little to worry about on day 1 that is different from day 1000, which is the cold reality that all mechanical devices will fail.
Cue the "but I have seen so many DOA drives from XYZcorp..." and to that I will pre-retort with this: if you buy a quality drive (i.e. not a refurb or one specifically designed as a consumer throwaway) from a vendor that takes some care in shipping and handling, then no you did not stumble on "the conspiracy of XYZcorp's bad drives". The weakest link was you. Try wearing a static strap next time.
UnRAID Preclear Script by Jumperalex · 2012-04-03 06:02 · Score: 3, Informative

http://lime-technology.com/forum/index.php?topic=2817.0 ... the main feature of the script is
1. gets a SMART report
2. pre-reads the entire disk
3. writes zeros to the entire disk
4. sets the special signature recognized by unRAID
5. verifies the signature
6. post-reads the entire disk
7. optionally repeats the process for additional cycles (if you specified the "-c NN" option, where NN = a number from 1 to 20, default is to run 1 cycle)
8. gets a final SMART report
9. compares the SMART reports alerting you of differences.
Check it out. Its "original" purpose was to set the drive to all "0's" for easy insertion into a parity array (read: parity drive does not need to be updated if the new drive is all zeros) but it has also shown great utility as a stress test / burn-in tool to detect infant mortality and "force the issue" as far as satisfying the criteria needed for an RMA (read: sufficient reallocated block count)
If your skill level is enough to adapt the script to your own environ then great, otherwise UnRaid Basic is free and allows 3 drives in the array which should allow you to simultaneously pre-clear three drives. You might even be able to pre-clear more than that (up to available hardware slots) since you aren't technically dealing with the array at that point, but with enumerated hardware that the script has access to which should be eveything on the disc. Hardware requirements are minimal and it runs from flash.

--
If you can't be good, be good at it!
Ears by Maximum+Prophet · 2012-04-03 06:37 · Score: 3, Informative

Most everything above is good, but don't overlook the obvious. Spin the drive up in a quiet room and listen to it. If it sounds different from all the other drives like it, there's a good chance something is wrong.

I replaced the drive in my TiVo. The 1st replacement was so much louder, I swapped the original back, then put the new drive in a test rig. It started getting bad sectors in a few days. RMA'd it to Seagate, and the new one was much quieter.

--
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Re:Why? by tlhIngan · 2012-04-03 07:44 · Score: 5, Informative

Also: HARDWARE RAID CARDS.
I can't stress that enough. software and semi-software raid is a joke.
Not until the hardware fails and you need the data that was on there but not on the backup (or realized the backup failed a long time ago...).
For performance, yes, hardware is fastest. For reliability though, software RAID is better (hardware RAID can have interesting firmware version issues).
Linux running an md RAID array? If the server goes down, pop the drives in another server, a couple of mdadm commands later and the array is up and running. Hell, even Windows' software RAID ought to be able to work to recover an array where the server hardware died.
So if you're using RAID not for performance reasons, but for protection against hard drive failure, soft-RAID works very well. Hell, one of my NAS appliances died, and all I did was take the drive out, attach 4 USB adapters to them, and plug them into my Linux box. Instant access to the data,
There's nothing like the panic that happens when an array goes down due to non-drive hardware failure.
My suggestions by Liquid-Gecka · 2012-04-03 07:53 · Score: 4, Informative

Speaking as somebody that has done hardware qualifications and burn-in development at very large scale for companies you ahve heard of let me tell you the tools I use:
fio: The _BEST_ tool for raw drive performance and burnin testing. A couple of hours of random access will ensure the drive head can dance, then a full block by block walk through with checksum verification will ensure that all blocks are readable and writable.. I usually do 2 or 3 passes here. You can tell fio to reject drives that do not perform to a minimum standard. Very useful for finding functional yet not quite up to speed drives. The statistics produced here are awesome as well.. Something like 70 stats per device per test.
stressapptest: This is google's burn in tool and virtually the only one I have ever found that supports NUMA on modern dual socket machines. This is IMPORTANT as its easy to ignore issues that come up with the link between the CPUs. The various testing modes give you the ability to tear the machine to pieces which is awesome. Stressapptest also is the most power hungry test I have ever seen, including the intel Power testing suite that you have to jump through hoops to get.
Pair this with a pass of memtest and you get a really, really nice burn in system that can burtalize the hardware and give you scriptable systems for detecting failure.
Re:Why? by CAIMLAS · 2012-04-03 09:42 · Score: 4, Informative

To a degree, you can rule with certainty that everything is working.
New equipment does tend to have ghosts. Given enough systems, with homogeneous roles, it doesn't matter: if it starts to fail, you pull it and put another one in.
If you've got an environment with only a few servers with dedicated roles, having a new 'production server' go tits up is a very bad thing. For a system like this, you really do want to do a 'burn in' period, IMO for at least a couple weeks, where the system is not being depended upon. Your 4-year-old system doing the same thing at relatively diminished capability is not nearly as bad as doing a cut-over and having things go south, then.
You do, however, want to do a "burn in" on that new equipment. My preference is to stress a new piece of equipment with something like building kernels (which will stress every significant subsystem to some degree) while doing file operations (eg. something like bonnie+ if you're not copying files to the machine) for a period of at least a week without any stability or significant performance problems. This is due to the following subjective observations:
* getting a system with a defective disk is not uncommon these days. It's not common, so it's not a serious concern.
* Short of initial failure of the disk/DOA status, the disks will likely run a number of months before your first failure (depending on how many you've got, of course)
* Instability, inconsistent behavior, flaky RAM, or odd behavior from RAID or NIC controllers, and 'ghosts' can almost invariably be traced back to the PDU or PSU. These seem to die within about two weeks to a month if they're defective/poorly designed. With a server, troubleshooting this can be a huge bitch due to how loud they are and the multiple-dependence issue on the PDU. This is kind of an end game for me, and I have a hard time trusting any of the equipment after I've had a PSU fail.
* if you plan on taxing the system at all, you'll probably have a driver related performance problem somewhere down the line. Better to find it before you need the performance.
* Every once in a while, you've got a bad solid state device (RAM, CPU, SSD). These seem to either work, or not work, if they pass initial "does it work?"

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers