Ask Slashdot: How Do You Test Storage Media?

← Back to Stories (view on slashdot.org)

Ask Slashdot: How Do You Test Storage Media?

Posted by timothy on Tuesday April 3, 2012 @05:29AM from the give-her-some-storage-tarot-cards dept.

First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"

9 of 297 comments (clear)

Min score:

Reason:

Sort:

Why? by headhot · 2012-04-03 05:35 · Score: 5, Insightful

Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.
1. Re:Why? by Shagg · 2012-04-03 05:40 · Score: 4, Insightful
  
  No, the point is to design your system so that if it fails 2 weeks down the line... it isn't a problem.
  
  --
  Unix is user friendly, it's just selective about who its friends are.
2. Re:Why? by gregmac · 2012-04-03 05:44 · Score: 5, Insightful
  
  Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.
  And then what you should test is that it actually notifies you when something does fail, so you know about it and can fix it. You can also test how long it takes to rebuild the array after replacing a disk, and how much performance degradation there is while that is happening.
  
  --
  Speak before you think
3. Re:Why? by Joce640k · 2012-04-03 05:56 · Score: 4, Insightful
  
  Point is: You can't 'test'.
  You can only tell if it's working, not when it's about to fail.
  If people could predict when hard drives were going to fail we wouldn't need RAID or backups.
  
  --
  No sig today...
4. Re:Why? by jeffmeden · 2012-04-03 06:14 · Score: 2, Insightful
  
  A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.
  If your static strap is made of (all) plastic, then you will have issues beyond shipping and handling woes...
5. Re:Why? by windcask · 2012-04-03 06:19 · Score: 3, Insightful
  
  Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.
  All RAID 5 does is move the single point of failure from the disk itself to the RAID controller, which could also fail at any time. This is why a truly effective solution is virtual machine redundancy with seamless failover and a rigorous backup schedule.
Hard Disk Sentinel by prestonmichaelh · 2012-04-03 05:46 · Score: 3, Insightful

Hard Disk Sentinel: http://www.hdsentinel.com/ is a great tool They even have a free Linux client. What it does over SMART is that it takes the SMART data and weights them according to indications of failure, then gives you a score of 0-100 (100 being great, 0 being dead) as to how healthy the drive is. We use this extensively and have created NAGIOS scripts that monitor the output. Generally, if a drive has a score of 65 or higher, I will generally continue using it (pretty much all my setups are RAID 10 or RAID 6). If the score starts dropping rapidly (a few points every day, even if it started high) or gets below 65 or so, I go ahead and replace it. It has helped out a bunch.

Even with that, using the SMART data, in a SMART way, still only predicts about 30% of failures. The other 70% will come out of no where. That is why it is best to assume all drives will die at anytime and are suspect and never allow a single drive to be the sole copy of anything.
Are you testing an array or individual drives? by HockeyPuck · 2012-04-03 05:51 · Score: 4, Insightful

I manage a team that oversees PB of disk, both within an enterprise array and internal to the server. For testing the arrays, since there's GB of cache in front of the disks, I can only rely on the vendor to do the appropriate post installation testing to make sure there are no DOA disks. For internal disks, as others have mentioned you could run IOMeter for days without a problem and then the very next day it's dead. Unlike memory, disks have moving parts that can fail much easier than chips. However, with proper precautions like RAID, single disk failures can be avoided.
The bigger problem is having a double disk failure. This is due to the amount of time required to rebuild the failed disk. Back when disks were 100GB this was a "relatively" quick process. However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs. With what an enterprise disk costs from the array vendor (not Frys), it can start to add up.
Re:scsi by Galactic+Dominator · 2012-04-03 06:03 · Score: 3, Insightful

Perhaps an honest mistake, the link is broken. Second, evidence has shown SATA are more reliable than commercial/enterprise grade drive. Only buy those if you don't like your money, or there is some clear advantage. That supposed advantage is not reliability, unless there is there is some sort of rapid replacement mechanism coming with the drive. Although replacement isn't reliability in my book.
http://lwn.net/Articles/237924/

--
brandelf -t FreeBSD /brain