Slashdot Mirror


Ask Slashdot: How Do You Test Storage Media?

First time accepted submitter g7a writes "I've been given the task of testing new hardware for the use in our servers. For memory, I can run it through things such as memtest for a few days to ascertain if there are any issues with the new memory. However, I've hit a bit of a brick wall when it comes to testing hard disks; there seems to be no definitive method for doing so. Aside from the obvious S.M.A.R.T tests ( i.e. long offline ) are there any systems out there for testing hard disks to a similar level to that of memtest? Or any tried and tested methods for testing storage media?"

297 comments

  1. SpinRite by alphax45 · · Score: 2, Informative
    --
    K Man
    1. Re:SpinRite by Anonymous Coward · · Score: 2, Informative

      Does their product actually do anything these days? Seems like the last time I used it was when you had the choice of an ARLL or RLL disk controller, haha...

      Anyway, I always stress test my drives with IOMeter. Leave them burning on lots of random IOP's to stress test the head positioners and don't forget to power cycle them a good number of times. Spin-up is when most drive motors will fail and when the largest in-rush current occurs.

    2. Re:SpinRite by NIN1385 · · Score: 1

      SpinRite is a good one, have used.

      --

      If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    3. Re:SpinRite by alphax45 · · Score: 2, Informative

      Still works 100% as HDD tech is still the same - just don't use on SSD's

      --
      K Man
    4. Re:SpinRite by Anonymous Coward · · Score: 0

      http://www.grc.com/sr/spinrite.htm

      Isn't that a data recovery program? I recommend GWScan based off of wd data lifegaurd tools, supports all drives.

    5. Re:SpinRite by cpu6502 · · Score: 1

      What about motor failure? My last drive became inaccessible when the motor stopped spinning (6 months continuous spin, followed by power failure, followed by no spin).

      --
      My AC stalker: " I personally agree with your posts most of the time, but that won't keep me from modding you troll"
    6. Re:SpinRite by rickb928 · · Score: 1

      SpinRite is excellent for testing. If your drives run as hot as the old Hitatchi drives did, it doubles as a space heater or makeshift stove.

      Seriously, SpinRite will exercise a drive very well indeed. And it will tell you more than the manufacturer wanted you to know.

      --
      deleting the extra space after periods so i can stay relevant, yeah.
    7. Re:SpinRite by intellitech · · Score: 1

      SpinRite is good. I've also used Hitachi's DFT (extensively) and the PC-Check Suite (used while working for Geek Squad), which has a really nice stress testing routine for hard disks.

      --
      vos nescitis quicquam, nec cogitatis quia expedit nobis ut unus moriatur homo pro populo et non tota gens pereat.
    8. Re:SpinRite by 0p7imu5_P2im3 · · Score: 2

      While SpinRite does a good job of recovering data temporarily on bad drives, it's intended use is to exercise the drive's SMART controller so that it will check the drive for problems more often, thus moving data from bad sectors before they fail completely. This has the fortunate side effect of reporting whether a drive is past it's stable use lifetime as well as other basic statistics regarding normal drive use.

      --
      Resistance is futile. Your technological distinctiveness will be added to our own. You will become one with the morgue
    9. Re:SpinRite by NEDHead · · Score: 5, Funny

      Did you restore power?

    10. Re:SpinRite by CanHasDIY · · Score: 1

      Anything coming from GRC is, IMO, awesome by default.

      --
      An enigma, wrapped in a riddle, shrouded in bacon and cheese
    11. Re:SpinRite by SuperTechnoNerd · · Score: 4, Informative

      Spinrite is not that meaningful thees days since drives don't give you that low level access to the media like the days of old. Since you can't low level format drives which was one of spinrites strong points, save money and use badblocks. Use badblocks in read/write mode with random test pattern or worst case test pattern a few times. Then do a SMART long self test. Keep an eye on the pending sector count and the reallocated sector count. A drive can remap a difficult sector without you ever knowing unless you look there. Also keep an eye on drive temperature, even a new drive can act flaky if it gets too hot.

    12. Re:SpinRite by yacc143 · · Score: 2

      LOL, RLL harddiscs had capacities that are by today standards located somewhere between the CPU cache size and the RAM size of average smartphones.
      (My first PC had a 20MB HDD)

      Put simply, a modern hdd are about the same as a RLL hdd, as a Cadillac is similar to a tryke.

    13. Re:SpinRite by linebackn · · Score: 4, Informative

      >Still works 100% as HDD tech is still the same

      Not entirely true. Back in the days of MFM/RLL drives, SpinRite could perform a "low level" format on each track. This ensured every last magnetic 1 and 0 was re-written to the disk. Back in the day I witnessed many times when SpinRite would completely recover bad sectors, presumably damaged by electrical/controller issues rather than physical surface issues, and a full pattern test would prove the space was safe to use.

      Modern IDE drives don't allow low-level formatting, and as far as I know, even re-writing the user content of the drive does not re-write sector header data. Modern IDE drives also have hidden reserved space for "spare" tracks and space where they store their firmware, which likewise never gets tested or re-written.

      Additionally, on MFM/RLL drives SpinRite could use low-level formatting to optimize the sector interleave for the specific system. (You would be surprised, moving some disks and their ISA controllers to a faster system would actually require a higher interleave, slowing them down incredibly until SpinRite was run)

      Still, SpinRite is the only program that I know of that can do a controlled read/write pattern test and modify the underlying file system when needed.

    14. Re:SpinRite by washu_k · · Score: 5, Informative

      Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.

      An easy test to prove that Spinrite is BS is run it against a USB key. Not a SATA SSD, but a USB flash drive. Make the USB key bootable with DOS, put Spinrite on and boot a PC with no other drives. Run its "tests" against the USB key. All the "low level" tests Spinrite claims to do will appear to work, but are impossible on a USB device.

      Infact, they are impossible on a modern mechanical HD as well. As yacc143 pointed out, modern drives are not the same as MFM/RLL drives of the past. The low level tests that Spinrite claims to do are simply impossible.

      It's also a terrible data recovery program, since it can only write recovered data back to the same disk. That's a data recovery 101 no-no, and Spinrite fails.

    15. Re:SpinRite by Anonymous Coward · · Score: 0

      this is similar to the problem of low false positives in clinical testing for rare diseases
      Lets say the HDD failure rate is 0.5% a year (pick any number)
      Lets say the real cost to your employer is 5 $/test
      That is , by my math, 1,000 dollars of testing per failure
      sounts a lot more cost effective to spend the extra thou on another drive, but YMMV

    16. Re:SpinRite by Anonymous Coward · · Score: 0

      Then you must think bullshit is awesome, because that's most of what Gibson spews forth.

      Back in the day, he was quite notorious for this; see grcsucks.com for a collection of explanations of why exactly he and his company suck.

      This is not to say spinrite doesn't work, as well as e.g. mhdd, just that it's sold to ignorant consumers with a massive pile of impossible and nonsensical claims about what it does.

    17. Re:SpinRite by ericloewe · · Score: 2, Funny

      Sounds like your average conversation with a tech support guy.

    18. Re:SpinRite by jones_supa · · Score: 1

      Modern IDE drives don't allow low-level formatting, and as far as I know, even re-writing the user content of the drive does not re-write sector header data.

      Enhanced Secure Erase should get a bit further than the user content. When initiating that, the drive internally uses the more detailed information it knows about itself to perform a deeper erase (at least in theory). For this kind of operations I recommend Parted Magic (works for nuking SSDs too).

    19. Re:SpinRite by rrohbeck · · Score: 2

      That may have been stiction. It was a big issue for some time a couple of years ago. Media was textured in the landing zone so the heads wouldn't stick on the super smooth data surface but the head retraction mechanisms weren't perfect so the head did sometimes land in the data zone when power failed. Chances are it gets stuck there.
      These days everybody uses ramp loading and the head isn't allowed to touch the disk ever.
      Power cycling the system (after a full backup) to check if it comes back is still good advice.

    20. Re:SpinRite by NotBorg · · Score: 4, Informative

      A drive can remap a difficult sector without you ever knowing unless you look there.

      Except when it happens a lot... then your drive is F***.............I.....N...............G slow for now apparent reason.

      We "test" our drives by filling them with whatever data we have laying around. We do this 5 to 10 times (depending on how soon we need the drives). What eventually happens with a bad drive is that the SMART counter ticks over to some magical number and starts reporting health issues (A requirement for some RMA processes). We also time each fill cycle. We expect the first two or three runs to take longer (EVERY drive these days will have relocations going on for the first few runs). For later runs we expect to see a more consistent fill time and the relocated sector count stop climbing so alarmingly fast.

      There are bad sectors on your brand new drive. You can count on it. You have to make the drive find them and map around them because it won't happen in the factory. Write to every byte several times. Don't wait for it to happen naturally... you'll just hit performance problems and put yourself closer to warranty cutoff time. They're counting on you not finding a problem soon enough. You must burn them in or suffer later.

      --
      I want this account deleted.
    21. Re:SpinRite by mistapotta · · Score: 1

      Still works 100% as HDD tech is still the same - just don't use on SSD's

      Actually, Level 1 on SpinRite is fine for SDD, as it's read only. Which does what it needs to, verify the data is accessible, and the drive recognizes if the "sector" is bad. http://www.grc.com/sn/sn-338.txt

    22. Re:SpinRite by Anonymous Coward · · Score: 0

      Odd, I treat the non-zero reallocated sector or pending sector statistics as "dead drive, replace it". I run badblocks in read-write mode for a few days to weeks straight, and even cheap 2TB "green" drives (which I expect to fail) usually complete without error and without reporting any remapped sectors. And this is both out of the box and after being run 24x7 for a year or two.

      For you to see it on many or most drives makes me wonder if you have something else wrong with your environment, such as extreme RF interference, vibration and shock, or dirty power that is affecting device reliability...

    23. Re:SpinRite by alanmeyer · · Score: 3, Informative

      Spinrite may do an OK job of exercising disks, but 90% of what it claims to do is BS.

      This is a very uniformed opinion about Spinrite. Spinrite has a large population of testimonials that prove that "it works". It's main purpose is data recovery and data maintenance on magnetic-based rotational media.

      Your example of a USB drive is just another way of saying "flash", for which Spinrite is not targeted to fix.

      Indeed, there are no more "low level" commands like in the day of old HDD technology. However, Spinrite uses the standard ATA command set to do everything possible to get your data off your drive. It does this very well and you'll be hard pressed to find other programs that do it better that don't cost a lot, lot more money (think data recovery repair center).

      It's also a terrible data recovery program, since it can only write recovered data back to the same disk

      Spinrite doesn't target this case. Backing up is what you do *after* you use Spinrite to first correct the few sectors that are preventing your system from recognizing the disk in the first place.

      You really need to review the product, what it's targeted to do, and the testimonials before you continue to bad mouth a product that has been shipping for as long as Spinrite has.

    24. Re:SpinRite by simplypeachy · · Score: 1

      Good god I thought I was alone until I got to your comment. Thank you for saving me.

    25. Re:SpinRite by mixmasta · · Score: 2

      You misunderstand ... Spinrite exercises the drive and the drive heals itself. Steve's pretty clear about that.

      --
      #6495ED - cornflower blue
    26. Re:SpinRite by SuperTechnoNerd · · Score: 1

      "SMART counter ticks over to some magical number and starts reporting health issues" I have seen drives in rather bad shape - dropping sectors left and right, high re-maped sector count, seek retries - and smart says they are healthy :)

    27. Re:SpinRite by Anonymous Coward · · Score: 0

      Why would you use real data to test? Badblocks and similar with random and set patterns is much better.

      If you use real data and the drive does fail then you may end up sending a drive full of your data to some Chinese warranty scrapyard. You think that is a good idea?

    28. Re:SpinRite by Michael+Spencer+Jr. · · Score: 1

      Yes you can ask Spinrite to do something to a flash drive that is useless.

      Are you arguing that same operation, done to a hard disk with spinning platters, is also useless? You seem to be saying "SpinRite is BS" and pointing to the silly flash drive use case, which doesn't apply.

    29. Re:SpinRite by washu_k · · Score: 1

      Your example of a USB drive is just another way of saying "flash", for which Spinrite is not targeted to fix.

      Indeed, there are no more "low level" commands like in the day of old HDD technology. However, Spinrite uses the standard ATA command set to do everything possible to get your data off your drive. It does this very well and you'll be hard pressed to find other programs that do it better that don't cost a lot, lot more money (think data recovery repair center).

      The fact that USB is flash isn't the point, the point is a USB key is not an ATA device. A USB mechanical hard drive would work just the same. A USB device booted in this way does not support ATA, only BIOS INT 13h calls. Same as if you used it against a SCSI drive in DOS. Spinrite is lying about even using ATA commands.

    30. Re:SpinRite by NotBorg · · Score: 1

      Drives often have a set number of sectors that are reserved for the remapping of bad sectors to good ones. You have to use some percentage (up to the manufacture) of those reserved sectors before SMART says their's a problem. Hence the "magic number." If you have a drive dropping sectors, check your backups and just keep writing data to the drive. Keep writing to the drive until SMART fails. It can take an aggravating amount of time to do but it's worth it if you have any warranty time left.

      If you know the drive is going, don't give up on making SMART cry for you. Get SMART to fail and the manufacture will most likely take it back. It's important that you get them to take the drive back because they get away with shipping crap way to often.

      --
      I want this account deleted.
    31. Re:SpinRite by Anonymous Coward · · Score: 0

      Who said anything about using valuable data? Just take a file that ships with your OS (e.g. sample mpeg file). Copy the file to two other files A and B. Append A to B. Now append B to A.... repeat a few times and you'll get some sizable files that you can use to fill your drive with.

      If you're really worried about the Chinese snooping your data, use a clip of Tiananmen square.

    32. Re:SpinRite by soramimicake · · Score: 2

      There are bad sectors on your brand new drive. You can count on it. You have to make the drive find them and map around them because it won't happen in the factory.

      In the MFM/RLL days, SCSI disks were tested in the factory and came with a list of known bad C/H/S locations, and also keeps a list for bad sectors developed afterwards. I forgot whether the controller board had to skip those sectors during LBA translation or the OS had to not use them.

      When IDE drives came out, the 'factory list' suddenly disappeared, and all drives seemingly came with 0 bad sectors out of the factory, but it was understood that the list was just hidden. They also introduced reserved sectors used to replace bad sectors developed afterwards so the user/OS always can always see/use the same capacity as long as the reserved area is not used up.

      I believe this is still the case (test in the factory and hiding the list) as 2 new drives of the same model / batch can perform differently when tested, and sometimes there are consistent speed dips in the performance graph where you can tell something is going on.

      That said, drives nowadays are more reliable, and I've not encountered a drive that develop bad sectors during the initial fill with random data, which I always do when I buy a new drive. I would not trust any brand-new drive which does it and for old drives that develops bad sectors I'll not use for anything important, even though the drive can reallocate them and might still run for years onwards.

    33. Re:SpinRite by jeffasselin · · Score: 1

      Spinrite has a large population of testimonials that prove that "it works".

      so do homeopathic recipes.

      --
      If he explores all forms and substances Straight homeward to their symbol-essences; He shall not die.
    34. Re:SpinRite by alanmeyer · · Score: 2

      Spinrite is lying about even using ATA commands

      You seem to have it in for Spinrite, but it's not clear why. If you listen to Steve's podcast (Security Now), you'll know that he is very careful on how he describes the technical aspects of his products (including Spinrite). I'd be very surprised if you or anyone could point to any of GRC's literature on Spinrite that would prove he's "lying" about anything.

      I'm not sure where Spinrite claims anything other than using Int13 commands. That's how it gets its minimalist compatibility with a DOS system using the system BIOS. For a SATA drive connected directly to the motherboard thru a standard SATA host adapter use regular ATA commands.

      A USB connected SATA HDD simply gets help from the BIOS so that it can use Int13, converting to the USB commands, and then over the USB bridge, eventually going back to a standard ATA command (the only commands a SATA drive knows).

      My point about the ATA command is that Spinrite is only using standard commands; not undocumented commands or anything secret like that. However, what is "special" are the sequence of commands used to help the drive recover sectors that get a read error.

    35. Re:SpinRite by washu_k · · Score: 2

      You seem to have it in for Spinrite, but it's not clear why. If you listen to Steve's podcast (Security Now), you'll know that he is very careful on how he describes the technical aspects of his products (including Spinrite). I'd be very surprised if you or anyone could point to any of GRC's literature on Spinrite that would prove he's "lying" about anything.

      http://www.grc.com/spinrite.htm "and ALL OTHER file systems". Tell me, how well does Spinrite support UFS? EXT4? ZFS? Given that the ZFS driver code alone is several times the size of Spinrite that's not really possible. And filesystem support is important given Spinrite's braindead data recovery. If there is no knowledge of the underlying filesystem then Spinrite has no way of knowing if it is overwriting data, filesystem structure or empty space. Even if it was lucky and got empty space, there is no way for it to update the filesystem so you can recover the data.

      How about this beauty from http://www.grc.com/srphysics.htm: "SpinRite is actually able to lower the amplification of the drive's internal read-amplifier". I don't think I even need to explain why that is BS. Tell me, which ATA or BIOS commands can do that?

      In fact, that whole page is BS. Take a look at https://groups.google.com/group/comp.dcom.xdsl/msg/9aeee32323c2978e?dmode=source&hl=en&pli=1 That explains it better than I can.

      My point about the ATA command is that Spinrite is only using standard commands; not undocumented commands or anything secret like that. However, what is "special" are the sequence of commands used to help the drive recover sectors that get a read error.

      Ok, using what you just said, explain the "Dynastat Data Recovery" in Spinrite. To refresh your memory, that is where it claims to be working down to the bit level. You cannot address individual bits or even bytes on a drive, either with BIOS or direct ATA commands. And before you say something stupid about "averaging" or other mathematical BS, a modern drive can only return one of two things for a sector request. The correct data when the ecc matches, or an error.

      You obviously have never really read what Spinrite claims to do. Look at that "physics" link. Anyone with even passing knowledge of basic science and how computers work can figure out that it is BS.

    36. Re:SpinRite by rev0lt · · Score: 1

      In the MFM/RLL days, usually all disks came with a list of known bad locations, often handwritten on a table label on the top of the driver. There was no true skipping involved (a low-level BIOS format would wipe all the bad sector markings). Modern drives have "intelligent" skipping - all the relocation data is handled internally by the controller (including spare sectors for relocation), and the CHS notion doesn't exist, because the drives have variable density - the inner zones of each plate map to fewer sectors than the outer zones.

    37. Re:SpinRite by Anonymous Coward · · Score: 0

      If you have a read error on a sector it's at least 10 times faster to simply write to the sector so that the drive can remap it. Stop fucking around trying to read the sector. Write to it and simply restore the affected file(s) from backup. It basically does the same thing as Spinrite without spending hours jerking the head around hoping to get a lucky read that the drive has enough confidence in to remap the sector.

      If you just write to the sector in question... the drive doesn't have to hold out... it can just remap the sector having known good data to store at the new location.

      Spinrite will only "save" you something if you happen to not do the sensible thing of backing up your data.

    38. Re:SpinRite by Bengie · · Score: 1

      I've been out of IT for 4.5 years now, but we had brand-new computers coming through where spin-rite did raise them from the dead long enough to grab the data. We tried several name brand drive recovery tools and SR was the only useful one.

      Outside of SSDs, I'm not sure how much has changed in ~5 years that spin-rite is useless. AHCI an issue?

    39. Re:SpinRite by Bengie · · Score: 1

      The first few times I use SpinRite was after I tried several other options. Several days of trying to recover user data, and SR did it effortlessly. All of the other options fell flat.

      "large population of testimonials that prove that "it works"."

      "so do homeopathic recipes."

      "so does Linux" - "so does Spin rite"

      One is different than the others.

    40. Re:SpinRite by Anonymous Coward · · Score: 0

      You seem to have it in for Spinrite, but it's not clear why. If you listen to Steve's podcast (Security Now), you'll know that he is very careful on how he describes the technical aspects of his products (including Spinrite). I'd be very surprised if you or anyone could point to any of GRC's literature on Spinrite that would prove he's "lying" about anything.

      http://www.grc.com/spinrite.htm "and ALL OTHER file systems". Tell me, how well does Spinrite support UFS? EXT4? ZFS? Given that the ZFS driver code alone is several times the size of Spinrite that's not really possible. And filesystem support is important given Spinrite's braindead data recovery. If there is no knowledge of the underlying filesystem then Spinrite has no way of knowing if it is overwriting data, filesystem structure or empty space. Even if it was lucky and got empty space, there is no way for it to update the filesystem so you can recover the data.

      Erm... it reads sectors and writes them back *in-place*. It is "below" filesystems, so the filesystems don't care -- when they are booted back up, they see things exactly as they had before, but hopefully with no CRC errors.

      How about this beauty from http://www.grc.com/srphysics.htm: "SpinRite is actually able to lower the amplification of the drive's internal read-amplifier". I don't think I even need to explain why that is BS. Tell me, which ATA or BIOS commands can do that?

      Don't need one. All data on magnetic platters degrade over time (you have things like Turbo codes and other signal majicks to "correct the errors" as long as they don't get too large); everything has gets fuzzier over time as the field weakens.

      If you read valid data (corrected by ECC), and then write it back, the magnetic signature is stronger, and thus you don't need as much gain on your magnetic heads...

      In good RAID arrays, you do this with a parity scrub operation that reads/writes everything as an ongoing operation because if you let the data sit too long, it will go from ECC-Correctable to Uncorrectable, and the data will have to be rebuilt from multiple drives or lost.

      In fact, that whole page is BS. Take a look at https://groups.google.com/group/comp.dcom.xdsl/msg/9aeee32323c2978e?dmode=source&hl=en&pli=1 That explains it better than I can.

      Meh... people have been attacking Getright forever. A lot of the magic is now useless (originally, you knew and could reliably write low level format codes to the MFM/RLL controllers, and if you wrote a particulary hard sequence of bits 00101010100010101 (note the triple-0 in the middle), you could flush out weak magnetics before they eroded your data (same as "walking 1's" in memtest).

      Nowadays, IDE keeps a list of bad/remapped sectors and does all this in the background *as long as it's told to* -- if nobody ever reads offset 17543, nobody knows that is about to go bad (lets say ECC can correct for 1 in 96 bits, and right now there's an error in 1 of 108 bits), and over time, that will degrade to 1-in-96 at which time the data is lost. If SpinRite (or any other parity scrubber) reads the data while it's good, the drive electronics should notice the high error rate and refresh it. Some drives won't write the correct data back to the disk unless told to (it slows the drive down)...

      My point about the ATA command is that Spinrite is only using standard commands; not undocumented commands or anything secret like that. However, what is "special" are the sequence of commands used to help the drive recover sectors that get a read error.

      Ok, using what you just said, explain the "Dynastat Data Recovery" in Spinrite. To refresh your memory, that is where it claims to be working do

    41. Re:SpinRite by washu_k · · Score: 1

      Spinrite has been useless for over 25 years. IDE came out in 1986. Since then, low level access to drives has been impossible. AHCI is not an issue, as Spinrite doesn't actually use the low level commands it claims to. SSDs aren't an issue either, at least as far as what Spinrite can actually do which is very little. Spinrite can't actually tell it is running on an SSD and thinks it is a mechanical drive. Doesn't stop it from spewing BS about magnetics of SSDs!

      Which tools did you try? Were they actually meant to deal with bad sectors or were they just tools for dealing with fileystem/OS corruption?

      Spinrite can deal with bad sectors in that it can continue copying after an error and it can retry many times. However, it is not unique in this. There are other tools that can do this as well, including free and open source ones. Where Spinrite fails in this task is that it only writes the recovered data to the same drive. This can cause other data to get overwritten or just lost as you are writing from one spot on a bad drive to another spot on the same bad drive.

      If you need a tool that can deal with bad sectors look at dd_rescue or Roadkil's Unstoppable Copier

    42. Re:SpinRite by lems1 · · Score: 1

      Steve is pretty clear about the USB disk situation. SpinRite is meant for disks with actual heads. The magnetic media and that kind of storage.

      --
      This sig can be distributed under the LGPL license
    43. Re:SpinRite by alanmeyer · · Score: 1

      Nowadays, IDE keeps a list of bad/remapped sectors and does all this in the background *as long as it's told to* -- if nobody ever reads offset 17543, nobody knows that is about to go bad (lets say ECC can correct for 1 in 96 bits, and right now there's an error in 1 of 108 bits), and over time, that will degrade to 1-in-96 at which time the data is lost. If SpinRite (or any other parity scrubber) reads the data while it's good, the drive electronics should notice the high error rate and refresh it. Some drives won't write the correct data back to the disk unless told to (it slows the drive down)...

      I mostly agree with this. Sectors can degrade over time, so a "refresh" will essentially fix it. The problem is that drives don't necessarily do this (some drives have offline scans, but it's not guaranteed). As you've pointed out, a degrading sector can go unnoticed until someone actually requests to read that sector. And, perhaps this is where most people are frustrated with Spinrite's marketing. Just doing a read and then re-write to a sector isn't magic. The point is that when Spinrite encounters an error that it cannot read, it goes thru a series of steps to try and *enable* the HDD to read the sector. For example, it moves the actuator at varying distances from the target LBA to try and get it to seek settle at a slightly different location on the track. This is important because tracks are typically wider than the read head, so you can have different but acceptable actuator/head locations on the same track/sector.

      My point about the ATA command is that Spinrite is only using standard commands; not undocumented commands or anything secret like that. However, what is "special" are the sequence of commands used to help the drive recover sectors that get a read error.

      Ok, using what you just said, explain the "Dynastat Data Recovery" in Spinrite. To refresh your memory, that is where it claims to be working down to the bit level. You cannot address individual bits or even bytes on a drive, either with BIOS or direct ATA commands. And before you say something stupid about "averaging" or other mathematical BS, a modern drive can only return one of two things for a sector request. The correct data when the ecc matches, or an error.

      I'm not sure what Spinrite is using, but there is a way to do this. It's not widely used, but there's a command called "Read Long", along with a complimentary "Write Long":

      INT 13h AH=0Ah: Read Long Sectors From Drive (Reference: http://en.wikipedia.org/wiki/INT_13H#INT_13h_AH.3D0Ah:_Read_Long_Sectors_From_Drive)

      This returns data including bad data along with the ECC. This allows for error bits to be read along with the correction field. You can use this to essentially do offline error correction, but it will also give you the ability to get "parts of a sector".

    44. Re:SpinRite by plover · · Score: 1

      In olden times, I was able to get a "not spinning" drive working by holding the drive between thumb and forefinger right on the axis of the internal disk, applying power, then striking the long edge of the disk. My idea was to rotate the enclosure around the platter using inertia. It actually worked quite well, and I recovered several stuck drives this way.

      But back then, the main problem lurking inside hard disks was faulty lubrication. The oils they used would separate and turn to varnish over time, essentially gluing the shaft in place. And disk drive motors have relatively little torque, so even a modest bit of friction could cause them to seize up. The lubricants used in modern drives generally don't have this same problem. So you might have a faulty motor after all, in which case you are still completely screwed.

      --
      John
  2. got it in one by Quiet_Desperation · · Score: 4, Funny

    I've hit a bit of a brick wall when it comes to testing hard disks

    Have you tried throwing them against the brick wall?

    1. Re:got it in one by K.+S.+Kyosuke · · Score: 1

      I think he means that he found so many bricks between the purchased disks that he can build a wall out of them.

      --
      Ezekiel 23:20
  3. mhdd by Anonymous Coward · · Score: 0

    mhdd will test each sector and time it takes to acess, you can blacklist weak/slow sectors. Bout the best I know for disk integrity.

  4. Why? by headhot · · Score: 5, Insightful

    Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

    1. Re:Why? by Shikaku · · Score: 1

      The point is to know whether it's faulty now at the time of arrival rather then 2 weeks down the line where it becomes a problem.

    2. Re:Why? by Shagg · · Score: 4, Insightful

      No, the point is to design your system so that if it fails 2 weeks down the line... it isn't a problem.

      --
      Unix is user friendly, it's just selective about who its friends are.
    3. Re:Why? by NIN1385 · · Score: 1

      Yeah, best advice for any data storage:

      BACKUP, then back that up... then back that up... then back that up offsite.

      --

      If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    4. Re:Why? by gregmac · · Score: 5, Insightful

      Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

      And then what you should test is that it actually notifies you when something does fail, so you know about it and can fix it. You can also test how long it takes to rebuild the array after replacing a disk, and how much performance degradation there is while that is happening.

      --
      Speak before you think
    5. Re:Why? by ColdWetDog · · Score: 1

      Yo Dawg....

      No, no. Won't do that.

      --
      Faster! Faster! Faster would be better!
    6. Re:Why? by na1led · · Score: 2

      I agree. You can't test when a piece of hardware is going to fail. I've purchased many Hard Drives for our servers, sometimes they last years, sometimes they fail after a few weeks. There is no way to tell.

      --
      -- By all means let's be open-minded, but not so open-minded that our brains drop out.
    7. Re:Why? by Anonymous Coward · · Score: 0

      I worked for a 911 emergency response centre. We had redundant raid arrays, one configured as raid 0+1 and the other configured as raid 5. Whatever wear pattern wears the drive out first will fail (and we will have graceful fail over). Then when the other array fails, the first will already be fixed and able to fail gracefully there too.

    8. Re:Why? by ByOhTek · · Score: 1

      I think both are important.

      If you have the time to test now, it will save you the hassle of swapping it out later.

      --
      Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
    9. Re:Why? by NIN1385 · · Score: 1

      Haha, I was thinking the same shit while I was typing that. That and an older rap song about backing that ass up.

      --

      If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    10. Re:Why? by Anonymous Coward · · Score: 0

      RAID5 is toast. This was true in 2009 and is definitely more true now with 2TB-4TB size disks.

      https://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

      Use RAID6 with one hotspare. And a proper backup solution. RAID is not backup.

    11. Re:Why? by jeffmeden · · Score: 2, Informative

      Hard drives, amazingly, are tested pretty effectively before leaving the factory. During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan. The result: if you don't screw up when you install it you have little to worry about on day 1 that is different from day 1000, which is the cold reality that all mechanical devices will fail.

      Cue the "but I have seen so many DOA drives from XYZcorp..." and to that I will pre-retort with this: if you buy a quality drive (i.e. not a refurb or one specifically designed as a consumer throwaway) from a vendor that takes some care in shipping and handling, then no you did not stumble on "the conspiracy of XYZcorp's bad drives". The weakest link was you. Try wearing a static strap next time.

    12. Re:Why? by CSMoran · · Score: 1

      Even if your storage passes the test, it could fail the next day.

      Similarly with medical checkups. Why bother, when you can get cancer the next day.

      Sarcasm aside, screening is not meant to guarantee lack of failure, but rather allow you to sort out clearly defective hardware.

      --
      Every end has half a stick.
    13. Re:Why? by Joce640k · · Score: 4, Insightful

      Point is: You can't 'test'.

      You can only tell if it's working, not when it's about to fail.

        If people could predict when hard drives were going to fail we wouldn't need RAID or backups.

      --
      No sig today...
    14. Re:Why? by Sir_Sri · · Score: 1

      That isn't really true. Lots of hard drives have various states of failure, and you might be able to write data to it even if it has SMART errors. There isn't a universal way to tell if a drive is going to permanently die.

      A classic example is a hard drive 'clicking'. The read head is contacting something intermittently, but it may still appear to work. You want to get that data off and onto another drive ASAP. Now if you get a drive out of the box like that, there's no point in even putting it into a machine to need to deal with it later. Unfortunately lots of problems can't be noticed with a visual or audio inspection.

      You should still be prepared to recover when, inevitably, drives will fail without warning, but that's not the same as verifying equipment before it fails. It's also the same problem as trying to figure out if a drive in a raid 5 has actually failed (or otherwise has a physical problem) or if there is something wrong with your software.

    15. Re:Why? by Anonymous Coward · · Score: 1

      That's a decent textbook answer but in reality it's not quite that simple. The failure rate of hardware over time is not linear. There's a higher probability of failure in the beginning of a device's average lifetime than in the middle.

      For example, railway systems are highly failsafe and redundant by design. Yet they "burn in" equipment like light bulbs for signals, i.e. they let them run for a few 100 hours in some warehouse before they are put into signals on the tracks. By doing so, they weed out parts that are more likely to cause maintenance overhead later on.

      It's better to identify defective hardware before you put them into production systems.

    16. Re:Why? by TheRealMindChild · · Score: 1

      A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    17. Re:Why? by jeffmeden · · Score: 2, Insightful

      A plastic strap won't save you from the drive head failing to move. I've seen this happen when a bunch of unemployed temp workers unload the truck. This is why it seems "batches" of similar drives fail if you are getting them from the same source... some asshole was throwing and kicking the boxes around.

      If your static strap is made of (all) plastic, then you will have issues beyond shipping and handling woes...

    18. Re:Why? by windcask · · Score: 3, Insightful

      Even if your storage passes the test, it could fail the next day. What you should be doing is designing your storage to gracefully handle failure, like RAID 5 with spares.

      All RAID 5 does is move the single point of failure from the disk itself to the RAID controller, which could also fail at any time. This is why a truly effective solution is virtual machine redundancy with seamless failover and a rigorous backup schedule.

    19. Re:Why? by K.+S.+Kyosuke · · Score: 2

      During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan.

      Interesting. Didn't the Google study on disk reliability show a distinct infant mortality spike in the beginning with a lowest failure rate between 1-2 years of age and after 2 years of age a sharp increase in failure rate quickly reaching a certain plateau? What you describe seems to be quite different.

      --
      Ezekiel 23:20
    20. Re:Why? by Copperhamster · · Score: 1

      Even better: Raid 6, with hot spare, cold spare on the shelf, and a unit that supports regular self-consitancy checking and automatic failure notification. My primary nas even has a wireable relay trigger, which is hooked to turn on a $20 spinning red light (old cop car style) sitting on top of the cabinet when there's an alert.

      It's also powered by two ups's (one for each power supply) and supports network controlled shutdown on both.

      If you can, order the drive packs (we got 2 packs of 4) from different vendors to minimize the chance of getting the same 'lot' of drives. Look at the amount of storage you need and get the minimum size drives... because they rebuild faster you are at less risk of a multi failure. I'd much rather have 12x 1 TB drives than 4x 3TB drives.

      (And if you scoff at Raid 6, I've had a second drive fail hard during the rebuild when the system detected a probable failure on a drive and started to rebuild with the hot spare at 3 am...)

      Also, backup backup backup.

      If you need speed of course, you want raid 1+0. That's fine, my rule of thumb is:
      Start with one hot spare, one cold spare.
      After each 3rd mirror pair, add another hot spare. (so 6 total in use drives needs 2 hot spares)
      Add another cold spare after every other hot spare.

      Cold spares should be testable and tested. I will swap them out with the hot spares once a month.
      But I'm paranoid.
      Also:
      Backup Backup Backup. RINB.

      Also: HARDWARE RAID CARDS.

      I can't stress that enough. software and semi-software raid is a joke.

    21. Re:Why? by Anonymous Coward · · Score: 0

      A lot of responses to this short-sighted posting assume we're all doing the same job. I send hardware to remote locations where servicing is difficult. Yes, we use RAID6 and we have a full up spare (and LTO backups, etc), but we still test drives before shipping, because we're not idiots. We do find drives with trouble (sometimes before they go bad) and replace them this way. Proper design can certainly include testing even if a bunch of Slashdot know-it-all's think otherwise.

    22. Re:Why? by DarkOx · · Score: 1

      That is not the only solution. There are plenty of multi-path, Multi-controller SAN solutions out there. You can install more than one HBA in your hosts.

      However! Once you are talking about controller or HBAs as failure points you need be rethinking your architecture. Disk failures are pretty common but its very unlikely silicon is going to up and die on you. If you can't tolerate those rare events. You really need to be looking at some cluster / application layer redundancy.

      You will never eliminate all the single points of failure from a host, or if you do you will have this called a mainframe which is hugely expensive.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    23. Re:Why? by ericloewe · · Score: 1

      RAID 6, then RAID 10, then backup hourly, backup daily, backup weekly, make two copies and send one each to two off-site locations.

    24. Re:Why? by v1 · · Score: 4, Interesting

      The point is to know whether it's faulty now at the time of arrival rather then 2 weeks down the line where it becomes a problem.

      I would disagree. I believe it's best to be able to identify the first moment a hard drive is starting to have problems, rather than the condition its in when you get it.

      One reason is that most of your hard drives will eventually develop a problem, and only a small fraction of the drives you buy will arrive defective.

      Another reason is that nothing of value is on the new drive, you are risking only purchase price. A year from now, you may have important, possibly irreplaceable or at least inconvenient things to replace.

      I run a piece of custom software I wrote that does a slow "disk crawl", reading ~100mb every 5 minutes. Over the course of a month it has read every block on the drive, and starts over. I get an email if an i/o error OR slow performance is encountered. I store a lot here, I have somewhere around 25TB of storage under the roof at home. Over the years I've been notified ~8 times of a failing drive. In all cases I was able to replace it before it became inaccessible. One of them failed to spin up ever again the day after I removed it from service. I consider this a very good system, and am surprised not to see a similar commercial offering. (it's a 5,600 line bash script!)

      SMART is only useful to possibly confirm that a drive has a problem. Only a fool relies on it to notify them when there's a problem. I've probably replaced somewhere around 750 hard drives here at work, and of those, under a dozen were still accessible and displaying a SMART failure. Many times I've had SMART toggle to failed while I was doing data recovery to a replacement drive, as I was fighting my way through I/O errors. Got some Cpt Obvious going on there I think.

      --
      I work for the Department of Redundancy Department.
    25. Re:Why? by Anonymous Coward · · Score: 0

      when a bunch of unemployed temp workers unload the truck.

      I think that was the problem. They weren't actually killing time by throwing and kicking the boxes around while waiting for the benefit slip. ;)
      I personally have had significant problems caused most likely by a medium age PSU which didn't reveal any signs of breaking otherwise, but seemed to kill SATA drives regardless of the brand with its sneaky, luring SATA tentacles.

    26. Re:Why? by Anonymous Coward · · Score: 0

      "Software raid is a joke."
      Says the 7-digit UID who's recommends RAID 6, with today's low prices/TB. FFS, just run RAID1+0; there's no good argument for RAID 6 in serious applications today, anymore than there was for RAID 5 a few years ago when all the tightwad idiots were building them with 4 identical drives, complaining about crappy performance, and then crashing it during rebuild. This is /.; there's no excuse for being that guy.

      If someone's doing some chickenshit project like a home media server where parity RAIDs make any sense, chances are the downsides of software RAID won't be significant, and if they do purchase a hardware RAID, odds are good they lowball it and get one of the cheap crappy ones that are worse than linux-md.

      Horses for courses, I say.

    27. Re:Why? by Lumpy · · Score: 1

      your "test" will tell you nothing at all.

      If you are a PHB that thinks that "testing" is important and apply it to everything then hire someone at $8.00 to waste their time.

      You might have them run some software to test the server enclosures and 19" racks at the same time.

      --
      Do not look at laser with remaining good eye.
    28. Re:Why? by Anonymous Coward · · Score: 1

      zfs with triple parity raidz.

    29. Re:Why? by Lumpy · · Score: 1

      I prefer raid 60.

      --
      Do not look at laser with remaining good eye.
    30. Re:Why? by Anonymous Coward · · Score: 0

      VM redundancy moves the failure point to the VM host, and its storage which is probably (you guessed it!) a RAID card!

      In other news, when a RAID card goes down, you don't lose data. When a drive goes down in RAID 0, you do.

    31. Re:Why? by djsmiley · · Score: 1

      Ah yes, except the clicking is commonly caused by a lack of oil between the ball bearings in the motor.

      No such failure.

      --
      - http://www.milkme.co.uk
    32. Re:Why? by Lumpy · · Score: 1

      IT's more fun than that.

      Two storage vaults with 12 Hard drives each.
      Running a nice Raid 5 or raid 6 or even a raid 50 or 60.

      Knock loose ONE of those USCSI connectors going to a drive cage.

      Raid is toast. I dont care WHAT raid you are running, none of them can withstand a loss of 50% of the drives.

      Raid is NOT a backup. it's high availability mitigating the highest failure rate part, the hard drives.

      --
      Do not look at laser with remaining good eye.
    33. Re:Why? by tlhIngan · · Score: 5, Informative

      Also: HARDWARE RAID CARDS.

      I can't stress that enough. software and semi-software raid is a joke.

      Not until the hardware fails and you need the data that was on there but not on the backup (or realized the backup failed a long time ago...).

      For performance, yes, hardware is fastest. For reliability though, software RAID is better (hardware RAID can have interesting firmware version issues).

      Linux running an md RAID array? If the server goes down, pop the drives in another server, a couple of mdadm commands later and the array is up and running. Hell, even Windows' software RAID ought to be able to work to recover an array where the server hardware died.

      So if you're using RAID not for performance reasons, but for protection against hard drive failure, soft-RAID works very well. Hell, one of my NAS appliances died, and all I did was take the drive out, attach 4 USB adapters to them, and plug them into my Linux box. Instant access to the data,

      There's nothing like the panic that happens when an array goes down due to non-drive hardware failure.

    34. Re:Why? by Anonymous Coward · · Score: 0

      If you replace a drive now or later, what's the difference? I'm thirsty. Let me save the hassle and just get a drink now instead of later.

      A drive failure should be as simple as pulling it out and slapping in a new one. I do it at least twice a week where I'm at. If you have data loss and hassle recovering something with a drive failure, even on a desktop, your system design and process is wrong and faulty, not the hard drives. Dude, really.. step OUT of the box and think about it.

    35. Re:Why? by jeffmeden · · Score: 2

      During tests in a controlled environment it was demonstrated that hard drives show no "failure curve" at onset, but follow a very boring, linear progression throughout their lifespan.

      Interesting. Didn't the Google study on disk reliability show a distinct infant mortality spike in the beginning with a lowest failure rate between 1-2 years of age and after 2 years of age a sharp increase in failure rate quickly reaching a certain plateau? What you describe seems to be quite different.

      Actually no that was the study I was referring to and it didn't show anything like the "Bathtub" curve you describe. The AFR for drives 0-1 year old was steady for the first year (at most 1% higher for drives new to 3 months than in the first year), and from 1-2 years it held steady and then 2-3 years it rose precipitously until in year 5 when it became statistically chaotic (likely due to drives that were suffering from more obscure failures than the normal bearing or platter wear-out.) Basically, it demonstrated a lack of a "DOA" effect that is exactly what one would expect given that automated production testing is very effective.

      PDF available here:http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

    36. Re:Why? by Anonymous Coward · · Score: 0

      i would assume this is for one storage unit, so we can ignore vm and clustering.
      both are good approaches to keep uptime near 100%.

      lets also assume this storate unit was bought from ebay (not that you would do that for a business, but if you can risk it...)
      this is what i would do to test it
      set it up for raid 5, no spares (so you are using all disks)
      then use a program like DBAN, Darik's Boot And Nuke
      the trick is to use a version that recognises your raid controller
      the let dban go to town wiping and rewiping your drives set it manually how many passes
      if it fails, ok then you have a bad drive

    37. Re:Why? by clarkn0va · · Score: 1

      But I'm paranoid.

      You could have started there and saved a lot of people a lot of reading.

      --
      I am literally 3000 tokens away from the chaotic crossbow --Stephen
    38. Re:Why? by whoever57 · · Score: 1

      SMART is only useful to possibly confirm that a drive has a problem. Only a fool relies on it to notify them when there's a problem. I've probably replaced somewhere around 750 hard drives here at work, and of those, under a dozen were still accessible and displaying a SMART failure.

      Are you monitoring only the Pass/Fail condition? Because IIRC, the Google study showed that an increase in the number of reallocated sectors was a good indication of future failure. While I have not swapped out anything like 750 drives, I have found that a combination of smartd to monitor the drives and logwatch to tell me when the reallocated sector counts start to increase has usually given me a warning about imminent failure.

      --
      The real "Libtards" are the Libertarians!
    39. Re:Why? by georgewilliamherbert · · Score: 1

      I have worked for an OEM who installed about 30,000 drives a year; for end users with 10,000 drive environments, built out new 1,000 HDD and 600 SSD environments in the last year. I know all about static, having had the manufacturer-level training on how not to zap.

      It's not just static. Some drives come with SMART errors (or bad blocks that matter), despite $MFGR assurances. Some of the failures develop in the factory and get shipped anyways as unlikely to get worse, some develop while being packaged or shipped or unpackaged. Run SMART data collection across hundred-drive collections (or thousands or more) and you get a lot of useful and scary info.

      Also, there are well documented runs of drives - specific models, time ranges, factories involved etc - which all just blew up. Also happens to chips sometimes - I've been seriously bit by bad CPUs by Sun and Intel, support chips from several vendors. Also RAM going bad.

      One prototype CPU literally melted the system down, all the plastic nearby inside the casing melted and puddled on the bottom of the case, the CPU label plastic was carbonized.

    40. Re:Why? by georgewilliamherbert · · Score: 2

      There is a very slight bathtub type curve - all numbers rounded, it's about 3% AFR in the first quarter (i.e. about 0.75% failures in first quarter) and 2% for drives in the 3-12 month range (i.e. about 1.5%). If I read the statistics presentation there right 33% of first year failures look to happen in the first quarter, which is detectable but minor initial higher rate. That's dwarfed by 1-2 year AFR (about 8%) and 2-3 year AFR (about 9%), but drops slightly after that.

      They presented the AFRs rather than the culminative losses in an initial cohort per quarter/year, which would be slightly clarifying, but whichever way they did the analysis it's about like that.

    41. Re:Why? by moortak · · Score: 1

      I seem to recall a Google study that said almost the opposite. It points to a small spike increase in death rates in the first few months, then a drop, followed by a continued climb. Their references also point to issues with specific types of drives, though I haven't read it and can't speak to the results. research.google.com/archive/disk_failures.pdf

      --
      Xavier Rabourdin for president 2012
    42. Re:Why? by Marillion · · Score: 1

      I wholeheartedly agree. The question isn't if a drive is going to go bad, the question is when will a drive go bad. Just accept that the drive will go bad and be prepared for it with redundancy. In my experience, the MTBF has a very high variance. It's either going to fail within four weeks or last more than four years. Keep your eye on the S.M.A.R.T. stats. Reallocation of sectors is a very bad omen of pending drive failure.

      One other thing I haven't seen mentioned is the difference between consumer drives and server drives. Consumer drives will go through Herculean efforts to silently recover from media errors. The host computer is often never aware of it. Server drives will report errors back to the host computer sooner with the expectation that RAID subsystems want to know about media problems sooner rather than later.

      --
      This is a boring sig
    43. Re:Why? by georgewilliamherbert · · Score: 1

      Also: HARDWARE RAID CARDS.

      I can't stress that enough. software and semi-software raid is a joke.

      Not until the hardware fails and you need the data that was on there but not on the backup (or realized the backup failed a long time ago...).

      For performance, yes, hardware is fastest. For reliability though, software RAID is better (hardware RAID can have interesting firmware version issues).

      Old SAN / Cluster folks believe in belt+suspenders. I.e., often, use both.

      Use Software RAID 1 across a couple of LUNs (or separate controllers / drive array stacks, for non-SAN environments). Build the LUNs with internal RAID (5, 6, hot spares, figure out your rebuild times, etc.)

      Also - hugely common failure is that the operators aren't properly monitoring the underlying hardware RAID drive status. You need to know immediately when a drive fails even if there's RAID6 and a couple of hot spares in the array. When I worked for a VAR on clusters, I can't count the number of times I arrived and found that they'd had 2, 3, 4 failures nobody noticed, and were one more failure away from catastrophic data loss...

    44. Re:Why? by omglolbah · · Score: 1

      Which is of course also implemented...

      But when the raid is on an oil rig in the North Sea, swapping a drive gets quite expensive... it involves a helicopter ride and a lot of paperwork..
      Having a testing bed BEFORE shipping stuff offshore is quite useful.

    45. Re:Why? by Anonymous Coward · · Score: 0

      I prefer raid 69.

    46. Re:Why? by ArundelCastle · · Score: 1

      if you buy a quality drive (i.e. not a refurb or one specifically designed as a consumer throwaway) from a vendor that takes some care in shipping and handling, then no you did not stumble on "the conspiracy of XYZcorp's bad drives".

      Ah, so it was my fault that I didn't read Quantum's fine print on the box that said "Fireball (consumer throwaway edition)", because in 1995 that information was only a CompuServe search away? So I should have expected that drive to fail under warranty, as well as the THREE RMA REPLACEMENT DRIVES they sent me *also* failing under warranty?

      This took place over the span of 3 years, so it wasn't a "bad batch"/ I had to give up on the brand just to stop the cycle of pain, because they would always send a replacement drive no questions asked. Never had a similar issue with another brand before or since (and their shipping was totally fine).

    47. Re:Why? by georgewilliamherbert · · Score: 1

      Raid is toast. I dont care WHAT raid you are running, none of them can withstand a loss of 50% of the drives.

      Really? I used to do that as a routine acceptance test for clusters. The only times it failed for real was when we'd screwed up something.

      For that to work, you have to rigorously separate RAID mirrors into their own trays so that a whole tray failure (or cable, as you said) only takes one mirror down. For something like 10, 50, 60 you just make sure all of one side is on one array and all of the other on another (or if you have more than 2 arrays, that you separate them out into pairs with one used for one side and one for another).

      Physical separation helps as well, so that you don't accidentally unplug A while starting servicing on B. That exact scenario is one of the canonical HA oopses.

    48. Re:Why? by jmorris42 · · Score: 1

      > Raid is toast. I dont care WHAT raid you are running, none of them can withstand a loss of 50% of the drives.

      Actually.... RAID 1, 1+0, 1+5 and 1+6 can do just that. Two seperate cabinets connected to two controllers on possibly two hosts. Then you can kick out a cable and survive. Just depends how much money you throw at it, which usually depends on how much downtime costs.

      > Raid is NOT a backup. it's high availability mitigating the highest failure rate part, the hard drives.

      Which is also true since RAID does nothing to help when filesystem corruption is the problem.

      --
      Democrat delenda est
    49. Re:Why? by noh8rz3 · · Score: 1

      To dawg, I heard you like backups, so I backed up your back up so your computer can fail while it fails.

    50. Re:Why? by windcask · · Score: 1

      VM redundancy moves the failure point to the VM host, and its storage which is probably (you guessed it!) a RAID card!

      In other news, when a RAID card goes down, you don't lose data. When a drive goes down in RAID 0, you do.

      Unless you're talking RAID 1 mirroring, you're wrong. Only the original controller knows where the parity bits are stored; the individual drives are worthless.

    51. Re:Why? by AmiMoJo · · Score: 1

      Most machines are not servers, they don't need the kind of uptime that RAID is designed to provide. Obviously RAID isn't a backup either. Testing makes sense when you need to put together a machine as a workstation for someone and just want to be reasonably sure the drive isn't knackered.

      Back when I worked in a shop we did a lot of HDD testing. Naturally we didn't want second hand machines to come back, but more than that a lot of random faults were due to failing HDDs. Stuff crashing, settings being forgotten, failed updates etc. We used PC Check for HDD testing and found it to be very reliable at detecting errors that other software sometimes missed. Took ages to run on large drives, but well worth it. Not cheap though.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    52. Re:Why? by alexhs · · Score: 2

      We had redundant raid arrays

      Were they made from inexpensive disks ?

      --
      I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
    53. Re:Why? by CAIMLAS · · Score: 1

      RAID5 is not graceful. It's like a 4-wheeled vehicle getting a flat: you can limp along at 5mph, sure, but until you pull over, stop, and replace the tire, you're not going anywhere fast.

      RAID6, or RAID10, are significantly more graceful.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    54. Re:Why? by Anonymous Coward · · Score: 0

      That is something fundamental that people need to know -- RAID is to keep data in place if a HDD decides to die, not protect against malware or accidental file deletions. One still needs backups, and one needs backup rotations. I've seen people have a 4GB TrueCrypt volume. That gets backed up via Nero's SecureDisk feature to a DVD every month or so. End result is being able to recover everything, but still have the contents encrypted... as well as the fact that the SecureDisk burn adds its own ECC to the equation.

      I wish we could get consumer level tape drives again, with at least 1-2 TB native capacities. There is nothing more reliable for backups than tape (not saying 100% reliable, but more reliable than disk -- just look at disk warranties, 1-3 years, compared to lifetime for tapes.)

    55. Re:Why? by CAIMLAS · · Score: 4, Informative

      To a degree, you can rule with certainty that everything is working.

      New equipment does tend to have ghosts. Given enough systems, with homogeneous roles, it doesn't matter: if it starts to fail, you pull it and put another one in.

      If you've got an environment with only a few servers with dedicated roles, having a new 'production server' go tits up is a very bad thing. For a system like this, you really do want to do a 'burn in' period, IMO for at least a couple weeks, where the system is not being depended upon. Your 4-year-old system doing the same thing at relatively diminished capability is not nearly as bad as doing a cut-over and having things go south, then.

      You do, however, want to do a "burn in" on that new equipment. My preference is to stress a new piece of equipment with something like building kernels (which will stress every significant subsystem to some degree) while doing file operations (eg. something like bonnie+ if you're not copying files to the machine) for a period of at least a week without any stability or significant performance problems. This is due to the following subjective observations:

      * getting a system with a defective disk is not uncommon these days. It's not common, so it's not a serious concern.
      * Short of initial failure of the disk/DOA status, the disks will likely run a number of months before your first failure (depending on how many you've got, of course)
      * Instability, inconsistent behavior, flaky RAM, or odd behavior from RAID or NIC controllers, and 'ghosts' can almost invariably be traced back to the PDU or PSU. These seem to die within about two weeks to a month if they're defective/poorly designed. With a server, troubleshooting this can be a huge bitch due to how loud they are and the multiple-dependence issue on the PDU. This is kind of an end game for me, and I have a hard time trusting any of the equipment after I've had a PSU fail.
      * if you plan on taxing the system at all, you'll probably have a driver related performance problem somewhere down the line. Better to find it before you need the performance.
      * Every once in a while, you've got a bad solid state device (RAM, CPU, SSD). These seem to either work, or not work, if they pass initial "does it work?"

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    56. Re:Why? by CAIMLAS · · Score: 1

      From what I've noticed, DOAs are, in all likelihood, the result of improper handling more so than they are manufacturing problems.

      For sensitive applications which have high costs associated with replacement, you'd probably want to be very sure you're getting your disks from somewhere that respects drives as sensitive mechanical devices. From what I've seen, getting a different production run can make a big impact on the drives, but getting them from another supplier seems to be more effective (possibly due to manuf runs going to only one or two specific clients? just guessing on that).

      LOL at your static strap statement. That has almost no baring except in estimated overall electrical lifecycle. It might lead to a premature failure, but then again, it might not. Percussive damage to a drive WILL cause it to fail early, or not work at all.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    57. Re:Why? by CAIMLAS · · Score: 1

      5600 lines?! What exactly is BASH doing in 5600 lines which could not be done in, say, about 56 lines of conditional statements, a loop or three, and dd using "seek="?

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    58. Re:Why? by Hillgiant · · Score: 1

      5600 lines?! What exactly is BASH doing in 5600 lines which could not be done in, say, about 56 lines of conditional statements, a loop or three, and dd using "seek="?

      WHISKEY and BLACKJACK

      --
      -
    59. Re:Why? by PhunkySchtuff · · Score: 1

      It's been a loooong time since hard drives used oil in their bearings. They've commonly used quieter and cheaper Fluid Bearings for something like the past 10 years. No, in this case the fluid isn't oil, but air.

    60. Re:Why? by PhunkySchtuff · · Score: 1

      Thanks slashdot for mangling the link - I meant to link the words Fluid Bearings to the wiki http://en.wikipedia.org/wiki/Fluid_bearing

    61. Re:Why? by the_B0fh · · Score: 1

      I prefer 68. I owe her one.

    62. Re:Why? by the_B0fh · · Score: 1

      Seriously?! hardware raid cards? Which century are you stuck in?

    63. Re:Why? by Anonymous Coward · · Score: 0

      Given the capacity of modern drives and the long rebuild times you should go with RAID 6. With RAID 5 you are left defenseless during the hours to days that a parity rebuild may take. One more failure of a drive that you probably bought at the same time from the same lot and you lose everything.

    64. Re:Why? by the_B0fh · · Score: 2

      What a convoluted way to make sure when the shit hits, it gets spread out really evenly.

    65. Re:Why? by the_B0fh · · Score: 1

      Don't understand why people still don't understand this.

    66. Re:Why? by the_B0fh · · Score: 1

      Exactly. Only morons argue against that. And we have clear evidence of that here.

    67. Re:Why? by jamesh · · Score: 1

      I agree. You can't test when a piece of hardware is going to fail. I've purchased many Hard Drives for our servers, sometimes they last years, sometimes they fail after a few weeks. There is no way to tell.

      TFA's idea is to figure out if the drive has already failed but doesn't know it yet.

      There's also the failure curve to consider. Early life failures are typically more common than mid-life failures (over the same timeframe), so testing drives thoroughly can bring out early life failures when it's not so inconvenient (hotswap drives are great but not so great if they are hundreds of miles away)

      But it's still playing the odds - you could test for a week and then have the drive fail in the first day of production use. It's still worth trying to push the odds in your favour though.

    68. Re:Why? by Voyager529 · · Score: 1

      http://www.youtube.com/watch?v=WL2txMU50CI

      This, I believe. (Parental Advisory...)

    69. Re:Why? by v1 · · Score: 2

      Color menus (arrow key), user-editable disk database, remote updates, authenticated email relaying, support for multiple drives, auto-detect and add, speed and capacity testing during add, performance and history graphing, quite a lot really. ;) It's a big'un. I make the most of whatever language I use. The incremental scan nature of the script itself requires a good deal of code. There have also been numerous changes to be as certain as possible that the script cannot get hung. Failing hard drives are very good at hanging apps on the system they are installed into. So there are threads and signals flying around too. It's also very modular, I actually have bash libraries that dynamically load, but this has those rolled up into it since that's unnecessary there. www.vftp.net/wd.zip if you dare :)

      --
      I work for the Department of Redundancy Department.
    70. Re:Why? by swalve · · Score: 1

      at 5200, 7500, 10k or 15k rpm, your dry bearings wouldn't "click", they would squeal.

    71. Re:Why? by PhunkySchtuff · · Score: 1

      As you observe, SMART is only partially useful.
      If SMART says that a disk is going to fail soon, it's generally pretty reliable - that disk is pretty well guaranteed to fail soon.
      However the reverse is not true - if SMART says that the disk passes, that's not a reliable indicator that the drive is operating 100% perfectly and will continue to do so.

    72. Re:Why? by swalve · · Score: 1

      Yeah, just build the RAID and let it resync. If nothing fails, that's all the testing you should be doing.

    73. Re:Why? by swalve · · Score: 1

      For small orgs where the $1000 is a big deal, then yes, software RAID is probably better. But for a big org with lots of servers, just buy Dell PERC, IBM ServRaid or HP SmartArray cards. There aren't going to be any supply issues with getting the right card for the life of the server. You can still get a 15 year old Smart2DH for like $25.

    74. Re:Why? by swalve · · Score: 1

      For servers and SCSI drives, they are the way to go.

    75. Re:Why? by swalve · · Score: 1

      Only on cheap controllers. On real ones from real vendors, the data is stored on both the drives and the controller. You can swap cards (or drives to another box) all you want.

    76. Re:Why? by Lumpy · · Score: 1

      When you have the cash to do it right? yes.

      when your CTO is a moron and demands you do it with the minimal needed and then he puts in his "suggestion" that we stripe everything across the cages for more performance. Ignoring all the suggestions that it is a REALLY BAD IDEA.

      We should have had 4 cages, and spread out the mirrors like it is supposed to be setup.

      --
      Do not look at laser with remaining good eye.
    77. Re:Why? by Anonymous Coward · · Score: 0

      ZFS mirror, with 3x disks mirrored (i.e. triple redundant) could eat it on 66.66% of the drives, actually...

      a plain mirror (raid 1) can have 50% of disks chomped....

    78. Re:Why? by Anonymous Coward · · Score: 0

      Okay, where can I get a copy of this program?

    79. Re:Why? by dbIII · · Score: 1

      With battery backup etc on the hardware RAID cards I'd say one century later than the one you are stuck in :)
      Software RAID consumes resources that NFS or similar could be using anyway.

    80. Re:Why? by dbIII · · Score: 1

      Disks got faster rendering the above article an even bigger pile of bullshit. For other reasons RAID6 or a variety of others are better anyway, but that article has always been bullshit.

    81. Re:Why? by Pentium100 · · Score: 2

      The click is a read error. Drive cannot read a sector, so it moves the arm all the way to the edge of the platter to recalibrate (that's the "click"), then moves it back in and tries to read the sector again.

    82. Re:Why? by the_B0fh · · Score: 2

      I guess you haven't dealt with multiple conflicting firmware versions that wipe out raidsets then. You're lucky.

    83. Re:Why? by the_B0fh · · Score: 1

      And if your servers are so constrained that your cpu or ram could not handle it, then maybe you should size your servers appropriately. Or perhaps upgrade to a better OS.

    84. Re:Why? by Anonymous Coward · · Score: 0

      Yeah, your comment on PSU fail is most informative. It's no surprise if a brand-name server or desktop
      begins to develop ghosts after adding something as innocent as known-to-be-good ram.
      The big makers save every gram of copper andsuch by making their PSUs support only the as-sold equipment.
      Building servers from scratch with twice the wattage than required is a safe bet.
      The truly reliable range for a PSU is about half the marketed output capability.
      And over time, the capacitors tend to dry up reducing the PSU's safe range.
      Add to this only tantalum caps on the motherboard, and you'll get 5+ years lifetime
      (on the parts that have no moving parts, such as a fucked off sysadmin :) ).

      With choosing HDDs reading peer reviews and gossip is surprisingly useful.

      Anyhow, making backup dumps of your data is also the course 101 of essential computer security,
      as it's not only hardware fail that can eat your homework.
      I mean, remember that stupidity is more common element in the universe than hydrogen.
       

    85. Re:Why? by Pentium100 · · Score: 1

      It all depends on how much money you can spend on the drives and controllers. RAID is great if you can buy a lot of new drives at once. On the other hand, I can only buy one drive at once, when I run out of space, so the file server is full of hard drives with capacities ranging from 120GB to 500GB. RAID does not work with that

    86. Re:Why? by Anonymous Coward · · Score: 0

      Troubleshooting a desktop at home can still be a hassle. I've replaced PATA cable(s), molex cables, and the PSU, to deal with HDD 'related' issues. Although, one time, human error did cause me some problems. I didn't plug the PATA cable in fully to the HDD and I think it caused bad blocks. Thankfully, it's been fine after plugging it back in fully.

      One of my HDDs in my personal computer is nearing 12,000 power on hours. The other HDD is nearing 15,000 hours. And this isn't factoring in that my motherboard has to have at least more than that since these are replacement drives.

      But I can't help but wonder. If a HDD has ran for this long, how much longer will it continue to run? I try to keep them cool. One is around 26 degrees celsius (320GB) and the other is around 31-33 degrees celsius (80GB). I don't like it when it gets over 31 degrees celsius, but it can't be helped if it's warm in my room coupled with the, well, go look at the inside of a eMachines T1090 where they put the HDD. (The second drive (320GB), which runs cooler, is shoved into where the floppy drive should be.)

    87. Re:Why? by afidel · · Score: 1

      Every array or controller that doesn't suck does background data scrubbing, disk failure prediction, etc. The vast majority of the drives that I replace in my enterprise arrays haven't hard failed, rather they've indicated a problem and the array has proactively rebuilt the data to a spare (or the distributed spare space in the case of the EVA). Over about 500 drives (average) over the last 6 years I'd say I've had a handfull of hard failures and a few dozen predictive failure replacements.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    88. Re:Why? by dbIII · · Score: 1

      Or a purpose designed better value for money RAID card instead of being "penny wise and pound foolish" by scaling to a size and price unnecessary if you have the right hardware to assist.
      I'm curious to get your opinion on 3D graphics cards - are they also unnecessary and replaced by a better CPU and OS? In what I assume is your ignorance you are putting forward exactly that argument in storage space.
      Also could you please point me to those server motherboards without a RAID controller capable of running 16 or more SATA or SAS drives and no RAID capability? You can't? Now do you see that you need to think for at leat three seconds before posting crap like you put above.
      We are in the century of wanting to use more than six disks. Now please leave the grownups alone.

    89. Re:Why? by KevReedUK · · Score: 1

      Ahh yes... the good old Quantum Fireball series. Always thought it was a particularly apt name for that series of drive.

      One of the jobs I had was at a site with almost exclusively Dell PowerEdge 300 series servers back in the early 2000s, and the ones we had all shipped with FireBalls in the drive carriers. It seemed, at the time, that Dell were only using this series of drive, as whenever we had a failure, the replacement supplied was always from the same series, and rapidly displayed the same symptoms. In our test lab, we stripped the Fireballs from some of the carriers and replaced them with alternative series without getting any discernible failure rate, but due to our contract with Dell (for some reason, the carriers were designated as non-user-serviceable, so swapping the drive in them would have invalidated the warranty and support contract for the whole installation of over 60 servers),not to mention the fact that we would have had to foot the bill for the replacements, we could not carry out such replacements in our production environment.

      This, however, was only one of many reasons why we dropped Dell as a supplier. The entire 300 series seemed to develop disturbing issues with RAM and CPU, as well. This coupled with the fact that the on-board component layout for some of the main daughterboards (particularly the one on which the RAM was mounted) was poorly thought out, with key components sited in just the right location that catching them with the airflow guides was practically unavoidable when reassembling the server after replacing yet another failed component (and it wasn't just our on-site staff who found this to be an issue, I personally saw three different Dell techs clip components from the corner of the RAM daughterboard whilst reassembling the server they were working on!).

      --
      Just my $0.03 (At current exchange rates, my £0.02 is worth more than your $0.02)
    90. Re:Why? by v1 · · Score: 1

      Every array or controller that doesn't suck

      I assume you mean "every array controller that only enterprise-budget users can afford"?

      (I got the impression from the OP that they weren't in that budget tier)

      --
      I work for the Department of Redundancy Department.
    91. Re:Why? by afidel · · Score: 1

      No, even the p4x0i that's built into every HP DL3xx server does background scrubbing and predictive failure notification.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    92. Re:Why? by CAIMLAS · · Score: 1

      I'd be interested in seeing this on sourceforge or similar. I've not seen anyone implement anything quite like this yet. Ever consider selling it and offering support for it, or getting it packaged for distros? :)

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    93. Re:Why? by CAIMLAS · · Score: 1

      The PSU's I've had the most problems with have been Supermicro, so your 'build your own' criteria doesn't really stand my experience. I'll take the reliability of good desktop components over server components almost any day.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    94. Re:Why? by cthulhu11 · · Score: 1

      Also: HARDWARE RAID CARDS.

      I can't stress that enough. software and semi-software raid is a joke.

      On Linux systems it sure is. Separate layers for RAID, volume management, and filesystem == nightmare. ZFS wipes the floor with any other software or HBA RAID system in almost every way. And before you bring up btrfs -- get back to me when it's real instead of an incomplete vaporware project that's been ready RSN for years. Among the hassles with HBA RAID: o Setting it up. On an x86 system, that means an anachronistic venture into BIOS option ROM utilities at least to set up a boot volume for OS installation. Depending on the HBA used, this may require a local or redirected keyboard to enter @#$@#! function keys or local / redirected *video* -- or absurdly complex and scantily-documented CLI syntax even from the OS level -- yes LSI I'm talking about YOU. o No 3-way mirrors. Until last month no RAID HBA I'd encountered could do 3-way mirrors, and no, RAID 1E doesn't count, in fact it's *worse* than a plain 2-way mirror. HP's new Smart Array HBA's announced with their Gen 8 systems are touted as (finally) supporting 3-way mirrors. I like to think that the RFE I made them enter for this made a difference. o Limited portability of drives and volumes across systems o Impossible to mirror plexes / vdevs / submirrors across HBA's for redundancy and capacity.

      For performance, yes, hardware is fastest.

      Sometimes yes, sometimes no. HBA RAID limits one to a single HBA, which in extreme cases can be a bottleneck. There are also plenty of lacklustre RAID HBA's out there, and let's not forget the hassle of monitoring and replacing backup batteries, which most require.

      For reliability though, software RAID is better (hardware RAID can have interesting firmware version issues).

      Linux running an md RAID array? If the server goes down, pop the drives in another server, a couple of mdadm commands later and the array is up and running.

      ... if you happen to have the configuration documented ahead of time. MD is a joke; it's messier and flawed even worse that SDS/ODS/SVM. As for the original topic of testing new disks, with Solaris it's easy to fire up format (1m), retrieve the factory and grown defect lists, get inquiry data, run a write/read/compare surface analysis, lay down a partition table if you have anachronistic hardware, and lay down slices. I have yet to find a tool for Linux systems that even comes close.

    95. Re:Why? by the_B0fh · · Score: 1

      My Sun x4540 box laughs at you. And if you believe NetApps and EMC use RAID "cards", well, I guess, pass the pipe. A better computer does not mean the most expensive computer. It just means you don't buy the cheapest thing from walmart.

      And your motherboards using FakeRaid? And you are using that as an argument? Shows how much enterprise shit you get to play with.

      Anyone building a serious enterprise storage capability knows better than to waste money on raid cards.

      You think Google, Facebook, and those cloud providers use RAID cards? Seriously?

      http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ sure don't show any RAID cards in there.

    96. Re:Why? by Anonymous Coward · · Score: 0

      on the flip side I've got a 2TB drive that SMART tested with some bad sectors and a warning that it was on its last legs over a year and a half ago. I use it for internal backup of my primary 2TB drive and it is still going strong, being written to every night. SMART still tells me it is almost dead, it is, I agree, so am I for that matter.

    97. Re:Why? by adavies42 · · Score: 1

      +1

      (tho i'd've said github)

      --
      Media that can be recorded and distributed can be recorded and distributed.
      -kfg
    98. Re:Why? by dbIII · · Score: 1
      Amusing - you just cited examples with specialised onboard RAID hardware. Give it up and get your supervisor to explain to you how it works before you embarrass yourself somewhere where it actually matters.

      And your motherboards using FakeRaid

      Can they run 16 disks? Of course not. WTF do you think I was writing about them at all? Thanks for the silly little insult, it really does show the level of maturity I'm dealing with and that you are just throwing "serious enterprise" etc around as buzzwords you do not understand but you think have power because you see others using them.

      You think Google, Facebook, and those cloud providers use RAID cards? Seriously?

      Yes, either controllers in file servers or self contained devices (eg. EMC) that have RAID controllers in what you appear to think are mysterious sealed black boxes that operate by magic. Just because you get somebody else to install the hardware and it comes pre-built doesn't mean that it isn't in there. I'm beginning to think you are coming from "beige box is the hard drive" territory.
      BTW - why didn't you even notice in the TITLE of the article you linked that it was about large amounts of storage on a low budget? There are no RAID controllers there to cut costs and have the tradeoff of less performance in situations where you don't need it to be fast. Your example does not address your point at all. With such a mistake, it appears that you know so little about this topic that you are going to have to start at the wikipedia level and work your way up.

    99. Re:Why? by Coren22 · · Score: 1

      When talking about 30 TB usable space arrays, there is very good reason for RAID6 +spares.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
    100. Re:Why? by Coren22 · · Score: 1

      You must not do VM. With VMware vSphere, you can have multiple copies of a VM running on different servers with iSCSI storage redundancy, and they auto fail-over.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
    101. Re:Why? by the_B0fh · · Score: 1

      Color me surprised if Netapps actually use hardware raid, since everything else is handled by the software, according to http://en.wikipedia.org/wiki/NetApp_filer and articles like http://media.netapp.com/documents/tr-3298.pdf

      Obviously they have non-commodity controllers, but are these "hardware raid"? So far, nothing indicates that they are from cursory reading. If you have links that say that they are, do share.

      But you sound like an old storage guy, so it's obvious why you like hardware raid. But it's a brave new world out there my friend.

    102. Re:Why? by dbIII · · Score: 1

      I don't know about the netapp gear but the EMC stuff you used as an example are specialised hardware build for RAID containing components not on ordinary server boards, thus, guess what, those applicances ARE hardware RAID.
      Also how is a software only solution going to cache unwritten data when the power goes off?
      I'm assuming by "old" you mean out of school :( WTF did you ever consider you know enough to dismiss an entire class of hardware across the entire range of computing when you've shown above you don't even understand the concepts behind it yet. I like software RAID for some situations (small number of hosts accessing it and data that is not so important that a battery powered cache can save pending operations), but for a lot of the systems I'm running I need both fast storage and the ability to transer data to and from a lot of hosts at once (a cluster) - so enough CPU power to handle a lot of NFS threads and some hardware that gets the disk operations done as well as extra processing power that would cost five or so times more than a card. So I'm in exactly the sort of space that a keen gamer that finds their motherboards onboard graphics just does not quite do the job so goes out and buys a new graphics card designed for the task instead of a 48 core monster that could handle it all via a tiny bit of all those processors at once. Is it sinking in yet?

    103. Re:Why? by hackertourist · · Score: 1

      This is just what I've been looking for. Thanks for making it available.

    104. Re:Why? by v1 · · Score: 1

      glad to hear you like it, please give me some feedback on it once you've had some time to play with it.

      --
      I work for the Department of Redundancy Department.
    105. Re:Why? by the_B0fh · · Score: 1

      I always like to go back to first principles. When Patterson, Gibson, Katz first described RAID, they detailed how it worked, and how you can break the data up over multiple inexpensive disks, how can you stripe data, and how to use parity. In 1987, x86 architectures were a 286 or maybe a 386. So, when they implemented parity for RAID, it had to be done in specialized hardware, because the CPUs are not fast enough. We have all seen how a Windows NT 4 server slowed to a crawl when you turn on the openGL screen saver, or used winmodems. The CPUs in those days were simply slow.

      Today, a Phenom II X4 945 can compute RAID6 parity at close to 8GB/s. A throughput of 500 MB/s requires less than 1.5% of CPU time. (see: http://blog.zorinaq.com/?e=10)

      Now, of course, there's some overhead, processes swapping, etc. And RAID is more than simply calculating parity. The numbers I have seen do not go above 5% of one core however.

      Can specialized chips implement RAID functionality faster than an Intel CPU? Sure, that's obvious. But again - if your server cannot handle a 5% load on one CPU on top of what it is already handling - you are sizing it wrong.

      Why would software RAID be better? Because you cannot easily implement new technologies like ZFS in hardware. Well, you could, after all, firmware is just software that's written to a chip instead of the hard drive, and runs whatever processes/algorithms you want it to run, on whatever processors. But when new revisions of ZFS come out, it is not as easy to upgrade.

      By the way, this is assuming well implemented software raid. Crappy software raid will give you really crappy performance, obviously.

      As an example - Cisco always used low powered CPUs (compared to their competitors) and always marketed their "we have specialized hardware ASICS for switching and routing" because they were too cheap to use real CPUs. And so what happens when you cross the threshold for whatever they designed the ASICs to hold/process? Your performance dives off a cliff. Or what happens when you implement IPSec? Or IPv6? Well, buy more Cisco equipment I guess. Cisco's CEO thanks you: http://etherealmind.com/poster-reassuringly-expensive/

      Now go watch this: http://video.google.com/videoplay?docid=-6304964351441328559# at :34'50" they tell you software raid gave them a 20% to 30% increase over hardware raid.

      This old dog has learnt to go with the times. You should too.

    106. Re:Why? by the_B0fh · · Score: 1

      And by the way, you should really check out how ZFS works, and how it overcomes all the RAID issues (atomicity, RAID hole, write when power loss, etc).

    107. Re:Why? by the_B0fh · · Score: 1

      By the way, you should really check out what Intel says as well:

      http://www.theregister.co.uk/2010/09/15/sas_in_patsburg/

    108. Re:Why? by the_B0fh · · Score: 1

      Just spoke to an EMC Solutions Architect, you know, the guy they sent round to all those Fortune 100 companies to put an EMC solution in place...

      I asked my friend:
      Does EMC use hardware raid, or is it all software, but implemented via
      firmware? For what it does, it just feels weird to me that it's
      "hardware raid" like some dell PERC card.

      Reason I'm asking - someone keeps insisting it is hardware raid and
      that really surprised me. Does it mean that Sun's new storage
      devices, the 7000 series is considered hardware raid if they put
      solaris in firmware...? :)

      EMC Solutions Architect responded:
      It's essentially software, but it appears as hardware to whatever
      accesses the storage. If you have the right access to the system then
      you can rebuild the RAID sets, make storage assignments, etc. But,
      this is all done on the back end using the array's own operating
      system and controllers.

    109. Re:Why? by dbIII · · Score: 1
      I just got around to reading your posts, and you have your answer in something you've posted if you just stop to consider and understand:

      using the array's own operating system and controllers.

      Also just because gold plated Oracle boxes can do the job does not mean that there is no space for cheaper hardware with purpose built controllers, and I'd say it's still the majority of the space and it includes those EMC devices. Of course their hardware also has software to make it run but so do the RAID cards. Also you've failed to address that getting things off the disk is only half the story - you've got to get it out to where it can be used and sometimes it doesn't make sense to have your CPUs working hard on both NFS and RAID calculations at the same time. When things are I/O bound it becomes just a bit important and in non-trivial situations things are going to get I/O bound at some point.
      One last thing - did you even READ that blog you put up as an example? WTF is a number with no context, not even the number of disks in the array, supposed to tell me about anything?

      In short, I consider your blanket statement above to be bullshit easily disproved by several things I've posted above. The thing about blanket statements is little edge cases do not prove them so I really don't give a shit about your gold plated Oracle box when things can be done almost as well at a fraction of the budget with the right choice of hardware.

    110. Re:Why? by the_B0fh · · Score: 1

      Sun box is gold plated? Then wtf is an EMC box?

      IO is a controller/interface issue. RAID is a parity calculation issue. They are not the same. If you can't understand that, you have serious issues. Calculating parity has nothing to do with IO. Don't understand why you don't get that.

      When things become IO bound, you need more controllers, not hardware raid. If you run out of cpu cycles, you get more cpu (either in the form of hardware raid, or your main cpu). As Intel itself says - the main cpu is now fast enough to calculate parity, to fill up the IO channel. And using less than 5% of the cpu.

      So, again, if your server can't handle software RAID (that Intel says is faster than hardware raid) because of a piddly 5% more cpu usage kills your server, you don't know how to size servers.

  5. Hard Drive Testing by Anonymous Coward · · Score: 2, Informative

    In previous jobs, I've used the system of:
    Full Format, Verify, Erase, then a Drive fitness test.
    If there are errors in media, the Format, verify or erase will pick it up, then the fitness test to check the hardware.
    Hitachi has a Drive Fitness test program
    I have also used hddllf (hddguru.com)

    1. Re:Hard Drive Testing by sirsnork · · Score: 1

      ALL the drive manufacturers have drive testing software, WD's is called Data Lifeguard.

      Download the software from the drive manufacturer, run extended test which will do a full disk scan for bad sectors. Then do their full write test.

      If it passes those, it's good to go... until it fails

      --

      Normal people worry me!
  6. S.M.A.R.T. by NIN1385 · · Score: 1

    It's a joke. I've seen drives work fine for years with it showing imminent drive failure and I've seen drives die instantly with no warning given whatsoever.

    There is no perfect tool that I could say, each drive manufacturer makes their own, and there are numerous third party tools out there as well. My best advice is have them all and have them handy. One I use quite a bit is HDD Regenerator, pretty thorough utility but it takes some time to run.

    --

    If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    1. Re:S.M.A.R.T. by NIN1385 · · Score: 1

      Edit: HDD Regen is more aimed at repair just fyi.

      --

      If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    2. Re:S.M.A.R.T. by DigiShaman · · Score: 4, Informative

      S.M.A.R.T is a joke, but not in implementation. It's a joke because most HDD failures occur on the logic board. It's a known fix in data recovery services to simply swap out the PCB for another of the same vintage make/model/firmware rev. Though I have ran tools such as HD Tune to view out-of-spec metrics and benchmarks. For example, I once had a user that reported that her workstation was running extremely slow. I suspected the drive was at fault and the graphs proved it, but technically it wasn't a failure. S.M.A.R.T would have flagged it if it was mechanical, but it wouldn't have if it was a controller issue. Now that may have changed with newer drives, but that's been my overall experience.

      --
      Life is not for the lazy.
    3. Re:S.M.A.R.T. by ethan0 · · Score: 1

      SMART is good for telling you when your drives do have problems that need addressing. it's not so great for giving you assurance that your drives do not have problems - consider a positive smart result to be more of an "I don't know" than a "good". you should generally assume your drives can fail at any time. I don't think there's any way to reliably predict the sudden death of a drive.

    4. Re:S.M.A.R.T. by greed · · Score: 1

      I've seen masses of cabling issues that won't be reported by SMART, either.

      The symptom, at least on Linux, is logs full of stuff like this:

      kernel: ata12.01: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6
      kernel: ata12.01: irq_stat 0x03060002, device error via SDB FIS
      kernel: ata12.01: failed command: READ FPDMA QUEUED
      kernel: ata12.01: cmd 60/40:00:92:6d:06/00:00:07:00:00/40 tag 0 ncq 32768 in
      kernel: res 41/84:40:92:6d:06/00:00:07:00:00/00 Emask 0x410 (ATA bus error)
      kernel: ata12.01: status: { DRDY ERR }
      kernel: ata12.01: error: { ICRC ABRT }

      That one is actually a dodgy port replicator board--the drives never see the garbled command packets, so their CRC error count never moves.

      A consistent comm problem to the drive itself should result in at least some of the SMART counters moving, but they will NOT fail out the drive because there is no reliable evidence it is a drive problem. For those, re-seat the SATA/SAS cables, reseat the HBA in the PCI/PCIe slot, replace the SATA/SAS cables, replace the HBA, replace the drive. In about that order--there's a lot of crappy cables on the market, and quality independent of retail price.

      (I'd recommend having a test rig with a couple of different HBAs so you can determine which part is giving you grief; motherboard and a cheap PCI card is usually enough variety.)

    5. Re:S.M.A.R.T. by NIN1385 · · Score: 1

      True, although most IT departments and/or computer shops don't have a spare board for every make and model of HDD. We tried a few swaps at a shop I used to work at and it was successful I think one time, and that was pure luck that we had that exact PCB for that drive.

      --

      If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
    6. Re:S.M.A.R.T. by CAIMLAS · · Score: 1

      I once had a drive which had no reported errors until one day, Sector 0 of the disk was unreadable. SMART would not work for some time prior to that (which I chalked up to a software failure, since the drive was performing fine). What I presume happened was the part of the disk where the SMART data and drive information is stored failed first.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
    7. Re:S.M.A.R.T. by cbope · · Score: 1

      SMART is nothing more and nothing less than a data logger for specific performance and failure metrics of a drive. It cannot predict when a drive becomes inoperable, but it can point you in the right direction when the drive is starting to have problems.

      I don't believe SMART in and of itself is a bad implementation, the problem is there are not any good tools to analyze the data on a running system continuously and predict reliability based on SMART data. At least none that I have found. Although most motherboard BIOS's can check for SMART errors on boot, I have never seen this used in a helpful way. More than likely, it is a token feature, and not really designed to catch serious errors before they become data recovery problems. Not to mention, it is utterly useless if you rarely shutdown or reboot.

      Case in point: I have a 4-drive NAS at home, three drives in a RAID 5 volume, and one drive as a hot spare. The NAS is configured to run a quick SMART self-test once a day on every drive and a full test once a week. After a bit more than 6 months, drive 0 (first drive of the RAID 5 volume) was reporting a few read errors during both SMART tests, but no other indications of a serious drive problem. The NAS did not report the drive as failed, only that read errors were detected during SMART testing. Read/write performance of the RAID volume was unaffected (yes, I tested it). Since I had a hot spare and all the drives were still well within the warranty period, I decided to leave it alone* and see if errors start to increase or normal operation becomes affected, which would obviously trigger a RAID failure and switchover to the hot spare. It never happened, and a couple months later, the previously suspect drive now passes all SMART tests and the NAS is still running perfectly. It has been 3-4 months since it became "clean".

      My theory is the drive had a few bad sectors, which were remapped over time to new locations. But I have no hard data to back this up. The drives are still within warranty and if SMART errors appear again on the same drive, I will return it.

      * The other major reason to leave the drive in-use rather than return it immediately... this all happened at an inopportune time, a month or so after the Thai floods. These 1.5 TB drives suddenly were no longer available due to supply shortages, or price spikes meant that the drives I paid less than 70€ for each when new, were now over 200€ replacement cost (yes, they roughly tripled in price where I live). Due to the supply shortage, even a warranty replacement would have taken time, as replacement drives were simply not in the channel anymore.

    8. Re:S.M.A.R.T. by NotBorg · · Score: 1

      My theory is the drive had a few bad sectors, which were remapped over time to new locations. But I have no hard data to back this up.

      Do you still have the SMART data you collected? If the drive remapped the sectors, most likely it will show up in your SMART data as a variable named "Reallocated_Sector_Ct" or similar. This number should have increased as a result of such remapping.

      --
      I want this account deleted.
  7. scsi by rs79 · · Score: 2

    don't use consumer drives if you're concerned.

    see also http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf

    The Goog wrote a nice paper on hard drives.

    --
    Need Mercedes parts ?
    1. Re:scsi by na1led · · Score: 0

      That's total BS. SCSI Hard Drives can fail just as often as a cheap IDE Drive. Anything mechanical is prone to failure

      --
      -- By all means let's be open-minded, but not so open-minded that our brains drop out.
    2. Re:scsi by bacon.frankfurter · · Score: 1

      Error 404 (Not Found)!!1

      Google

      404. That’s an error.

      The requested URL /external_content/untrusted_dlcp/research.google.com/en/archive/disk_failures.pdf was not found on this server. That’s all we know.

    3. Re:scsi by Galactic+Dominator · · Score: 3, Insightful

      Perhaps an honest mistake, the link is broken. Second, evidence has shown SATA are more reliable than commercial/enterprise grade drive. Only buy those if you don't like your money, or there is some clear advantage. That supposed advantage is not reliability, unless there is there is some sort of rapid replacement mechanism coming with the drive. Although replacement isn't reliability in my book.

      http://lwn.net/Articles/237924/

      --
      brandelf -t FreeBSD /brain
    4. Re:scsi by pak9rabid · · Score: 1

      Yes, but consumer-level drives are more prone to failure than their enterprise counterparts. It's a known fact that enterprise-level hard drives are built more reliably. If you don't believe me, then check this out.

      However, with proper redundancy one can still get away with using consumer-level drives with an acceptable level of risk.

    5. Re:scsi by Trubacca · · Score: 1

      Seems like sound advice. Thank you for tracking down and providing a functional link, it was a good read. Your post would have received a mod point if I had any!

    6. Re:scsi by CastrTroy · · Score: 1

      I wonder if this is anything due to the higher RPM of commercial/enterprise drives. Consumer drives usually top out at 7200 RPM. while I've seen enterprise drives that go as high as 15000 RPM. That difference is probably enough to account for a significant reduction in drive life. If the extra spin speed is really necessary, go for the faster drives, but in many cases, the faster speed won't get you much except decreased lifespan.

      --

      Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
    7. Re:scsi by na1led · · Score: 1

      They only reason those Enterprise Drives last longer is because they are used in Servers or Storage units that provide much better cooling and power regulation. Put an expensive hard drive in a cheap case and it will fail just as quickly as any other.

      --
      -- By all means let's be open-minded, but not so open-minded that our brains drop out.
    8. Re:scsi by LivinFree · · Score: 1

      Nope. They really are constructed and/or tested to higher standards. Now, you may have a poor experience putting high-end drives into a crap situation - that's not necessarily the drive's fault.

      If you start talking to a company that wants to sell you a lot of drives, and for relatively cheap (a couple of bucks per managed-RAIDed gigabyte), ask them about the duty cycle on the drives. A lot of consumer and even midline drives have duty cycles of less than 40% (put heavy read/write cycles on the disk for less than 40% of the time it is powered up). Enterprise drives are rated up to 100% utilization. If you take a midline drive and an enterprise drive of the same type (SAS, SATA, FC, etc.) and run them at full load head-to-head, statistically, the midline or consumer drives will fail sooner.

    9. Re:scsi by Anonymous Coward · · Score: 0

      So you've never been in a data center. There's a lot more vibration, usage, and heat in drives used in servers. We get on average nearly twice the life out of the same drives in an office PC. The office PC simply doesn't see the same usage as a server drive.

    10. Re:scsi by fnj · · Score: 1

      I have experience that supports that desktop drives last as long as enterprise drives, and my bullshit detector is off the scale for many of the claims made in that Intel document.

      It has become the fashion for manufacturers of late to specify "operational availability" and arbitrarily set this spec to 8x5 for "desktop" drives and 24x7 for "server drives". It is most likely bullshit. I personally have had dozens of "desktop" drives that have been operating very successfully 24x7 for a period of years, some in desktops and some in servers. Actually, turning drives on and off stresses them much more than leaving them on.

    11. Re:scsi by Carnildo · · Score: 1

      The fundamental difference between consumer and enterprise hard drives is the error-recovery strategy in the drive firmware.

      A consumer-grade drive, upon getting an unrecoverable ECC error while reading a sector, will make repeated tries at recovering the data, spending seconds or even minutes at it -- good for a drive operating on its own, but the delay could cause a storage array controller to mark the entire drive as failed. An enterprise-grade drive assumes it's part of some sort of storage redundancy array and will give up quickly, letting the higher-level controller recover the data from the rest of the array -- much faster, but less reliable if the drive is operating on its own.

      --
      "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
    12. Re:scsi by Anonymous Coward · · Score: 0

      A quick look to the table of the first paper do tend to suggest that the deployment date and the number of disk events have some kind of correlation. The average reliability of disks and systems have increased perhaps?

  8. standard tools by Anonymous Coward · · Score: 1

    I look at the SMART data, then I run "fsck -f -c" to test all blocks on the drive, then I look at SMART data again to see if there have been any read errors or remapped sectors. Next, I run dd if=/dev/zero of=/dev/sdx (where sdx is the new drive), to write all sectors. I look at the SMART data again, and repeat the fsck/dd commands as many times as I need. This can easily be scripted, and you can do some random writing as well to exercise the drives seek characteristics.

  9. ZFS and a stress test by Anonymous Coward · · Score: 0

    ZFS is focused mainly on integrity, so just set copies to 10, with checksuming, and stress it with filling up files, an occasional scrub, and so forth. If there's a problem, zfs will report it.

  10. Memtest is fine by StoutFiles · · Score: 0

    Then have ghosting software auto backup periodically.

  11. Testing it for what exactly by jeffmeden · · Score: 0

    You can test that the drive works pretty easily, put it in a PC, copy a bunch of files to it (perhaps enough to fill it up), then run MD5 on those files vs the originals. That would be the "pedantic" way to test it, for "turbo-pedantic" (a bit like running memtest for 72 hours) you can test this way for your entire MP3 collection, then test again for your entire Quantum Leap upscaled 720p dvdrips collection.

    For more practical testing, most drive manufacturers offer "validation" software tools for RMA purposes to test low-level operations and performance, and most of them are generic to the extent that you can actually test any make of drive the same way. It's free and it works, what more could you ask for?

    1. Re:Testing it for what exactly by Anonymous Coward · · Score: 0

      That's crazy. You don't upscale while ripping, you do it during playback.

  12. Jet Stress by jader3rd · · Score: 2

    Jet Stress does a good job of runnig the storage media through a lot of work.

  13. Don't waste your time by colin_faber · · Score: 1

    If you're concerned about drive performance and reliability don't waste your time on off-the-shelf junk. Buy actual enterprise class drives from distributors which pay many dollars to have each and every drive tested for both performance and reliability in varying environmental conditions.

  14. Try Hitachi Drive Fitnes Test (DFT) by Anonymous Coward · · Score: 0

    Has several levels of testing including a full blown exerciser. I've found it very effective for detecting the slightest drive problems. It's available for download from multiple sources.

  15. The usual by macemoneta · · Score: 5, Informative

    All I usually do is:

    1. smartctl -AH
    Get an initial baseline report.

    2. mke2fs -c -c
    Perform a read/write test on the drive.

    3. smartctl -AH
    Get a final report to compare to the initial report.

    If the drive remains healthy, and error counters aren't incrementing between the smartctl reports, it's good to go.

    --

    Can You Say Linux? I Knew That You Could.

    1. Re:The usual by ChrisMaple · · Score: 1

      I recently ran mke2fs -c -c on a 2TB USB3 drive. It took 48 hours to complete. As drives get bigger, this is going to take even longer.

      --
      Contribute to civilization: ari.aynrand.org/donate
    2. Re:The usual by macemoneta · · Score: 1

      A 2-day burn-in is minimal for new hardware. Also, as drives get bigger, they also get faster: IDE, SATA1, SATA2, SATA3, Thunderbolt, ...

      --

      Can You Say Linux? I Knew That You Could.

    3. Re:The usual by jmorris42 · · Score: 1

      > I recently ran mke2fs -c -c on a 2TB USB3 drive. It took 48 hours to complete.

      I suspect USB is the problem. I can run a full raid parity check on six 1TB drives in a RAID5 in a few (3-7) hours, while backups are flowing to them from the network. Something else you might look at is RAM usage, if it is puny enough mke2fs might be swapping?

      That said, you do have a kernel of truth in your statement, improvement in read/write times hasn't kept pace with capacity growth now for years. And seek times are pretty much stuck around where they were a decade ago. It is a problem. People are working on it.

      --
      Democrat delenda est
    4. Re:The usual by Hobart · · Score: 1

      1. smartctl -AH
      Get an initial baseline report.

      2. mke2fs -c -c
      Perform a read/write test on the drive.

      3. smartctl -AH
      Get a final report to compare to the initial report.

      "mke2fs -c -c" is running badblocks -s -w for you.

      If you want more to stare at, you can also add -v , or specify your own test patterns with (multiple) -t options. ( -t 0xCAFEBABE -t 0xDEADBEEF or whatever)

      Badblocks does fill a full disk with the pattern, then read it all back confirming no changes.
      This does miss flaky devices that, for example are writing over other parts of themselves. (Fake USB flash drives that misreport their size have been known to do this.)

      Not sure what a good test would be... first thing that comes to mind is:

      1. openssl enc -rc4 -nosalt -K 0 -iv 0 < /dev/zero > /dev/sdXX
      2. openssl enc -d -rc4 -nosalt -K 0 -iv 0 < /dev/sdXX | tr -d '\0' | wc -c (should return '0' w/o errors)
      --
      o/~ Join us now and share the software ...
  16. Take a hint by Anonymous Coward · · Score: 0

    Disappear for 4 months and come back and say they are good. Even if you test there is no reason that hardware can't fail at any point after the test. That's why we buy redundancy and support.

  17. Hitachi DFT by Anonymous Coward · · Score: 0

    I've always had good results with the Hitachi Drive Fitness Test. Works fine with non-Hitachi drives too.

  18. This is what I use by Wolfrider · · Score: 3, Interesting

    root ~/bin # cat scandisk
    #!/bin/bash

    # RW scan of HD
    argg='/dev/'$1

    # if IDE (old kernels)
    hdparm -c1 -d1 -u1 $argg

    # Speedup I/O - also good for USB disks
    blockdev --setra 16384 $argg
    blockdev --getra $argg

    #time badblocks -f -c 20480 -n -s -v $argg
    #time badblocks -f -c 16384 -n -s -v $argg
    time badblocks -f -c 10240 -n -s -v $argg

    exit;

    ---------

    Note that this reads existing content on the drive, writes a randomized pattern, reads it back, and writes the original content back. With modern high-capacity over-500GB drives, you should plan on leaving this running overnight. You can do this from pretty much any linux livecd, AFAIK. If running your own distro, you can monitor the disk I/O with ' iostat -k 5 '.

    From ' man badblocks '
    -n Use non-destructive read-write mode. By default only a non-destructive read-only test is done. This option must not be combined with the -w option, as they are mutually exclusive.

    --
    .
    == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
  19. QuickTech Pro by Anonymous Coward · · Score: 0

    http://www.uxd.com/qtpro.shtml

    This is, by far, the best hard drive (and hardware) testing suite I've ever used.

  20. Do a Surface Scan by na1led · · Score: 1

    There are many free tools for doing a surface scans of a hard drive, that test for bad sectors. Usually it's bad sectors that cause Hard Drives to fail, and that's all you can really test anyway. Hard Drives will fail, it's just a mater of time, that's why you need redundancy. Other than testing, keeping the System Cool, and Dust Free is all you can do.

    --
    -- By all means let's be open-minded, but not so open-minded that our brains drop out.
  21. It isn't worth it. by Anonymous Coward · · Score: 0

    Modern failure modes tend to be catastrophic, you won't find bad sectors on a hard drive these days. The drives have so much error correcting and sector re-mapping that the very act of writing to a bad portion of the platter will silently correct and remap the sector. The main way you can see failures is to write data, do not read it for a *long* time, then get a read failure. Plus the initial part of the bathtub curve is in months not days, so testing for reliability is really not something you can do.

    1. Re:It isn't worth it. by Anonymous Coward · · Score: 0

      Found one last week on a WD1001FALS-00E8B0. It's being replaced on RMA now...

  22. Hard Disk Sentinel by prestonmichaelh · · Score: 3, Insightful

    Hard Disk Sentinel: http://www.hdsentinel.com/ is a great tool They even have a free Linux client. What it does over SMART is that it takes the SMART data and weights them according to indications of failure, then gives you a score of 0-100 (100 being great, 0 being dead) as to how healthy the drive is. We use this extensively and have created NAGIOS scripts that monitor the output. Generally, if a drive has a score of 65 or higher, I will generally continue using it (pretty much all my setups are RAID 10 or RAID 6). If the score starts dropping rapidly (a few points every day, even if it started high) or gets below 65 or so, I go ahead and replace it. It has helped out a bunch.

    Even with that, using the SMART data, in a SMART way, still only predicts about 30% of failures. The other 70% will come out of no where. That is why it is best to assume all drives will die at anytime and are suspect and never allow a single drive to be the sole copy of anything.

  23. Think Performance - IOZONE by humphrm · · Score: 1

    When it comes to media, even with SMART your drives will work 'till they die, and there's no way to predict that with a test.

    Given that, your best option is to ensure that the drives are performing as expected. I've found many a faulty drive with IOZONE.

    http://www.iozone.org/

    --
    -- "In order to have power, I must be taken seriously." -Mojo Jojo
  24. old timers look here by vlm · · Score: 1, Interesting

    OK so that was the noob version of the question.

    I have a question for the old timers. has anyone ever implemented something like:
    1) log the time and temp
    2) do a run of bonnie++ or a huge dd command
    3) log the time and temp
    4) Repeat above about ten times
    5) numerical differentiation of time and temp and also any "overtemps"

    In theory run from a cold or lukewarm start that could detect a drive drawing "too much" current or otherwise being F'd up, or cooling fan malfunction
    I'm specifically looking for rate of temp increase as in watts expended, not just static workload temp.
    In practice it might be a complete waste of time.

    Another one might be something like a smart reported temp vs iostat reported usage plotted on a scatterplot graph.

    So the old timer question is has anyone ever bothered to implement this, and if so, did it do anything useful other than pad your billable hours?

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:old timers look here by na1led · · Score: 1

      Seems like a big waste of time to me.

      --
      -- By all means let's be open-minded, but not so open-minded that our brains drop out.
    2. Re:old timers look here by prestonmichaelh · · Score: 1

      If the temps are in "operating ranges" which run higher than you might think (check with the hard drive manufacturer for specs), temperature doesn't correlate to drive failure:

      http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/#more-337
      Look for the "lessons learned" section in that link.

    3. Re:old timers look here by Mister+Liberty · · Score: 1

      Sounds interesting. For the oldtimer you claim to be you
      really should have performed said tests and analyses
      long time ago already.

    4. Re:old timers look here by vlm · · Score: 1

      Ahh but the delta-temp over delta-time, assuming identical hardware, is a direct measurement of cooling capacity.

      --
      "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    5. Re:old timers look here by marcosdumay · · Score: 1

      Yep, but how do you decide that the cooling capacity is too low? Remember that it doesn't have a linear relationship with the temperature, and doesn't come on the manual.

      You'll really have to test a batch of disks, correlate (cooling power, temperature) with longevity. And then, when your disks are already obsolete, you'll have enough data to gather conclusions from that test.

    6. Re:old timers look here by SinShiva · · Score: 1

      are you concerned about... i dunno, friction? ambient temperature? don't store your drives in snow or furnaces and stop overclocking your megabytes

    7. Re:old timers look here by CAIMLAS · · Score: 1

      Nothing too scientific, mostly just seat of the pants engineering. "This is running hot, it needs more fans". IF something gets close to over temp while running at maximum (eg. the dd) then you don't have enough cooling. Good engineering of mechanical/electrical needs at least a 30% overhead for worst case, and 50% over likely for thermal or electrical maximums.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  25. Are you testing an array or individual drives? by HockeyPuck · · Score: 4, Insightful

    I manage a team that oversees PB of disk, both within an enterprise array and internal to the server. For testing the arrays, since there's GB of cache in front of the disks, I can only rely on the vendor to do the appropriate post installation testing to make sure there are no DOA disks. For internal disks, as others have mentioned you could run IOMeter for days without a problem and then the very next day it's dead. Unlike memory, disks have moving parts that can fail much easier than chips. However, with proper precautions like RAID, single disk failures can be avoided.

    The bigger problem is having a double disk failure. This is due to the amount of time required to rebuild the failed disk. Back when disks were 100GB this was a "relatively" quick process. However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs. With what an enterprise disk costs from the array vendor (not Frys), it can start to add up.

    1. Re:Are you testing an array or individual drives? by na1led · · Score: 1

      It depends on how critical your data is. There are many different types of Raids, like Raid 6, 10, or 50, and if you have a good storage unit with redundant controllers and a couple of hot spare, you should be all set. The chances of a total failure of multiple drives/controllers is very slim, and that's what nightly backups are for anyway. We use a Dell PowerVault MD 3220 - Dual Controllers, Dual Power Supplies, Raid 50 with 2 hot spares. Chances of losing data from Hard Drive failure is like winning the lottery.

      --
      -- By all means let's be open-minded, but not so open-minded that our brains drop out.
    2. Re:Are you testing an array or individual drives? by Anonymous Coward · · Score: 1

      > However, in some of my arrays with 3TB drives in them, it can take much longer to replace the drive. Even to the point whereby having hotspares has been considered to be not worth it as my array vendor will have a new disk in the array within 4hrs.

      Two things for you to consider:

      1. Are you that confident that your vendor will always meet the four hour SLA? I've worked with a myriad of vendors (IBM, Dell, HP, EMC, Hitachi, etc) and there has been multiple times they couldn't meet the SLA. In a few cases it took days for the replacement drive to arrive.

      2. In some situations it may be possible to have the hot spare permanently replace the failed drive, the new drive your vendor delivers could be the new hot spare. This means your RAID volume will be in a degraded state for less time, which is better for performance. It also reduces the risk of a double failure, even though you believe the risk is minimal to begin with.

    3. Re:Are you testing an array or individual drives? by HockeyPuck · · Score: 1

      1. Are you that confident that your vendor will always meet the four hour SLA? I've worked with a myriad of vendors (IBM, Dell, HP, EMC, Hitachi, etc) and there has been multiple times they couldn't meet the SLA. In a few cases it took days for the replacement drive to arrive.

      I've got a contract with the vendor that they stock spare parts onsite, so they're only requirement is to get the engineer onsite to replace the part. Also given that my datacenters aren't in the middle of the Oklahoma prairies, there are parts depots in the area.

    4. Re:Are you testing an array or individual drives? by Anonymous Coward · · Score: 0

      In what universe is having such a high level of vendor support (response time, parts on site) cheaper than having sufficient hot spares?

    5. Re:Are you testing an array or individual drives? by CAIMLAS · · Score: 2

      If you've got 3TB drives in use in RAID, you're a fool for not running double parity. Like you said, the time required is just too long; you need to be able to survive a 2-disk failure.

      --
      ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  26. badblocks by Anonymous Coward · · Score: 0

    I generally run 'badblocks' (included in most linux disributions).

  27. Reliability and fault-tolerance by Mondragon · · Score: 5, Informative

    Not completely related to how to test, but...

    In 2007 Google reported that for a sample of 100k drives, only 60% of their drives with failures had ever encountered any SMART errors. Also, NetApp has reported a significant amount of drives with temporary failures, such that they can be placed back into a pool after being taken offline for a period of time and wiped. Google also had a lot of other interesting things to say (such as heat has no noticeable effect on hard drive life under 45C, that load is unrelated to failure rates, and that if a drive doesn't fail after 3 months, it's very unlikely to fail until the 2-3 year timeframe.

    You can find the google paper here: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

    A few other notes that you can find from storage vendor tech notes if you own their arrays:
      * Enterprise-level SAS drives aren't any more reliable than consumer SATA drives
        - But they do have considerably different firmwares that assume they will be placed in an array, and thus have a completely different self-healing scheme than consumer-level drives (generally resulting in higher performance in failure scenarios)
      * RAID 5 is a really bad idea - correlated failures are much more likely than the math would indicate, especially with the rebuild times involved with today's huge drives
      * You have a lot more filesystem options that might not even make sense to use with a RAID system, like ZFS, as well as other mechanisms for distributing your data at a layer higher than the filesystem

    Ultimately the reality is that regardless of the testing you put them under, hard drives will fail, and you need to design your production system around this fact. You *should* burn them in with constant read/write cycles for a couple days in order to identify those drives which are essentially DOA, but you shouldn't assume any drive that passes that process won't die tomorrow.

    1. Re:Reliability and fault-tolerance by n4djs · · Score: 1

      All hard drive errors boil down to how many failed bits occur on the raw, pre ECC corrected media vs. the calcuated post-ECC return. A hard failure is one that exceeds the span of possible corrections. Most hard block failures should be correctable by sparing of the media block in question. If you get too many non-correctable errors, it is indicative that the electronics or the heads have died... which in practice turns out to be a catalysmic failure where the drive totally fails on a subset of reads (i.e. one surface is no longer accessable).I would think a better way to test a drive would be to perform long reads (data + ECC), programmatically calculating ECC and determining the number of bits in error, and then performing sparing of the problematic tracks (if supported by the command set of the drive - SCSI does this, I don't know if ATA drives allow sparing to occur in call cases.) Of course, a simple 'dd if=/dev/sda of=/dev/null bs=1024k' may be just as effective in the long run...

  28. rsync -nc by Anonymous Coward · · Score: 0

    I mirror data and test it periodically with rsync using the dry-run (-n) and checksum options (-c) to do a full comparison. I usually have more confidence in a new disk after I've done this a few times.

  29. StorageMojo by Anonymous Coward · · Score: 0

    Read this for more info on disk storage

    http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/

  30. rsync -nc by patniemeyer · · Score: 1

    I mirror data and test it periodically with rsync using the dry-run (-n) and checksum options (-c) to do a full comparison. I usually have more confidence in a new disk after I've done this a few times.

  31. Lithophobia by macraig · · Score: 1

    I have a favorite boulder that has served my burn-in testing needs pretty well. Would you like a photo so you can chisel your own? I added some LED bling to mine.

  32. Can't even believe this made it to the front page. by Ironhandx · · Score: 1

    I mean, really, someone working at slashdot doesn't know this? This is about as basic a question as it gets when it comes to hardware.

    Raid 5, a solid backup scheme, and a storage closet full of replacement drives. There is no good way to test HDDs.

    Due entirely to the fact that they are a WEAR item it is only possible to decide which brand you trust the most.

    Other than that, if its a big job and a lot of HDDs are going to be bought you could take 10 of each candidate drive and run them through spinrite till they fail. One that fails last wins.

    SSDs are still too expensive to be using for mass storage, and should be treated the exact same way even if you ARE using them for mass storage regardless, as they are also a wear item, and will fail without warning after so many read/write operations, the same as a HDD.

  33. ZFS by Anonymous Coward · · Score: 0

    IIRC zfs supports online "data scrubbing" http://en.wikipedia.org/wiki/Data_scrubbing#In_RAID . This, combined with the other features of zfs can help you prevent data loss.

  34. IOmeter by jd2112 · · Score: 1

    The storage team where I work use a program called IOmeter. Ot runs on Windows and Linux (and I think other platformd as well) and is open source. They were using it for stress testing SAN storage but I think it would work for locally attached storage as well.

    --
    Any insufficiently advanced magic is indistinguishable from technology.
  35. testing on linux: use spew by David+Muir+Sharnoff · · Score: 1

    http://linux.die.net/man/1/spew

    While you can't predict against future failures, if you want to make sure that your drive media is okay today, there is a tool that will fill your disk with garbage and then verify that your disk has the right garbage on it: spew. Spew isn't the friendliest tool, but it does the job.

    As a side effect, it stresses your I/O systems and memory. Years ago, I discovered that some Dell 2550's I had couldn't pass this test with the SATA controller I had shoved into them that seemed to work fine otherwise.

  36. UnRAID Preclear Script by Jumperalex · · Score: 3, Informative

    http://lime-technology.com/forum/index.php?topic=2817.0 ... the main feature of the script is
    1. gets a SMART report
    2. pre-reads the entire disk
    3. writes zeros to the entire disk
    4. sets the special signature recognized by unRAID
    5. verifies the signature
    6. post-reads the entire disk
    7. optionally repeats the process for additional cycles (if you specified the "-c NN" option, where NN = a number from 1 to 20, default is to run 1 cycle)
    8. gets a final SMART report
    9. compares the SMART reports alerting you of differences.

    Check it out. Its "original" purpose was to set the drive to all "0's" for easy insertion into a parity array (read: parity drive does not need to be updated if the new drive is all zeros) but it has also shown great utility as a stress test / burn-in tool to detect infant mortality and "force the issue" as far as satisfying the criteria needed for an RMA (read: sufficient reallocated block count)

    If your skill level is enough to adapt the script to your own environ then great, otherwise UnRaid Basic is free and allows 3 drives in the array which should allow you to simultaneously pre-clear three drives. You might even be able to pre-clear more than that (up to available hardware slots) since you aren't technically dealing with the array at that point, but with enumerated hardware that the script has access to which should be eveything on the disc. Hardware requirements are minimal and it runs from flash.

    --
    If you can't be good, be good at it!
    1. Re:UnRAID Preclear Script by aarongadberry · · Score: 1

      I use this on all new drives. I run it twice or three times. It has saved me from a failure more than once.

  37. Storage Unit is more important than the Drives by na1led · · Score: 1

    A good Storage Unit will do a good job at maintaining the Hard Drives you purchased and keeping them safe. They can also handle problems with drives to prevent data loss, and notify you when a drive is about to fail. A good SAN or DAS is what I would purchase. We purchased a Dell Powervault DAS, and have been very happy with it, I never worry about Hard Drives failing because I know the DAS will take care of it. Some companies like Dell and EMC will know if you have a bad hard drive, and ship you a new one before you realize it.

    --
    -- By all means let's be open-minded, but not so open-minded that our brains drop out.
  38. always good to do a full write with read verify by Anonymous Coward · · Score: 0

    It's always good to do a full write with read verify on new media. For my own piece of mind, I wrote a Java application that fills a drive with pseudo-random data and then reads it back to make sure (1) the data is correct, and (2) the entire drive capacity can be accessed. Use this in addition to the many good hardware diagnostic tools (see other comments). As has been pointed out, this only tells you that the drive is working now, but can't predict when it will fail.

    BLATANT ADVERTISEMENT: The Java program has been released under the GPL and can be found here (Linux, MacOS, Windows, etc): http://linux.softpedia.com/get/System/System-Administration/Erase-Disk-46749.shtml

  39. vendors solved this problem years ago by alen · · Score: 1

    we use HP servers and HP ships a suite of software to install on the server along with the OS. they monitor the hardware and warn you of any problems. unless you like doing things the hard way, this was solved years and years ago

    i have a bad hard drive i call HP, send them a log file and in 2 hours i have a new one delivered

  40. Re:Can't even believe this made it to the front pa by Anonymous Coward · · Score: 0

    Don't use raid 5 (write hole).
    use 10 or simple mirroring.
    (raidz should be ok).
    Use the money saved on 2 expensive raid controllers (One fails you are stuffed if you cannot get another which is by no means guaranteed) to buy more disks.

  41. Testing is expensive R.A.I.D. is cheap by Anonymous Coward · · Score: 0

    Just RAID your storage or better connect to a SAN and be done with it.

  42. Hitachi DFT by mr.bri · · Score: 1

    Hitachi's (previously IBM's) Drive Fitness Test is the most thorough disk test I've used. It works on all makes, and has a "drive exerciser" that can loop a test sequence.

    I've seen it find problems with drives that the manufacturer's own tools don't expose.

    My policy is that if a drive survives 20 loops of the exerciser and then a full extended test that it's fit for production service.

  43. Testing hardware by excercising by cvtan · · Score: 1

    Testing hardware by exercising it is like testing matches to see if they are good.

    --
    Sorry, but gray text on gray background is making my eyes bleed.
  44. H2TestW - in particular for (often fake) USB media by D4C5CE · · Score: 1

    http://www.heise.de/download/h2testw.html - switchable to English of course.
    While it is primarily advertised for flash media these days (and indispensable since there have been numerous forgeries or DOAs at least on the European market lately), it evolved as an HDD tester in the first place.

    On Linux in particular, a combination of dd and smartctl (before&after writing the entire disk, as well as for self-tests) may come in handy too, of course.

  45. Testing Drives. by Rashkae · · Score: 1

    It takes a while, but if you really want to be sure of your hardware (as sure as you can be, at least.)

    Check the SMART status. If there are any re-allocated sectors, make note of the number.

    Run badblocks with the -w switch against the drive (from a Linux live cd of your choice, for example)

    That should completely read/write test the drive 4 times with multiple patterns. There should be no errors reported. This test will take longer than overnight on modern drives.

    Check the SMART data again. Be wary if there has been an increase in Re-allocated sectors. This is considered normal and does not constitue drive failure. However, most drives should not have any reported re-allocations so early in life, and this may indiacate you have a drive of marginal quality.

    Do not try this on SSD drives.

  46. hire someone who knows what they are doing.. by issicus · · Score: 1

    or read a book

  47. Ears by Maximum+Prophet · · Score: 3, Informative

    Most everything above is good, but don't overlook the obvious. Spin the drive up in a quiet room and listen to it. If it sounds different from all the other drives like it, there's a good chance something is wrong.

    I replaced the drive in my TiVo. The 1st replacement was so much louder, I swapped the original back, then put the new drive in a test rig. It started getting bad sectors in a few days. RMA'd it to Seagate, and the new one was much quieter.

    --
    All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
  48. Just sent it to me by damn_registrars · · Score: 1

    Send your drives to me, postage paid. I'll test them for you for no charge, and send them back to you before the warranty expires.

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
  49. Testing methods by thewiz · · Score: 1

    Or any tried and tested methods for testing storage media?
    #2 pencil, Scantron sheet and the test.

    Works for me everytime.

    --
    If "disco" means "I learn" in Latin, does "discothèque" mean "I learn technology"?
  50. h2benchw by Anonymous Coward · · Score: 0

    There's a program called h2benchw and that's great for testing hard drive.

    There might be a few versions floating around, but it was published by the German magazine "C't". I use it to test my hard drives because it's the ONLY one that I've found that actually has a "swapping" application profile and tries to mimick the swapping file/data access pattern as a lot of what I do is swap (performance) limited.

    The rest of the usual suite that I use include ATTO Express Pro Tools (now renamed to something else I think), and also HDTach as the most basic.

    But there's no test that can replace what you're ultimately going to actually do with the system. Create a uniform benchmark for yourself and run your regular workload on it. That'll be the sure-fire way to test new hardware.

  51. badblocks by Janek+Kozicki · · Score: 4, Interesting

    badblocks -c 10240 -s -w -t random -v /dev/sda1

    that's my standard test for all HDDs

    --
    #
    #\ @ ? Colonize Mars
    #
  52. For backups - restore + check MD5 hash by JoeSchmoe007 · · Score: 1

    If storage media is used for backups - try restoring them. If files that are stored there are not changed - you can also save MD5 (or some other hash) in separate file and re-calculate/re-check it periodically.

  53. def check_storage_media(media): by Bill,+Shooter+of+Bul · · Score: 1

    while label_visible(media):
            apply_lighter_fluid(media)
            ignite(media)
            wait_while_burining(media)
    return media_is_bad

    --
    Well.. maybe. Or Maybe not. But Definitely not sort of.
  54. MHDD -Are you kidding me? by Anonymous Coward · · Score: 0

    MHDD, If this hasn't been mentioned yet, it's official...everyone on /. is an idiot.

  55. Yep...SPINRITE by Anonymous Coward · · Score: 0

    Spinrite is the one you want.

    http://www.grc.com

  56. Test Rig + SeaTools by Anonymous Coward · · Score: 0

    1a) Build a small test computer with several removable hard drive racks. (SNT-125B go for around $16 on Newegg)
    1b) Or plug a hard drive dock with eSata to your rig for individual drive testing.
    2a) Run SeaTools for all hard drive brands, including WD. (http://www.seagate.com/www/en-us/support/downloads/seatools)
    2b) Or run Data Lifeguard Diagnostics if it's a WD drive.

    I've run SeaTools against every brand of hard drive, and it's caught every one.

    For the most part I stick with WD drives. Their warranty is excellent.

  57. hard drive tests by Anonymous Coward · · Score: 0

    I use Hiren's boot disk and run the HDAT test, it seems to work for the stuff i do. The other best way to test a HDD is to run it through a serious burn in test. We test hard drives and ship them daily and rarely do we have failures or returns.

  58. Scrubbing/BMS by bored · · Score: 1

    Other people have posted about the manufactures test utilities, RAID etc.

    But I didn't see anyone mention that most newer drives can run media scans while idle.
    For example seagate supports the T10 Background Media Scan (BMS), where the drive scans itself, and relocates sectors when its not actively processing commands from the host. It also supports idle read after write, which reads recently written sectors and compares them with the copy in the cache during idle periods.

    Finally, all the modern RAID controllers i've seen have scrub options for validating the RAID parity and taking drives offline that are failing the parity checks. (which is mostly pointless if your drive is scrubbing itself, the ECC from the drive provides more protection than RAID5. 6 is probably safer..).

    The key of course it to make sure your raid controller understand the drive failure metrics behind it.
     

  59. My suggestions by Liquid-Gecka · · Score: 4, Informative

    Speaking as somebody that has done hardware qualifications and burn-in development at very large scale for companies you ahve heard of let me tell you the tools I use:

    fio: The _BEST_ tool for raw drive performance and burnin testing. A couple of hours of random access will ensure the drive head can dance, then a full block by block walk through with checksum verification will ensure that all blocks are readable and writable.. I usually do 2 or 3 passes here. You can tell fio to reject drives that do not perform to a minimum standard. Very useful for finding functional yet not quite up to speed drives. The statistics produced here are awesome as well.. Something like 70 stats per device per test.

    stressapptest: This is google's burn in tool and virtually the only one I have ever found that supports NUMA on modern dual socket machines. This is IMPORTANT as its easy to ignore issues that come up with the link between the CPUs. The various testing modes give you the ability to tear the machine to pieces which is awesome. Stressapptest also is the most power hungry test I have ever seen, including the intel Power testing suite that you have to jump through hoops to get.

    Pair this with a pass of memtest and you get a really, really nice burn in system that can burtalize the hardware and give you scriptable systems for detecting failure.

    1. Re:My suggestions by zigfreed · · Score: 1

      Mod parent up. I was going to suggest a grey code-ish bit pattern using badblocks:

      for i in {random,65535,21845,52428,26214,13107,39321,43690,0}; do badblocks -wst $i /dev/sda; done

      which actually cleared up a S.M.A.R.T. failure, probably by internally remapping the bad sectors. Ultimately, the disk size remained above 1TB, so if it was RMA'd I have a suspicion that it would end up in the refurbished bin. Fio, stressapptest, and if you have the system online, Furmark, will overheat anything that can overheat.

  60. Re:Can't even believe this made it to the front pa by Anonymous Coward · · Score: 0

    Choosing one brand to trust seems counterproductive, since this implies your disks will all be from one brand, thus increasing the chances that both/all disks in a mirror set are affected by a single process flaw. Best practice is to ensure mirrored drives are not only not of the same lot, but not produced on the same equipment, and the only easy way to do that is to use different brands.

  61. test tools by pentabular · · Score: 1

    dbench, lmbench, bonnie/++ and badblocks monitoring with smartmontools, blktrace, and seekwatcher

  62. Always running=Longest life by Anonymous Coward · · Score: 0

    I've always used "consumer" grade HD's in my servers and they work most reliably. Difference from a desktop: they never stop spinning from day1. No heat up/cool down expansion/contraction. The "green" drive that spin down after some time of no use I've had the highest failure rates with.

    1. Re:Always running=Longest life by jmorris42 · · Score: 2

      > I've always used "consumer" grade HD's in my servers...

      A lot depends on what your servers are doing. Try putting a consumer drive in a news server some time if you want a good laugh. Or put one in any server that is really busy 24/7 and a cheap drive will crap out fairly fast since they were never designed for that sort of constant abuse. Meanwhile good drives can and will hold up for years of that kind of torture. Yes they will all happilly spin 24/7 and thrive if the drive is well cooled since power cycles are when most drives die, but constant seeking is another kettle of fish.

      --
      Democrat delenda est
  63. testing for speed or failure? by SinShiva · · Score: 1

    I see that many comments include utilities that may or may not indicate ACTUAL problems. Some of these test for speed, but how? Personally, i find the most accurate way to test storage media (and network media) is by copying data to and from the storage media in question, to a RAMDISK. For network throughput, i like to copy ramdisk to ramdisk via differing protocols, ie ftp/http/samba, etc. (i run a small ramdisk as a virtual folder in IIS (win7) for hosting small files to give to friends.) Also, installing (cpu/mem/gpu) benchmark utilities to a ramdisk yields interesting results. i don't believe most or all benchmark utilities are ACTUALLY unhindered by storage media when testing other components, like the cpu.

  64. Re:Can't even believe this made it to the front pa by Anonymous Coward · · Score: 0

    RAID 1 is what you are refering to, RAID 10 is nested RAID setup which is a stripe of mirrors.

    When they start selling drives large enough and can support more I/O, your RAID 1 only plan might work. Some people need more than 2TB per LUN and need far more I/O than two spindles can support. Sure, you can make one big RAID1 I guess but that is a 50% loss in space and not a good method to use.

    You can go back to /b/ now..

  65. MHDD by NoobixCube · · Score: 2

    It may not be sophisticated, but MHDD is what I use at work (among a couple of other tools). Other tools are more reliable in different circumstances, but my first stop is always MHDD, because it will give me a comprehensive R/W delay test on a disk. Extremely practical for a workshop, perhaps not practical for a data centre.

    --
    Admit it. You post strawman arguments as AC so you get modded Insightful for refuting them, rather than Troll
  66. chkdsk /r by Anonymous Coward · · Score: 0

    Or do any sort of verified format of the device.

  67. Re:SpinRite irrelevant??? by Anonymous Coward · · Score: 0

    Drives exposing logical sectors instead of physical sectors has been a problem for over 10 years. There are plenty of ways to determine the physical geometry such as timing seek times and sometimes it is as easy as looking up the drive type in a list. As far as SpinRite being BS because it works on USB drives, the program does what it is programmed for. When it encounters a condition it doesn't understand it makes lemonade and moves on. It may not be programmed to respond, "Are you sure this is rotational media?"

  68. Just use by noh8rz3 · · Score: 1

    ...passwords. Be sure they're hardened and salted. Also, redundancy and the cloud.

  69. I use badblocks -w by Anonymous Coward · · Score: 0

    Basiclally, check the reallocated sectors in smartctl, run badblocks -w (for greater converage, add some "-t random" passes to the usual 4 fixed patterns), and make sure that reallocated sectors & pending sectors haven't changed.

    If they have, you can use some judgement, but if they haven't, it's probably okay.

    Some seek tests would be nice, but it already takes long enough to just read and write every sector.

  70. Record, check, test, burn-in, monitor by simplypeachy · · Score: 1

    If the weather's been cold (<5C) between the warehouse and the place of testing then I give a disk 24 hours to acclimatise outside of any packaging before powering it up. I check it for visible signs of damage as I've had a few delivered which had undamaged packaging but was still visibly broken. Once I've noted the product code, model name and serial (great fun otherwise, if $mfg does a recall and you don't know if affected) I do a warranty check on the drive. Ebuyer in the UK send out grey imports with no UK warranty sometimes, which shows up on $mfg's web site. If this is the case I'll get written confirmation from the supplier that they will honour the manufacturer warranty. n.b. I run a business, so consumer protection laws do not apply to me - this is a necessary precaution.

    Then I start testing using a procedure which is designed to be thorough but not wasting any time if something is amiss. I don't consider SMART to be the end-all of fault finding, but always trust it if it's saying something is wrong. The manufacturer will honour returns according to SMART, so it's good enough for me.

    I check the SMART attributes for anything obvious. Any mention of any LBA being bad or possibly bad, then the drive fails. I then run the short, conveyance, offline and long tests, checking the attributes and logs afterwards.

    I run the manufacturer's short test on it, then the long test. These often lie if a disk is bad but if they do show up anything, the disk fails. I record the firmware version and check the manufacturer's site to see if they advise an upgrade (I don't trust the version printed on the label). smartmontools is also quite good at alerting you to this. I'll then DBAN the disk with DoD short to give it a good test and then check the SMART attributes again. Once these all pass I'll start using it and get smartmontools to schedule regular tests and email me if anything bad shows up.

  71. Ask manufacturer tech support! by couchslug · · Score: 1

    They might know something or have utilities for doing such things...

    --
    "This post is an artistic work of fiction and falsehood. Only a fool would take anything posted here as fact."
  72. ZFS by Anonymous Coward · · Score: 0

    Try ZFS.

  73. RAID by Anonymous Coward · · Score: 0

    If it's that important, implement an appropriate RAID solution and you don't even need to care about individual drives.

  74. Perhaps this might be of interest: by Anonymous Coward · · Score: 0

    http://www.google.com/patents/US7461298

  75. Consider the drive faulty from day one by enjar · · Score: 1

    If it's a single disk (e.g. in a compute only server), then treat it like toilet paper. Know it's going to be replaced, how to replace it, and have spares on hand. Inform all people who use these servers that data on the server is being treated as throw-away at all times, and to never, ever expect data to survive a reboot -- and when the hard drive fails, you *might* be able to recover some data. It goes without saying that you also need to be able to re-provision the system with a common image, as well.

    If people can't deal with that, then it's time to get into RAID, decent backups, NAS, SAN and other technologies that require additional cost and care to purchase and maintain but that are designed to guard against disk failures bringing down critical systems.

    You should also make sure you understand your drive replacement procedure (as well as your consumers). Are you keeping a stack of spares and self-servicing, or are you paying a vendor to provide n-hour service?

  76. PROD by Anonymous Coward · · Score: 0

    When in doubt, test in production...

  77. IOMeter by TheEldest · · Score: 1

    To answer the actual question: IOMeter. It's a load generator / benchmark. You can generate loads to test a storage device for your specific requirements and see if performance is up to snuff. You can also generate loads to stress a device until you halt it.

    As someone else mentioned, throw bunches of read/writes at a drive for a couple days then put it into production with a reliable system to gracefully handle failure. You want to find drives that would fail in the first couple weeks and keep them from hitting your production environment.

  78. Re:Can't even believe this made it to the front pa by Ironhandx · · Score: 1

    Its not. Its actually cheaper to test a particular batch if its for a big enough job.

    In a raid 5 3 drive configuration if one disk fails you rebuild it, then swap each of the other drives out one at a time immediately afterwards. Since one drive has failed and they've all had the same use pattern you can assume that the others are going to fail as well.

    Beyond that, if things are that mission critical you should have streaming backups to an additional backup array of newer drives. Brand makes zero difference here, they just have to have less usage.

    Besides which, whatever they claim, raid controllers to this day do not like different types of hardware inside a single array. I've seen IT guys take the manufacturers at their word and spend months looking for "ghosts in the system" due to random corrupt data errors. Its just the raid controller spazing because someone installed a Western Digital drive in the same array as a Maxtor, or a 120gb drive alongside two 80gb drives.

    There is no "easy" when it comes to proper hardware management. If you're working in IT you need to learn this fast before it costs your company big time down the road.

  79. Iozone for performance testing/tuning by Fallen+Kell · · Score: 1

    You set some basic parameters, such as min/max filesizes to use, stepping size between files, how many processes/threads to use in the test, etc., etc.,... It will run a gambit of tests reading/writing files, sequential, random, varying read/write sizes, and sizes of the files it is creating. It outputs nice graphs so you can see where the peek performance values are in terms of the storage dealing with different sized files and read/write sizes.

    http://www.iozone.org/

    --
    We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
  80. PDF's by opentunings · · Score: 1

    The company I work for reads the manufacturer's white papers on their drives. That's been our testing strategy for years; doesn't everyone do it that way?

  81. Spinrite by Anonymous Coward · · Score: 0

    Spinrite is good.

  82. They get bigger faster than they get faster by tepples · · Score: 1

    Also, as drives get bigger, they also get faster

    True, but they get bigger faster. That's why there's a movement to migrate away from RAID 3-5: the risk of two drives failing within two days is too great.

  83. Why are you pretending to be dumb? by dbIII · · Score: 1
    You want to actually do something with that data don't you? I mean I/O to the node/server/whatever that the storage is supplying, didn't you notice the NFS in big capital letters?

    you don't know how to size servers.

    And you obviously don't know how to price them.

    1. Re:Why are you pretending to be dumb? by the_B0fh · · Score: 1

      In a RAID environment, the IO bottleneck is at one of various places:

      1) Disk mechanism (not under discussion)
      2) Interface (not under discussion)
      3) calculating RAID parity (hardware RAID vs software RAID)
      4) higher level functionality (moving LUNs around, RAID hole, etc - not under discussion)

      As of 2010, Intel has publicly declared that software RAID (good implementations, obviously) is faster than hardware RAID, and does not take up more than 5% of your CPU. Youtube's engineers have stated that they get 20% to 30% more performance out of software RAID than hardware RAID. Obviously *YOUR* expertise triumphs them all.

      I am also worried that a guy who can't price a CPU that has 5% more performance without breaking the bank. You must be working on itty bitty machines.

    2. Re:Why are you pretending to be dumb? by dbIII · · Score: 1
      Please stop pretending to miss my point - you want to actually get the data to where something can be done with it - JUST LIKE I SAID IN THE FIRST SENTENCE ABOVE, and now you go on writing about something else while pretending to be answering your post. You are obviously not as stupid as you are pretending to be so please stop pretending, it's annoying.
      You've put up nothing that directly addresses your blanket statement that hardware RAID is entirely useless and you've gone on about gold plated solutions when instead hardware RAID is really just adding a small percentage of the cost to a controller that you need anyway. Just give it up, unless you really believe what you are writing in which case it's time for you to look up specs and prices for controller cards and learn for yourself.

      You must be working on itty bitty machines.

      Or several machines where the costs add up, then of course that hell of a lot more than 5% can push you up into a different price range while a few extra dollars on the controller card you actually need doesn't.
      How many decent controllers capable of handling a lot of discs don't have RAID6 capability on board these days anyway? Sorry, but you are the one that is living in the past. Just give it up and bitch about something you actually know about instead.

    3. Re:Why are you pretending to be dumb? by the_B0fh · · Score: 1

      At newegg, a Phenom II X4 965 at 3.4Ghz is $125. A Phenom II X4 975 at 3.6Ghz is $130. That 5% is $5.

      At times, you cannot avoid buying cards without RAID - they simply don't make it. But as youtube discovered, even when you have a hardware RAID enabled card, going the software RAID route increased performance by 20% to 30%.

      Which part of that do you not understand?

    4. Re:Why are you pretending to be dumb? by the_B0fh · · Score: 1

      I guess the real question is this:

      Do you agree/admit that current generation CPUs can calculate RAID parity faster that a hardware RAID card (in most cases, with minimal, 5% overhead)?

      If yes, great.

      If no, are you saying the realities that the youtube engineers saw, that other people saw (linux kernel hackers, etc), that Intel publicly stated are all false?

    5. Re:Why are you pretending to be dumb? by dbIII · · Score: 1

      NO - the real question is if controller cards with RAID on them still have a place "in this century", so don't just try a little bit of misdirection to pretend your silly statement that kicked all this off never happened.
      You can use the CPU for other things, and as I keep saying, just getting the stuff on and off the disk isn't the only thing a computer has to do even if it is dedicated to providing storage for other machines. It has to get it out to the network OR do something with the data locally. Please stop ignoring those statements and pretending I do "not understand".
      Sorry kid, but you made a stupid blanket statement so if you want to pretend it's not stupid you've got to prove it for everything and not just throw up edge cases I already know about.
      From where I stand, if you have a fixed budget and want to have a large array and get the data out to more than a couple of other machines you get a mid priced controller card with decent onboard RAID instead of getting another board with an extra socket (which will cost more then the difference in controllers) and another CPU (that's going to cost more than the controller in total). That's why I reject your blanket case that RAID hardware has no place and that's why I see your comments about just adding more cores as an irrelevant distraction.

    6. Re:Why are you pretending to be dumb? by the_B0fh · · Score: 1

      http://slashdot.org/comments.pl?sid=2763753&cid=39563051 stated:

      Also: HARDWARE RAID CARDS.

      I can't stress that enough. software and semi-software raid is a joke.

      In http://slashdot.org/comments.pl?sid=2763753&cid=39566051 I said:

      Seriously?! hardware raid cards? Which century are you stuck in?

      Anyone with a reading comprehension will realize that I'm responding to the OP that software raid is a joke, and that hardware raid is a necessity. My point, all along, has been that software raid is now good enough to replace hardware raid.

      And apparently you have reading comprehension problems. Why buy another fucking socket when you can just get a cpu that can handle 5% more load? I gave an example of the CPU already - the CPU only cost *FIVE FUCKING DOLLARS MORE* for 5% more performance. In the edge case where you are sucking up every cpu cycle for your application, then, yes, hardware raid can make sense, but I'd also say you have seriously fucked up on your sizing because any server that's running at 100% is not capable of handling more load.

      You keep going over into edge cases and tiny little machines that run at 100% of the cpu. *FOR MOST PEOPLE, SOFTWARE RAID IS BETTER THAN HARDWARE RAID*.

      But you're a know it all. Can't even answer straight questions. I'm done. Enjoy being the "expert".

  84. badblocks (like memtest) is insufficient by Anonymous Coward · · Score: 0

    On a new drive it is a good idea to run badblocks in write mode, which fills the drive with several byte patterns and compares. This will cause bad blocks to be remapped internally in the drive and generally ensures the physical medium is working.

    But there can be other problems with disk and even more so with multi-disk storage. Specifically badblocks writes a single byte pattern over all of the disk and then reads that back. But what if the drive makes a mistake with the address data is written to or read from? Since the disk is filled with the same pattern an address error isn't detected. It would be better to write different patterns at different addresses.

    One quick hack to do this is to encrypt the drive and ran badblocks on the encrypted volume. Most encryption algorithms produce different output depending on the position of the block and address errors will then result in corrupt data being read back from the drive.

  85. A blanket case is a blanket case by dbIII · · Score: 1

    Nice shift of goalposts and blame - but I was replying to YOU, that stupid comment, and your somewhat silly ones since trying to justify it. Are you really trying to pretend that you were not dismissive of every hardware solution in that post and all those ones since?
    Your mysterious and nebulous "most people" doesn't matter anyway when you were attempting to pretend that it was never useful. You've also entirely misread or pretended to misread my post immediately above - please read it again (or for the first time) before babbling about just buying a better CPU when 1 or not enough no matter how good it is.

    1. Re:A blanket case is a blanket case by the_B0fh · · Score: 1

      I'm responding to someone else.

      You responded to me.

      I pointed out that fact, and I'm the one shifting goal post? You are the one defending hardware raid to the death. You are the one who disagree with Intel, Youtube, Google, and Linux kernel developers (last one is left as a search exercise for you).

      And when you are asked specifically whether you agree with Intel/Youtube, you refuse to answer and weasel your way out. Such a dick move.

      ANSWER THE GODDAMNED QUESTION: DO YOU AGREE WITH INTEL AND YOUTUBE ENGINEERS THAT SOFTWARE RAID IS AS FAST AS, OR FASTER THAN HARDWARE RAID, AT MINIMUM IMPACT (5% or less) TO THE CPU?

    2. Re:A blanket case is a blanket case by dbIII · · Score: 1

      I really cannot understand how you would believe anybody would fall for you blaming somebody that said the exact opposite to your garbage for your garbage. It appears the entire waste of words in this thread is because you still have some growing up to do :(
      Your software RAID distraction is irrelevant because software RAID has been in existence for longer than hardware RAID anyway - in both cases if you pick good enough hardware (CPU or processor on the controller) of course it gets the job done. Your statement is as irrelevant as saying corrosion is never a problem anywhere because you can just coat everything in gold - technically correct but almost always useless in practice. That petty little trick of asking a different question with an obvious answer and pretending that makes me agree with your stupid blanket statement above is very childish and must come from watching one too many sleazy politician in action. You can be a better person than that.

    3. Re:A blanket case is a blanket case by the_B0fh · · Score: 1

      And your inability to comprehend the question at hand - is hardware raid better than software raid, is incomprehensible, given all the evidence I've provided.

      And you still dodge a direct question being asked. And you talk about sleazy politicians? Oh, the irony.