Ask Slashdot: Do You Test Your New Hard Drives?
An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"
Like, never. Out of the box and away she goes...good luck to thee!
"As the intrepid kobold companion continues his journey, he begins to wonder... if priests raises dead, why anybody die?
> Who cares about HDDs anymore these days?
Anyone with a need for a massive amount of storage space.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
If dban can write out every sector and not have smartctl show any pending sectors after the fact (and the average speed of the dban wipe was normal) then you've got good chances the drive will be fine.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
..but I did not have harddrive fail catastrophically on me.
I do test flashcards, and their survival rate is about 50% :-(. (tar czvf /dev/sdb ..., and another flashcard dead...)
And you don't even tell us the vendor?!
Slashdot Valentines Beta Massacre: iT WORKED! The boycotts killed Beta!!
My first help desk job included every computer in the company. We had a server drive fail, so I had Compaq send a replacement. The new arrival didn't work. So then I spent more time looking at RAID configuration and such, but we got a second replacement. That one didn't work either. But I tested it on arrival. The third replacement worked fine, just when I was worried it was something stupid I was missing. Two DOA RMAs for the same part. And yes, that's happened to me again since that first time.
I test every "used" part as if it's suspect. The question was about new, but they are still new to me.
Learn to love Alaska
I havn't even considered testing my personal harddrives. If they break I try to retrieve whatever is on them, but I just buy new drives instead of spending any amount of time fixing them, never returned a disk - I just buy a couple of new ones whenever I need more space.
At work we're using properly configured SANs with 24x7 support, so I couldn't be arsed to test disks there either. We don't have multiple racks of disks, so I don't see any good reason to test everything.
If you're testing new diskdrives you must be really bored or very broke.
Set up the smartd.conf file to do the example short-test daily and long-test weekly, and email you when something is fishy. It's a trivial amount of effort, resulting in a significant amount of peace of mind. (In many cases, you'll have some amount of warning before your drive kicks the bucket and it's too late)
Yes, if it's a windows box, I run chkdsk /F /R a few times, and defragment the drive after deploy. (Not because it needs it, but for the exercise.) Similar with fsck on linux. If it fails, I want it to fail when the in-store return policy is still in effect, so I don't have to deal with the manufacturer.
But having a returned drive rejected because I repartitioned it or "ran linux"? Never heard of that.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
betteridge's law of headlines applies here. Hard drives go through extensive calibration before shipping, so the need for burn in doesn't really exist. As for problems with RMAs for hard drives used under Linux, repartitioned, etc. No.
The real "Libtards" are the Libertarians!
manufacturers do a burn-in before shipping, that gets most the early failures. of course, some will still win the lottery and get a crappy early-failure drive but has never happened to me.
Every single platter HD I get I scan for bad sectors. I got sick and tired of returning faulty WD black drives to different suppliers because of huge bad sector counts. Sine I have been testing I have returned about 5 drives due to sector issues. I don't run any tests on SSD's
Old bathtubs lasted longer than old hard drives. Now it's the other way around.
Sorry, but gray text on gray background is making my eyes bleed.
I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.
I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.
This catches not only obviously bad drives, but also the slow or otherwise odd ones.
Sounds like a really old troll.
Only the State obtains its revenue by coercion. - Murray Rothbard
Not really. People usually don't modify gigantic footprints of data per day, so standard incremental backup strategies are still very applicable. Most of the large data tends to be read-only over time, typically media, archives, large installation files, etc.
Let me guess,,, if it sank to the bottom it was a good drive, but if it floated it was a bad drive and needed to be burnt at the stake.
Trying to coax an error will never reveal one. Only when you start using it "for real" will the problem manifest.
Of course! Fucking witches are getting into everything these days!
We do here at work. We need some modest 120+ TB of storage right now, and 30% of that content is highly dynamic (PostgreSQL databases). Anything but data center quality HDD would be silly, not to mention unreliable as hell and heavily expensive. SSDs are just for laptops or so, not for real data storage requirements.
cpghost at Cordula's Web.
Do you perform extensive functional tests against third party software libraries before including them in your system? In most situations, no -- if it's established and proven. You trust that it does what it advertises, and only when it doesn't do you dig further.
Same goes for hard drives.
I always do a format and a secure erase (one pass of zeros). In addition to finding bad sectors I want to be sure to get rid of any trace of whatever crap they put on it at the factory (viruses, kiddie porn, crapware, etc).
People who need reliable, long-term storage care about HDDs. Just like how people still used tape drives even when CDs and DVDs came along.
Well, to be sure, if your HDD does float in water, it probably is possessed...
Sleep your way to a whiter smile...date a dentist!
badblocks -t random /dev/sdX && shred /dev/sdX
Badblocks checks for bad sectors while writting random data to the drive and after all is good, I run shred once or twice to fill the drive with random data. You can probably get by with just badblocks tho.
-- Its survival of the fittest...and we got the fucking guns!!!
I buy hard drives in pairs, using one for live data and one kept offline until it's time to back up the live drive (I use Unison sync to quickly determine what's changed between the two drives). My boot drive gets backed up every night with Macrium Reflect. The secret to a happy life: assume that every drive will fail tomorrow and keep everything backed up.
I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.
If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.
Please help metamoderate.
This answers most of your questions and does so using data based on a large dataset.
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/disk_failures.pdf
If you are concerned about reliability I suggest using an Intel SSD. Their failure rate is very low.
How did this make it to the front page, especially with SSD prices being what they are?
I have a 20TB RAID array that cost me about $0.10/GB, including controllers. If you can afford to build a 20TB array using SSD, you have far more money than I do. You will also need more controllers than I do (port multipliers divide the bandwidth, which you don't want to do for SSDs), since you'd need at least 20 SSDs (if you were willing to pay about $2.50/GB), but more likely more than 45 (at about $0.85/GB).
You also need special controllers that understand SSDs and can pass TRIM commands, and that will add about $0.15/GB. And, you'll need a much more expensive motherboard, since you need at least 24 PCIe lanes that can all be used for something other than video cards, but likely more than 40. Last, since this is Slashdot, you might not be able to use those special controllers, as not all of them have drivers for the kernel version you want to use.
So, yeah, for a boot drive, SSDs kick ass, but for storing your movie collection, not only are they 10 times more expensive than magnetic disks, but they are way overkill as far as performance is concerned.
Rebuild time. It takes our hardware raids about 24 hours to rebuild, and software raids about 72 hours. If the disk failure isn't detected immediately, even with RAID-6 you are pushing your luck.
RAID is not backup.
How have you been treated when returning them? I'd like to know what brands and what vendor. I'm always looking for success stories especially on commodity hardware. Thanks.
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
I used to work for a manufacturer of video raid arrays. While I was writing software, not on hardware QA, I saw a lot of drives go past. I saw no sign of high early failures, bathtub style. It seemed to me essentially random. The only tip I would have would be to monitor your bad block count. Most drives only showed one or two "grown" as opposed to factory marked bad blocks. If the bad block list grows into the teens, swap that drive.
Consciousness is an illusion caused by an excess of self consciousness.
...bought and installed in desktops & laptops over the last decade, and what I've learned is to buy Seagate drives. I have seen way fewer defects and first-year failures on Seagate than WD, and I was happy to see Maxtor go away.
Guess Google is silly then using the cheapest possible hard drives and accommodating the inevitable failures.
> SSDs are just for laptops or so, not for real data storage requirements
Yep, just for laptops
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-910-series.html
http://www.equallogic.com/products/default.aspx?id=10857
SSD isn't great for bulk data storage, but where you need high IOPS a few SSDs in arrays replace a truckload of drives.
Depending on your definition of reliable and long term, people still use tapes.
betteridge's law of headlines applies here.
No, it doesn't. This is an actual, legitimate question.
As I correctly predicted earlier this year, lots of Slashdotters have seized upon Betteridge as the latest fad kneejerk response, and are misapplying it without understanding what it means. In his own words, Betteridge's Law applies to cases where journalists "know the story is probably bollocks, and don’t actually have the sources and facts to back it up, but still want to run it."
For example, without the evidence to back it up, a headline saying "Tomato ketchup caused AIDS that led to exitinction of dinosaurs" would be obvious crap and lead to criticism of the paper and/or journalist. OTOH, "Did Tomato ketchup cause AIDS that led to the extinction of the dinosaurs?" gives them the weasellish get-out of "Well, we didn't actually *claim* that it did".
Even then, if a question headline was a genuine attempt to present a plausibly-supported but not universally-accepted idea (possibly because it was new and/or divisive), then Betteridge's wouldn't apply.
In short, Betteridge's original observation was insightful where he claimed it applied, but it was never a blanket dismissal of question headlines, so please stop the tedious, kneejerk misapplication.
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
When installing a new disk in a Mac, I run Disk Utility with the Secure Erase option enabled. This will write 7 or 30 passes of 0000 to every block, that should find any early problems...
>So, yeah, for a boot drive, SSDs kick ass, but for storing your movie collection, not only are they 10 times more expensive than magnetic disks, but they are way overkill as far as performance is concerned.
And where performance is concerned the raid of SSDs replaces many many more disks.
Use a sledgehammer to drive railroad spikes
Use a finishing hammer to drive finishing nails.
Never.
I thoroughly test any new hdd I get for my desktop PC:
The first thing I do is format it and install windows. If that works, then we know the drive isn't DOA
From there I torture test it by copying several hundred gigabytes of software and movies, as well as installing some more programs.
After that, I let it run for a few months, using it normally. If it crashes during that time, then I know it was bad.
Recently picked up a couple 3TB Seagate drives and a Synology box for a new NAS at home. Since I was planning to move all my music, pictures, video, and general documents to the new box, I decided to download the manufacturer HDD tools and scan the drives first just in case. I think Seagate's is called SeaTools, I'm sure WD has a program as well. No errors reported on either drive, and no errors so far with the RAID array after a couple months of use.
Well, the last drive I returned to a manufacturer was one that I was running FreeBSD on and they didn't seem to care. Granted, the experience with the manufacturer (Seagate) was less-than-pleasant but that had nothing to do with my choice of OS which I don't think they ever asked.
I now buy only Western Digital.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Nope. SSDs are reliable enough to be used in server-grade implementations. The only issue with them is that they're highly specialized. If your regular HDDs become the bottleneck, you will need SSDs. Also, if you have some small implementations where you need fast access to read/write/modify data (some MMOs come to mind) and need to protect it against a power failure or RAM going awry, you should use SSDs.
...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
This is part of a process for testing new server gear.
Since I use Fedora, currently at 17, burn in testing is important.
Quick tip: Most of the distro's currently do not detect SSD drives during the install and do not include the "discard" keyword in the fstab entries for the device.
If you do use a Modern Distro, make sure that if you install or use a SSD with it, to mount the device with kernel flag for TRIM support set.
For example:
UUID=xxxxxxxxxxxxxxxxx /mnt/ssd2 ext4 discard,defaults 1 2
Where xxx..is your UUID label you made for the device and discard indicates enable TRIM support.
Burn in process for equipment for hard disks usually involved write a file the entire size of the disk, reading it, random seeking it, then deleting it.
I also use a customer script to drive sysbench with some common fileio tasks.
This is important for disks as it can reveal differences in firmware or firmware between SSD's used in arrays. For example a customer of mine had a really bad performing raid array and it was due to the mixing and matching of firmware between drives. (It worked well for a while, but then went bad when one of the drives in the RAID 5 array died, and he replaced it with a new one with different firmware.)
-Hack
Got Geometrodynamics? Awe, too hard to figure out? Too bad.
HW RAID and SW RAID have been on par in performance for at least a decade. SW RAID these days is actually exceeding HW RAID performance because of the large difference in performance and calculation capabilities of the CPU (especially with data checksumming and compression).
Custom electronics and digital signage for your business: www.evcircuits.com
Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.
cpghost at Cordula's Web.
Testing is simple - plug it in, and run it till it fails. Might as well use it in the mean-time.
If my SMART data is showing the following:
Reallocated Sectors Count = 15
Reallocation Event Count = 15
Current Pending Sector Count = 0
Uncorrectable Sector Count = 0
Could the reallocated sectors mean data was lost? I've seen conflicting information on whether reallocated sectors means data was lost. Are there any other SMART attributes I can look at to determine if data was lost on the drive?
You would know if there was data that was lost. Normally the drive silently copies the data off of failing sectors to new sectors, reallocates the sector, and you don't notice anything. But if the sector is completely unreadable or returns incorrect CRC (that is, drive's internal CRC that is irrelevant of how the drive is formatted) then the drive will return an error to the operating system and you will be notified of it. The drive does not automatically reallocate such sectors as it will wait until the OS tries to write data to the broken sector before the drive reallocates it exactly for the reason that there wouldn't be silent corruption to files without users' knowledge. Case in point: the power supply on my server caught on fire and disrupted the other electrical components and on one of my drives there was a bunch of sectors with broken internal CRC -- nothing I could do about it, but atleast I was informed of what files I lost when I tried to read them. I proceeded to delete the files in question and wrote random data to the affected sectors after which the reallocated sectors count was increased.
I just bought a new ThinkPad which had several SSD options. I chose the slower 1 terabyte disk instead. I'd rather have everything I need with me, even if it is a little slower.
As for backups, I have a daily cron job which rsyncs between my laptop and my home server.
When I have massive changes I make sure I'm hooked up to the wired home network, otherwise it just goes on over wifi.
https://www.facebook.com/digitizeicm -- Show your support for the digitization of the Iron County Miner newspaper archiv
The Heisenberg Principle states that measuring anything changes it. So I don't check anything to see if it works for fear of it falling apart.
We should learn what we need to know about issues, before we decide what we need to feel about them.
smartmontools works brilliantly under Windows too as smartd can be run as a service. With a suitable smartd.conf and blat to email reports, it can be a double-click-installed jobby. Also writes to the Event Log.
betteridge's law of headlines applies here.
No, it doesn't. This is an actual, legitimate question.
Thanks for the clarification. If you read the answers here, you'll notice that while most people don't test their new drives, some people do, so that proves you're right.
At two companies I managed IP libraries (massive amounts of photographs and drawings used in catalogs and advertisements). The data changes only slowly, and (depending on usage) seasonally, so incremental backups are very much practical. But that's not really the issue.
This is important. Raid protects you from certain kinds of failures, usually limited to the mechanical or electrical failure of a single hard drive. (More protection can be had by nesting raid levels, but for most installations this is the case.) Raid does not protect you from a wide variety of failures including data corruption from a bad controller or application bug, systemic failure of the raid appliance (example: a catastrophic power supply failure taking out multiple drives) operator-induced data loss, either accidental or malicious, or environmental catastrophe. If your data is important, there is still no substitute for backing up your data and sending it to a remote site. Even geosynch won't necessarily help if you're synching bad data to the only remote copy. And, I'm not yet convinced that syncing to "the cloud" is a good idea.
Mind you, backups don't have to be to tape. I'm a photographer when I'm not a geek, and I typically keep tens of thousands of photographs online on my workstation. As backup to tape, DVD or even blu-ray isn't really practical, I back up to a series of hard drives using one of those plug-in hard drive toasters, then carefully store them elsewhere, disconnected from the computer. Disaster recovery is a set of drives in a safe at a friend's house.
There are examples where backups aren't necessary. I worked with one array that was essentially a huge cache for 1-800 calls, and a complete wipe would only mean that customers would see a delay on the next call as their particular part of the cache was rebuilt. But for the most part, depending on raid instead of a properly implemented backup solution is a really bad idea.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Isn't this what hot spares are for?
I had a sig once. It was lost in the great storm of '09.
And those building systems that want QA on their builds, people storing their family videos and photos, hell the list could go on all day. Problem is SMART got ruined by the OEMs, its more about CYA than actually reporting the truth and the only program that bypassed the lying SMART hasn't been updated in over half a decade and can't support newer drives.
ACs don't waste your time replying, your posts are never seen by me.
Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.
So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.
ACs don't waste your time replying, your posts are never seen by me.
I've been dealing with hardware failures for 20+ years. What I've learned is that disasters WILL happen, regardless of what preventive measures are in place. So I shifted my focus toward recoverablity. To me, the important question is "When something catastrophic happens, how quickly and easily can I put things back in working order"?
Since I use RAID where appropriate, and more importantly, I am positively fanatic about frequent, full, and tested backups, the only concern I have when a hard drive dies is whether I'm still entitled to a warranty replacement.
Yea, I would like to see a better communication method for these error to be communicated up from the kernel through userspace. Most of the time when a "normal" user gets errors for EIO, they see some kind of crash or debug message. If the filesystem could simply put the filename with the error into a list for some userspace service, the GUI file manager(s) or some health monitoring service could notify the end user with something a little more descriptive.
This could also let the user activate the relocation write scrub for that file.
I guess this is all stuff that can be solved in the more advanced filesystems like ZFS/btrfs where they can simply read the replicated copy or recover with the RS code blocks. Then the end user doesn't even know they had a platter defect outside the relocation count.
On black Friday I bought a 1 TB drive at Office Depot, and of course they waved the box over their anti-theft degauser. I asked for a different drive and told them that they shouldn't do that with drives. The girl gave me the look we all have seen, but the boy behind her actually agreed with me and they gave me a drive out of the cage and let me leave the store with the alarm blaring. I've just about filled it up already and It's been working fine.
Jack of all trades, master of some.
I test every single drive before deployment. I've found Gibson Research Corp (grc.com) Spinrite to be vital. It's pretty much the only drive test / repair / recover tool I use - other than RAID recovery tools. I'm astonished at the number of people who say they don't test at all.
Go visit a UPS or Fed-Ex distribution center and watch the "slapper" kick packages off the 45MPH belt onto a slide at the load dock. Small boxes like hard drive packages are airborne. I doesn't matter how much the factory tests. Shipping damage is a very serious issue.
BTW - I have no connection with GRC, except as a near-daily user of their products.
Place nail here >+
But I always do a full/slow format to at least do a sanity check.
We are the 198 proof..
Shouldn't this be a question for HDD manufacturers and OEM resellers?
I don't bother doing a full test before putting a new drive in service. If I am buying a new drive, it is because I need it now, and I need to be back up and running stat.
When buying a new drive, I have expectations of the manufacturer doing their job and selling me a working product that should be operational out of the box. However, after I have copied my data over from where I had backed it up (usually onto a shared drive on another system on my network), I keep the backup handy until I know for sure that the new drive will not need to be returned to where I got it for a warranty replacement.
This space unintentionally left blank.
Not sure how effective this is, but we've been testing our hard drives using the Long Generic test with SeaTools. It appears to do a write/read test on each sector of the drive, as large drives such as a 2TB can take almost a full day to complete. There's also an option to repair bad sectors during the test. Seems to be pretty effective, and it's probably better than nothing. YMMV
Holy crap. Twenty 3T spindles in a single array ? What do you do to de-stress ? Run between cars on a highway ?
If the filesystem could simply put the filename with the error into a list for some userspace service, the GUI file manager(s) or some health monitoring service could notify the end user with something a little more descriptive.
This could also let the user activate the relocation write scrub for that file.
I wholeheartedly agree, and I'd also like S.M.A.R.T. capabilities to actually be properly integrated with the OS if the drive supports them and reports sane values. Alas, very few OSes by default actually monitor S.M.A.R.T. or provide facilities for reporting component health to the end-user -- if the same facilities also monitored the health status of any other possible components in the system -- GPU, CPU, motherboard, other attached devices that know how to report their health -- it could possibly save people huge amounts of needless headaches.
I guess this is all stuff that can be solved in the more advanced filesystems like ZFS/btrfs where they can simply read the replicated copy or recover with the RS code blocks. Then the end user doesn't even know they had a platter defect outside the relocation count.
Well, it shouldn't be solved completely silently. End-user should still be warned of such defects even if the filesystem can correct them just so that the user can keep this in mind should there appear more such defects in a short amount of time.
If I can fill a drive, flush or disable anything that would interfere with read tests, then read back all the data okay, then I'm confident enough to trust the drive. Yes, it could be bad but I'm not going to waste time doing additional burn-in tests.
If I'm doing a fresh OS install, it's easy enough to install the OS on the bare drive, configure it the way I want to, then boot with another disk (bootable CDs are your friend), fill up the disk with pad data, reboot, clone the disk, reboot, then verify the clone. If the clone succeeds, I trust the drive. Delete the pad data and boot with the new drive and away we go.
If it's not a fresh install I'm probably either cloning another OS disk to this one, in which case I do pretty much what I said above, or it's going to be a data disk. If it's going to be a data disk, I copy any data I want on it then fill it up with a pad file then clone and verify it as above. If the clone verifies okay, I'm good to go.
This is for non-mission-critical use of course.
If it were mission critical it's probably in a RAID with some redundancy anyways, so this test would be adequate.
If I don't have enough scratch space to fully clone the disk, I can clone and verify it in stages, one several-hundred-GB-chunk at a time.
OK, I lied. I don't really do all of this most of the time. Most of the time I just use a disk-erase program that offers verify-after-write and trust that if the drive's firmware does lie to me and report "successful read-after-zero" before the sector is zeroed out that it will either successfully remap the sector or before the end of the testing a "hard" error of some type will be reported back to the disk-erase program. I will be fooled if I bought a bum drive with lots of bad sectors but at least as many spare sectors for the firmware to remap, but that's an acceptable risk for almost every scenario I run into. I may also be fooled if all but the last failed sector remapped successfully but the last one did not. Hitting this edge case on a brand-new drive is extremely unlikely.
Oh, there's also the "ear" test the "it's taking too long" and "gee, that was fast" tests, the "ouch this drive is way too hot" test, and other tests you don't think about but which gives an experienced person a reason to suspect a problem every time he hears an abnormal sound, waits way too long for an operation or is surprised at how fast a drive is, or touches a drive and nearly (?) burns his fingers.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The best is having both RAID6 and at least a hot spare. With the difference between disk capacities versus I/O so disparate, even with two drives, there is a large window (~24 hours in some cases, even on the high end SANs and lower tier drives) where the array is in a degraded state and is being rebuilt. The hot spare is important in this case because it allows the array to start being rebuilt immediately.
The ironic thing is that tier 3 drives store a lot of critical data, even though they are relatively cheap and slow. Because of this, taking reasonable measures such as RAID 6 + a hot spare or two is a must.
If the drive successfully syncs up to the array then it passes otherwise it gets RMAd.
when a stresstest wears out the ssd too much, either you do not need it first place, or your usual usage will be too much stress as well.
I've seen HW and SW RAID leapfrog each other. Before Oracle bought Sun, there was a time where Sun pushed software RAID and Veritas Volume Manager, saying how CPU is cheap so might as well use it for drive I/O. A year or two later, they came out with hardware RAID appliances and the sales guys pushed how good having all the CPU overhead of party generation done on disk controllers was.
HW and SW raid I use on an application basis. For enterprise tasks, this tends to be moot because the data resides on FC/FCoE LUNs, and the only time I might worry about using RAID on the system side would be if I were migrating data from one SAN to another without shutting things down. Local disk tends to be more an afterthought (mainly because it tends not to have the ability to use MPIO), so it ends up being mirrored, just out of simplicity.
Of course, for desktops that are used on a day to day basis, those get hardware RAID so a drive failure means a SNMP trap firing off and a trouble ticket to desktop support versus a user screaming bloody murder.
Whenever I get hold of a new drive I run it first through the SMART conveyance test (which usually comes up clean) followed by an extended test. The latter has shown errors in a surprising number of drives, if I'd have to give a rough estimate I'd say around 5%. These are usually read errors, which usually can be 'fixed' by overwriting the sectors in question, but it generally forebodes problems with the drive later on. If a drive shows errors in any of these tests I RMA it. The replacement drive gets a similar treatment.
--frank[at]unternet.org
I test new drives for performance, not for reliability. Now you got me thinking...
Stress testing hard disks is a particular bugbear of mine, after having some really bad luck with early hard disks. Over the 15 years that I've been doing it I've had to send back loads of hard disks and flash cards because they failed my tests, either breaking completely or returning single bit errors in your data. Mostly the manufacturers will take disks back if you can get their stupid Windows program to return an error code. Sometimes it takes a bit of arguing but ultimately the manufacturers want to keep you happy. Flash disks with single bit errors are the hardest to send back in my experience.
Here is the latest generation of my stress testing code (re-written in Go recently): https://github.com/ncw/stressdisk
(Interestingly the stressdisk program sometimes finds bad ram in your computer too!)
I generally thrash every new hard disk or memory card for 24 hours to see if I can break it before trusting any data to it!
I also run a long smart test too.
Somewhat paranoid, yes, but I really, really hate losing data!
Every man for himself, all in favour say "I"
I asked the same question not so long ago: http://slashdot.org/submission/2004807/ask-slashdot-how-do-you-go-about-testing-a-storage-medium sadly the comments with all the helpful messages in seem to have disappeared.
That is all correct. If you are lucky, retrying the read may help. If the hard disk does not receive any writes for that sector, it does not get relocated. If the hard disk repeatedly receives read requests for that sector, it will try to read it each time it gets a read request. If eventually the read succeeds, then the sector gets relocated at that time.
So relocated sectors doesn't mean lost data. However relocated sectors is a warning sign that the drive may be dying. The study on hard drive failures, which Google did a few years back found that even a single relocated sector meant a significant increase in the likelihood of the disk failing.
Do you care about the security of your wireless mouse?
Please. Quoting Jeff Atwood as an authoritative source on SSDs?
Some anecdotal evidence and a subsequent admission of buying from the brand known for the highest failure rate in SSDs isn't going to convince anyone.
I'd like to see some proper statistics before I believe anything you say.
The most reliable statistics I've seen show SSDs performing as good or better than HDDs when it comes to failing. I haven't seen any statistics on what percentage of failing drives did so spontaneously, completely, without warning and without any possibility for repair.
Mind you, I'm not claiming they don't. Just that I haven't seen any evidence beyond some anecdotes. And well, anybody that trusts a single drive with important data is an idiot or ignorant anyway.
We have recently purchased around 300-400 drives of 500GB from Hitachi GST.
Our test method for checking the drives is filling up the drives with files (by replication) and do a hash check after (with comparison to the original source file.) Should the drive drop out (due to retry errors) it is RMAd. We do check for SMART after, as based on experience, it is fairly accurate on the sector reallocation count when the drive is in imminent failure. You also need to keep an eye on read statistics (we use iostat) to check if the performance is sub par. Normally, the drives will return to normal speeds after sector reallocation.
Based on our statistics, I would say that we do get around 1% defect rate for the drives (we have swapped out around 3 of them for 1 died and 2 having bad sectors.) After around a month, you get a further 1% or less (for typically having bad sectors further.) The same goes for after around 1 year.
In another interesting note, we purchased around 8 pcs of 2TB from Hitachi GST and probably from a batch problem, we had to replace around half of it due to bad sectors. We had a batch before of around 8 pcs of 2TB but everything were good.
As for performance, there are times when some of the drives deliver consistent performance (the hash checks don't all finish at the same time.) Though we don't classify the drives but my guesstimate is around 5%.
Live your life each day as if it was your last.
I grind through lots of hard drives.
Among my other duties at our ASP software company, I perform the system administration, which includes backing up a few hundred databases daily and perhaps few dozens of billion files. To give you some example of our backup size, we currently have about 20 TB of data in redundant, consumer drives in RAID1 fashion, for about 10 TB of effective storage space.
I've gone through dozens of WD consumer drives with nary a failure, while I've had 2 out of 3 consumer Seagate drives fail within a few months, over several model lines.
For the past few years, I've more or less stayed clear of Seagate, although I had a number of their SCSI 10K drives in production with no trouble.
And everywhere you go, you get wildly conflicting results like this. (shrug)
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Yea, I was happy to see Ubuntu doing something with basic SMART output by default. The main problem is the more advanced health detection values are basically noise unless you're the manufacturer or a big enough disk customer that they will let you in on the secrets. But like you implied, lots of drives don't output sane values.
Yes, more bubble up health reporting would go a long way toward making computer support easier.
Anyone who cares about reliability? Probably why they would also test the drives before doing anything important with them.
I also wanted to point out that the difference between full and partial backups disappears when you do backups using file-level hard links as your backup solution.
Doing backups disk-to-disk with rsync, using the hard links option, the difference between partial and full backups disappears. All backups are full AND partial; you get the benefits of both.
We do an "incremental" backup daily, in that only the files changed in the interim transfer as part of the backup, and only the changes occupy additional disk space, but the result is a "full" backup in that we end up with a complete snapshot of the file system that can be copied or used on demand.
This is really the way to go, and our particular solution is free and open sourced long ago.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Exactly this.
I know a (very large) Data Center belonging to a (very large) company which started replacing their HDDs with SSDs. The price difference isn't even that large; price-per-GB for a server-grade 15K RPM SAS was negligibly close to SSD price. And the advantages are really there: (much) lower heat produced, less noise, less space taken, less energy consumed. Even with a similar failure rate, the advantages are there.
...gis sdrawkcab (usually not responding to ACs; don't bother posting as AC)
The controller issue is mitigated by backups, RAID, and hot spares.
Of course even with RAID 1 (mirroring) and a hot spare that's 3 drives, not to mention your backup system, though that's not normally an SSD.
Ideally you'd have the same redundancy with spinning rust drives, of course, but the higher MTBF and the higher chance of detecting a pre-failure state can allow one to get away with less, say RAID with hot-swap.
Not a sentence!
The last thing I would want to do to a new, potentially-untrustworthy hard drive is get it off to a bad start by causing more wear and tear on it right from the beginning. I just put it in, fire it up and start using it. First on simple and less important things, and then after a while of regular use and after I have gained some trust in it I start using it for more important things.
Of course... this can't always be done, like when replacing the system drive, so in that case I install the drive up as usual, set the OS up, make copies of the files I will need on it, and just see how it goes from there.
I have yet to buy a drive that died without working for at least several years of its life (5-8 years or more usually), so I don't typically buy a new drive with the expectation that it's a goner. I have had pretty good luck, with hard drives long outlasting computers.
ZFS is a lot quicker to rebuild so long as the array isn't completely full, for no other reason than it's doing less than a full dumb bit for bit copy.
Hardware RAID can sometimes spend stupid amounts of time copying all of that empty space, but I'm sure that will improve over time with better firmware. Even though that stuff is at the filesystem level they'll be ways to recognise empty space.
Bare minimum is dd if=/dev/zero of=/dev/sdX before any drive is put into use while monitoring with smartctl followed by a file system and large file (50% of drive size) read and hashing.
Same thing with RAM, who doesn't stress test it with memtest before using it?
Recently I purchased a bunch of WD Red drives and all six failed within 37 hours of first spin up. Dead Red's with a 37 hours MTBF.
I cannot confirm nor deny the allegation or allegations you may or may not have just made
Interesting. I have a NAS with 3TB Seagate drives set up in RAID5. There was apparently a firmware issue in these drives that made the NAS drop a drive every now and then for a spurious read or write error.. Run a verify pass on the failed drive and all checked out OK. The drive would happily be added aback into the RAID group.
Because of the capacity, the rebuild took about 2 days. But on one of the drives it actually took 6 days! Perhaps I should look into this again..
The drives have since been updated to the latest firmware and the NAS has not dropped any since.
To Terminate, or not to Terminate, that's the question - SCSIROB
Isn't this what hot spares are for?
Which model TiVo is it that lets you use hot spares?
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Actually the real saving grace of ZFS is that you can't get into the absurd situation of rebuilding a whole drive that only has a single bad sector. This is a life-saver, since chances are you end up finding another bad sector somewhere else during the rebuild, but thanks to the checksums if you do you're not left with corrupted data.
I had this happen to me with an mdraid6 volume: single disk, single bad sector - kicks out the disk (and so stop's syncing changes to it). Triggers a rebuild, finds another bad sector elsewhere, kicks out that disk. Suddenly my 2-disk redundancy has dropped to zero, and god help me if there's a bad sector anywhere else.
I'm pretty skeptical of the "offline hard disk" approach. Unless you are very careful with those things, they're not inert when unpowered - you've got grease slowly hardening, oxygen leaking in, thermal cycling stress - and they're most importantly, designed to last about 3-5 years under constant use, not 10 years on the shelf.
There is no useful testing a user can do. The factory has already run more tests, at a deeper level, than anything the user could imagine or replicate....
And then it gets packed and shipped to a vendor, and repacked and shipped to a customer.
Or it gets packed and shipped to a company which divides up that shipment and ships it again to their brick and mortar outlets.
Or there's a distributor in between the manufacturer and the retailer.
Anyway, it can be perfect coming out of the factory and get used as a football somewhere during shipping.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
The same model TiVo that lets you add disks to its RAID array?
I had a sig once. It was lost in the great storm of '09.
Usually if I buy a new drive, it's to rotate out an older one. My "test" is to copy the mostly full drive onto the new one and keep the old drive on the shelf for a couple months in case of problems. Most drives either suffer an early death or last a good number of years. After 2-3 months I'll reuse the old drive for other storage needs.
It's true they're not inert when unpowered, but modern drives park the heads, making them less fragile than in the old days. It's true they're more sensitive to physical abuse than are tapes, but one takes that into account.
It's important to keep track of how old they are and cycle them. I write on the face with a sharpie the date I started using them and what they're backing up. (Just as I track the start date for memory cards for the camera.) Once a year I replace the drive with my important data with a new, usually larger drive, (ghosting the data) and the old one becomes the level zero backup. Incremental backups are done to spare lower capacity drives which are used for five years or so then scrubbed and donated to the local freegeek.
I don't keep the backup drive online. I know that a lot of people use an external drive as a backup and leave it plugged in, but that does not protect you from data corruption and some types of viruses. A good backup is physically disconnected from the machine. A great backup is geologically distant from the machine.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Quite relevant post for me, as I just had a Seagate 1.5T drive (which they sneakily had branded as Samsung) go bad after just 3000 hours - I purchased the drive in May and there are plenty of upset reviewers there complaining about Seagate trashing Samsung's name. I heard clicking, but interestingly SMART returned no errors. Luckily Seagate's SeaTools software detected the error:
Model: ST2000DL004 HD204UI
Firmware Revision: 1AQ10001
SMART - Pass 12/18/2012 10:45:55 AM
Short DST - Started 12/18/2012 10:46:11 AM
Short DST - FAIL 12/18/2012 10:48:14 AM
SeaTools Test Code: 6C9AC2A4
So, I set up the RMA. I think I'll go with a WD as a real replacement - they still have drives with 5 year warranties. Even there, though, on the newegg board are allegations they're either experiencing significant delays in getting a replacement, or the replacements are also bad.
But my real reason for posting was wondering about the integrity of a replacement drive? If I'm getting a "refurbished" drive, can I be guaranteed there's no virus/worm residing on the MBR? Is there a way to completely purge the drive that would clear any virus/worms?
I wiped my drive using HDDErase which worked without a hitch. I believe that would fix any infections, so maybe I'll start doing that before installing replacement drives. Thoughts?
The HDD tool that I really needed that nobody supports is the one that lets you manage manufacturers' hidden disk partitions on PATA drives.
A couple of years ago I had a 200GB Maxtor external hard drive which eventually started getting bad blocks, so I replaced the disk with a new 500MB PATA drive. The box didn't recognize the drive (because it was newer than the box, so the model wasn't listed), so it reformatted it using the "hide a partition" feature, leaving me with a drive that pretended to be 200 GB. (The feature's sometimes used for hiding a small Windows recovery partition, but is also available so a manufacture can do things like turning a 300 GB drive with bad blocks into a 200 GB drive with only good blocks...) Windows couldn't see the extra space at all, and the one Linux tool that was supposed to be able to support it was able to make the partition smaller but not larger, because it really didn't know about LBA yet :-) I ended up buying pizza for my local kernel wizard, and we looked at the source code for the tools and recompiled the Linux kernel to get some things to work a bit better but still weren't able to get it going. At that point I decide it was more cost-effective to buy a new drive than keep wasting time, but it was still really annoying, and it was at least a working spare 200MB drive.
I hope that by now, SATA drives don't have that feature, or else Win7 or Linux can work with it (because the point of the feature is supposed to be that you can't have full access to that space...)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I see people here hating on Seagate, and on WD, and I'll happily complain about Maxtor from before they got bought, but who's left? Are there any disk makers that people don't hate?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I run a small hosting company (an add-on to our consulting), and we run burn-in testing on every drive we put into production.
Before we did this, we would regularly run into drives that would sporadically fall out of the RAID array because a block couldn't be read during validation. Once we started testing our drives, this all but stopped. My guess would be that some parts of the platters were marginal, and after a few read/write cycles they would fail and need to be relocated. So doing some testing would cause these blocks to be remapped.
What we do is "badblocks -svw -p 10". However, we've reduced it from 10 down to 3 because 2TB drives take so long to test now and that is our standard drive now. We target a few days to a week of burn-in testing.
Other things that this resulted in:
There was a Linux kernel bug with the LBA access code that caused one specific block on the drive to always report as bad with certain firmwares. The old firmware used to silently do the right thing with what the kernel was asking, the newer firmware reported a read error on this sector.
We've also found some drives that passed the testing fine, but did so at around a tenth the throughput. We were never able to track down why this was, we had a batch that were exhibiting this and we just gave the 10 or 15 drives away that were impacted.
We also had a batch of 10 or so drives that half of them were reporting high numbers of failures. We figured something had happened upstream (at the reseller or during shipping) and so we replaced even the ones that tested out ok.
So, yes, test your drives. Even though we're putting them in RAID arrays, we like to run the tests.
Sean
Correct.
It's their latest model, the Unobtania.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Two 1.5TB Seagates failed here, one died completely just after its 90 day warranty expired, the other lasted almost 6 months before its SMART error rate abruptly became huge (with a lurid warning that the drive is about to fail). They were the first Seagates I've had in years, and they'll be the last allowed in this house for several more years. My previous experiences with Seagate had been good, but that was back in the sub-GB days (and sub-GB is not a typo) before Seagate quality became a crap-shoot. The failed 1.5TB Seagates were replaced with 2TB WD drives, which have been humming along without SMART errors for almost two years.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire