Reviews of Hard Drive Reliability?
ewhac asks: "After having
three 18G drives go toes-up on me in the last two months, all of them
done so after about 40 days of use, I want the replacement drives to
be rock-solid. While Tom's Hardware and
AnandTech review individual
drives and their performance, I haven't yet been able to locate any
comprehensive or cohesive review of drive reliability and longevity.
Does such a resource exist?"
Just had two 18GB IBM SCSI (LZX) drives die after less than a year. Also had 6 bad disks in 5 months on a shark at work.
Never, ever, ever, ever buy IBM storage.
Conformity is the jailer of freedom and enemy of growth. -JFK
I commend the request for asking for real data.
Anecdotal evidence from people who have had drives of a certain brand fail on them and then say "never use this drive" is basically worthless. Even if you hear 5 or 10 people say that, ignore them.
What you need to know is if there are enough anecdotes to show that the mfgr's MTBF rate is inaccurate and the real rate is a lot lower than what they report (or a lot lower than other mfgr's). Or maybe if there is a certain batch of drives that are anomalous.
The question is: is the mfg's MTBF rate good enough for you and is it accurate?
www.storagereview.com has started a reliability database but I don't know if their data is statistcally valuable yet.
Jesus saves....And takes 1/2 damage.
If you've had 3 hard disks die on you in 2 months, the problem may not have been with the disks themselves. The first thing to check is if you're getting adequate ventilation to the area where the hard disks are at. You might also want to test the voltage your power supply is putting out.
Questions like this about hard disks are really better answered here.
the storage review reliability index should serve you well. Unfortunately the site itself may be taken down soon (due to financial reasons), so get there quick.
Four 36 gig drives on 16 in our array blew out last week. (Probably heat-related. We had some AC problems in the computer room but the room never exceeded rated temperature.) Two weeks before that, two 18-gig drives in separate machines died for unknown reasons. The 36-gig drives were IBM. The 18-gig drives were Segate (who, at one time, made the IBM drives). In the last two months, we've also lost a few Maxtor drives.
Except for the batch of drives in one array, the above is fairly typical. We have thousands of drives from many vendors and I can't swear one is any better or worse than the other. Hard drives all pretty much suck.
Sure, we all read about MTBF being 500,000 hours for new drives but that's a pipe dream. Drives burn out every single day.
If you have the money, buy a pair of top quality drives and mirror them. If you can't afford that, buy a couple of cheap drives and mirror them. Don't put important data on a single drive and expect it to be there when you get back from lunch.
InitZero
So, if you're finding your drives die in 30-60 days, there's likely another problem you're missing. If you're using SCSI, I'd guess they're probably 7200 or 10k RPM drives, which means LOTS of heat, especially if you have several. So, first of all, go buy a few 60 or 80mm fans, and stick them in front of the drives, if you can. Get some air flow across them (remember, air pushed across the drives does much more than air pulled/sucked across them). Heat will quickly kill a drive.
Barring that, you haven't said how the drives have died (won't spin up, unusual read errors, etc), but a poor power supply, especially one running at capacity could burn out a drive. Finally, any sort of shock (case constantly being moved, bounced around, kicked, etc) could do a drive in, though that is probably less likely.
As with anything else, it's all IMO, YMMV, etc.
We need a truly objective survey of hard drive reliability. My personal experience is nearly the exact opposite of yours-- I have had two fujitsu drive failures within 2 years, and one IBM failure in 8 months. My maxtor and western digital drives (even the really old ones) are all still running happily.
Just goes to show how true YMMV really is, and why anecdotal evidence isn't much help.
One problem I have is that most of the times I have had drives die early in their lifespan, it has been a 'batch' problem, and had a purchased two identical drives from the same vendor, chances are, both of them would have died at about the same time.
Most mirroring solutions depend on using nearly-identical drives for the mirrored pair, right?
Another issue, I've had very few drives fail in service, where the system was running for years and then either just went dead or started getting disk errors, increasing over time. 99% of the failures I have encountered have been with drives that just would not come back up after a shutdown.
Sometimes you can hear the bearings going out, other times you shut the system down for just a few minutes, turn the power back on, and the drives just go 'clunk', but cannot spin up.
In the old days of 'stiction' this could sometimes be overcome by repeated powercycles or the old 'weak karate chop to the side of the drive' method.
Again, I've had multiple drives of about the same age fail in this manner, which in the case of a mirror, means losing the data...
I do not deploy Linux. Ever.
I've attempted this 'live software disconnect/spin down' with other OS's using standard SCSI, but haven't had much luck. Solaris never supported it before, and now only on FC-AL.
One trick you can do with this is to have a 'warm spare' installed, a drive that contains a mirror of the system as of the last major change, but is not constantly running. By keeping the spare drive updated, installed, and ready, you can recover from a failed disk remotely, without any need for physical intervention. Combine this with the new "RSC" (battery-backed lights-out-management card with it's own ethernet and modem paging, and you really have something to brag about).
If the big Sunfires are out of your budget, a subset of the full feature set is in the LOM interface on some(?) Netra models.
One drawback of spinning down the disk (as I mentioned in another comment here), one of the most common failure modes is a drive that just won't spin up once you turn it off...
I do not deploy Linux. Ever.