Why RAID 5 Stops Working In 2009

← Back to Stories (view on slashdot.org)

Why RAID 5 Stops Working In 2009

Posted by kdawson on Tuesday October 21, 2008 @11:03AM from the back-'em-up-rawhide dept.

Lally Singh recommends a ZDNet piece predicting the imminent demise of RAID 5, noting that increasing storage and non-decreasing probability of disk failure will collide in a year or so. This reader adds, "Apparently, RAID 6 isn't far behind. I'll keep the ZFS plug short. Go ZFS. There, that was it." "Disk drive capacities double every 18-24 months. We have 1 TB drives now, and in 2009 we'll have 2 TB drives. With a 7-drive RAID 5 disk failure, you'll have 6 remaining 2 TB drives. As the RAID controller is busily reading through those 6 disks to reconstruct the data from the failed drive, it is almost certain it will see an [unrecoverable read error]. So the read fails ... The message 'we can't read this RAID volume' travels up the chain of command until an error message is presented on the screen. 12 TB of your carefully protected — you thought! — data is gone. Oh, you didn't back it up to tape? Bummer!"

6 of 803 comments (clear)

Min score:

Reason:

Sort:

Re:Carefully protected? by Hadlock · 2008-10-21 12:11 · Score: 4, Interesting

I can't vouch for DVD-R but I have el-cheapo store brand CD-Rs that I backed up my MP3 collection to 11 years ago and they work just fine. My solution is this:

Back everything up that's not media (mp3/video) every 6 months to CD-R, and once a year, copy all my old data onto a new hard drive that's 20+% larger than the one I bought last year and unplug the old one. I have 11 old hard drives sitting in the closet should I ever need that data, and the likelihood of a hard drive failing in the first year (after the first 30 days) is phenomenally low. Any document that I CAN'T lose between now and the next CD-R backup goes on a thumb drive or it's own CD-R and/or email it to myself.

--
moox. for a new generation.
Scrub your arrays by macemoneta · 2008-10-21 12:32 · Score: 4, Interesting

This is why you scrub your RAID arrays once a week. If you're using software RAID on Linux, for example:
echo check > /sys/block/md0/md/sync_action
The above will scrub array md0 and initiate sector reallocation if needed. You do this while you have redundancy so the bad data can be recovered. Over time, weak sectors get reallocated from the spare bands, and when you do have a failure the probability of a secondary failure is very low over the interval needed for drive replacement.
Most non-crap hardware controllers also provide this function. Read the documentation.

--
Can You Say Linux? I Knew That You Could.
I'm convinced. by m.dillon · 2008-10-21 13:26 · Score: 4, Interesting

I have to say, the ZFS folks have convinced me. There are simply too many places where bit rot can creep in these days even when the drive itself is perfect. The fact that the drive is not perfect just puts a big exclamation point on the issue. Add other problems into the fray, such as phantom writes (which have also been demonstrated to occur), and it gets very scary very quickly.
I don't agree with ZFS's race-to-root block updating scheme for filesystem integrity but I do agree with the necessity of not completely trusting the block storage subsystem and of building checks into the filesystem data structures themselves.
Even more specifically, if one is managing very large amounts of data one needs a way to validate that the filesystem contains what it is supposed to contain. It simply isn't possible to do that with storage-system logic. The filesystem itself must contain sufficient information to make validation possible. The filesystem itself must contain CRCs and hierarchical validation mechanisms to have a proper end-to-end check. I plan on making some adjustments to HAMMER to fix some holes in validation checking that I missed in the first round.
-Matt
The Black Swan by jschmerge · 2008-10-21 15:29 · Score: 4, Interesting

A Black Swan is an event that is highly improbably, but statistically probable.
Yes, it is possible for a drive in a RAID 5 array to become absolutely inoperable, and for one of the other drives to have a read failure at the same time. This is highly unlikely though, and is not the Black Swan. The math use to calculate the likelihood of these two events occurring at the same time is faulty. The MTBF metric for hard drives is measured in 'soft failures'; this is very different from a 'hard failure'.
The difference between the two types of failures is that a soft failure, while a serious error, is something that the controlling operating system can work around if it detects it. It is extremely unlikely that a hard drive will exhibit a hard failure without having several soft failures first. It is even more unlikely that two drives in the same array will exhibit a hard failure within the length of time it takes to rebuild the array. In my experience, it is more likely that the software controlling the array will run into a bug rebuilding the array. I've seen this with several consumer-grade RAID controllers.
The true Black Swan is when a disk in the array catches fire, or does something equally as destructive to the entire array.
To echo other people's points, RAID increases availability, but only an off-site backup solves the data retention problem.
Re:Carefully protected? by jaxtherat · 2008-10-21 15:48 · Score: 4, Interesting

Judging by the budget you quoted, it's a combination of all of the above: you are a crappy sysadmin for a crappy company with limited growth potential.
Sigh. *ignores flamebait*
Anyway, here's the actual reality of the situation:
I'm a not brilliant (but certainly not crappy either) sysad who is working for a company that has rapidly expanded to the point where they need a full time sysad, and then felt the kaboom of the subprime mortgage debacle, since they consult to the property market. Hence why my original upgrade budget got shrunk big time.
The company BOTH cares about their data AND can't afford a proper backup system.

--
http://www.zombieapocalypse.tv/
Re:Carefully protected? by techess · 2008-10-22 01:04 · Score: 5, Interesting

I always love it when Fed-Ex destroys something and then tries to hide it. One day I walked past the shipping office and I smelled the very strong odor of hydraulic oil coming from the room. I take a look inside since we shouldn't be receiving anything that has hydraulic oil in it. I found a bunch of boxes with the local Detroit Airport logo all over them and sealed with DET labeled tape. The cardboard was completely soaked through with the oil.
I carefully opened one of the boxes and found it contained servers! It appears that the original boxes got in some sort of accident at the airport and were completely soaked. At the airport Fed-Ex or the baggage handlers did us a "favor" and re-boxed everything. The servers were so coated (and filled) that even the new boxes were completely soaked through and the bottoms of the boxes were starting to pull apart. The Fe-Ex guy (so we wouldn't refuse them) dropped them off at lunch and then got some random person in the hall to sign off on it.
We had to pay for new servers to be built ASAP and shipped overnight (UPS this time) at huge cost for us. Since someone had signed off on the package we then had a very long fight to get Fed-Ex to pay for the equipment they destroyed. We never got the extra cost for the overnight shipping and the rush build reimbursed.

--
Don't anthropomorphize computers. They *hate* that.