Why RAID 5 Stops Working In 2009
Lally Singh recommends a ZDNet piece predicting the imminent demise of RAID 5, noting that increasing storage and non-decreasing probability of disk failure will collide in a year or so. This reader adds, "Apparently, RAID 6 isn't far behind. I'll keep the ZFS plug short. Go ZFS. There, that was it." "Disk drive capacities double every 18-24 months. We have 1 TB drives now, and in 2009 we'll have 2 TB drives. With a 7-drive RAID 5 disk failure, you'll have 6 remaining 2 TB drives. As the RAID controller is busily reading through those 6 disks to reconstruct the data from the failed drive, it is almost certain it will see an [unrecoverable read error]. So the read fails ... The message 'we can't read this RAID volume' travels up the chain of command until an error message is presented on the screen. 12 TB of your carefully protected — you thought! — data is gone. Oh, you didn't back it up to tape? Bummer!"
12 TB of your carefully protected â" you thought! â" data is gone. Oh, you didn't back it up to tape? Bummer!
If it wasn't backed up to an offsite location, then it wasn't carefully protected.
There are shills on slashdot. Apparently, I'm one of them.
... very few people correctly cycle in new drives periodically to reduce the chance of a mass failure.
That is also because very few people buy a Raid setup piecemeal. Most end up buying a solution, fully populated. The idea of swapping out some drives as you go, or growing your RAID over time doesn't always look good, either to the PHBs who usually run the budget, or to the vendor. We had a vendor trying to sell us a iSCSI SAN device tell us that varying the drive lots and dates increased the chances of failure. Needless to say we went elsewhere.
When we bought the RAID array for our Exchange box, this is going back a few years, everybody looked at my like an idiot because I asked for drives with different lot numbers. It was the best I could do as buying over time was not an option. HP was actually pretty cool about this request and out of 8 disks, no 3 have the same lot number or manufacture date.
Of course we are also running RAID on that machine for non-backup and do a nightly replication, so your mileage may vary.
"To Do Is To Be" - Socrates, "To Be Is To Do" - Sartre, "Do Be Do Be Do" - Sinatra
Wow, how incite-ful. Doesn't matter what the discussion is, some geek is bound to weigh in with all the shortcomings of any idea.
Newsflash: there is no perfect backup! No method is foolproof, especially when it's bound to be boring as hell, and you've got an inevitable human factor. You get lazy moving the tapes offsite, you put off fixing a dead drive because there are 4 others, you wipe your main partition upgrading your distro and forget that your CRON rsync script uses the handy --delete flag, and BOOM wipes out your backup.
Shit happens. Pointing out what we all already know doesn't do anything helpful.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
RAID doesn't protect against your worst enemy
rm -r *
nor is it supposed to. not being a moron seems to have protected me from "my worst enemy" just fine. RAID has protected me from random disk failures. seems to be working as designed
TIAEAE!
No, it won't. That's the point of this not-news article. It's getting to the point where (due to the size of the disks) a rebuild takes longer than the statistically "safe" window between individual disk failures. Two disks kick it in the same timeframe (the chance of which increases as you add disks) and you're screwed.
A poorly designed multi-disk storage system can easily be worse than a single disk.
The mathematical theory behind raid5 is not complicated at all. http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5
And there is parity, that's how raid5 works.
You are probably referring to "silent" errors, which for performance reasons, isn't read/detected by most raid5 implementations. And in reality there is little reason to actively read parity, unless they are running/recovering in degraded mode: Sure, you'll be informed that there is data corruption, but there is no way to tell whether the parity, or the original data is at fault (though its true, some implementations will scrub/update the parity to match the original data on an occasional basis).
I don't see a single set of raid5 disks as a backup solution at any measure though (disk reliability is only one aspect of this, hardware/driver/filesystem bugs can also cause hard or impossible to detect corruption), but it is a great 'best effort' to prevent a bit of downtime on high availability disks.
I fear the Y2038 bug
Yes. It's amazing that the article presents the basic point so horribly poorly. The problem is not the capacity of the disks.
The problem is that the capacity has been growing faster than the transfer-bandwith. Thus it takes a longer and longer time to read (or write) a complete disk. This gives a larger window for double-failure.
Simple as that.