RAID's Days May Be Numbered
storagedude sends in an article claiming that RAID is nearing the end of the line because of soaring rebuild times and the growing risk of data loss. "The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable. The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it's a scenario that will eventually require a solution, if not a whole new way of storing and protecting data."
Don't consider an entire drive is dead if you get a piddly one-sector error.
Just mark it read only and keep chugging.
The author says it himself in the article:
"And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue."
but he hasn't adressed the fact that today you get 100 times as much diskspace for the same cost as you did 10 years ago when cost was a factor. In real life cost isn't a factor when it comes to datastorage, simply because it's really low in real life projects, as compared to the other costs in a project requiring storage. So if you want the reliability you go get a mirror. Drivespace is dirt cheap.
As for the rebuildtimes, fine, go buy FASTER drives. I dont see the problem. HP and many other vendors have long been trying to sell combined raid soltions (like the EVA) where you mix high storage with high performance drives (like SSD vs. SATA).
The only real argument for the validity of this article is the personal use of drives/storage. And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.
--- To err is human... Am I more human than most ?
Probably the next meta solution after RAID 6 will be something like ZFS, where the filesystem that works not just on the fs-specific layer, but on the LVM layer so it can log CRCs of files and immediately be able to tell if a file got corrupted (and perhaps fix it with some ECC records.) One can see a filesystem not just writing a RAID layer, but taking recovery data and storing that away as filesystem metadata.
Of course, there is always doing redundant arrays of RAID clusters, say three groups, two data, one parity, or mirroring RAID 5 volumes. You have the usual tradeoffs: The more fancy the RAID scheme, the more disks you need, and the more computing you have to do for every bit thrown at and read off the array.
Long term solution? A move to something other than magnetic storage. This could be optical, it could be SSD if some advance allows very large density increases, or something unknown. The technology would have to have a rate of failure magnitudes better than magnetic, as well as a cost on par with magnetic for it to completely work. Holographic storage has languished for a while, perhaps as the technology improves for that, we may see drives using 3D blocks of that replacing the old fashioned spindles.
Hardware RAID is dead - software for redundant storage is just getting started. I am looking forward to making use of btrfs so I can have some consistency and confidence to how I deal with any ultimately disposable storage component.
The ZFS folks have been doing it fine for some time now.
Hardware RAID controllers have no place in modern storage arrays - except those forced to run Windows
For enterprise level storage systems, this is also a non-issue because of thin provisioning.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
If you want smaller drives to speed up rebuild times then, erm, buy smaller drives? You can get ~70Gb 10Krpm and 15Krpm drives fairly readily - much smaller than the 500-to-2000-Gb monsters and faster too. You can still buy ~80Gb PATA drives too, I've seen them when shopping for larger models, though you only save a couple of peanuts compared to the cost of 250+Gb units.
If you can't afford those but still don't want 500+Gb drives because they take too long to rebuild if the array is compromised and needs a rebuild, and management won't let you buy bog standard 160Gb (or smaller) drives as they only cost 20% less than 750Gb units without the speed benefits of the high cost 15Krpm ones, how about using software RAID and only using the first part of the drive? Easily done with Linux's software RAID (partition the drives with a single 100Gb (for example) partition, and RAID that instead of the full drive) and I'm sure just as easy with other OSs. You'll get speed bonuses too: you'll be using the fastest part of the drive in terms of bulk transfer speed (most spinning drives are arranged such that the earlier tracks have higher data density) and you'll have lower latency on average as the heads will never need to move the full diameter of the platter. And you've got the rest of the drive space to expand onto if needed later. Or maybe you could hide your porn stash there.
I've managed to get this going, using the excellent FreeNAS - although proceed with caution, as only the beta build supports it, and I've already had serious (all data lost) crashes twice.
However the principle is sound, and I'm sure this will become standard before long - the only trouble being that HP, Dell and the like can't simply offer upgrades for existing RAID cards - due to the nature of ZFS, it needs a 'proper' CPU and a gig or two or RAM. Even so, it does protect against many of the problems now besetting RAID (which was never meant to handle modern, gargantuan disk sizes).
What about fountain codes? The coding there is capable of recovering from a greater variety of faults.
But really none of that should be necessary for the general case. Storing data in different physical locations is a good but entirely unrelated issue- the main problem of disk reliability is still very much in need of a solution. That's pretty much the point of the article: You can come up with various solutions which move the problem around, give multiple fallbacks for when something goes wrong.. but there's still the problem of things going wrong in the first place. I shouldn't need to use 12 separate disks spread across the globe just for basic reliability / redundancy
Read that before on slashdot. Why RAID 5 Stops Working In 2009
Actually I like the parity declustering idea that was linked to in that article, seems to me if implemented correctly it could mitigate a large part of the issue. I have personally encountered the hard error on RAID5 rebuild issue, twice, so there definitely is a problem to be addressed...and yes, I do now only implement RAID6 as a result.
For those who haven't RTFATFALT (RTFA the f*** article links to), parity declustering, as I understand it, is where you have, say, an 8 drive array, but where each block is written to only a subset of those drives, say 4. Now, obviously you loose 25% of your storage capacity (1/4), but consider a rebuild for a failed disk. In this instance only 50% of your blocks are likely to be on your failed drive, so immediately you cut your rebuild time in half, halving your data reads, and therefore your chance of encountering a hard error. Larger numbers of disks in the array, or spanning your data over fewer drives, cuts this further.
Now, consider the flexibility you could build into an implmentation of this scheme. Simply by allowing the number of drives a block spans to be configurable on a per block basis, you could then allow any filesystem that is on that array to say, on a per file basis, how many disks to span over. You could then allow apps and sysadmins to say that a given file needs to have the maximum write performance, so diskSpan=2, which gives you effectively RAID10 for that file (each block is written to 2 drives, but with multiple blocks in the file is likely to be written to a different pair of drives, not quite RAID10, but close). Where you didn't want a file to consume 2x its size on the storage system, you could allow a higher diskSpan number. You could also allow configurable parity on a per block basis, so particularly important files can survive multiple disk failures, temp files could have no parity. There would need to be a rule however that parity+diskSpan is less than or equal to the number of devices in the array.
Obviously there is an issue here where the total capacity of the array is not knowable, files with diskSpan numbers lower than the default for the array will reduce the capacity, numbers higher will increase it. This alone might require new filesystems, but you could implement todays filesystems on this array as long as you disallowed the per-block diskSpan feature.
This even helps for expanding the array, as there is now no need to re-read all of the data in the array (with the resulting chance of encountering a hard error, adding huge load to the system causing a drive to fail, etc). The extra capacity is simply available. Over time you probably want a redistribution routine to move data from the existing array members to the new members to spread the load and capacity.
How about you implement a performance optimiser too, that looks for the most frequently accessed blocks and ensures they are evenly spread over the disks. If you take into account the performance of the individual disks themselves, you could allow for effectively a hierarchical filesystem, so that one array contains, say, SSD, SAS and SATA drives, and the optimiser ensures that data is allocated to individual drives based on the frequency of access of that data and the performance of the drive. Obviously the applications or sysadmin could indicate to the array which files were more performance sensitive, so influencing the eventual location of the data as it is written.
Stealing a rhinoceros should not be attempted lightly.
Will scalable distributed storage systems like Hadoop and Google File System take over from RAID?
As others have mentioned, this is something that is discussed on the ZFS mailing lists frequently.
For more info there, check out the digest for zfs-discuss@opensolaris.org
and, in particular, check out Richard Elling's blog
(Disclaimer: I work for Sun, but not in the ZFS group)
The fundamental problem here isn't the RAID concept, is that the throughput and access times of spinning rust haven't changed much in 30 years. Fundamentally, today's hard drive is no more than 100 times as fast (both in throughput and latency) than a 1980s one, while it holds well over 1 million times more.
ZFS (and other advanced filesystems) will now do partial reconstruction of a failed drive (that is, they don't have to bit copy the entire drive, only the parts which are used), which helps. But there are still problems. ZFS's pathological case results in rebuild times of 2-3 WEEKS for a 1TB drive in a RAID-Z (similar to RAID-5). It's all due to the horribly small throughput, maximum IOPs, and latency of the hard drive.
SSDs, on the other hand, are no where near the problem. They've got considerably more throughput than a hard drive, and, more importantly, THOUSANDS of times better IOPS. Frankly, more than any other reason, I expect the significant IOPS of the SSD to signal the death knell of HDs in the next decade. By 2020, expect HDs to be gone from everything, even in places where HDs still have better GB/$. The rebuild rates and maintenance of HDs simply can't compete with flash.
Note: IOPS = I/O Per Second, or the number of read/write operations (irregardless of size) which a disk can service. HDs top out around 350, consumer SSDs do under 10,000, and high-end SSDs can do up to 100,000.
-Erik
There are always four sides to every story: your side, their side, the truth, and what really happened.
Um, don't schemes like raid 1+0 solve the parity rebuild problem? Even in the worst case of full disk loss, only one disk needs to be rebuilt and even for a large disk that doesn't take very long. Am I missing something?
In theory, there's no difference between theory and practice; in practice there is.
RAID 1 has much less reliability than RAID 6. Assume a typical case: one disk totally fails. You then start to reconstruct - in a RAID 1 scheme a single sector error will result in the rebuild failing. Not great.
In RAID 6 you start the rebuild and you get a single sector error from one of the drives you're rebuilding from. At that point you've got yet another parity scheme available (in the form of the RAID 6 bit) that figures out what that sector should have been and then continues the rebuild. Then you go back and decide what to do about that drive that had the second error.
A lot of drive failures aren't full head crashes or motor errors but just single sector, track, bits of dirt on the platter style errors. Other than the affected area the drive can be read.
With RAID 6 you can fail two disks completely and still access the data. You're still reading from the same ten 10TB disks in your example and if the implementation of RAID 6 is optimal (RAID-DP) you aren't having to read additional data from the same physical disks.
In the world you describe with 10TB drives it sounds like you'd just not be able to use the disks at all since any process that reads from the disks will kill them. There are a few things that could happen:
1. Disks get more reliable. Hasn't happened much yet but...
2. We switch to different packaging. Instead of making disks larger we cram more of them into the same space similar to CPU cores - same MTBF per disk but lots of them presented out by one physical interface.
3. We change technologies completely. SSD (interesting failure modes there too... needs RAID)
I guess we'll find out in only a few years...
Actually, storing data in a multiple data center / high availability environment is a completely related issue. The summary above talks of "entirely different paradigms." Cloud storage would be multiple data center based, which is entirely different from keeping the only copy on your local drives. In this concept, your machine would have enough OS to boot, and enough hard drive space to download the current version of whatever software you are leasing. Your personal info would always be maintained in the data centers, and only mirrored locally. Have a home failure? Drop in a new part or even a new PC, (possibly with an entirely different operating system, such as Chrome,) connect to the service, and you're 100% back.
It's no longer a novel concept for the home market. Consider Google Docs. It's not even being sold as "safer than RAID", it's being touted as "get it from anywhere" or "share with your friends". Safer than RAID is just a bonus.
So are we ready to move all our personal information to clouds? I certainly am not, but Google Docs are wildly popular and a lot of people are. I long ago learned that I can't look to myself to judge what the mainstream attitudes are in many things.
John
RAID 4 is where you have one dedicated parity drive. RAID 5 solves this by spreading the parity information for each drive to all the other drives in the array. RAID 6 adds a second parity block for increased reliability, but as a result of the increased write for that extra parity block, it slows down write speeds.
The real key to making RAID 4, 5, or 6 work is that you really need 4-6 drives in the array to take advantage of the design. I wouldn't say that it will fall out of favor though, because having solid protection from a single drive going bad really is critical for many businesses. Backups are all well and good for if your system crashes, but for most businesses, uptimes are more critical yet. So, backups for data so corruption problems can be rolled back, and RAID 5,6,10 for stability and to avoid having the entire system die if one drive goes bad. What takes more time, doing a data restore from a backup for when an individual application has problems, or having to restore the entire system from a backup, with the potential that the backup itself was corrupted?
With that said, web farms and other applications can get away with just using a cluster approach instead of a single well designed machine(or set of machines) have become popular, but there are many situations which make a system with one or more RAID arrays a better choice. The focus on RAID 0 and 1 for SMALL systems and residential setups has simply kept many people from realizing how useful a 4-drive RAID 5 setup would be.
Then again, most people go to a backup when they screw up their system, not because of a hard drive failure. With techs upgrading hardware before they run into a hard drive failure, the need for RAID 1, 4, 5, and 6 has dropped.
I will say this, since a RAID 5 array can rebuild on the fly(since it keeps working even if one drive fails), the rebuild time itself does not significantly impact system availability. Gone are the days when a rebuild has to be done while the system is down.
I use RAID6 for several high-volume machines at work. Having double parity plus a hot spare means rebuild time is no worry.
But if you are not a fan you can always throw something together with ZFS's RAIDZ or RAIDZ2 which is also distributed parity but the ZFS filesystem checksums and keeps multiple (distributed) copies of every block to detect and fix data corruption before it becomes a bigger problem.
People using ZFS have been able to detect silent data corruption from a faulty power supply that other solutions would never have found just because of the checksumming process.
Imagine a disk fails on every 100TB of reads (10^14). You have ten 1TB data disks. Imagine you keep them in perfect rotation so they've spent 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100% of their lifetime. The last disk dies and you replace it with a new drive (0%). To rebuild the drive you read 1TB from each data disk and use whatever parity you need. They've now spent 11, 21, 31, 41, 51, 61, 71, 81, 91 and 1% (your new disk) of their lifetime and you can read another 9TB before you need a new disk.
Except that it doesn't work anywhere NEAR like that. The lifetime of disks are much, much greater than the read cost to rebuild a failed drive. You're definitely not spending 1% of its total lifetime. You're not even spending 1/1000th of that 1%. You couldn't measure the difference between having to rebuild that drive and just using the disk as a new one.
The vast majority of hard disk failures are manufacturing issues, not end-of-lifetime for an average drive issue. You don't have a raid system because the average lifetime of a harddisk is small, you get a raid system because of the outliers. Every once in a while, a manufacturer puts out a deathstar, and it's going to fail a month after you put it in. At the same time, you'll have disks in there that are going to keep chugging away for five years straight, and you'll eventually replace them because you want a bigger disk, not because they've failed.
Is he saying that you can never read a whole hard disk because it will fail before you get to the end?
That's what it seems like he's saying but my hard disks usually last for years of continuous so I'm not sure it's true.
No sig today...
Here's what I want, folks:
A 5.25 inch device with 5 double-sided platters running at 5400 RPM. Basically the same size as a desktop CD/DVD drive, ala Quantum Bigfoot.
I want 8 sides of the platters dedicated to data, and the other two sides dedicated to parity (or one parity and the other servo), essentially a self-contained RAID on a single disk.
I want all data heads to write and read simultaneously, in Parallel. The idea is to have 64 byte sectors on each platter which are recombined into a 512-byte result. 8 heads writing and reading in paralell means HUGE throughput for sequential operations.
It's RAID 5 or 6 on a single disk, although without spindle redundancy.
And I also want a high-performance option: 2 sets of read/write heads 180 degrees apart, which effectively would cut seek times in half, making the drive perform more like a 10k RPM drive. With current densities, that's 12 TB in the volume of a DVD drive. It solves speed, sector error recovery and capacity issues. The only thing missing is a data bus that can handle the throughput.
I don't understand what your failure rate strategy is. First of all, there's no such thing as saying you are 90% or 10% of the way through a disk's life.. It's a probability distribution, who's probability is dramatically effected by the current events (and somewhat related to historical events). A drive might be at a 0.00005% probability of failure at any given moment, but then a large sustained read occurs which adjusts the heat and causes voltage fluctuations , so now you're operating at 0.001% probability.
Then a drive dies in hot-swap-mode, a drive spins down, then another spins up, this has massive voltage fluctuations as well as slight tension on the cabling which causes reflections in the wiring which increases your probability of failure to say 0.02%. (I'm totally making up numbers, but the trends are what's important).
So the act of powering down/up or hot-swaping intrinsically increases the probability of co-disk-failures, unless you have a very expensive system with separate AC/DC converters (e.g. fully decoupled) and obviously isolated frames, heat-compartments, etc.
BUT, you can mitigate this by having 3+-way redundancy (RAID-1; I honestly don't understand the point of using slower RAID-5 / RAID-6 anymore). So when one drive fails, you have addressed the probability of a second failure. There is a geometric reduction in probability that 3 or 4 or 5 simultaneous drives fail. Meaning even at the peek risky part of the drive-swap operation, if you have say 2% probability that another drive will fail, then there is 0.004% probability that two drives will fail simultaneously. 0.0008% that three fail, etc.
This isn't strictly correct, of course, because the probabilities are not fully independent. You have many common components, and thus their probabilities are intertwined. But sufficient to say the probabilities are less.
Now I say 3+way RAID-1 because it may be silly to swap out a single drive when one goes bad. The process I would recommend (if you have a sufficiently advanced RAID controller, and non-super-expensive disks), is this:
5-way RAID-1 with 2 powered down disks (thus effectively 3-way RAID-1)
On a drive failure, power up the two disks and initiate their syncing.
Swap out the error'd drive, and and initiate it's syncing.
For a brief-while, you have 2 valid, 2 semi-valid, and 1 semi-semi-valid drive.
As the drives sync-up(may take over 24 hours), power-down the original remaining 2 and remove them.
Recycle the good disks into JBOH (Just a bunch of hardware) clustering. Meaning boot-disks / log-file disks in say RAID-1, swapping out the oldest drive.
You can either buy several 4-way/5-way RAID controllers, or get a single 15-disk RAID controller for under than $1k. This allows you to have multiple logical volumes and share the 'spun-down disks', So now you're really only using 3 disks per logical-volume, though having two logical volumes with bad disks does reduce your ideal reliability somewhat. But this gives you 4 volumes which can be combined into RAID-10. You could build such a system for under $6k with various mixtures of high-end and low-end disks (for different partition requirements, boot/OS/linear-logging (RAID-1), random-write-data (RAID-10)).
If the data is super critical, use a block-level master-slave replication. Ideally your application supports direct master-slave or better yet, multi-master.
And if you're JBOH (Just a Bunch Of Hardware) clustering, then trivial RAID-1 with 2 or 3 disks (in software-raid) is all you need. Note, I use 3-disk RAID10 on my home linux machine, (that plus DVD drive fills up my IDE slots) - pretty clever technique. Yes I know virtually all MB's have hardware RAID these days, but unless they've got an extra 4Gig of buffer-RAM in them, they're pointless in my opinion, plus they're non-portable (screw transparent windows support, you can't distinguish disk errors from forced reboots anyway).
-Michael
Well the logical thing IMHO is after the first year you put in a new drive and do an array rebuild after making a backup.
Drives are really cheap and I would do that for as long as the array is in use.
Reuse the old drives in desktops if they are SATA.
Not perfect but it keeps you from having an array of old drives in your server.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Having worked with plenty of enterprise grade raid (EMC symetrix, clarion, and Dell SAN devices) I can say that capacity and rebuild times are not a problem for high-end arrays.
What will bring the problem to the masses are these stupid consumer NAS boxes. It is very easy to build a 4 or 8 TB array for home use using relatively cheap hardware. Unfortunately, no home user/abuser, that I know, has the skill set to manage or protect such a large array of data.
My most recent experience with a Western Digital sharespace was awful. Here is a box with a Gigabit NIC, and 4 - 2TB hard drives in a RAID 5 array that has transfer rates around 9MB/sec at best. Combine that pitiful performance with a rebuild/reformat time of over two days - and you know where this is going.
Average joes are going to put their entire lives on these things and never back them up due to the time and space cost. When a failure does occur - it will take days to perform a rebuild of the array - vastly increasing the likelyhood of another failure and permanent data loss.
Crappy RAID's days are numbered - good RAID implementations will be with us as long as hard drives have ANY failure rate at all.
-ted
Filling the drive with helium should help; the speed of sound in helium is 3x higher than in air, and it offers less resistance.
(Hydrogen would be even better, but it has a tendency to interact with metals in unfortunate ways.)
I do like mirroring more than this parity stuff.
Let's assume I have e.g. a RAID10 containing 100TB of Data. I loose one disk. and then I loose the second one. CrashBoomBang!
Of course I'm not stupid and I have a backup of all my 100TB of Data. What traditionally would happen now is, I will need to restore all of my Data, as I don't know what I had on these disks. Takes ages!
Now comes the magic of ZFS. Through the combination of ZFS as a Volume manager and a file system, ZFS can tell me exactly which files are missing. I will then replace both failed disk. If using 1TB Disks, I don't have to restore 100TB, but only up to 1TB. This will minimize the recovery time up to 100 times.