RAID's Days May Be Numbered
storagedude sends in an article claiming that RAID is nearing the end of the line because of soaring rebuild times and the growing risk of data loss. "The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable. The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it's a scenario that will eventually require a solution, if not a whole new way of storing and protecting data."
The author says it himself in the article:
"And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue."
but he hasn't adressed the fact that today you get 100 times as much diskspace for the same cost as you did 10 years ago when cost was a factor. In real life cost isn't a factor when it comes to datastorage, simply because it's really low in real life projects, as compared to the other costs in a project requiring storage. So if you want the reliability you go get a mirror. Drivespace is dirt cheap.
As for the rebuildtimes, fine, go buy FASTER drives. I dont see the problem. HP and many other vendors have long been trying to sell combined raid soltions (like the EVA) where you mix high storage with high performance drives (like SSD vs. SATA).
The only real argument for the validity of this article is the personal use of drives/storage. And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.
--- To err is human... Am I more human than most ?
Hardware RAID is dead - software for redundant storage is just getting started. I am looking forward to making use of btrfs so I can have some consistency and confidence to how I deal with any ultimately disposable storage component.
The ZFS folks have been doing it fine for some time now.
Hardware RAID controllers have no place in modern storage arrays - except those forced to run Windows
For enterprise level storage systems, this is also a non-issue because of thin provisioning.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
If you want smaller drives to speed up rebuild times then, erm, buy smaller drives? You can get ~70Gb 10Krpm and 15Krpm drives fairly readily - much smaller than the 500-to-2000-Gb monsters and faster too. You can still buy ~80Gb PATA drives too, I've seen them when shopping for larger models, though you only save a couple of peanuts compared to the cost of 250+Gb units.
If you can't afford those but still don't want 500+Gb drives because they take too long to rebuild if the array is compromised and needs a rebuild, and management won't let you buy bog standard 160Gb (or smaller) drives as they only cost 20% less than 750Gb units without the speed benefits of the high cost 15Krpm ones, how about using software RAID and only using the first part of the drive? Easily done with Linux's software RAID (partition the drives with a single 100Gb (for example) partition, and RAID that instead of the full drive) and I'm sure just as easy with other OSs. You'll get speed bonuses too: you'll be using the fastest part of the drive in terms of bulk transfer speed (most spinning drives are arranged such that the earlier tracks have higher data density) and you'll have lower latency on average as the heads will never need to move the full diameter of the platter. And you've got the rest of the drive space to expand onto if needed later. Or maybe you could hide your porn stash there.
What about fountain codes? The coding there is capable of recovering from a greater variety of faults.
But really none of that should be necessary for the general case. Storing data in different physical locations is a good but entirely unrelated issue- the main problem of disk reliability is still very much in need of a solution. That's pretty much the point of the article: You can come up with various solutions which move the problem around, give multiple fallbacks for when something goes wrong.. but there's still the problem of things going wrong in the first place. I shouldn't need to use 12 separate disks spread across the globe just for basic reliability / redundancy
Actually I like the parity declustering idea that was linked to in that article, seems to me if implemented correctly it could mitigate a large part of the issue. I have personally encountered the hard error on RAID5 rebuild issue, twice, so there definitely is a problem to be addressed...and yes, I do now only implement RAID6 as a result.
For those who haven't RTFATFALT (RTFA the f*** article links to), parity declustering, as I understand it, is where you have, say, an 8 drive array, but where each block is written to only a subset of those drives, say 4. Now, obviously you loose 25% of your storage capacity (1/4), but consider a rebuild for a failed disk. In this instance only 50% of your blocks are likely to be on your failed drive, so immediately you cut your rebuild time in half, halving your data reads, and therefore your chance of encountering a hard error. Larger numbers of disks in the array, or spanning your data over fewer drives, cuts this further.
Now, consider the flexibility you could build into an implmentation of this scheme. Simply by allowing the number of drives a block spans to be configurable on a per block basis, you could then allow any filesystem that is on that array to say, on a per file basis, how many disks to span over. You could then allow apps and sysadmins to say that a given file needs to have the maximum write performance, so diskSpan=2, which gives you effectively RAID10 for that file (each block is written to 2 drives, but with multiple blocks in the file is likely to be written to a different pair of drives, not quite RAID10, but close). Where you didn't want a file to consume 2x its size on the storage system, you could allow a higher diskSpan number. You could also allow configurable parity on a per block basis, so particularly important files can survive multiple disk failures, temp files could have no parity. There would need to be a rule however that parity+diskSpan is less than or equal to the number of devices in the array.
Obviously there is an issue here where the total capacity of the array is not knowable, files with diskSpan numbers lower than the default for the array will reduce the capacity, numbers higher will increase it. This alone might require new filesystems, but you could implement todays filesystems on this array as long as you disallowed the per-block diskSpan feature.
This even helps for expanding the array, as there is now no need to re-read all of the data in the array (with the resulting chance of encountering a hard error, adding huge load to the system causing a drive to fail, etc). The extra capacity is simply available. Over time you probably want a redistribution routine to move data from the existing array members to the new members to spread the load and capacity.
How about you implement a performance optimiser too, that looks for the most frequently accessed blocks and ensures they are evenly spread over the disks. If you take into account the performance of the individual disks themselves, you could allow for effectively a hierarchical filesystem, so that one array contains, say, SSD, SAS and SATA drives, and the optimiser ensures that data is allocated to individual drives based on the frequency of access of that data and the performance of the drive. Obviously the applications or sysadmin could indicate to the array which files were more performance sensitive, so influencing the eventual location of the data as it is written.
Stealing a rhinoceros should not be attempted lightly.
Will scalable distributed storage systems like Hadoop and Google File System take over from RAID?
As others have mentioned, this is something that is discussed on the ZFS mailing lists frequently.
For more info there, check out the digest for zfs-discuss@opensolaris.org
and, in particular, check out Richard Elling's blog
(Disclaimer: I work for Sun, but not in the ZFS group)
The fundamental problem here isn't the RAID concept, is that the throughput and access times of spinning rust haven't changed much in 30 years. Fundamentally, today's hard drive is no more than 100 times as fast (both in throughput and latency) than a 1980s one, while it holds well over 1 million times more.
ZFS (and other advanced filesystems) will now do partial reconstruction of a failed drive (that is, they don't have to bit copy the entire drive, only the parts which are used), which helps. But there are still problems. ZFS's pathological case results in rebuild times of 2-3 WEEKS for a 1TB drive in a RAID-Z (similar to RAID-5). It's all due to the horribly small throughput, maximum IOPs, and latency of the hard drive.
SSDs, on the other hand, are no where near the problem. They've got considerably more throughput than a hard drive, and, more importantly, THOUSANDS of times better IOPS. Frankly, more than any other reason, I expect the significant IOPS of the SSD to signal the death knell of HDs in the next decade. By 2020, expect HDs to be gone from everything, even in places where HDs still have better GB/$. The rebuild rates and maintenance of HDs simply can't compete with flash.
Note: IOPS = I/O Per Second, or the number of read/write operations (irregardless of size) which a disk can service. HDs top out around 350, consumer SSDs do under 10,000, and high-end SSDs can do up to 100,000.
-Erik
There are always four sides to every story: your side, their side, the truth, and what really happened.
Actually, storing data in a multiple data center / high availability environment is a completely related issue. The summary above talks of "entirely different paradigms." Cloud storage would be multiple data center based, which is entirely different from keeping the only copy on your local drives. In this concept, your machine would have enough OS to boot, and enough hard drive space to download the current version of whatever software you are leasing. Your personal info would always be maintained in the data centers, and only mirrored locally. Have a home failure? Drop in a new part or even a new PC, (possibly with an entirely different operating system, such as Chrome,) connect to the service, and you're 100% back.
It's no longer a novel concept for the home market. Consider Google Docs. It's not even being sold as "safer than RAID", it's being touted as "get it from anywhere" or "share with your friends". Safer than RAID is just a bonus.
So are we ready to move all our personal information to clouds? I certainly am not, but Google Docs are wildly popular and a lot of people are. I long ago learned that I can't look to myself to judge what the mainstream attitudes are in many things.
John
RAID 4 is where you have one dedicated parity drive. RAID 5 solves this by spreading the parity information for each drive to all the other drives in the array. RAID 6 adds a second parity block for increased reliability, but as a result of the increased write for that extra parity block, it slows down write speeds.
The real key to making RAID 4, 5, or 6 work is that you really need 4-6 drives in the array to take advantage of the design. I wouldn't say that it will fall out of favor though, because having solid protection from a single drive going bad really is critical for many businesses. Backups are all well and good for if your system crashes, but for most businesses, uptimes are more critical yet. So, backups for data so corruption problems can be rolled back, and RAID 5,6,10 for stability and to avoid having the entire system die if one drive goes bad. What takes more time, doing a data restore from a backup for when an individual application has problems, or having to restore the entire system from a backup, with the potential that the backup itself was corrupted?
With that said, web farms and other applications can get away with just using a cluster approach instead of a single well designed machine(or set of machines) have become popular, but there are many situations which make a system with one or more RAID arrays a better choice. The focus on RAID 0 and 1 for SMALL systems and residential setups has simply kept many people from realizing how useful a 4-drive RAID 5 setup would be.
Then again, most people go to a backup when they screw up their system, not because of a hard drive failure. With techs upgrading hardware before they run into a hard drive failure, the need for RAID 1, 4, 5, and 6 has dropped.
I will say this, since a RAID 5 array can rebuild on the fly(since it keeps working even if one drive fails), the rebuild time itself does not significantly impact system availability. Gone are the days when a rebuild has to be done while the system is down.
They aren't talking about drive speeds as much as failure rate:
The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss.
They are talking about the MTBF of drives has not gone up as fast as the capacity, and the fact that a missed write is actually quite likely with a modern high capacity drive. Even saying drive speeds haven't gone up is very accurate, 15k RPM drives have been around for quite a while now, at least for 10 years, and there has not been an improvement in speed in that time. Where are my 30k RPM drives?~
Also, I have a bit of a problem with your statement about OMG small enterprise drives. Enterprise drives have caught up to consumer drives in size, you can now buy 1TB SAS drives; they are just OMG expensive compared to the consumer drives.
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
The problem becomes space in the data center. I don't know about you, but we're trying to cram Petabytes into existing computer rooms and coming up short. Plus you don't address Tier 2 or Tier 3 storage which tends to be on SATA or near-line SAS both of which have the ridiculous size problem. Calling 15,000 RPM fast in the datacenter is also misleading because those are the speeds we've been at for a few years now, 10GB iSCSI (or FCoE, which bypasses the collison problem) is about to render that untenable. The current solution tends toward storage virtualization (in this case virtualization means excessive amounts of high-speed cache in front of controllers and less control on where controllers allocate space). The future is most likely some kind of grid technology (like XIV from IBM). Where any blcok is on two random drives in the array, and only the controller knows where. This means that drive rebuilds become subject to swarm speeds (since there is an equal chance that it is pulling data from every other drive in the tower).
He effected a bored affect.
You're not likely to see 30k RPM drives any time soon. The speed of a 15k drive means that the outer edge of the 3 1/2" drive is spinning pretty fast... getting close to the speed of sound and the lions share of power consumed by 15k drives is consumed in counteracting the air buffeting the heads. With 2 1/2" drives we could go faster but while drives are open to the air it's not likely we'll see much in the short term.
It's why CDROM speeds haven't gone up much since the old day of 52x.
As areal density improves the drives will be able to push out more raw MB/sec just like DVD is better than CD, but in terms of IOPs it's not likely to dramatically improve.
Here's what I want, folks:
A 5.25 inch device with 5 double-sided platters running at 5400 RPM. Basically the same size as a desktop CD/DVD drive, ala Quantum Bigfoot.
I want 8 sides of the platters dedicated to data, and the other two sides dedicated to parity (or one parity and the other servo), essentially a self-contained RAID on a single disk.
I want all data heads to write and read simultaneously, in Parallel. The idea is to have 64 byte sectors on each platter which are recombined into a 512-byte result. 8 heads writing and reading in paralell means HUGE throughput for sequential operations.
It's RAID 5 or 6 on a single disk, although without spindle redundancy.
And I also want a high-performance option: 2 sets of read/write heads 180 degrees apart, which effectively would cut seek times in half, making the drive perform more like a 10k RPM drive. With current densities, that's 12 TB in the volume of a DVD drive. It solves speed, sector error recovery and capacity issues. The only thing missing is a data bus that can handle the throughput.
Even partial evacuation would help, but you run into the problem that the read heads are designed to use the air to keep them from contacting the platters, so you'd need to replace that effect somehow.
The Space shuttle and ISS even have special sensors to shut the hard drives down if the air pressure goes too low. Reading about which was how I found out that hard drives are designed to use air.
Not to mention that you're now trying to build an air tight container, but if you're looking at ultra-high performance drives that's less of an issue.
Still, you have to look at how much such a drive would cost, and whether the cost would ever be repaid - if I was looking at investing in such technology I'd be concerned that Flash would outpace my vacuum drives before I got them released. Even if I DO manage to find a niche, would the niche last long enough against flash memory that's getting faster and cheaper so quickly?
For certain data sets and access patterns, flash is already much cheaper than the old raid options - the best example I saw was a dataset of a few hundred gigabytes that was mostly read-only, but accessed so much so randomly they had to mirror it on 10 hard drives to meet the read demands. One professional level SSD performed BETTER, while costing less than half of the setup.
I don't read AC A human right
Filling the drive with helium should help; the speed of sound in helium is 3x higher than in air, and it offers less resistance.
(Hydrogen would be even better, but it has a tendency to interact with metals in unfortunate ways.)
Can you say "instantaneous heat death" ? Vacuum is an excellent insulator.
I want to delete my account but Slashdot doesn't allow it.
Unfortunately all that is quite a myth for the most part.
Having worked in storage for a aeons the reality is that the difference between enterprise and "consumer grade rubbish" has very little to do anything but tollerance. If you picked up a 300G 10k enterprise drive and compared it to the consumer grade rubbish you'd find nothing different. It used to be the case, way back when, that they were very different but because consumer grade drives have gotten so much better its just not worth the expense of building the same drive for enterprise as for consumers with slightly different specs. What is different is the acceptable tollerances, when a platter comes off the line if its within 2% of its manufacturing tollerances its ok to use for entperise and if its higher they throw it into consumer. The reality is that most drives are in that "better than 2% tollerance" range and that is simply because the processes to make them have gotten so good over the years. The point is that when you hit your magic tollerance number, the drive is capable of 100% duty cycle.
So essentially, the difference between "consumer" and "enterprise" when it comes to the casing, the platters, the heads and the motors is zero. There are alot of different spec drives out there today ranging from 146gb (typically the smallest you'll find these days) all the way to 2gb with speeds form 7200 to 15000 rpm and enterprise is the only place that uses all of them, but they still come off the same manufacturing line. The drivers behind it all come down to the consumer itself, in enterprise its often about performance, and with consumers its about size. Very conveniently building bigger consumer grade drives typically means improving the performance of a drive in ways that scale straight back to the enterprise. Sure, you wont see many users throwing around 15k rpm drives, but thats more because its unnecessary.
So why is it that in the mid-to-low server range do we find 300gb 15k drives? Because its a cheap way of getting performance - and that is fairly important at that end of the market where servers need to be cheap and theres alot of competition (you know, 1-2ru with 4-8 drives and a raid card, no san).
So what else differs between the two? Interface. In the mid-to-low server range we start talking SAS and this is more to do with being able to talk to several drives at once (Again not something alot of consumers do other than with usb drives perhaps). The SAS interface is quite brilliant cause it can scale quite well to a larger number of drives than can SATA and does it very cheaply. It also takes alot of load off the server when it comes to processing data transfer (for a large number of drives). But in that same space you WILL find sata drives going up to 2tb (often servers lag consumers in size simply because of certification, not because of anything to do with stability). To call a 1tb drive unstable is rather silly in reality.
Now the BIG end of town - SAN's. These days in most SAN's you'll find a mix of SATA and Fibre channel (some do do SAS as well, but its uncommon though its changing). In the SAN end of town (the big boy game) you'll see it all. 7.2k rpm 2tb SATA's sitting in the same array along side 146g 15k RPM fibre channel and its all about trading off storage density/cost to performance. Consider this: 10 1tb sata drives can consume (easily) a 8gbps FC interface - OUCH! Now alot of SAN arrays start at around 4 FC intercaes and go up to maybe 16, but they'll be supporting literally thousands of drives. Alot of the SAN industry realised some time ago that throwing 2tb SATA's into an array made alot of sense because SAN interfaces have grown very slowly in terms of throughput and single HD interfaces have grown very quickly. There are even several very popular arrays that only do SATA and that was the driver behind "enterprise" grade large-storage drives (i.e. entperise grade 1tb+ sata drives). At the server you still get the fibre channel performance. The critical difference is that the array does more work