Slashdot Mirror


RAID's Days May Be Numbered

storagedude sends in an article claiming that RAID is nearing the end of the line because of soaring rebuild times and the growing risk of data loss. "The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable. The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it's a scenario that will eventually require a solution, if not a whole new way of storing and protecting data."

108 of 444 comments (clear)

  1. simple idea by shentino · · Score: 2, Interesting

    Don't consider an entire drive is dead if you get a piddly one-sector error.

    Just mark it read only and keep chugging.

    1. Re:simple idea by paulhar · · Score: 4, Informative

      Enterprise arrays copy all the good data off the drive to a spare drive, use RAID to recover the failed sector(s), then fail the broken disk.

    2. Re:simple idea by Eric+Smith · · Score: 4, Insightful

      The drives already do that internally. By the time they're reporting errors, bad things are happening, and it really IS time to replace the drive. Anyhow, drives are inexpensive. It's more cost effective to replace them than to spend a lot of time screwing around with them.

    3. Re:simple idea by paulhar · · Score: 3, Informative

      They do to varying degrees of success but just because a disk can't read a particular sector doesn't mean that the drive is faulty - it could be a simple error on the onboard controller that is causing the issue.

      FC/SAS drives mostly leave error handling up to the array rather than doing it themselves because the arrays can typically make better decisions as to how to deal with the problem and helps cope with time sensitive applications. The array can choose to issue additional retries, reboot the drive while continuing to use RAID to serve the data, etc.

      Consumer SAS drives on the other hand try really hard to recover from the problem - for example retrying again and again with different methods to get the sector and while admiral that leads to behaviours we see in consumer land where the PC just "locks up". The assumption here is that there is no RAID available and so reporting an error back to the host is "a bad thing". The enterprise SAS drives we're seeing on the market are starting to disable this automatic functionality to make them behave correctly when inserted into RAID arrays.

      Usually ;-)

    4. Re:simple idea by Anonymous Coward · · Score: 5, Insightful

      Enterprise arrays are also very VERY different from what most people know as RAID. Smart controllers, smart drive cages, drives that are a magnitude better than the consumer grade garbage.

      The Summary talks about how speed has not kept up with capacity, Yes that is correct in the low grade consumer junk. Enterprise server class RAID drives are a different story. The 15,000 RPM drives I have in my RAID 50 array here on the Database server are insanely fast. Plus server class drives are not silly unstable capacities like 1Tb or 1.5Tb they area "OMG small" 300gb size but are stable as a rock.

      So I guess the question is, Is the summary talking about RAID on junk drives or RAID on real drives?

    5. Re:simple idea by Coren22 · · Score: 3, Interesting

      They aren't talking about drive speeds as much as failure rate:

      The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss.

      They are talking about the MTBF of drives has not gone up as fast as the capacity, and the fact that a missed write is actually quite likely with a modern high capacity drive. Even saying drive speeds haven't gone up is very accurate, 15k RPM drives have been around for quite a while now, at least for 10 years, and there has not been an improvement in speed in that time. Where are my 30k RPM drives?~

      Also, I have a bit of a problem with your statement about OMG small enterprise drives. Enterprise drives have caught up to consumer drives in size, you can now buy 1TB SAS drives; they are just OMG expensive compared to the consumer drives.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
    6. Re:simple idea by alva_edison · · Score: 4, Interesting

      The problem becomes space in the data center.  I don't know about you, but we're trying to cram Petabytes into existing computer rooms and coming up short.  Plus you don't address Tier 2 or Tier 3 storage which tends to be on SATA or near-line SAS both of which have the ridiculous size problem.  Calling 15,000 RPM fast in the datacenter is also misleading because those are the speeds we've been at for a few years now, 10GB iSCSI (or FCoE, which bypasses the collison problem) is about to render that untenable.  The current solution tends toward storage virtualization (in this case virtualization means excessive amounts of high-speed cache in front of controllers and less control on where controllers allocate space).  The future is most likely some kind of grid technology (like XIV from IBM).  Where any blcok is on two random drives in the array, and only the controller knows where.  This means that drive rebuilds become subject to swarm speeds (since there is an equal chance that it is pulling data from every other drive in the tower).

      --
      He effected a bored affect.
    7. Re:simple idea by paulhar · · Score: 4, Interesting

      You're not likely to see 30k RPM drives any time soon. The speed of a 15k drive means that the outer edge of the 3 1/2" drive is spinning pretty fast... getting close to the speed of sound and the lions share of power consumed by 15k drives is consumed in counteracting the air buffeting the heads. With 2 1/2" drives we could go faster but while drives are open to the air it's not likely we'll see much in the short term.

      It's why CDROM speeds haven't gone up much since the old day of 52x.

      As areal density improves the drives will be able to push out more raw MB/sec just like DVD is better than CD, but in terms of IOPs it's not likely to dramatically improve.

    8. Re:simple idea by JediTrainer · · Score: 2, Funny

      lions share of power consumed by 15k drives is consumed in counteracting the air buffeting the heads

      Until some genius figures out how to build one with no air inside?

      --

      You can accomplish anything you set your mind to. The impossible just takes a little longer.
    9. Re:simple idea by denis-The-menace · · Score: 2, Interesting

      Why not add multiple heads to the same platter?

      Keep the disk spinning at 15K but add heads with their own actuator and everything. One could read only the other write only. Whatever makes sense.

      --
      Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
    10. Re:simple idea by russotto · · Score: 3, Insightful

      Pardon my ignorance here, but is there any reason the casing couldn't just be vacuum sealed such that there was no air in the chamber where the platters were spinning?

      You'd need a whole new way of keeping the head off the platter. You'd have a problem with lubricants vaporizing. Heat would be a problem as well.

    11. Re:simple idea by operagost · · Score: 4, Informative

      I'll assume you aren't trolling, and point out that disks work BECAUSE OF the air inside. The heads gain lift.

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    12. Re:simple idea by operagost · · Score: 3, Informative

      The only real difference between WD's enterprise SATA and their consumer line (other than, perhaps, the warranty) is a firmware setting that determines how long it attempts to write to a sector before giving up and using a spare block. It has to be reduced for enterprise use so that the RAID controller doesn't fail the disk prematurely. My WD disks kept "failing" until I set this timeout shorter. It's been a year since I did that, and I've had no failures or data corruption. It's possible that this is no longer the case for their latest models.

      --

      Gamingmuseum.com: Give your 3D accelerator a rest.
    13. Re:simple idea by amoeba1911 · · Score: 3, Informative

      Speed of sound at sea level: 340.29 m/s verify

      ((3.5 inches) * (2.54 (cm / inches)) * pi) * (((15000 / minute) * (1 minute)) / (60 second)) * (0.01 (meter / centimeter)) = 69.8218967 m / s verify

      If my calculation is correct, the outer edge of a 3.5" plate spinning at 15000 RPM is moving at 69.82m/s, which is about 20% of speed of sound. It's fast, but it's nowhere near the speed of sound.

    14. Re:simple idea by Zenaku · · Score: 4, Informative

      Air is necessary for the read/write head to operate. The piece that comes into close proximity of the platter is essentially a tiny hovercraft. It's about the size of a pepper flake, and has a microscopic pattern called an "air bearing" carved into the side facing the platter. Designing this air bearing is an exercise in fluid dynamics -- it is the shape of the bearing and how air flows over it that allows the read/write head to skim over the surface of the platter at a distance measured in microns without actually contacting the surface of the platter.

      If the read/write head does contact the surface of the platter, that is called a head crash, and is bad.

      --
      If fate makes you a motorcycle, you become a motorcycle.
    15. Re:simple idea by Firethorn · · Score: 4, Interesting

      Even partial evacuation would help, but you run into the problem that the read heads are designed to use the air to keep them from contacting the platters, so you'd need to replace that effect somehow.

      The Space shuttle and ISS even have special sensors to shut the hard drives down if the air pressure goes too low. Reading about which was how I found out that hard drives are designed to use air.

      Not to mention that you're now trying to build an air tight container, but if you're looking at ultra-high performance drives that's less of an issue.

      Still, you have to look at how much such a drive would cost, and whether the cost would ever be repaid - if I was looking at investing in such technology I'd be concerned that Flash would outpace my vacuum drives before I got them released. Even if I DO manage to find a niche, would the niche last long enough against flash memory that's getting faster and cheaper so quickly?

      For certain data sets and access patterns, flash is already much cheaper than the old raid options - the best example I saw was a dataset of a few hundred gigabytes that was mostly read-only, but accessed so much so randomly they had to mirror it on 10 hard drives to meet the read demands. One professional level SSD performed BETTER, while costing less than half of the setup.

      --
      I don't read AC A human right
    16. Re:simple idea by Gothmolly · · Score: 3, Interesting

      Can you say "instantaneous heat death" ? Vacuum is an excellent insulator.

      --
      I want to delete my account but Slashdot doesn't allow it.
    17. Re:simple idea by AlecC · · Score: 3, Informative

      No - to reconstruct 1 sector you have to read one sector from every other drive, then write 1 sector to the replacement drive. Effectively, to reconstruct you have to read thw whole raid. So the read and write speeds both count.

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    18. Re:simple idea by rayzat · · Score: 2, Informative

      I do a lot of storage work and whenever the talk comes to spindle counts I've always wondered this as well. Since the only thing no scaling in hard drives these days are the rotational and read speeds adding another head could double the single drive throughput and IOPS. I've looked into it and found 100's of patents on the idea, and one drive from 1986 that had multiple heads, it had 8 if I remember correctly, looked to be the size of a record player, held 200 MB, and cost 250k.

    19. Re:simple idea by Rich0 · · Score: 2, Informative

      I'm surprised that nobody has mentioned the issue of failure of the drive material itself at higher rotational velocities.

      I believe CDs are limited to 52X because the polycarbonate they are constructed of explodes when you get too much higher than that (with a safety factor of course).

      A metal hard drive probably can take more speed, but I'm sure that at some point you get deformation of the platter. You also have bearings/etc to deal with. 30k is a pretty fast rotation rate - and we're talking about a device that is always-on.

      Additionally, even 10k SCSI drives aren't exactly consumer-grade hardware. We're already getting in to the high-end realm, and the whole point of RAID was the "I."

    20. Re:simple idea by zippthorne · · Score: 2, Informative

      You know google does the conversion for you: 2*pi*3.5 inches * 15,000 minute^-1 in m/s = 140 m / s

      --
      Can you be Even More Awesome?!
    21. Re:simple idea by Anonymous Coward · · Score: 3, Funny

      340.29 m/s is the speed of sound in a vaccuum.

      Moran.

    22. Re:simple idea by Binary+Boy · · Score: 3, Funny

      lions share of power consumed by 15k drives is consumed in counteracting the air buffeting the heads

      Until some genius figures out how to build one with no air inside?

      Lions need air.

    23. Re:simple idea by WickedLilMonkies · · Score: 2, Informative

      You're not likely to see 30k RPM drives any time soon. The speed of a 15k drive means that the outer edge of the 3 1/2" drive is spinning pretty fast... getting close to the speed of sound ...It's why CDROM speeds haven't gone up much since the old day of 52x...

      Perhaps I haven't taken a math class in a while, but my cocktail napkin calculation says that a 3.5 inch disc spinning at 15,000 times per minute will travel just over 156 miles/hour. No where near 761 mph (speed of sound).

      3.5 x Pi = 11 inch circumference x 15000 = 164,933 inches per minute / 12 inches / 5280 feet/mile * 60 minutes/hour = 156 mph.

      Furthermore, while I don't argue your point that they are spinning pretty fast, I disagree with your assertion that CDROM's haven't increased because of this. More like, I believe CDROMs are simply not manufactured within sufficient tolerances, as indicated by their frequent vibrations when they spin up, and such vibrations could cause them to shatter.

      For amusement: http://www.powerlabs.org/cdexplode.htm

    24. Re:simple idea by pjr.cc · · Score: 3, Interesting

      Unfortunately all that is quite a myth for the most part.

      Having worked in storage for a aeons the reality is that the difference between enterprise and "consumer grade rubbish" has very little to do anything but tollerance. If you picked up a 300G 10k enterprise drive and compared it to the consumer grade rubbish you'd find nothing different. It used to be the case, way back when, that they were very different but because consumer grade drives have gotten so much better its just not worth the expense of building the same drive for enterprise as for consumers with slightly different specs. What is different is the acceptable tollerances, when a platter comes off the line if its within 2% of its manufacturing tollerances its ok to use for entperise and if its higher they throw it into consumer. The reality is that most drives are in that "better than 2% tollerance" range and that is simply because the processes to make them have gotten so good over the years. The point is that when you hit your magic tollerance number, the drive is capable of 100% duty cycle.

      So essentially, the difference between "consumer" and "enterprise" when it comes to the casing, the platters, the heads and the motors is zero. There are alot of different spec drives out there today ranging from 146gb (typically the smallest you'll find these days) all the way to 2gb with speeds form 7200 to 15000 rpm and enterprise is the only place that uses all of them, but they still come off the same manufacturing line. The drivers behind it all come down to the consumer itself, in enterprise its often about performance, and with consumers its about size. Very conveniently building bigger consumer grade drives typically means improving the performance of a drive in ways that scale straight back to the enterprise. Sure, you wont see many users throwing around 15k rpm drives, but thats more because its unnecessary.

      So why is it that in the mid-to-low server range do we find 300gb 15k drives? Because its a cheap way of getting performance - and that is fairly important at that end of the market where servers need to be cheap and theres alot of competition (you know, 1-2ru with 4-8 drives and a raid card, no san).

      So what else differs between the two? Interface. In the mid-to-low server range we start talking SAS and this is more to do with being able to talk to several drives at once (Again not something alot of consumers do other than with usb drives perhaps). The SAS interface is quite brilliant cause it can scale quite well to a larger number of drives than can SATA and does it very cheaply. It also takes alot of load off the server when it comes to processing data transfer (for a large number of drives). But in that same space you WILL find sata drives going up to 2tb (often servers lag consumers in size simply because of certification, not because of anything to do with stability). To call a 1tb drive unstable is rather silly in reality.

      Now the BIG end of town - SAN's. These days in most SAN's you'll find a mix of SATA and Fibre channel (some do do SAS as well, but its uncommon though its changing). In the SAN end of town (the big boy game) you'll see it all. 7.2k rpm 2tb SATA's sitting in the same array along side 146g 15k RPM fibre channel and its all about trading off storage density/cost to performance. Consider this: 10 1tb sata drives can consume (easily) a 8gbps FC interface - OUCH! Now alot of SAN arrays start at around 4 FC intercaes and go up to maybe 16, but they'll be supporting literally thousands of drives. Alot of the SAN industry realised some time ago that throwing 2tb SATA's into an array made alot of sense because SAN interfaces have grown very slowly in terms of throughput and single HD interfaces have grown very quickly. There are even several very popular arrays that only do SATA and that was the driver behind "enterprise" grade large-storage drives (i.e. entperise grade 1tb+ sata drives). At the server you still get the fibre channel performance. The critical difference is that the array does more work

    25. Re:simple idea by RedBear · · Score: 2, Informative

      Besides which I have no idea what the speed of sound has to do with the theoretical upper limit of the speed of a spinning disk. It's not like an airplane wing with a trailing shock wave. I would think there would be much more pressing problems that are keeping us from seeing 30K RPM hard drives anytime soon, like:

      - Shear strength of the platter material
      - Total mass of the platter, especially near the edge
      - Heat generated in the bearings
      - Energy necessary to spin the platter at that speed
      - Torsional forces from rotating the drive while it's spinning

      And probably down near the bottom of the list of potential problems:

      - Cavitation and/or shock waves from the air around the spinning platter.

  2. reallocate on write by Spazmania · · Score: 2, Informative

    Or just regenerate and write the one sector from the parity data since all modern hard disks reallocate bad sectors on write.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    1. Re:reallocate on write by Erik+Hensema · · Score: 4, Informative

      That's what any raid controller worth their salt does. I've seen 3ware and areca controllers do this, and those aren't the most expensive controllers on the market by far.

      --

      This is your sig. There are thousands more, but this one is yours.

  3. Solved a Long Time Ago by BBCWatcher · · Score: 4, Informative

    Honestly, there really aren't that many unsolved problems in computing if you are sufficiently aware enough to include mainframes and mainframe operating disciplines in your consideration. The basic way the mainframe community solved this particular problem long ago was to, first, take a holistic view about mitigating data loss. Double concurrent spindle failures are just one possible risk element. What about, for example, an entire data center exploding in a spectacular fireball? (Or whatever.) IBM, for example, came up with several different flavors of GDPS and continues to refine them, and they include multiple approaches to data storage tiering across geographies, depending on what you're trying to achieve. Data loss, whether physical or otherwise (such as security breaches), is not a particular problem with this class of technology and associated IT discipline, nor does there seem to be any signs of a growing problem in this particular technology class.

    1. Re:Solved a Long Time Ago by Odinlake · · Score: 2, Insightful

      As /.:ers so eagerly point out whenever RAID is mentioned: it's not for backup. It's for reducing downtime when hd's fail. So I assume that's the issue the original poster was thinking of. Not that I know what the solution would possibly be, but there's the correct question at least.

    2. Re:Solved a Long Time Ago by DarkOx · · Score: 2, Informative

      Well, the point the of the article is that if it takes your array 6 hours to rebuild instead of 4 because the capacities have gone up but the failure rate of the hardware is unchanged you have a problem. The problem is that you are more likely to experience another failure before the first one has been mitigated. If you have that additional failure on most raids (unless you are doing 5-5 or 1-5 or some other RAID over RAID scheme) you get down time. The volume is off line and must be restored from some other location.

      The solution is usually a cluster or remote hotsite or something like that. It would be nice to have fast rebuild times back. There are lots of situations were 5 nines is not a requirement but downtime still should be avoided, shorter exposure windows for array rebuilds are a good thing.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
  4. Bogus outdated thinking by twisteddk · · Score: 5, Interesting

    The author says it himself in the article:

    "And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue."

    but he hasn't adressed the fact that today you get 100 times as much diskspace for the same cost as you did 10 years ago when cost was a factor. In real life cost isn't a factor when it comes to datastorage, simply because it's really low in real life projects, as compared to the other costs in a project requiring storage. So if you want the reliability you go get a mirror. Drivespace is dirt cheap.

    As for the rebuildtimes, fine, go buy FASTER drives. I dont see the problem. HP and many other vendors have long been trying to sell combined raid soltions (like the EVA) where you mix high storage with high performance drives (like SSD vs. SATA).

    The only real argument for the validity of this article is the personal use of drives/storage. And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.

    --
    --- To err is human... Am I more human than most ?
    1. Re:Bogus outdated thinking by TechnoFrood · · Score: 5, Insightful

      I admit I haven't RTFA, but I don't quite get your statement of "And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.", I can't see how an SSD is a replacement for a raid-5 array. Everyone I know who uses a raid-5 uses it for large amounts of storage with a basic level of protection against data loss. I could justify replacing a raid-0 set up with a SSD.

      That said I definitely couldn't afford an SSD that would be able to replace the raid-5 in my pc (4x500GB usable space of 1.34TB), the largest SSD listed on ebuyer.com are 250GB @ £360 each, I would need 8 to match my raid 5 setup which is £2880 which is probably enough to build 2 reasonable machines both with a 1.34TB raid-5 using normal HDDs.

    2. Re:Bogus outdated thinking by drsmithy · · Score: 5, Insightful

      And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.

      Huh ? That's like saying show me 3 people who have a nice pair of running shoes and I'll show you 3 guys who can't afford a car.

    3. Re:Bogus outdated thinking by daybot · · Score: 3, Informative

      And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.

      Yeah, every time an article on storage catches my eye, I have to check laptop SSD prices. So far, each time I do this, for the cost of a drive the size I need, I could buy a new snowboard, or a laptop, bike, half a holiday, room full of beer... etc. I really want one, but so far I haven't been able to look at that list and say "I'd rather have an SSD!"

    4. Re:Bogus outdated thinking by tg123 · · Score: 2, Insightful

      I admit I haven't RTFA, but I don't quite get your statement of "And name 3 people you know who run raid-5 on their personal PCs, and I'll show you 3 guys who can't afford an SSD drive.", I can't see how an SSD is a replacement for a raid-5 array. Everyone I know who uses a raid-5 uses it for large amounts of storage with a basic level of protection against data loss........

      I hope your not mixing up Raid with a backup.

      Raid when used for protecting your computer will not protect your data it just makes your system able to tolerate hard drive failure.

    5. Re:Bogus outdated thinking by plover · · Score: 3, Funny

      Half a holiday is overrated. Buy the SSD! :-)

      --
      John
    6. Re:Bogus outdated thinking by Shakrai · · Score: 3, Funny

      Huh ? That's like saying show me 3 people who have a nice pair of running shoes and I'll show you 3 guys who can't afford a car.

      We need a +1 car analogy mod.... ;)

      --
      I want peace on earth and goodwill toward man.
      We are the United States Government! We don't do that sort of thing.
    7. Re:Bogus outdated thinking by Lumpy · · Score: 4, Informative

      The problem is IT guys and PHB's that think RAID=Backup.

      It's not and it never has been a backup solution. RAID is high availability and nothing more.

      RAID does it's job perfectly for high availability and will continue to do so for decades. Sorry but I have yet to see any other technology deliver the capacity I use for my small 30TB Database we have at work. Our Raid 50 array works great. We also realtime mirror that to the Backup SQL server (not for backup of data but backup of the entire server so that when SQL1 goes offline SQL2 picks up the work.)

      SQL2 is backed up to a SDAT tape magazine nightly.

      RAID does what it's supposed to do perfectly, it's days are not numbered because no other technology other than RAID can provide high availability.

      --
      Do not look at laser with remaining good eye.
    8. Re:Bogus outdated thinking by L4t3r4lu5 · · Score: 5, Insightful

      Raid when used for protecting your computer will not protect your data it just makes your system able to tolerate hard drive failure.

      ... Which will protect my data when a drive fails.

      RAID-5 means that I can have 3x500GB drives with 1GB of space, and not have the same worry (total loss of data) that I would if a 1x1TB drive failed.

      We know it doesn't replace backup. We know it doesn't protect against theft, fire, malicious data destruction etc etc. You do realise who you're talking to, don't you? This is an IT article on Slashdot. Telling people on this thread that RAID isn't a replacement for regular backups is like telling a mechanic that a stick of celery is not a suitable replacement for a piston.

      --
      Finally had enough. Come see us over at https://soylentnews.org/
    9. Re:Bogus outdated thinking by Coren22 · · Score: 4, Insightful

      I will never run RAID 5 on anything but data I don't care about. The risk is too great, and the rebuild times are not near good enough. RAID 1 or 10 is the only way to go. The acronym is Redundant Array of Inexpensive Disks, if they are so Inexpensive, why are you concerned about the difference between losing 1 drive to parity, or losing half your drives to duplicates. I cannot think of a single place where RAID 5 is appropriate, the performance loss on write just isn't worth the trouble.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
    10. Re:Bogus outdated thinking by Svartalf · · Score: 2, Interesting

      RAID5 is not backup. It's resilience for bringing the whole system down with a failure.

      RAID was originally developed to make what we consider small storage capacities (then massive) affordable and reasonably reliable.

      You're using RAID5 in it's "intended" use- but an SSD of the same capacity will be inherently MORE reliable (by a factor of how many of those magnetic disks you remove) than your system design right now.

      From personal experience with a system customer base of literally thousands of enterprise class servers spread out over many companies, RAID doesn't work QUITE the way people make it out to be. We're ripping it out of the equipment and reverting to warm backups instead- the RAID1 design they fielded made the servers unstable.

      The field engineer crowd (one of my friends worked with Nortel in the field engineer group and my brother is a manager for outsource company doing a lot of the same work with the same customers...) HATES RAID.

      Blow a controller? Better hope you have an identical one in stock. You can't just swap out a differing controller of the same brand or pop a different brand in- they all do things ever so slightly differently on the disks.

      Blow a disk? Better hope you can get the new drive in there and integrate it properly before you lose another.

      Disks don't have the reliability we once thought they had.
      RAID doesn't do what most people thinks it does for them.

      --
      I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
    11. Re:Bogus outdated thinking by metamatic · · Score: 4, Insightful

      Blow a controller? Better hope you have an identical one in stock. You can't just swap out a differing controller of the same brand or pop a different brand in- they all do things ever so slightly differently on the disks.

      That's why I prefer software RAID.

      --
      GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
    12. Re:Bogus outdated thinking by Hi_2k · · Score: 2, Informative

      That's why the smart money is based on node-based storage: Multiple boxes that are interchangeable. It's a shameless product plug, but I work for Isilon Systems, and our solution is that the whole system is considered replaceable: We don't sell a configuration that doesn't allow you to yank an entire box transparently. A drive failure is rebuilt and ready for swapping as soon as it comes up: Most of our admins don't know about disk failures until their data is already reprotected.

      Granted, our smallest config is 9TB; We're somewhat overkill for a home user. But if you need a company-wide NAS...

      Commodity hardware, standard networking (Gig and 10Gig Ethernet frontend, Infiniband backend), and a very smart filesystem (Capable of protecting from up to 4 simultaneous whole-node failures) == a killer combination; It takes some seriously bad luck for data-loss to become a problem.

      --
      When life gives you crap, Make Crapade.
      Sluggy Freelance.
    13. Re:Bogus outdated thinking by fluffernutter · · Score: 2, Informative

      To do a true raid-5, cost of the drives is fairly negligible. AFIK, to avoid missing writes in the event of a power outage you need a true raid chassis with a battery backup which runs $4K +. Fake raid 5 and software raid 5 are pretty risky as the writes can get caught in the parity calculation and not get witten out if the power goes down to everything at the same time.

      --
      Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
    14. Re:Bogus outdated thinking by Rockoon · · Score: 2, Informative

      To do a true raid-5, cost of the drives is fairly negligible.

      While you are absolutely correct about cost, I think your definition of what a true raid-5 is needs a little work.

      The purpose of RAID-n is to survive failures with near-zero downtime. The larger the disparity grows between capacity and performance as array sizes increase, the less and less these RAID's are serving their purpose. The chance of a drive failure while rebuilding a multi-TB array is quite significant, an occurrence that RAID-n was supposed to minimize to near-zero levels.

      In the future, there will only be RAID-0 and RAID-JBOD for conventional drives. Uptime will have to be solved another way, because RAID-n solves it less and less as the years (and thus, capacity) tick away.

      --
      "His name was James Damore."
    15. Re:Bogus outdated thinking by Wdomburg · · Score: 2, Informative

      Try a rebuild on a much larger aggregate running a dual parity array under load. Trust me, they can easily run days. Say you have a 16 disk aggregate using 1TB 7200RPM disks. Because you need every block in a stripe to reconstruct parity, you need to read from the other disks to reconstruct; so 14 reads and 1 write per block.

      You're also misunderstanding how the SSD caching works for ZFS. Blocks are only pulled in after repeated requests, which isn't going to be the case for a resliver. There will be at least some benefit to read ahead caching in memory, but even that has sharply diminishing returns, particularly with the ZFS rebuild strategy of reconstructing at a file level rather than a linear block rebuild. That approach has significant benefits though. By walking through the metadata instead of blindly copying blocks you don't have to rebuild empty space, and if - god forbid - you lose more than one drive in a RAID-Z or two drives in a RAID-Z2 array, you still have a partial recovery to work with.

    16. Re:Bogus outdated thinking by Courageous · · Score: 2, Informative

      As for the rebuild times, fine, go buy FASTER drives.

      Hard drives are getting bigger faster than they are getting faster.

      Hard drives are getting bigger faster than they are getting more reliable.

      In an enterprise setting, SATA based storage is a reality, for cost reasons, in tiers 2 and 3.

      Your suggestion that this problem is solved simply by buying faster drives is a poor one.

      And in a few generations of high speed drives, the problem with manifest regardless.

      Henry's article is not as clear as it could be, however. He's really talking about the pending failure for traditional raid sets as we know them, such as aggregates of N drives in a set, or drives hung off a RAID controller. RAID as an algorithm for error correction is nowhere near failure. Look at the manner in which Isilon does it. All the data in an isilon system is part of a clustered RAID approach, but this is distributed in data packets far different than standard block. All nodes in an Isilon cluster participate in a "RAID rebuild" when it's needed; the system is capable of multigigabyte per second RAID rebuild, and it only rebuilds what is needed, not the "disk". This can all be done with economical SATA drives.

      Note, however, that Isilon's RAID is not really RAID at all. I.e., it's not about arrays of disk, but rather partity based correction of lost file redundancy data. I.e., it's more object based, such as Henry was alluding to.

      As for the classic RAID set, Henry is quite right when he says that it is trying to die. RAID rebuild times are already in excess of 24 hours, and are going to be that much worse with 2TB and 4TB drives. With longer RAID rebuilt times, pDATALOSS increases notably, particularly if you are aware the Google and Carnegie findings that drives actually tend to fail at the same time. I.e., pFAIL of a HD is not independent of pFAIL of other HD's in a RAID set. They tend to fail together.

      C//

  5. Enlighten me by El_Muerte_TDS · · Score: 3, Insightful

    (Certain) RAID (levels) address the issue of potential dataloss due to hardware malfunction. How does moving to an Object-Based Storage Device address this issue better? Actually, I don't see how RAID and OSD are mutually exclusive.

  6. Harddisks, not RAID by Anonymous Coward · · Score: 5, Insightful

    Now that's a stupid article.

    It basically says, you can't read a harddisk more than X times before you get an error on some sector, so RAID is dead. That's a logical nonsequitur. RAID is a generic technology that also applies to flash memory cards, USB sticks, anything you can store data on basically. The base technique says "given this reliability, you can up the reliability if you add some redundancy". There's no link to harddisks other than that that's what they're used for right now.

    1. Re:Harddisks, not RAID by J4 · · Score: 2, Insightful

      RAID is here to stay for a while no doubt, but it's a response to a series of problems that has problems of it's own. You can take 5+1 drives make an array where one bad chassis slot can indeed take the whole thing out, or you make a bunch of mirrors at the expense of capacity, or you can stripe one scary large fragile volume.In production it's about performance & availability. Realize that the whole data integrity thing is relative and merely an illusion. It's kinda like on Futurama when they had the tanker with 1k hulls. The only solution to the first case is double the hardware, which is a major investment and recurring cost (rack space/electricity, stamps). Murphy's law tell's us that indeed "shit happens", so there are no guarantees.

      Although I didn't read the article I suspect it's promoting the cloud paradigm, which is the current ultimate expression of redundancy.

    2. Re:Harddisks, not RAID by Coren22 · · Score: 2, Insightful

      Wow, never thought I would see an obscure reference like that. Most of the people I know who read Ender's Game never bothered to read the rest of the series and would have no clue who Jane was.

      --
      APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
  7. RAID is here to stay by paulhar · · Score: 5, Insightful

    Disclaimer: I work for a storage vendor.

    > FTA: The real fix must be based on new technology such as OSD, where the disk knows what is stored on it and only has to read and write the objects being managed, not the whole device
    OSD doesn't change anything. The disk has failed. How has OSD helped?

    > FTA: or something like declustered RAID
    Just skimming that document it seems to claim: only reconstruct data, not white space, and use a parity scheme that limits damage. Enterprise arrays that have native filesystem virtualisation (WAFL for example) already do this. RAID 6 arrays do this.

    Lets recap. Physical devices including SSDs will fail. You need to be able to recover from failure. The failure could be as bad as the entire physical device failing, or as bad as a single sector being unreadable. In the former case a RAID reconstruct will recover the data but you'll hit RAID recovery errors due to the raw amount of data that needs to be recovered. Enterprise arrays mitigate the risk of recovery errors by using RAID 6. They could even recover the data from a DR mirrored system as part of the recovery scheme.

    And when RAID 6 has a high enough risk that it's worth expanding the scheme everyone will start switching from double parity schemes to triple parity schemes since their much less expensive in terms of spindle count than RAID 6+1.

    One assumption is, at some point in the future, reconstructions will be a continual occurring background task just like any other background task that enterprise arrays handle. As long as there is enough resiliency and performance isn't impacted then it doesn't matter if a disk is being rebuilt.

    1. Re:RAID is here to stay by Kjella · · Score: 4, Informative

      And when RAID 6 has a high enough risk that it's worth expanding the scheme everyone will start switching from double parity schemes to triple parity schemes since their much less expensive in terms of spindle count than RAID 6+1.

      I don't think you've quite understood the problem described. You can have an infinite number of parity disks, but it does you no good if recovering one data disk causes another data disk to fail.

      Imagine a disk fails on every 100TB of reads (10^14). You have ten 1TB data disks. Imagine you keep them in perfect rotation so they've spent 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100% of their lifetime. The last disk dies and you replace it with a new drive (0%). To rebuild the drive you read 1TB from each data disk and use whatever parity you need. They've now spent 11, 21, 31, 41, 51, 61, 71, 81, 91 and 1% (your new disk) of their lifetime and you can read another 9TB before you need a new disk.

      Now we try doing the same with ten 10TB disks and the same reliability. The last disk dies and you replace it, only now you must read 10TB from each disk. Instead of adding 1% to the lifetime it adds 10% so that they've spent 20, 30, 40, 50, 60, 70, 80, 90, 100 and 10% (your new disk) of their lifetime. But now another disk fails, you can recover that but then another will fail and another and another and another.

      Basically, parity does not solve that issue. If you had a mirror, you would instead copy the mirrored disk with significantly less wear on the disks. RAID is very nice as a high-level check that the data isn't corrupted but it's a very inefficient way of rebuilding a whole disk.

      --
      Live today, because you never know what tomorrow brings
    2. Re:RAID is here to stay by paulhar · · Score: 2, Interesting

      RAID 1 has much less reliability than RAID 6. Assume a typical case: one disk totally fails. You then start to reconstruct - in a RAID 1 scheme a single sector error will result in the rebuild failing. Not great.

      In RAID 6 you start the rebuild and you get a single sector error from one of the drives you're rebuilding from. At that point you've got yet another parity scheme available (in the form of the RAID 6 bit) that figures out what that sector should have been and then continues the rebuild. Then you go back and decide what to do about that drive that had the second error.

      A lot of drive failures aren't full head crashes or motor errors but just single sector, track, bits of dirt on the platter style errors. Other than the affected area the drive can be read.

      With RAID 6 you can fail two disks completely and still access the data. You're still reading from the same ten 10TB disks in your example and if the implementation of RAID 6 is optimal (RAID-DP) you aren't having to read additional data from the same physical disks.

      In the world you describe with 10TB drives it sounds like you'd just not be able to use the disks at all since any process that reads from the disks will kill them. There are a few things that could happen:

      1. Disks get more reliable. Hasn't happened much yet but...
      2. We switch to different packaging. Instead of making disks larger we cram more of them into the same space similar to CPU cores - same MTBF per disk but lots of them presented out by one physical interface.
      3. We change technologies completely. SSD (interesting failure modes there too... needs RAID)

      I guess we'll find out in only a few years...

    3. Re:RAID is here to stay by dpilot · · Score: 2, Insightful

      Even this doesn't handle the other side of the scenario...

      Buy your box of drives and put them in a RAID-6. Chances are you just bought all of the drives at the same time, from the same vendor, and they're probably all the same model of the same brand. Chances are also very good that they're from the same manufacturing lot. You've got N "identical" drives. Install them all into your drive enclosure, power the whole thing up, build your RAID-6, put it into service.

      Now all of your "identical" drives are running off of the same power supply, getting the same voltage. There's likely to be some temperature gradient inside the box, but overall they're all at similar temperatures. They have the same number of POH, the same number of read requests, same number of write requests. In essence, they remain very nearly "identical" through their service life.

      Next, let one drive fail. What are your chances of having a second drive failure, especially when you power the RAID down to replace the first failing drive?

      That's what I've heard some anecdotal evidence from, from those who manage this type of thing where I work. RAIDs tend not to have single-drive failures, or at least tend to have "time clustered" drive failures. Plan for it.

      --
      The living have better things to do than to continue hating the dead.
    4. Re:RAID is here to stay by atamido · · Score: 2, Informative

      Actually, reliability quickly scales towards RAID 1+0 as the number of drives increases. In a 14 drive array, a single drive failure in both is fine. A second drive failure has the possibility of destroying the RAID 1+0 array, but the chance of the right drive failing is low. With 3 total drive failures, RAID 6 will fail, while RAID 1+0 has a low probability of failure.

      Rebuild times are also much shorter on RAID 1+0 as only a single drive has to be read, which reduces heat produced and the chance of a second failure.

      There are some papers that describe the math of the statistical analysis to prove it, but I can't track it down at the moment. It is a rather counter intuitive. But, you have significantly less drive space, so RAID 6 may still be the better option for some circumstances.

    5. Re:RAID is here to stay by maraist · · Score: 2, Interesting

      I don't understand what your failure rate strategy is. First of all, there's no such thing as saying you are 90% or 10% of the way through a disk's life.. It's a probability distribution, who's probability is dramatically effected by the current events (and somewhat related to historical events). A drive might be at a 0.00005% probability of failure at any given moment, but then a large sustained read occurs which adjusts the heat and causes voltage fluctuations , so now you're operating at 0.001% probability.

      Then a drive dies in hot-swap-mode, a drive spins down, then another spins up, this has massive voltage fluctuations as well as slight tension on the cabling which causes reflections in the wiring which increases your probability of failure to say 0.02%. (I'm totally making up numbers, but the trends are what's important).

      So the act of powering down/up or hot-swaping intrinsically increases the probability of co-disk-failures, unless you have a very expensive system with separate AC/DC converters (e.g. fully decoupled) and obviously isolated frames, heat-compartments, etc.

      BUT, you can mitigate this by having 3+-way redundancy (RAID-1; I honestly don't understand the point of using slower RAID-5 / RAID-6 anymore). So when one drive fails, you have addressed the probability of a second failure. There is a geometric reduction in probability that 3 or 4 or 5 simultaneous drives fail. Meaning even at the peek risky part of the drive-swap operation, if you have say 2% probability that another drive will fail, then there is 0.004% probability that two drives will fail simultaneously. 0.0008% that three fail, etc.

      This isn't strictly correct, of course, because the probabilities are not fully independent. You have many common components, and thus their probabilities are intertwined. But sufficient to say the probabilities are less.

      Now I say 3+way RAID-1 because it may be silly to swap out a single drive when one goes bad. The process I would recommend (if you have a sufficiently advanced RAID controller, and non-super-expensive disks), is this:

      5-way RAID-1 with 2 powered down disks (thus effectively 3-way RAID-1)
      On a drive failure, power up the two disks and initiate their syncing.
      Swap out the error'd drive, and and initiate it's syncing.

      For a brief-while, you have 2 valid, 2 semi-valid, and 1 semi-semi-valid drive.

      As the drives sync-up(may take over 24 hours), power-down the original remaining 2 and remove them.

      Recycle the good disks into JBOH (Just a bunch of hardware) clustering. Meaning boot-disks / log-file disks in say RAID-1, swapping out the oldest drive.

      You can either buy several 4-way/5-way RAID controllers, or get a single 15-disk RAID controller for under than $1k. This allows you to have multiple logical volumes and share the 'spun-down disks', So now you're really only using 3 disks per logical-volume, though having two logical volumes with bad disks does reduce your ideal reliability somewhat. But this gives you 4 volumes which can be combined into RAID-10. You could build such a system for under $6k with various mixtures of high-end and low-end disks (for different partition requirements, boot/OS/linear-logging (RAID-1), random-write-data (RAID-10)).

      If the data is super critical, use a block-level master-slave replication. Ideally your application supports direct master-slave or better yet, multi-master.

      And if you're JBOH (Just a Bunch Of Hardware) clustering, then trivial RAID-1 with 2 or 3 disks (in software-raid) is all you need. Note, I use 3-disk RAID10 on my home linux machine, (that plus DVD drive fills up my IDE slots) - pretty clever technique. Yes I know virtually all MB's have hardware RAID these days, but unless they've got an extra 4Gig of buffer-RAM in them, they're pointless in my opinion, plus they're non-portable (screw transparent windows support, you can't distinguish disk errors from forced reboots anyway).

      --
      -Michael
    6. Re:RAID is here to stay by LWATCDR · · Score: 2, Interesting

      Well the logical thing IMHO is after the first year you put in a new drive and do an array rebuild after making a backup.
      Drives are really cheap and I would do that for as long as the array is in use.
      Reuse the old drives in desktops if they are SATA.
      Not perfect but it keeps you from having an array of old drives in your server.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    7. Re:RAID is here to stay by Chris_Jefferson · · Score: 2, Insightful

      Basically you are suggesting someone would make and then sell a disk which could only be read, entirely, 10 times in it's entire life time?

      Well that's easily solved. We won't buy those disks.

      --
      Combination - fun iPhone puzzling
  8. Hardware RAID is dead by PiSkyHi · · Score: 3, Interesting

    Hardware RAID is dead - software for redundant storage is just getting started. I am looking forward to making use of btrfs so I can have some consistency and confidence to how I deal with any ultimately disposable storage component.

    The ZFS folks have been doing it fine for some time now.

    Hardware RAID controllers have no place in modern storage arrays - except those forced to run Windows

    1. Re:Hardware RAID is dead by Chrisje · · Score: 4, Insightful

      First of all, "Hardware RAID" is still software, just executed by dedicated circuits. The distinction is kind of moot. For low-cost, low performance systems, software can run on your main box to perform this task, but for high-end applications you'll want dedicated hardware to take care of it, so your machine can do what it needs to do with more zeal.

      So my guess is that you're not working for a storage vendor. I haven't seen many people switch to SW RAID recently. If anything, the Unix world is finally crawling out of its "lvm striping" hole. Most servers anywhere are running on stuff like HP's Proliants, and I don't see customers ship back the SmartArray controllers.

    2. Re:Hardware RAID is dead by paulhar · · Score: 2, Informative

      > First of all, "Hardware RAID" is still software, just executed by dedicated circuits. The distinction is kind of moot.

      I'm not sure where in my post you saw anything about a comparison between Hardware RAID or Software RAID.

      > So my guess is that you're not working for a storage vendor. I haven't seen many people switch to SW RAID recently.

      I work for NetApp. I didn't think it mattered much in the post I made though. To your second point, as all of the NetApp Enterprise storage systems use software based RAID I can happily confirm that many hundreds of thousands of customers have switched to software RAID.

      As you mentioned earlier though the point is moot since when you're delivering an enterprise array to a customer it doesn't matter if the array uses RAID cards provided by a 3rd party vendor, uses RAID cards built in-house, or uses software RAID to write the data that the customer gives you. The ingress point for the customer is a physical port (IP/FC typcially) and that port provides RAID capabilities. Maybe that's also hardware RAID?

    3. Re:Hardware RAID is dead by RulerOf · · Score: 3, Informative

      FWIW, I'm a happy 3ware customer... saddened by their sellout to LSI, but I digress.

      When I think of software RAID, I think of parity data being handled by the operating system, being done on x86 chips as part of the kernel or offloaded via a driver (thinking Fake-RAID).

      If you're abstracting your storage away from the operating system that uses it, say via iSCSI or NFS or SMB to a dedicated storage box, like a NetApp filer or a Celerra, then I would consider that hardware RAID, personally speaking. If you're saying that these dedicated storage boxes manage parity, mirroring and so on all done with the same chip that's also running their local operating systems, then I have to admit that yes, that sounds like software RAID to me, but the real distinction I've come to draw between software and hardware RAID is a matter of performance and feature set. If said boxes give the same or better performance (I/Ops and throughput) to a workload as a dedicated, internal storage system managed by something like my 9650SE, then hell..... who cares, right? Aside from being rather impressed that such is possible without dedicated XOR chips, that is.

      --
      Boot Windows, Linux, and ESX over the network for free.
  9. Non-issue ... by Lazy+Jones · · Score: 3, Interesting
    Modern RAID arrays show no dramatic performance degradation while rebuilding, also with RAID-50/RAID-60 arrays, only a fraction of the disk accesses is slower than usually when a single drive is replaced.

    For enterprise level storage systems, this is also a non-issue because of thin provisioning.

    --
    "I love my job, but I hate talking to people like you" (Freddie Mercury)
  10. I thought RAID was about spindle count by BlueParrot · · Score: 4, Insightful

    I admit I'm not an expert, but I was under the impression that RAID was mainly about ensuring you a large number of spindles and some redundancy so you can serve data quickly even if a couple of drives fail while the servers are under pressure. Surely you would not rely on a RAID to avoid data loss since you should be keeping external backups anyway?

    1. Re:I thought RAID was about spindle count by gedhrel · · Score: 4, Informative

      You don't rely on RAID to avoid data loss; you rely on it as a first line in providing continuity. We run backups of large systems here, but we tend to do other things too: synchronous live mirroring between sites of the critical data. And beter system design. There are some systems where, whilst we _could_ go back to tape (or VTL) at a pinch, having to do so would be a disaster in itself.

      We're designing systems that permit rapid service recovery (the most live critical data) and a second tier of online recovery to get the rest back. We just can't afford the downtime.

      Double-spindle failures on RAID systems are just one of those things that you _will_ see. Deciding whether a system deserves some other measure of redundancy is mostly an actuarial, rather than a technical, decision.

  11. Wrong assumptions by vojtech · · Score: 5, Insightful

    The article assumes that when within a RAID5 array a drive encounters a single sector failure (the most common failure scenario), an entire disk has to go offline, be replaced and rebuilt.

    That is utter nonsense, of course. All that's needed is to rebuild a single affected stripe of the array to a spare disk. (You do have spares in your RAID setups, right?)

    As soon as the single stripe is rebuilt, the whole array is again in a fully redundant state again - although the redundancy is spread across the drive with a bad sector and the spare.

    Even better, modern drives have internal sector remapping tables and when a bad sector occurs, all the array has to do is to read the other disks, calculate the sector, and WRITE it back to the FAILED drive.
    The drive will remap the sector, replace it with a good one, and tada, we have a well working array again. In fact, this is exactly what Linux's MD RAID5 driver does, so it's not just a theory.

    Catastrophic whole-drive failures (head crash, etc) do happen, too. And there the article would have a point - you need to rebuild the whole array. But then - these are by a couple orders of magnitude less frequent than simple data errors. So no reason to worry again.

    *sigh*

    1. Re:Wrong assumptions by Anonymous Coward · · Score: 2, Insightful

      Even if only a sector in a disk has failed, I'd mark the entire disk as failed and replace it as soon as I could. Maybe I'm paranoid, but I've seen many times that when something starts to fail, it continues failing at increasing speed.

  12. If you want smaller drives... by asdf7890 · · Score: 4, Interesting

    If you want smaller drives to speed up rebuild times then, erm, buy smaller drives? You can get ~70Gb 10Krpm and 15Krpm drives fairly readily - much smaller than the 500-to-2000-Gb monsters and faster too. You can still buy ~80Gb PATA drives too, I've seen them when shopping for larger models, though you only save a couple of peanuts compared to the cost of 250+Gb units.

    If you can't afford those but still don't want 500+Gb drives because they take too long to rebuild if the array is compromised and needs a rebuild, and management won't let you buy bog standard 160Gb (or smaller) drives as they only cost 20% less than 750Gb units without the speed benefits of the high cost 15Krpm ones, how about using software RAID and only using the first part of the drive? Easily done with Linux's software RAID (partition the drives with a single 100Gb (for example) partition, and RAID that instead of the full drive) and I'm sure just as easy with other OSs. You'll get speed bonuses too: you'll be using the fastest part of the drive in terms of bulk transfer speed (most spinning drives are arranged such that the earlier tracks have higher data density) and you'll have lower latency on average as the heads will never need to move the full diameter of the platter. And you've got the rest of the drive space to expand onto if needed later. Or maybe you could hide your porn stash there.

  13. ZFS, Anyone? by Tomsk70 · · Score: 2, Interesting

    I've managed to get this going, using the excellent FreeNAS - although proceed with caution, as only the beta build supports it, and I've already had serious (all data lost) crashes twice.

    However the principle is sound, and I'm sure this will become standard before long - the only trouble being that HP, Dell and the like can't simply offer upgrades for existing RAID cards - due to the nature of ZFS, it needs a 'proper' CPU and a gig or two or RAM. Even so, it does protect against many of the problems now besetting RAID (which was never meant to handle modern, gargantuan disk sizes).

  14. Fountain codes? by andrewagill · · Score: 3, Interesting

    What about fountain codes? The coding there is capable of recovering from a greater variety of faults.

  15. ZFS by DiSKiLLeR · · Score: 5, Informative

    This is something the ZFS creators have been talking about for some time, and been actively trying to solve.

    ZFS now has triple parity, as well as actively checksumming every disk block.

    --
    You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
    1. Re:ZFS by DiSKiLLeR · · Score: 5, Informative

      I thought I should add:

      ZFS speeds up rebuilding a RAID (called resilvering) over traditional non-intelligent or non-filesystem based RAIDS by only rebuilding the blocks that actually contain live data; there's no need to rebuild EVERYTHING if only half the filesystem is in use.

      ZFS also starts the resilvering process by rebuilding the most IMPORTANT parts first; the filesystem metadata and works its way down the tree to the leaf nodes rebuilding data. This way, if more disks fail, you have attempted to rebuild the most data possible. If filesystem metadata is hose, everything is hosed.

      ZFS tells you which files are corrupt, if any are, and insufficient replicas exist to due failed disks.

      All this on top of double or triple parity. :)

      --
      You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
  16. Re:Worked-around a Long Time Ago by Anonymous Coward · · Score: 5, Interesting

    But really none of that should be necessary for the general case. Storing data in different physical locations is a good but entirely unrelated issue- the main problem of disk reliability is still very much in need of a solution. That's pretty much the point of the article: You can come up with various solutions which move the problem around, give multiple fallbacks for when something goes wrong.. but there's still the problem of things going wrong in the first place. I shouldn't need to use 12 separate disks spread across the globe just for basic reliability / redundancy

  17. Old news by EmTeedee · · Score: 2, Interesting

    Read that before on slashdot. Why RAID 5 Stops Working In 2009

  18. Parity declustering by Biolo · · Score: 4, Interesting

    Actually I like the parity declustering idea that was linked to in that article, seems to me if implemented correctly it could mitigate a large part of the issue. I have personally encountered the hard error on RAID5 rebuild issue, twice, so there definitely is a problem to be addressed...and yes, I do now only implement RAID6 as a result.

    For those who haven't RTFATFALT (RTFA the f*** article links to), parity declustering, as I understand it, is where you have, say, an 8 drive array, but where each block is written to only a subset of those drives, say 4. Now, obviously you loose 25% of your storage capacity (1/4), but consider a rebuild for a failed disk. In this instance only 50% of your blocks are likely to be on your failed drive, so immediately you cut your rebuild time in half, halving your data reads, and therefore your chance of encountering a hard error. Larger numbers of disks in the array, or spanning your data over fewer drives, cuts this further.

    Now, consider the flexibility you could build into an implmentation of this scheme. Simply by allowing the number of drives a block spans to be configurable on a per block basis, you could then allow any filesystem that is on that array to say, on a per file basis, how many disks to span over. You could then allow apps and sysadmins to say that a given file needs to have the maximum write performance, so diskSpan=2, which gives you effectively RAID10 for that file (each block is written to 2 drives, but with multiple blocks in the file is likely to be written to a different pair of drives, not quite RAID10, but close). Where you didn't want a file to consume 2x its size on the storage system, you could allow a higher diskSpan number. You could also allow configurable parity on a per block basis, so particularly important files can survive multiple disk failures, temp files could have no parity. There would need to be a rule however that parity+diskSpan is less than or equal to the number of devices in the array.

    Obviously there is an issue here where the total capacity of the array is not knowable, files with diskSpan numbers lower than the default for the array will reduce the capacity, numbers higher will increase it. This alone might require new filesystems, but you could implement todays filesystems on this array as long as you disallowed the per-block diskSpan feature.

    This even helps for expanding the array, as there is now no need to re-read all of the data in the array (with the resulting chance of encountering a hard error, adding huge load to the system causing a drive to fail, etc). The extra capacity is simply available. Over time you probably want a redistribution routine to move data from the existing array members to the new members to spread the load and capacity.

    How about you implement a performance optimiser too, that looks for the most frequently accessed blocks and ensures they are evenly spread over the disks. If you take into account the performance of the individual disks themselves, you could allow for effectively a hierarchical filesystem, so that one array contains, say, SSD, SAS and SATA drives, and the optimiser ensures that data is allocated to individual drives based on the frequency of access of that data and the performance of the drive. Obviously the applications or sysadmin could indicate to the array which files were more performance sensitive, so influencing the eventual location of the data as it is written.

    --
    Stealing a rhinoceros should not be attempted lightly.
  19. Re:Worked-around a Long Time Ago by Fred_A · · Score: 5, Funny

    I shouldn't need to use 12 separate disks spread across the globe just for basic reliability / redundancy

    You're trying to weasel out of paying IBM protection money !

    --

    May contain traces of nut.
    Made from the freshest electrons.
  20. Remembering an article earlier this week: by Chrisq · · Score: 3, Interesting

    Will scalable distributed storage systems like Hadoop and Google File System take over from RAID?

  21. RAID concept is fine, it's that HDs are too big by trims · · Score: 5, Interesting

    As others have mentioned, this is something that is discussed on the ZFS mailing lists frequently.

    For more info there, check out the digest for zfs-discuss@opensolaris.org

    and, in particular, check out Richard Elling's blog

    (Disclaimer: I work for Sun, but not in the ZFS group)

    The fundamental problem here isn't the RAID concept, is that the throughput and access times of spinning rust haven't changed much in 30 years. Fundamentally, today's hard drive is no more than 100 times as fast (both in throughput and latency) than a 1980s one, while it holds well over 1 million times more.

    ZFS (and other advanced filesystems) will now do partial reconstruction of a failed drive (that is, they don't have to bit copy the entire drive, only the parts which are used), which helps. But there are still problems. ZFS's pathological case results in rebuild times of 2-3 WEEKS for a 1TB drive in a RAID-Z (similar to RAID-5). It's all due to the horribly small throughput, maximum IOPs, and latency of the hard drive.

    SSDs, on the other hand, are no where near the problem. They've got considerably more throughput than a hard drive, and, more importantly, THOUSANDS of times better IOPS. Frankly, more than any other reason, I expect the significant IOPS of the SSD to signal the death knell of HDs in the next decade. By 2020, expect HDs to be gone from everything, even in places where HDs still have better GB/$. The rebuild rates and maintenance of HDs simply can't compete with flash.

    Note: IOPS = I/O Per Second, or the number of read/write operations (irregardless of size) which a disk can service. HDs top out around 350, consumer SSDs do under 10,000, and high-end SSDs can do up to 100,000.

    -Erik

    --
    There are always four sides to every story: your side, their side, the truth, and what really happened.
    1. Re:RAID concept is fine, it's that HDs are too big by SwashbucklingCowboy · · Score: 2, Insightful

      "The fundamental problem here isn't the RAID concept, is that the throughput and access times of spinning rust haven't changed much in 30 years."

      Uh, there's another bigger problem. The drive error rate (when reading data) hasn't changed that much either while data on a drive has dramatically increased.

      When doing a rebuild when you've lost all redundancy a single read error means the rebuild will fail. Increase the size of a drive (while keeping error rates constant) and you increase the likelihood of a rebuild failure.

    2. Re:RAID concept is fine, it's that HDs are too big by lewiscr · · Score: 2, Funny

      Irregardless, I'll continue to use it.

  22. Look the solution is obvious by jayhawk88 · · Score: 5, Funny

    The cloud. Just cloud it, baby. Nothing bad ever happens in the cloud; they're so white and fluffy after all.

  23. doesn't raid 10 solve this? by davros-too · · Score: 2, Interesting

    Um, don't schemes like raid 1+0 solve the parity rebuild problem? Even in the worst case of full disk loss, only one disk needs to be rebuilt and even for a large disk that doesn't take very long. Am I missing something?

    --
    In theory, there's no difference between theory and practice; in practice there is.
  24. Re:Wrong title. Or dramatization again? by defireman · · Score: 2, Informative

    RAID 0 does not offer any redundancy. Just a performance increase from reading simultaneously from 2 drives.

  25. Re:Worked-around a Long Time Ago by plover · · Score: 4, Interesting

    Actually, storing data in a multiple data center / high availability environment is a completely related issue. The summary above talks of "entirely different paradigms." Cloud storage would be multiple data center based, which is entirely different from keeping the only copy on your local drives. In this concept, your machine would have enough OS to boot, and enough hard drive space to download the current version of whatever software you are leasing. Your personal info would always be maintained in the data centers, and only mirrored locally. Have a home failure? Drop in a new part or even a new PC, (possibly with an entirely different operating system, such as Chrome,) connect to the service, and you're 100% back.

    It's no longer a novel concept for the home market. Consider Google Docs. It's not even being sold as "safer than RAID", it's being touted as "get it from anywhere" or "share with your friends". Safer than RAID is just a bonus.

    So are we ready to move all our personal information to clouds? I certainly am not, but Google Docs are wildly popular and a lot of people are. I long ago learned that I can't look to myself to judge what the mainstream attitudes are in many things.

    --
    John
  26. RAID 4 has a dedicated parity drive, not 5 by Targon · · Score: 4, Interesting

    RAID 4 is where you have one dedicated parity drive. RAID 5 solves this by spreading the parity information for each drive to all the other drives in the array. RAID 6 adds a second parity block for increased reliability, but as a result of the increased write for that extra parity block, it slows down write speeds.

    The real key to making RAID 4, 5, or 6 work is that you really need 4-6 drives in the array to take advantage of the design. I wouldn't say that it will fall out of favor though, because having solid protection from a single drive going bad really is critical for many businesses. Backups are all well and good for if your system crashes, but for most businesses, uptimes are more critical yet. So, backups for data so corruption problems can be rolled back, and RAID 5,6,10 for stability and to avoid having the entire system die if one drive goes bad. What takes more time, doing a data restore from a backup for when an individual application has problems, or having to restore the entire system from a backup, with the potential that the backup itself was corrupted?

    With that said, web farms and other applications can get away with just using a cluster approach instead of a single well designed machine(or set of machines) have become popular, but there are many situations which make a system with one or more RAID arrays a better choice. The focus on RAID 0 and 1 for SMALL systems and residential setups has simply kept many people from realizing how useful a 4-drive RAID 5 setup would be.

    Then again, most people go to a backup when they screw up their system, not because of a hard drive failure. With techs upgrading hardware before they run into a hard drive failure, the need for RAID 1, 4, 5, and 6 has dropped.

    I will say this, since a RAID 5 array can rebuild on the fly(since it keeps working even if one drive fails), the rebuild time itself does not significantly impact system availability. Gone are the days when a rebuild has to be done while the system is down.

  27. Re:Ask what does Google do by Carewolf · · Score: 2, Insightful

    A search engine doesn't mind losing data, most of the storage is essentially just a cache or summary of the internet and can be regenerated. That said, Google already have so many mirrors for performance reasons that actual data loss is practically impossible.

  28. RAID6 with enterprise hardware is reliable by niola · · Score: 2, Interesting

    I use RAID6 for several high-volume machines at work. Having double parity plus a hot spare means rebuild time is no worry.

    But if you are not a fan you can always throw something together with ZFS's RAIDZ or RAIDZ2 which is also distributed parity but the ZFS filesystem checksums and keeps multiple (distributed) copies of every block to detect and fix data corruption before it becomes a bigger problem.

    People using ZFS have been able to detect silent data corruption from a faulty power supply that other solutions would never have found just because of the checksumming process.

  29. I'm not sure I get it by Joce640k · · Score: 2, Interesting

    Is he saying that you can never read a whole hard disk because it will fail before you get to the end?

    That's what it seems like he's saying but my hard disks usually last for years of continuous so I'm not sure it's true.

    --
    No sig today...
  30. Re:Worked-around a Long Time Ago by kickedfortrolling · · Score: 4, Funny

    Don't discourage the boy. Weaseling out of things is important to learn. It's what separates us from the animals

    --
    --AlexC
    Just because I dont agree with climate change doesnt make me a troll
  31. Re:Worked-around a Long Time Ago by 2obvious4u · · Score: 2, Insightful

    And then like AOL, Google goes out of business (shocker I know) and all your data is lost forever. The cloud is good for a lot of stuff, but for data storage it should be part of the solution, not 100% of it.

  32. Re:Worked-around a Long Time Ago by yahwotqa · · Score: 5, Funny

    Not from weasels, though...

  33. Dear Seagate, Western Digital, et. al: by ThreeGigs · · Score: 4, Interesting

    Here's what I want, folks:
    A 5.25 inch device with 5 double-sided platters running at 5400 RPM. Basically the same size as a desktop CD/DVD drive, ala Quantum Bigfoot.
    I want 8 sides of the platters dedicated to data, and the other two sides dedicated to parity (or one parity and the other servo), essentially a self-contained RAID on a single disk.
    I want all data heads to write and read simultaneously, in Parallel. The idea is to have 64 byte sectors on each platter which are recombined into a 512-byte result. 8 heads writing and reading in paralell means HUGE throughput for sequential operations.

    It's RAID 5 or 6 on a single disk, although without spindle redundancy.

    And I also want a high-performance option: 2 sets of read/write heads 180 degrees apart, which effectively would cut seek times in half, making the drive perform more like a 10k RPM drive. With current densities, that's 12 TB in the volume of a DVD drive. It solves speed, sector error recovery and capacity issues. The only thing missing is a data bus that can handle the throughput.

    1. Re:Dear Seagate, Western Digital, et. al: by EmagGeek · · Score: 3, Insightful

      Without spindle redundancy...

      or logic element redundancy...

      or power supply redundancy...

      or cable interconnect redundancy...

      add to that the cost of adding dedicated RAID hardware to every single drive (that's an expensive PLD), and it's no wonder it's not on the market. High cost - no return.

    2. Re:Dear Seagate, Western Digital, et. al: by adisakp · · Score: 2, Interesting

      I want 8 sides of the platters dedicated to data

      More platters == more mass. Which translates to more power required for the motor, higher energy usage and much more heat generated by the drive. Generating more heat == quicker hardware failures. Also with bigger / larger / more platters, it's much harder to spin the platters faster. Usually more platters == slower RPM drive speed and much slower seek rates. If you can do fewer, smaller, and lighter platters, you can make the drive spin faster and perform better -- this is exactly what the Velociraptor does with it's high RPM 2.5" format.

      Also, using only one side of the platter is often faster and more reliable because the head arm weighs less (1/2 the heads) so they don't have as much mass to impede fast seeking or to cause vibration. Plus you don't have to worry about the alignment on both sides of the platter. This is one reason why the highest speed drives do not necessarily even use both sides of the platter.

      It's RAID 5 or 6 on a single disk, although without spindle redundancy.

      No it's not... what happens if the control electronics fail, the arm actuator, or the spindle motor? RAID 5/6 have whole disk redundancy. You just have data redundancy on the platters - not full hardware redundancy. Also, all this extra components you want to add to the drive will just make it more complicated and have more points of failure so the drives will actually fail earlier.

      And I also want a high-performance option: 2 sets of read/write heads 180 degrees apart, which effectively would cut seek times in half, making the drive perform more like a 10k RPM drive.

      Except that it would slow the drive down by making calibration harder and slower. Moving the head arms causes vibration and movement in the drive. One arm would not be able to reliably read while the other was moving unless the drive was spinning slower to begin with.

      The moral of your story is that you have some interesting ideas, but believe it or not, most of them have already been tried and rejected well before coming to market because they weren't feasible or reliable or didn't actually result in performance improvements in a cost effective manner.

  34. Solved, but at a price by Anonymous Coward · · Score: 2, Insightful

    You are absolutely correct in that the mainframe world has dealt with all of the modern recovery issues. But think of the actual USE of storage these days. What used to be a colossal database is now just a bunch of a bunch of home videos from my camcorder. Not only has the cost of storage dropped to nearly nothing, the threshold for using it has dropped even lower. I'm perfectly willing to commit a few megabytes every time I push the button on my digital camera. I remember college, where my mainframe disk quota was a mere 256K.

    Today's challenge is to get mainframe-class recovery without bringing back to mainframe-style prices. Some of this is controlled by the way we USE data storage. And then there is all the "savings" we get from server consolidation. Everything we do to consolidate just makes storage management a bigger headache. The trick is to evolve not just the low-level, "invisible" management of storage, but the high level applications as well. If I don't truly NEED to have 10TB on a single mount point, perhaps I should have multiple volumes, distribute my storage, and find a way to be happy with twenty 500GB volumes instead. The easiest way to avoid the recovery time of a 10TB RAID set is to not build one.

    I was in mainframe IT long before RAID was commonplace. We commonly faced limits of 450MB on indexed files, because that's as much as you could get from a hard drive back in the early 1980's. Modern Oracle DBAs must be scratching their heads at all of the tablespace management options that seem so redundant when you have RAID storage. This was the pre-RAID method of storage management, in which database container files could be of any size, mounted anywhere, and utilized in all sorts of creative ways to circumvent the hardware limitations of storage in those days. Today, it represents little more than an opportunity to inadvertently bring out the worst of both worlds by setting up these two storage methodologies in conflict with each other.

  35. Crappy RAID's days are numbered by zerofoo · · Score: 2, Interesting

    Having worked with plenty of enterprise grade raid (EMC symetrix, clarion, and Dell SAN devices) I can say that capacity and rebuild times are not a problem for high-end arrays.

    What will bring the problem to the masses are these stupid consumer NAS boxes. It is very easy to build a 4 or 8 TB array for home use using relatively cheap hardware. Unfortunately, no home user/abuser, that I know, has the skill set to manage or protect such a large array of data.

    My most recent experience with a Western Digital sharespace was awful. Here is a box with a Gigabit NIC, and 4 - 2TB hard drives in a RAID 5 array that has transfer rates around 9MB/sec at best. Combine that pitiful performance with a rebuild/reformat time of over two days - and you know where this is going.

    Average joes are going to put their entire lives on these things and never back them up due to the time and space cost. When a failure does occur - it will take days to perform a rebuild of the array - vastly increasing the likelyhood of another failure and permanent data loss.

    Crappy RAID's days are numbered - good RAID implementations will be with us as long as hard drives have ANY failure rate at all.

    -ted

  36. Re:There are always more solutions... by denis-The-menace · · Score: 2, Informative

    Who says there are no errors with optical media?
    I've seen a CD with light shining through after 5 years.

    --
    Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
  37. SATA vs FC/SAS: grapes and oranges by argent · · Score: 3, Insightful

    The chart he's using goes from SCSI, to fiberchannel, to SAS... to SATA. When you go from professional/server interfaces to hobby/desktop ones, of course the rebuild time skyrockets. If you did this article a few years ago and slid ATA in as the last data point instead of fiberchannel, you'd be seeing the knee showing up then instead of now. How about looking at 2010 and doing the calculations with 6 Gb SAS interconnect and 3 Gb drives, instead of 1.5 Gb SATA and 1 Gb drives?

  38. Re:Worked-around a Long Time Ago by Hatta · · Score: 4, Insightful

    Consider Google Docs.

    If you have so much data that you're likely to encounter an error when rebuilding your RAID array, I don't think Google Docs is going to cut it.

    --
    Give me Classic Slashdot or give me death!
  39. fill the drive with helium by speedtux · · Score: 3, Interesting

    Filling the drive with helium should help; the speed of sound in helium is 3x higher than in air, and it offers less resistance.

    (Hydrogen would be even better, but it has a tendency to interact with metals in unfortunate ways.)

    1. Re:fill the drive with helium by networkBoy · · Score: 3, Insightful

      there are these things called filters.
      They work pretty well.

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    2. Re:fill the drive with helium by Firethorn · · Score: 2, Informative

      don't get it, why not just seal them hermetically with helium inside, and not worry about outside air pressure?

      1. Hasn't been necessary
      2. Helium is expensive
      3. Sealing something Helium-tight is expensive, about as bad as trying to seal in hydrogen*
      4. Fairly sensitive to pressure - not a problem in a non-airtight HD, but a problem in a sealed HD that's heating up.
      5. Cooling can be an issue

      *Mostly because He tends to stay monoatomic, H pairs up into H2. End result is that the H2 molecule is around the same size as a He atom.

      --
      I don't read AC A human right
    3. Re:fill the drive with helium by the_other_chewey · · Score: 3, Informative

      Filling the drive with helium should help;

      Yeah. For about half a week. Helium has the smallest "gas particles" there are - Hydrogen atoms would
      be smaller, but those really like to bond, and an H_2 molecule is quite a bit larger than a Helium atom

      That's why He leaks out of everything. No exception. It diffuses through "leakproof" welds for vacuum tanks.
      It diffuses through the steel walls of tanks (albeit more slowly). That's also why He is used in leakage detection:
      If you see less than $not_so_few He atoms on the outside of the container you test within a couple of seconds after you injected a little bit of He, the container is considered airtight.

      The only way to keep a HE atmosphere in your drive would be to constantly refill it. I don't think that there'll be any scenario where this would seem like an even remotely good idea.

    4. Re:fill the drive with helium by dkf · · Score: 2, Interesting

      Filling the drive with helium should help; the speed of sound in helium is 3x higher than in air, and it offers less resistance.

      (Hydrogen would be even better, but it has a tendency to interact with metals in unfortunate ways.)

      Thinking about it, methane might be a more practical choice. Yes, it's denser than helium so the effect won't be anything like as strong (the speed of sound in methane is only about 40% faster) but it's also very cheap and available, and won't cause too many problems from interacting with the rest of the drive. Having to seal the drive is an issue, yes, but that's not far off what's needed now; it's imperative that dust is kept out of the platter enclosure anyway...

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  40. Re:Worked-around a Long Time Ago by tomhudson · · Score: 2, Insightful

    Faster to just copy it to a usb key. You have multiple copies of your data, and no longer have to worry about network latency, or even if there IS a network available.

  41. Re:simple idea (the bad old days) by Informative · · Score: 2, Informative

    The sealed drives we use now showed up in the '80s. Before that the platters were not part of the drive, they were in a plastic cover to keep the dust off. On the mainframes the cover held a stack of platter; on the minis there was just one or two 5mb platters inside. We would place the whole stack with cover into the drive, then rotate the handle to pull the cover out, leaving the spindle of platters in the drive. Then just close the dorr and push the button to spin it up.
    In either those old open ones or the "new" sealed ones, the head flies on a cushion of air, but the distance from head to platter is microspic; a piece of dust is big in comparison. In the old open drives, if the head hit even a tiny piece of dirt, it could "crash" into the platter and gouge out a rip. If you haven't heard it, it was actually fairly loud and startling.