Proposed Disk Array With 99.999% Availablity For 4 Years, Sans Maintenance
Thorfinn.au writes with this paper from four researchers (Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, and Thomas Schwarz, S. J.), with an interesting approach to long-term, fault-tolerant storage: As the prices of magnetic storage continue to decrease, the cost of replacing failed disks becomes increasingly dominated by the cost of the service call itself. We propose to eliminate these calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime. To evaluate the feasibility of this approach, we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years. We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures.
I don't see power mentioned in the paper.
So I tried to view the PDF, and it says "can't use the plugin, it causes problems on our server". So I figured I'd just download the file with wget instead. Nope, 403 forbidden.
Looks like fetch works though. If anybody else has trouble getting the file, try my local mirror.
I read the internet for the articles.
That's not long term. That's the normal life of a storage array. Long term is like 8-10 years.
because, you know...
Really, 4 year life span and they are replaced?
God I need to work for a company like that!
I am so tired of dealing with these RS/6000 systems that were made back in 1994, and these intel systems made back in 2002.
The bottom line is, having a lot of spare disks for a 2D array makes it reliable over time. These configurations of 2D arrays are quite reliable, over time because they have many spares available to automatically replaces failed disks:
Data parity spare
12 3 13
12 3 14
24 6 20
36 9 26
To understand the above table, we'll use the first row as an example. An array made up of 1TB disks 12TB of data space would have 3TB of parity and 13 spare 1TB drives, for a total of 28 drives to get 12 drives worth of net storage.
What they didn't mention is that the same reliability can be achieved with only three spares, by replacing spares at your convenience. Replacing drives can be somewhat costly if it has to be done quickly, but if you can schedule to replace the failed drive "some time in the next two months", that probably won't be costly.
selected a five-year disk array lifetime and assumed disk failures were independent events distributed according to a Poisson law with a mean time to failure (MTTF) of 100,000 hours.
100,000 hours = 273 years. Does anyone believe that?
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
I worry a lot less about losing data than I do corrupting data and not knowing it.
But hey, congratulations, you've learned about RAID mirrors with lots of copies and learned how to apply basic, well understood engineering principals to it.
Guess what, some of us were aware of this years ago, some others aware of it longer than you've probably been alive. Its been known my entire life, thats for sure, so thats at least 40 years.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Is this really a long time? 4 Years? ------That seems kind of short and not reliable to me.
See subject: Kids, this is why math is good for you!
"Yeah, well just put more disks in it..."
Nice idea. Only: TCO is not just based on initial spending and maintenance. There is also rackspace to consider and did I hear anyone talk about green IT?
If my day to day considerations were that one dimensional, my employer could save a ton of money on my salary.
Well, duh. RAID6 is not a serious level of redundancy. ZFS RAIDZ-3 (triple parity) FTW. And you can build in as many hot spares as you want. Dinosaurs who have still not adopted ZFS need to get a clue.
Yes, but then you're dancing around the possibility of additional disk failures while waiting on that replacement.
If you pop a few more drives (which, if you got your disks in lots is QUITE possible), you're in deep shit.
Chas - The one, the only.
THANK GOD!!!
more like 11.4 years
In academia, everything is simple and independent. I'm sure it's fun to calculate theoretical parity requirements for quintuple disk failures. ...but it's useless.
In the real world, if you have five disks simultaneously fail in an array, there was a common cause. The next step is to restore from backup because every drive in that array is now suspect. Whatever knocked out five disks probably did a number on the rest, and it would be reckless to assume they are unaffected, even if they have a clean SMART report. You're well past the point of caring about parity if an array gets crushed like that.
TL;DR version:
Replacing disks sucks some times. Sticking in additional spares means you don't have to replace them. They calculated an efficient RAID solution that means you don't need as many spares.
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.
I'm not impressed with this at all. My last two hard drives lasted 8 or 9 years each, with no motherboard failures or controller failers, or anything. But everything mentioned in this story indicates a much bigger investment, only to get a little more security for 4 years? No Thanks!
I haven't read the paper, but.
If N=4 then spares = 10.
If N=6 then spares = 21.
Seems like overkill to get 5 9s (5.26 minutes per year)
Yeah, and what are you going to do with 9 out of 10 of the disks all go bad, because they came from the same factory run and exhibit the same issue? This is what we usually experience, when a disk fails, most of the time it's a subcomponent issue shared by all of the disks from that and any concurrent factory runs - and we have to swap them ALL out. I guess you just throw the whole array out ... :-(
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Just a few things I thought of while looking at this study:
The authors are using Backblaze data. Backblaze uses consumer grade SATA disk which isn't going to be as reliable as the Enterprise SATA/SAS disk we would use.
I'm willing to bet that none of the authors of this paper have ever had to pay for colocated rack space, power, and cooling either, they've just doubled the RU that I need for storage. At $1500.00 - $2000.00 per rack that adds up.
Doubling the rack space for storage I need so I can avoid a few service calls by my storage vendor over 5 years simply isn't efficient.
We've installed close to 500TB of archival storage using commodity hardware and 2-3TB Nearline SAS. We have maybe 3 hand and eyes calls per year for disk replacement.
Anyway - just rambling.
To last all of 4 years, and need nearly as many hot spares as data drives. I guess the academics think they know something yet again. They took some dubious failure rates (backblazes use whatever is the cheapest consumer drive at the time and eventually stop buying the really bad ones (seagate 1.5 and 3tb looking at you)) and a rather optimistic transfer rate (200MBS) that assume all sequential reads. They failed to account for back plane, controller, and power assuming that those never fail. By their numbers you might as well run mirrored raid 5 or 6 with enough hot spares to make it between regularly scheduled tech visits. That give you the ability to split chassis and controllers along mirror lines. As to rebuilds we have better methods, predictive failure works well, ssd's make great caches while rebuilding etc etc. We also have less centralized options with distributed technologies that potentially scale better.
5 9's is not that hard of an objective when talking about raid sets, the tools have been there for decades. Sure you will never reliably reach it with a single path to anything, 5 minutes is not enough time for even a staffed site to remedy any outage.
No sir I dont like it.
One of my SATA hard drives has been running Winodws XP for five years. I work with legacy programs for MS-DOS, Windows 3.1 and Windows 9x. Don't laugh. The original disk went bad. It developed several bad sectors. The MFT became damaged and some of my programs didn't run properly. S.M.A.R.T. complained about bad sectors too. Smartdisk couldn't fix the bad sectors. I ended up having to swap out the 80 GB drive for a 320 GB drive. Always keep a backup of your files.
My understanding is that disks often fail when a head touches the surface, or a piece of dirt gets between the head and the surface. Once that happens, more dirt is produced, increasing the probability of more head crashes, leading to a failure cascade. As a consequence, once one of my drives starts to show unrecoverable errors, corresponding to damaged surface areas, I replace it while it can still be read.
The spare platter strategy does nothing to reduce this failure mode. In fact, all modern disks already have spare space for bad block relocation.
also the cost of a raid card / cards with that many ports. Maybe even dual cpu just to get more pci-e lanes as say x4 to x8 for each raid card + say X4 to each 10 gige card. Say about X8 for 2-4 port cards.
I don't trust anybody who has published a document with the title "C:\Users\Jehan-Francois Paris\Documents\ADAPT15\Case3.doc." Not even in .docx format. Tsk tsk.
The goal is to realize that for manufacturers, service calls are expensive. Perhaps a company has a 4 hour response time - if a disk fails, the company is still running with redundancy, but they're wanting that drive replaced pronto, which is easily $500+ per incident (need to have spares on hand, drop ship extras if a tech runs low, need to station techs around, maybe even need to fly a tech in).
So the goal is that building an extra 13 spare 1TB drives (which probably cost under $50 in bulk) is $650, or the cost of just over one service call.
If enough drives have to be replaced then the tech can change a whole pile of them at once, which is still cheaper than sending people out for individual drive failures.
The goal is basically to have no service calls over the service life - then maybe refresh it periodically at one's convenience by replacing all the failed drives in one go.
What about other causes of failure, especially ones that impact every single disk you own?
Human errors and manufacturing defects. What do you think happens if all your 15K rpm drives were incorrectly manufactured with bearings designed for 7200 rpm drives? I had a BIG customer where that happened. It took several years before the disk manufacturer was forced to fess up because of the crazy failure rate.
And did I actually read some posts all but saying with enough redundancy there'd be no need for backups? Umm, wrong.
If you read the article, that is exactly what they suggest. If failure rates are too far above predicted, they say to replace with new array. At least they are upfront about it.
A service call? Seriously? A syadmin (or operator if it's a big place) can't see the yellow light on a disk and replace the pack with in-house spares? Have we become so inept as an IT community that we can no longer do a walk-through of our machine room and service simple things like this? Maybe we do deserve to be outsourced.
And if one must have a service contract such that only the vendor can touch the hardware, (why would you do that? never mind) wouldn't you negotiate a provision that includes drive replacement (as drives are consumables that must eventually be replaced) without being charged for an "office visit"?
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
We do just that, when it gets down to 1 hot spare it's an emergency service and we replace all the failed units. This does not happen very often and tends to be just that a bad batch.
No sir I dont like it.
"to achieve a 99.999 percent probability of not losing data over four years."
I think the summary writer doesn't understand the difference.
SIGN ME UP!
Yes, but then you're dancing around the possibility of additional disk failures while waiting on that replacement.
Even if the mean time between failures for consumer drives was 6 months, the odds of 'popping' two more spares in the month after the first failure would be less than 3%. If the MTBF is 1 year the probability drops to 0.7%.
If you need 99.999% reliability, I think you should consider more options. Keep your data in two separate locations. I don't know that power and internet are 99.999% available in a single location.
For that kind of reliability, they should consider diversifying the types of disks in the system. Disk failure is not a purely independent random event. The kind of power, vibration or magnetic-field glitch that could knock out one drive, would likely knock out many drives.
SAN's are great, until a whole run of disks die at the same time, or a fiber cut knocks out access to your data.
I think the major point of the paper is that pre-allocation some disks, as swap-in ready, might be cheaper then a service call. That is a change of mindset. Everything else about this is a distraction. Yes maybe in the future they would power-off those extra disks, until needed, to keep the green people happy etc...
Alright, fine, ashift=12 is newer than 2009, for 2TB+ drives. And always use /dev/disk/by-id for your sanity.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Even if the mean time between failures for consumer drives was 6 months, the odds of 'popping' two more spares in the month after the first failure would be less than 3%. If the MTBF is 1 year the probability drops to 0.7%.
Except if you got a bad batch where some kind of material or production defect will cause many disks to fail near simultaneously. The overall MTBF might be true for all the disks they produce, but unless you make a real effort to source them from different batches over time you can't assume that's going to be your MTBF.
Live today, because you never know what tomorrow brings
Instead of keeping the spares inside as just that — spares — can it not start using all of them (in a sufficiently redundant configuration) and gradually lose capacity as physical disks fail?
Yes, it would require coordination with the driver and filesystem, but there is nothing insurmountable in that...
In Soviet Washington the swamp drains you.
One of the authors is a Catholic priest. He probably blessed the drives first.
"We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures."
That's true only if you assume that three disk failures occur faster than a single disk can be rebuilt.
If you assume no more than two disk failures *during the length of time it takes to rebuild the array* then RAID 5 or RAID 6 works fine as long as you assign enough hot spares.
>. service calls are expensive. Perhaps a company has a 4 hour response time -
Service calls are expensive BECAUSE it's an emergency. If you have four spares, plus the two parity drives, you're still six drives away from a problem. With a few spares, you can easily replace one by sending it UPS ground, rather than having a tech run out there immediately.
Indeed -- remember the experiment posted on Slashdot a year or so ago where they measured the MTBF across drives purchesed in batches and outside batches? Failures tended to cascade within the batch; other batches would cascade at different times.
So that entire cluster is likely to fail catastrophically unless you're swapping in drives from new batches from time to time -- at which point it should last MUCH longer than 4 years with data integrity. Bonus points if your array can handle size boosts over time (swapping in larger disks).
Call me when there's a dick array with 99.999% availability
The number of drives seems to be large. The calculations are exponential therefore as the cluster gets bugger the number of spare disks get much bigger.
Drives spares Total
5, 15, 20
10, 55, 65
30, 465, 495
That's a lot of disks. There is a point that space and power overcomes the human cost.
Run 2 drives each holding a copy of the same data. The probability of BOTH failing at the same time is clearly low, and easily calculated. Now when one fails, the other only has to operate perfectly for the period required to copy its contents to a new drive- a probability again easily calculated.
So long as the failure mode isn't something like a fire or a catastrophic power surge killing both drives (which obviously should not be sharing the same local PSU anyway), the two drive setup gives extraordinary robustness against failure- and actually illustrates how direct mirroring is the ONLY sane form of better data protection.
The reason such trivial logic is not applied is because it is TOO SIMPLE. People want to sell you complex solutions, and use pseudo-science to justify such solutions, for reasons of commerce. The REAL issue of the HDD is that people are encouraged to use it past a clear point of failure- and I mean when platter surface break-up and mechanical problems are generating increasing numbers of sector write-fails. In my TWO drive situation, if both drives have hidden faults on the same data that has gone un-noticed, clearly that data is lost. The trick is KNOWING that a HDD has terminally failed long before most industry 'tests' would suggest this to be the case.
And the problem with DORMANT data is the biggest issue. If you haven't checked a file in a time, how do you know it is still 'good'? Of course you can simply change mirror-2 to mirror-n (which as any person versed in back-up theory will tell you is like having n full back-ups made across n days, where n represents an appropriate level of paranoia).
Most data is lost because there is NO BACKUP at all. Then most backed-up data is lost because of a system failure, like all the back-ups burning in the same fire that destroys the main computer, or all the back-ups being malformed (but not noticed, because up to that point, no backup was used for data restoration).
Long term archival 'back-ups' are a different issue suffering from various forms of degradation- usually cost and care related (look at old Hollywood films that still have perfect copies today, and films from the 80s that only have dreadful grainy copies available).
Automated data-loss prevention systems tend to be jokes because no-one wants to pay 3+ times the cost of storage. If someone will pay for data replication costs (and no data is easier to replicate than data on a HDD), then mirror methods make the likelihood of data loss as good as zero.
Of course, in practise, VERY real-time replication means things like horrible RAID controller chips, and these don't work well with modern OSes and HDDs. Less real-time replication, as seen in Google's de-facto cloud, for instance, is probably as good as a non-fuss automated system gets. But if a company has key files being updated all the time in real-time, NOTHING can prevent localised data loss situations.
They 'invented' RAIDZ3? Or they are perhaps using ZFS or something similar internally and not telling anyone (like so many in the industry). Sure you can achieve very high reliability using ZFS but most systems maintain those 9's by a) having hot spares and b) replacing disks that failed in a timely manner. They are simply adding more hot spares so a service call is less important, you can just go by and replace 5 disks at a time whenever you need to expand your storage.
They also forgot to mention that once disks start failing, you could easily have a whole set of them fail. Especially with firmware issues or if someone dropped an entire box in shipping. Once you drop below 2 hotspares/10 disks, you are in serious risk of degrading your system because disks could fail while rebuilding as well.
Custom electronics and digital signage for your business: www.evcircuits.com
Calculating the the System MTBF of 77 drives at 100,000 hours as a subsystem we'd expect to have a drive failure approximately every 1300 hours. That's not the reality of most observations/environments but it's enough to have a least a couple of spares on hand and why we have things like Raid 6 and ZFS. It also doesn't necessitate you having tons of spares onsite either.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Their spinning disk SANs don't use individual drives but a Datapac with many drives inside. The array can take down individual disks inside and re manufacture them in-situ by doing low-level formatting per drive or down to per platter and platter side if needed. The only time you need to disk swap is when you've suffered enough internal issues that they can't be corrected for with spares or reconditioning.
The downside is you're now replacing an entire pac of something like 15 drives and if you're pac bound through VMware RDMs, then you're looking at downtime to relocate the data and detatch the connections. Other than that, it's actually a really neat system.
That's why, as the manufacturer of such a system, you refuse to sell it bare. Your customers won't complain if you tell them what the bare cost, cost per disk, and labor cost to install a disk are, and sell disks at cost and with reasonable labor. Make money on your hardware, bring in enough to pay for assembly based on disk install labor.
That's only step one, though. Start ordering disks when you start your first production run of hardware. Order direct from manufacturer, and from as many suppliers as possible, so you get disks from as many batches as possible. Then, continue placing frequent, but small, orders from whoever can get you the disks the cheapest; it may work out that you can get volume pricing from the manufacturer by telling them "I'm going to need X disks over all and am willing to pay for them up front, but I need them shipped (X/52) per week from current stock at the time of shipping, don't set aside my disks out of the current batch to ship at a future date".
It's a bit more labor, but compare serial numbers and attempt to color code by batch. Use colored dot stickers for this. When fetching drives for an installation, try and get an even distribution of colors, so you don't have an excess of drives from any given batch, and always record who has which drives, so if you start getting failure reports that indicate a bad batch, you can proactively alert the customers who have those drives that it might be a good idea to have you swap them even if they still appear to be functioning.
All of that drives up the cost, of course. I'm not going to sit here and to the math to figure out what the cost would be, as there are simply too many assumptions and I have too little time, but if you've nothing better to do and don't mind making a couple dozen, likely provably wrong, assumptions, you can have at it.
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
If service call costs for one or two disks are prohibited, simply put in enough spares so you only have to roll a tech for, say, 10 drives.
Alternatively, make them user-swappable. If all the customer has to do is ask their tech to yank drives with a Blinky Amber Light of Doom, even the most untrained monkey could figure that out.
Everything works great UNTIL:
A. You discover that the drives have a firmware bug that causes silent data corruption.
B. You discover that the drives have a firmware bug that causes them to drop out of the array.
C. You end up with one or more drives that fail in a "unique" way that hangs the bus they're on, making multiple other drives drop out too.
D. You get a bad batch of drives, since you bought them all at the same time from the same supplier instead of adding more over time to increase capacity and/or replacing the failed ones over time with new ones from very different batches.
E. Realize that controllers, FANs, and sometimes even cables and backplanes can die over time, especially in certain countries where air pollution (like sulfer) is a problem.
F. Discover that tin whiskers grow on some lead-free component connections and fry them over time.
G. Your datacenter cooling fails due to a breaker failing in the control room that causes the controllers to lose power (despite your multiple redundancy EVERYWHERE else) and your drives get heat stressed and fail at astronomically higher than "normal" rates (true story, actually most of these are).
H. During backup generator testing, someone screws up and your array loses power during heavy activity, and you get a surge when it comes back on.
I. Natural disaster. 'nough said.
Need I continue? 99.8% is realistic, 99.99% is doable with a lot of extra effort and expense. 99.999% is usually total, absolute and utter BS unless you have a fully separate datacenter in another region to fail over to or you're just lucky (which is nothing even resembling a "guarantee"), with synchronous data replication. That's less than 5.3 MINUTES PER YEAR of downtime, or a total over 4 years of less than 28 minutes total. Usually, one single unplanned service event and you've already blown it.
That's before you waste all the capital, power, and cooling to fully expand the array on day 1 rather than thin provision and add over time and replace failed disks as you go (also keeping in mind that storage acquisition costs go DOWN over time).
Did they think a formula for the no. of spares would impress
The formula above reduces to spare capacity required = 50% of parity and data disks.
They have stronger magnets because they need to write that data more harder than normal drives.
You could use ZFS with RAIDZ3 and multiple spares.
Those who believe in this type of risk assessment should read the Rasmussen (no relation to the pollster) report on the risks associated with commercial nuclear reactors. In real life as opposed to just the product of probabilities a single event can cause a chain of "improbable" failures. For example a short in a cable tray can wipe out a whole data array.
Disk Array, "Sans" maintenance.
Spock, if that array isn't rebuilt in two hours, get that rack out of there and back to a Service Bay.
I bought this house and you know I'm boss
Ain't no h'aint gonna run me off
It also assumes a normal failure of drives. However modern drives do not always fail normal. They develop slow spots, timeouts from which they might recover.
Also the software to create the redundancy might fail, of it might fail if you do not update the firmware.
And I am not even talking about catastropic failure. When a drive overheats you might want to remove it from the datacenter.
How about having like 10 additional spare discs in your rack, and calling the service for replacement when 10 discs died? The cost of the service call does not matter much when it is for many discs at once.
Some company was doing this in the Bay area in 2000.
Hotplug is expensive. Cases are expensive. Making room for human access is expensive.
Design for nothing but airflow and drive density, keeping pieces as absolutely cheap as possible. Gigabit instead of 10G.
At exabyte scale, why do you care about the loss of 4TB? Using Super Micro boxes w/4TB Drives, you can have over 6 petabytes of raw storage in a 72u rack / cabinet
Metadata servers keep track of where the copies of blocks are.
Put copies of the blocks on completely disparate systems. If there is heavy read usage of a block, make more copies.
Head servers scale and have some beef to them. They are all about getting info from the commodity stuff and packaging it for (subscribers, clients, whatever).
If a drive dies or has issues - mark it bad and leave it at that. Ignore it.
If a server dies, mark it as bad. Leave it.
In 4 years you are forklifting the equipment and replacing it with new storage.
There is no "RAID", other than there are multiple copies of blocks throughout the system.
I met with a company in the bay area doing this in 2000 (I don't remember which one). It was dealing with Filesystems and not block, but with NFS, VMDKs, VHD, etc, who cares. I don't see anything new here at all.
I used the wrong Supermicro box to make my point - I selected the pure storage, vs server with storage.
So 72 drives instead of 90 per 4U. 5.5 PB per 72U instead of "over 6".
The rest of my points stand.
I'll believe others when I see the uptime....
I am the unwilling control for my Origin.
This has happened repeatedly. The most notorious example is the "IBM Deskstar", which failed en masse after consistent amounts of use. They destroyed RAID arrays around the world because the individual drives could not be replaced fast enough to secure the data before multiple drives went offline simultaneously.
They have N parity disks, and then roughly N(N-1)/2 data disks and roughly the same number of spares.
In larger arrays the overall overhead of the parity and spare disks is slightly under 50%, or roughly equivalent to RAID-1, but more reliable since the spares can be reassigned as needed.
The solution for this is checksums and parity on the disk contents at the filesystem level. Read a block off the disk and check the stored checksum against what you read...if it doesn't match then use the parity information to correct the data and store it somewhere else.
They do exactly that, with a replacement to be scheduled "some time in the next 4 years".
If you run raid 66 (a raid 6 array of raid 6 arrays) then you get that much more protection.
Not that raid6 is anywhere near good enough since 2Tb drives came along. There's around a 10% chance that you'll lose your remaining spare during a parity rebuild from a drive loss on a 12+2 disk array and a 1% chance that you'll lose another drive recovering from that (I've seen it happen)
This is one of the reasons for considering ZFS raidZ3. One of the other reasons is that because it uses SSD buffering and caching, drive seek activity is smoothed out and heavy head seek is one of the prime life shorteners in mechanical hard drives (I've had identical array hardware using the same batches of drives and the ones which get hit hardest for random IO are the ones where drives fail more often.)