Proposed Disk Array With 99.999% Availablity For 4 Years, Sans Maintenance
Thorfinn.au writes with this paper from four researchers (Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, and Thomas Schwarz, S. J.), with an interesting approach to long-term, fault-tolerant storage: As the prices of magnetic storage continue to decrease, the cost of replacing failed disks becomes increasingly dominated by the cost of the service call itself. We propose to eliminate these calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime. To evaluate the feasibility of this approach, we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years. We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures.
I don't see power mentioned in the paper.
So I tried to view the PDF, and it says "can't use the plugin, it causes problems on our server". So I figured I'd just download the file with wget instead. Nope, 403 forbidden.
Looks like fetch works though. If anybody else has trouble getting the file, try my local mirror.
I read the internet for the articles.
That's not long term. That's the normal life of a storage array. Long term is like 8-10 years.
The bottom line is, having a lot of spare disks for a 2D array makes it reliable over time. These configurations of 2D arrays are quite reliable, over time because they have many spares available to automatically replaces failed disks:
Data parity spare
12 3 13
12 3 14
24 6 20
36 9 26
To understand the above table, we'll use the first row as an example. An array made up of 1TB disks 12TB of data space would have 3TB of parity and 13 spare 1TB drives, for a total of 28 drives to get 12 drives worth of net storage.
What they didn't mention is that the same reliability can be achieved with only three spares, by replacing spares at your convenience. Replacing drives can be somewhat costly if it has to be done quickly, but if you can schedule to replace the failed drive "some time in the next two months", that probably won't be costly.
100,000 hours = 273 years. Does anyone believe that?
Everyone except you apparently.
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
"Yeah, well just put more disks in it..."
Nice idea. Only: TCO is not just based on initial spending and maintenance. There is also rackspace to consider and did I hear anyone talk about green IT?
If my day to day considerations were that one dimensional, my employer could save a ton of money on my salary.
100,000 hours is 4,167 days which is ~11.4 years. That sounds pretty reasonable to me, since I've run plenty of disks for over a decade.
I read the internet for the articles.
Well, duh. RAID6 is not a serious level of redundancy. ZFS RAIDZ-3 (triple parity) FTW. And you can build in as many hot spares as you want. Dinosaurs who have still not adopted ZFS need to get a clue.
Check your math. 100,000 hours / 24 = 4166.6~ days
4166.6666~ days / 365 = 11.4 years
Silence is a state of mime.
Yeah, we get it. You like to deal with cutting-edge stuff. Now get off my lawn.
Sent from my Commodore 64.
Get free satoshi (Bitcoin) and Dogecoins
Yeah, and what are you going to do with 9 out of 10 of the disks all go bad, because they came from the same factory run and exhibit the same issue? This is what we usually experience, when a disk fails, most of the time it's a subcomponent issue shared by all of the disks from that and any concurrent factory runs - and we have to swap them ALL out. I guess you just throw the whole array out ... :-(
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
http://www.dailywritingtips.co...
Get free satoshi (Bitcoin) and Dogecoins
Just a few things I thought of while looking at this study:
The authors are using Backblaze data. Backblaze uses consumer grade SATA disk which isn't going to be as reliable as the Enterprise SATA/SAS disk we would use.
I'm willing to bet that none of the authors of this paper have ever had to pay for colocated rack space, power, and cooling either, they've just doubled the RU that I need for storage. At $1500.00 - $2000.00 per rack that adds up.
Doubling the rack space for storage I need so I can avoid a few service calls by my storage vendor over 5 years simply isn't efficient.
We've installed close to 500TB of archival storage using commodity hardware and 2-3TB Nearline SAS. We have maybe 3 hand and eyes calls per year for disk replacement.
Anyway - just rambling.
Basically as the disk size grows you are talking about N-squared spares. I think most businesses are going to be more than happy with just hot-swapping out failed disks as needed.
My understanding is that disks often fail when a head touches the surface, or a piece of dirt gets between the head and the surface. Once that happens, more dirt is produced, increasing the probability of more head crashes, leading to a failure cascade. As a consequence, once one of my drives starts to show unrecoverable errors, corresponding to damaged surface areas, I replace it while it can still be read.
The spare platter strategy does nothing to reduce this failure mode. In fact, all modern disks already have spare space for bad block relocation.
I don't trust anybody who has published a document with the title "C:\Users\Jehan-Francois Paris\Documents\ADAPT15\Case3.doc." Not even in .docx format. Tsk tsk.
A service call? Seriously? A syadmin (or operator if it's a big place) can't see the yellow light on a disk and replace the pack with in-house spares? Have we become so inept as an IT community that we can no longer do a walk-through of our machine room and service simple things like this? Maybe we do deserve to be outsourced.
And if one must have a service contract such that only the vendor can touch the hardware, (why would you do that? never mind) wouldn't you negotiate a provision that includes drive replacement (as drives are consumables that must eventually be replaced) without being charged for an "office visit"?
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Do you have any idea how many butterflies it took to reply to your message?
Now get off my lawn!
Get free satoshi (Bitcoin) and Dogecoins
er, last time I checked, 100,000 hours is 11 years.
273 years is 2,400,000 hours. Did you lose the use of your calculator?
"Cats like plain crisps"
That is one of the greatest subtle Wrath of Khan references I've seen yet.
Spock: "Admiral, if we go by the book, like Lieutenant Saavik, hours would seem like days."
Masterful!