Proposed Disk Array With 99.999% Availablity For 4 Years, Sans Maintenance
Thorfinn.au writes with this paper from four researchers (Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, and Thomas Schwarz, S. J.), with an interesting approach to long-term, fault-tolerant storage: As the prices of magnetic storage continue to decrease, the cost of replacing failed disks becomes increasingly dominated by the cost of the service call itself. We propose to eliminate these calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime. To evaluate the feasibility of this approach, we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years. We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures.
I don't see power mentioned in the paper.
So I tried to view the PDF, and it says "can't use the plugin, it causes problems on our server". So I figured I'd just download the file with wget instead. Nope, 403 forbidden.
Looks like fetch works though. If anybody else has trouble getting the file, try my local mirror.
I read the internet for the articles.
That's not long term. That's the normal life of a storage array. Long term is like 8-10 years.
Really, 4 year life span and they are replaced?
God I need to work for a company like that!
I am so tired of dealing with these RS/6000 systems that were made back in 1994, and these intel systems made back in 2002.
The bottom line is, having a lot of spare disks for a 2D array makes it reliable over time. These configurations of 2D arrays are quite reliable, over time because they have many spares available to automatically replaces failed disks:
Data parity spare
12 3 13
12 3 14
24 6 20
36 9 26
To understand the above table, we'll use the first row as an example. An array made up of 1TB disks 12TB of data space would have 3TB of parity and 13 spare 1TB drives, for a total of 28 drives to get 12 drives worth of net storage.
What they didn't mention is that the same reliability can be achieved with only three spares, by replacing spares at your convenience. Replacing drives can be somewhat costly if it has to be done quickly, but if you can schedule to replace the failed drive "some time in the next two months", that probably won't be costly.
I worry a lot less about losing data than I do corrupting data and not knowing it.
But hey, congratulations, you've learned about RAID mirrors with lots of copies and learned how to apply basic, well understood engineering principals to it.
Guess what, some of us were aware of this years ago, some others aware of it longer than you've probably been alive. Its been known my entire life, thats for sure, so thats at least 40 years.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Is this really a long time? 4 Years? ------That seems kind of short and not reliable to me.
100,000 hours = 273 years. Does anyone believe that?
Everyone except you apparently.
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
"Yeah, well just put more disks in it..."
Nice idea. Only: TCO is not just based on initial spending and maintenance. There is also rackspace to consider and did I hear anyone talk about green IT?
If my day to day considerations were that one dimensional, my employer could save a ton of money on my salary.
100,000 hours is 4,167 days which is ~11.4 years. That sounds pretty reasonable to me, since I've run plenty of disks for over a decade.
I read the internet for the articles.
Well, duh. RAID6 is not a serious level of redundancy. ZFS RAIDZ-3 (triple parity) FTW. And you can build in as many hot spares as you want. Dinosaurs who have still not adopted ZFS need to get a clue.
Yes, but then you're dancing around the possibility of additional disk failures while waiting on that replacement.
If you pop a few more drives (which, if you got your disks in lots is QUITE possible), you're in deep shit.
Chas - The one, the only.
THANK GOD!!!
Check your math. 100,000 hours / 24 = 4166.6~ days
4166.6666~ days / 365 = 11.4 years
Silence is a state of mime.
more like 11.4 years
100,000 hours = 273 years. Does anyone believe that?
Oddly enough, it doesn't matter whether you believe it or not. What matters is whether that's the same predictive model used for estimating lifetimes of RAID arrays, or a single drive for that matter. Since you want to compare the proposed new config directly with current paradigms, you have to use the same set of underlying assumptions.
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
In academia, everything is simple and independent. I'm sure it's fun to calculate theoretical parity requirements for quintuple disk failures. ...but it's useless.
In the real world, if you have five disks simultaneously fail in an array, there was a common cause. The next step is to restore from backup because every drive in that array is now suspect. Whatever knocked out five disks probably did a number on the rest, and it would be reckless to assume they are unaffected, even if they have a clean SMART report. You're well past the point of caring about parity if an array gets crushed like that.
TL;DR version:
Replacing disks sucks some times. Sticking in additional spares means you don't have to replace them. They calculated an efficient RAID solution that means you don't need as many spares.
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.
Girls suck at math.
Tic-Tac-Toe, Global Thermonuclear War, and relationships all have the same winning move.
Umm, 273 years is nearly 2.4 million hours. So, no, no one with basic arithmetic skills believes that 100,000 hours is 273 years.
You don't understand the meaning of MTBF.
The real "Libtards" are the Libertarians!
They did 100000/365 which equals about 274. They seem to have confused hours with days.
Actually it does matter. If you believe 100,000 hours = 273 years you lack basic arithmetic skills.
They also don't realize that 100,000 hours / 365 days is not the way you get years from hours.
Yeah, and what are you going to do with 9 out of 10 of the disks all go bad, because they came from the same factory run and exhibit the same issue? This is what we usually experience, when a disk fails, most of the time it's a subcomponent issue shared by all of the disks from that and any concurrent factory runs - and we have to swap them ALL out. I guess you just throw the whole array out ... :-(
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Just a few things I thought of while looking at this study:
The authors are using Backblaze data. Backblaze uses consumer grade SATA disk which isn't going to be as reliable as the Enterprise SATA/SAS disk we would use.
I'm willing to bet that none of the authors of this paper have ever had to pay for colocated rack space, power, and cooling either, they've just doubled the RU that I need for storage. At $1500.00 - $2000.00 per rack that adds up.
Doubling the rack space for storage I need so I can avoid a few service calls by my storage vendor over 5 years simply isn't efficient.
We've installed close to 500TB of archival storage using commodity hardware and 2-3TB Nearline SAS. We have maybe 3 hand and eyes calls per year for disk replacement.
Anyway - just rambling.
Oops my math error. Still, 11.4 years is also way out of line with the reality that, as density rises, so do failure rates. Why do you think they've lowered the warranty period?
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Basically as the disk size grows you are talking about N-squared spares. I think most businesses are going to be more than happy with just hot-swapping out failed disks as needed.
But I have yet to see a high-density disk last more than 8,000 hours, with the median being maybe half that.
Good for you. I have a number of 2 and 3 TB drives that are more than 5 years old. Anecdotes != evidence.
I've apoligized for the bad math, but sorry again. However, 11.4 years doesn't match what's actually happening as we go to higher densities. I've had a few drives last 8,000 hours, but most have died much sooner.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
I screwed up. Sorry. However, even 11.4 years is overly optimistic as we cram more and more onto a single platter.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
To last all of 4 years, and need nearly as many hot spares as data drives. I guess the academics think they know something yet again. They took some dubious failure rates (backblazes use whatever is the cheapest consumer drive at the time and eventually stop buying the really bad ones (seagate 1.5 and 3tb looking at you)) and a rather optimistic transfer rate (200MBS) that assume all sequential reads. They failed to account for back plane, controller, and power assuming that those never fail. By their numbers you might as well run mirrored raid 5 or 6 with enough hot spares to make it between regularly scheduled tech visits. That give you the ability to split chassis and controllers along mirror lines. As to rebuilds we have better methods, predictive failure works well, ssd's make great caches while rebuilding etc etc. We also have less centralized options with distributed technologies that potentially scale better.
5 9's is not that hard of an objective when talking about raid sets, the tools have been there for decades. Sure you will never reliably reach it with a single path to anything, 5 minutes is not enough time for even a staffed site to remedy any outage.
No sir I dont like it.
But thinking that 11.4 years is going to save their behind is unrealistic.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
My understanding is that disks often fail when a head touches the surface, or a piece of dirt gets between the head and the surface. Once that happens, more dirt is produced, increasing the probability of more head crashes, leading to a failure cascade. As a consequence, once one of my drives starts to show unrecoverable errors, corresponding to damaged surface areas, I replace it while it can still be read.
The spare platter strategy does nothing to reduce this failure mode. In fact, all modern disks already have spare space for bad block relocation.
Yes, I goofed. However, believing that 11.4 years is what you'll get in practice is also naive, especially with the higher-density drives that haven't accumulated even 2 years of real-life experience,
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Sorry, the 3 TB drives are around 3 years old. The 2 TB have passed their 5 year warranties with no issues.
Actually it does matter. If you believe 100,000 hours = 273 years you lack basic arithmetic skills.
+1 sardonic
But doesn't address my serious point about application of statistical methods.
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
I don't trust anybody who has published a document with the title "C:\Users\Jehan-Francois Paris\Documents\ADAPT15\Case3.doc." Not even in .docx format. Tsk tsk.
The goal is to realize that for manufacturers, service calls are expensive. Perhaps a company has a 4 hour response time - if a disk fails, the company is still running with redundancy, but they're wanting that drive replaced pronto, which is easily $500+ per incident (need to have spares on hand, drop ship extras if a tech runs low, need to station techs around, maybe even need to fly a tech in).
So the goal is that building an extra 13 spare 1TB drives (which probably cost under $50 in bulk) is $650, or the cost of just over one service call.
If enough drives have to be replaced then the tech can change a whole pile of them at once, which is still cheaper than sending people out for individual drive failures.
The goal is basically to have no service calls over the service life - then maybe refresh it periodically at one's convenience by replacing all the failed drives in one go.
No, they are constantly being read and written to from a NAS.
They seem to have confused hours with days.
Captain! They've broken our secret Starfleet code!
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
If you read the article, that is exactly what they suggest. If failure rates are too far above predicted, they say to replace with new array. At least they are upfront about it.
100,000 hours = 273 years. Does anyone believe that?
I don't, because 100,000 hours is 11.4 years.
273 (much closer to 274) years is 100,000 days.
systemd is Roko's Basilisk.
A service call? Seriously? A syadmin (or operator if it's a big place) can't see the yellow light on a disk and replace the pack with in-house spares? Have we become so inept as an IT community that we can no longer do a walk-through of our machine room and service simple things like this? Maybe we do deserve to be outsourced.
And if one must have a service contract such that only the vendor can touch the hardware, (why would you do that? never mind) wouldn't you negotiate a provision that includes drive replacement (as drives are consumables that must eventually be replaced) without being charged for an "office visit"?
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
We do just that, when it gets down to 1 hot spare it's an emergency service and we replace all the failed units. This does not happen very often and tends to be just that a bad batch.
No sir I dont like it.
PS You've already apologised more than enough for this. Sorry to compound it!
systemd is Roko's Basilisk.
I kind of deserve it, thiough. That's what I get for trying to pass the vacuum, watch Dr. Phil, keep my neighbors dog from drinking my coffee (again), and post on slashdot at the same time.
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
I would hope I'm misunderstanding it, because that seems like a lot of spares to purchase ahead of time.
That seems exceptionally short. I run a repair shop, and dead/dying HDDs are the second most common problem. While I do not know the operational hours of these devices, the great majority are past the 3 year mark when they begin to fail.
I guess it also depends on your definition of high density as I do not see many drives > 1TB in consumer/SMB equipment.
Silence is a state of mime.
Alright, fine, ashift=12 is newer than 2009, for 2TB+ drives. And always use /dev/disk/by-id for your sanity.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Even if the mean time between failures for consumer drives was 6 months, the odds of 'popping' two more spares in the month after the first failure would be less than 3%. If the MTBF is 1 year the probability drops to 0.7%.
Except if you got a bad batch where some kind of material or production defect will cause many disks to fail near simultaneously. The overall MTBF might be true for all the disks they produce, but unless you make a real effort to source them from different batches over time you can't assume that's going to be your MTBF.
Live today, because you never know what tomorrow brings
A mean time between failure of 11.4 years means you can reasonably expect half of all drives to fail before then*. Assuming a constant failure rate (which we really shouldn't do), that means you can expect ~4.4% of drives to fail every year. Which leads to the benefit of lowering the warranty period: Every year of warranty increases the expected total production/replacement cost of the drive by 4.4% - reduce the warranty period and you boost profit margins and/or can reduce the price to undercut your competitors.
*In reality it's not quite so simple, MTBF is actually the average failure rate of a large number of young drives tested for (probably) considerably less than a year, with aging effects never taken into consideration.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Instead of keeping the spares inside as just that — spares — can it not start using all of them (in a sufficiently redundant configuration) and gradually lose capacity as physical disks fail?
Yes, it would require coordination with the driver and filesystem, but there is nothing insurmountable in that...
In Soviet Washington the swamp drains you.
One of the authors is a Catholic priest. He probably blessed the drives first.
"We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures."
That's true only if you assume that three disk failures occur faster than a single disk can be rebuilt.
If you assume no more than two disk failures *during the length of time it takes to rebuild the array* then RAID 5 or RAID 6 works fine as long as you assign enough hot spares.
>. service calls are expensive. Perhaps a company has a 4 hour response time -
Service calls are expensive BECAUSE it's an emergency. If you have four spares, plus the two parity drives, you're still six drives away from a problem. With a few spares, you can easily replace one by sending it UPS ground, rather than having a tech run out there immediately.
er, last time I checked, 100,000 hours is 11 years.
273 years is 2,400,000 hours. Did you lose the use of your calculator?
"Cats like plain crisps"
Indeed -- remember the experiment posted on Slashdot a year or so ago where they measured the MTBF across drives purchesed in batches and outside batches? Failures tended to cascade within the batch; other batches would cascade at different times.
So that entire cluster is likely to fail catastrophically unless you're swapping in drives from new batches from time to time -- at which point it should last MUCH longer than 4 years with data integrity. Bonus points if your array can handle size boosts over time (swapping in larger disks).
Call me when there's a dick array with 99.999% availability
That is one of the greatest subtle Wrath of Khan references I've seen yet.
Spock: "Admiral, if we go by the book, like Lieutenant Saavik, hours would seem like days."
Masterful!
The number of drives seems to be large. The calculations are exponential therefore as the cluster gets bugger the number of spare disks get much bigger.
Drives spares Total
5, 15, 20
10, 55, 65
30, 465, 495
That's a lot of disks. There is a point that space and power overcomes the human cost.
Not a little more reliability, a LOT more reliability.
A single drive's 100,000h MTBF translates to a 5-nines reliability period of only 1.4427 hours: 0.99999^(100,000h/1.4427h) = 0.5
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Reread the summary - N is the number of parity disks, not the number of data disks.
N parity disks
N*(N-1)/2 data disks
N*(N+1)/2 spares
So roughly the same number of spares as data disks, and the number of parity disks scales as twice the square root of that number. Pretty impressive if you're talking haigh-capcity data storage with 100s or thousands of data disks.
Also data reliability is something very different than uptime: you don't lose data for only 5.26 minutes per year - once gone it's gone.
Meanwhile a single drive's 100,000h MTBF translates to a 5-nines reliability period of only 1.4427 hours: 0.99999^(100,000h/1.4427h) = 0.5
--- Most topics have many sides worth arguing, allow me to take one opposite you.
N is the number of parity disks - the number of data disks also increases as N-squared.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
They 'invented' RAIDZ3? Or they are perhaps using ZFS or something similar internally and not telling anyone (like so many in the industry). Sure you can achieve very high reliability using ZFS but most systems maintain those 9's by a) having hot spares and b) replacing disks that failed in a timely manner. They are simply adding more hot spares so a service call is less important, you can just go by and replace 5 disks at a time whenever you need to expand your storage.
They also forgot to mention that once disks start failing, you could easily have a whole set of them fail. Especially with firmware issues or if someone dropped an entire box in shipping. Once you drop below 2 hotspares/10 disks, you are in serious risk of degrading your system because disks could fail while rebuilding as well.
Custom electronics and digital signage for your business: www.evcircuits.com
Calculating the the System MTBF of 77 drives at 100,000 hours as a subsystem we'd expect to have a drive failure approximately every 1300 hours. That's not the reality of most observations/environments but it's enough to have a least a couple of spares on hand and why we have things like Raid 6 and ZFS. It also doesn't necessitate you having tons of spares onsite either.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
er, last time I checked, 100,000 hours is 11 years.
Oh you check that a lot?
Slashdot, fix the reply notifications... You won't get away with it...
That's why, as the manufacturer of such a system, you refuse to sell it bare. Your customers won't complain if you tell them what the bare cost, cost per disk, and labor cost to install a disk are, and sell disks at cost and with reasonable labor. Make money on your hardware, bring in enough to pay for assembly based on disk install labor.
That's only step one, though. Start ordering disks when you start your first production run of hardware. Order direct from manufacturer, and from as many suppliers as possible, so you get disks from as many batches as possible. Then, continue placing frequent, but small, orders from whoever can get you the disks the cheapest; it may work out that you can get volume pricing from the manufacturer by telling them "I'm going to need X disks over all and am willing to pay for them up front, but I need them shipped (X/52) per week from current stock at the time of shipping, don't set aside my disks out of the current batch to ship at a future date".
It's a bit more labor, but compare serial numbers and attempt to color code by batch. Use colored dot stickers for this. When fetching drives for an installation, try and get an even distribution of colors, so you don't have an excess of drives from any given batch, and always record who has which drives, so if you start getting failure reports that indicate a bad batch, you can proactively alert the customers who have those drives that it might be a good idea to have you swap them even if they still appear to be functioning.
All of that drives up the cost, of course. I'm not going to sit here and to the math to figure out what the cost would be, as there are simply too many assumptions and I have too little time, but if you've nothing better to do and don't mind making a couple dozen, likely provably wrong, assumptions, you can have at it.
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
I'm sure he used a calculator, seems he simply forgot to divide by 24.
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
If service call costs for one or two disks are prohibited, simply put in enough spares so you only have to roll a tech for, say, 10 drives.
Alternatively, make them user-swappable. If all the customer has to do is ask their tech to yank drives with a Blinky Amber Light of Doom, even the most untrained monkey could figure that out.
Not, it's not your basic conversion error that's the problem.
A MTBF of 11.4 years does not mean that a typical array will have a lifetime of 11.4 years. From Wikipedia:
You are conflating "useful life period" with MTBF. They measure different things.
The real "Libtards" are the Libertarians!
They have stronger magnets because they need to write that data more harder than normal drives.
That's certainly what it says in the summary. As for distinguishing between parity and spares - I should think that would be obvious: the parity disks are in active use, the array can't detect/correct errors without them. The spares meanwhile are just sitting there, presumably powered down, until one of the active disks needs to be replaced.
As for the equivalence in the number of spares... I suspect it's not exactly coincidence, more like human nature: "Okay, we've got a cool 2D parity system - let's see just how long it will maintain 5-nines reliability if we give it one spare for every active drive. Over four years! Cool, for the press release lets juice it up a little and rephrase that as 'more than enough for five nines for four years'."
--- Most topics have many sides worth arguing, allow me to take one opposite you.
You could use ZFS with RAIDZ3 and multiple spares.
Disk Array, "Sans" maintenance.
Spock, if that array isn't rebuilt in two hours, get that rack out of there and back to a Service Bay.
I bought this house and you know I'm boss
Ain't no h'aint gonna run me off
every 11 years, or when my inbuilt estimation engine says "these figures are wrong, let's just check that".
Said engine was especially useful when we used slide-rules (you might have to look that up), as I did at high school. It still is, because the world is full of people who blindly believe stuff.
Not you of course.
"Cats like plain crisps"
Even Jupiter's day is 10 hours. (Ok, 9.9, but close enough).
Maybe if we speeded up the earth's rotation a bit ... yeah, let's do that, make it one hour. Oh boy, effective gravity has gone slightly negative at the equator, we are losing our atmosphere, and cows will fly, perhaps over the moon, though mooing seems unlikely.
Nah, I vote to leave it alone and do arithmetic properly. Boring, but we should live longer (though maybe not in days).
"Cats like plain crisps"
Err... didn't see who the original bad math was done by. I mean "she"... I think...
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
It also assumes a normal failure of drives. However modern drives do not always fail normal. They develop slow spots, timeouts from which they might recover.
Also the software to create the redundancy might fail, of it might fail if you do not update the firmware.
And I am not even talking about catastropic failure. When a drive overheats you might want to remove it from the datacenter.
How about having like 10 additional spare discs in your rack, and calling the service for replacement when 10 discs died? The cost of the service call does not matter much when it is for many discs at once.
We propose to eliminate [disk replacement] calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime ...we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years.
Are you seriously telling me you read that and get that they're creating a disk array out of spare disks that can provide 5-nines reliability for four years without involving any disk replacement? Methinks you need to invest some serious effort on your reading comprehension skills. Not to mention your sanity-check skills.
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Some company was doing this in the Bay area in 2000.
Hotplug is expensive. Cases are expensive. Making room for human access is expensive.
Design for nothing but airflow and drive density, keeping pieces as absolutely cheap as possible. Gigabit instead of 10G.
At exabyte scale, why do you care about the loss of 4TB? Using Super Micro boxes w/4TB Drives, you can have over 6 petabytes of raw storage in a 72u rack / cabinet
Metadata servers keep track of where the copies of blocks are.
Put copies of the blocks on completely disparate systems. If there is heavy read usage of a block, make more copies.
Head servers scale and have some beef to them. They are all about getting info from the commodity stuff and packaging it for (subscribers, clients, whatever).
If a drive dies or has issues - mark it bad and leave it at that. Ignore it.
If a server dies, mark it as bad. Leave it.
In 4 years you are forklifting the equipment and replacing it with new storage.
There is no "RAID", other than there are multiple copies of blocks throughout the system.
I met with a company in the bay area doing this in 2000 (I don't remember which one). It was dealing with Filesystems and not block, but with NFS, VMDKs, VHD, etc, who cares. I don't see anything new here at all.
I used the wrong Supermicro box to make my point - I selected the pure storage, vs server with storage.
So 72 drives instead of 90 per 4U. 5.5 PB per 72U instead of "over 6".
The rest of my points stand.
I'll believe others when I see the uptime....
I am the unwilling control for my Origin.
This has happened repeatedly. The most notorious example is the "IBM Deskstar", which failed en masse after consistent amounts of use. They destroyed RAID arrays around the world because the individual drives could not be replaced fast enough to secure the data before multiple drives went offline simultaneously.
They have N parity disks, and then roughly N(N-1)/2 data disks and roughly the same number of spares.
In larger arrays the overall overhead of the parity and spare disks is slightly under 50%, or roughly equivalent to RAID-1, but more reliable since the spares can be reassigned as needed.
The solution for this is checksums and parity on the disk contents at the filesystem level. Read a block off the disk and check the stored checksum against what you read...if it doesn't match then use the parity information to correct the data and store it somewhere else.
Most should make it to their second year (>=8640 hours).
In our small 24 bay array I've seen a lot of those bad Seagate ST3000DM001 fail at ~15000-19000 hours.
If you run raid 66 (a raid 6 array of raid 6 arrays) then you get that much more protection.
Not that raid6 is anywhere near good enough since 2Tb drives came along. There's around a 10% chance that you'll lose your remaining spare during a parity rebuild from a drive loss on a 12+2 disk array and a 1% chance that you'll lose another drive recovering from that (I've seen it happen)
This is one of the reasons for considering ZFS raidZ3. One of the other reasons is that because it uses SSD buffering and caching, drive seek activity is smoothed out and heavy head seek is one of the prime life shorteners in mechanical hard drives (I've had identical array hardware using the same batches of drives and the ones which get hit hardest for random IO are the ones where drives fail more often.)
Just don't drop the whole thing on a concrete fllor, otherwise every platter will fail immediately.