Proposed Disk Array With 99.999% Availablity For 4 Years, Sans Maintenance
Thorfinn.au writes with this paper from four researchers (Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, and Thomas Schwarz, S. J.), with an interesting approach to long-term, fault-tolerant storage: As the prices of magnetic storage continue to decrease, the cost of replacing failed disks becomes increasingly dominated by the cost of the service call itself. We propose to eliminate these calls by building disk arrays that contain enough spare disks to operate without any human intervention during their whole lifetime. To evaluate the feasibility of this approach, we have simulated the behaviour of two-dimensional disk arrays with N parity disks and N(N – 1)/2 data disks under realistic failure and repair assumptions. Our conclusion is that having N(N + 1)/2 spare disks is more than enough to achieve a 99.999 percent probability of not losing data over four years. We observe that the same objectives cannot be reached with RAID level 6 organizations and would require RAID stripes that could tolerate triple disk failures.
The bottom line is, having a lot of spare disks for a 2D array makes it reliable over time. These configurations of 2D arrays are quite reliable, over time because they have many spares available to automatically replaces failed disks:
Data parity spare
12 3 13
12 3 14
24 6 20
36 9 26
To understand the above table, we'll use the first row as an example. An array made up of 1TB disks 12TB of data space would have 3TB of parity and 13 spare 1TB drives, for a total of 28 drives to get 12 drives worth of net storage.
What they didn't mention is that the same reliability can be achieved with only three spares, by replacing spares at your convenience. Replacing drives can be somewhat costly if it has to be done quickly, but if you can schedule to replace the failed drive "some time in the next two months", that probably won't be costly.
Many high end equipment does have fairly large capacitors to allow enough power off time to do a clean power off.
I remember back in the 1990's some PC Centric folks were looking in a Sun Workstation they were surprised about all the large capacitors that were on the motherboard. In short it gives the system enough time finish its final calculation before the power goes out.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
"More work is still needed to define policies that would allow array users and manufacturers to detect unusually disk failure rates and take the appropriate actions before any data loss takes place." - Last line in the conclusion.
This implies that not all the spare drives are active and ready to go all the time and that some/most would be kept powered down as cold spares. Of course this same guy is likely to get another paper done where he examines the cost to run the array and how many drives could be left cold and still achieve the 5-9s reliability. Heck, if the software managing the drives is smart, it would rotate active/spare drives in and out, working them in quickly to get them all past the 'first 18 months high failure' rate to the sweet spot, then swap in and out over the lifespan of the array to enable the array to be at highest reliability for longer.
Hrmm, maybe I should look at building such an algorithm, a quick google search doesn't turn any such systems up.
...
In your fantasy there is a difference besides a hideously higher price and a somewhat longer warranty period. In real life, commodity SATA is much more cost effective. Everybody who is serious reognizes this (Google, Backblaze, Amazon).
The question posed is whether the human intervention (labor charge) saved is worth more than the power costs.
Sloppy calculation tip: 24*365 = 10000.
If you're Sloppy enough to accept that premise, then at 10 cents/KWHr, a Watt costs a dollar per year. It makes your $28 turns into $32, but hey, close enough. When I'm shopping, I can add up lifetime energy costs really fast, without actually being smart. Nobody ever catches on!
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Get a real SAN or a better maintenance contract.
I manage various SAN/NAS totaling about 5000 disks in different parts of the world.
3:00 AM - Email that a disk failed, followed a few seconds later by an email that a hot spare kicked in
3:30 AM - Email from our vendor that a disk failed and they are sending a replacement, reply if I would like someone on site to replace that drive or if we will do it ourselves
~3:45 AM - Email that the RG/Pool are been rebuilt
~11:00 AM - A tech in that office gets a drive delivered to their desk, they walk into the server room, replace it and put the failed one in the box, put the included label on the box and take it to their mail room.
~11:45 AM - Email that the pool/rg has been rebuilt and that the hot spare has been returned to a hot spare