Ask Slashdot: Practical Bitrot Detection For Backups?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Practical Bitrot Detection For Backups?

Posted by timothy on Tuesday December 10, 2013 @05:15AM from the error-detected-goodbye dept.

An anonymous reader writes "There is a lot of advice about backing up data, but it seems to boil down to distributing it to several places (other local or network drives, off-site drives, in the cloud, etc.). We have hundreds of thousands of family pictures and videos we're trying to save using this advice. But in some sparse searching of our archives, we're seeing bitrot destroying our memories. With the quantity of data (~2 TB at present), it's not really practical for us to examine every one of these periodically so we can manually restore them from a different copy. We'd love it if the filesystem could detect this and try correcting first, and if it couldn't correct the problem, it could trigger the restoration. But that only seems to be an option for RAID type systems, where the drives are colocated. Is there a combination of tools that can automatically detect these failures and restore the data from other remote copies without us having to manually examine each image/video and restore them by hand? (It might also be reasonable to ask for the ability to detect a backup drive with enough errors that it needs replacing altogether.)"

1 of 321 comments (clear)

Min score:

Reason:

Sort:

Re:BTRFS filesystem by girlintraining · 2013-12-10 07:40 · Score: 0, Flamebait

I'll be the heretic here, but on Windows 8.1 and Windows Server 2012 R2, there is a feature called Storage Spaces. It works similar to ZFS where you toss drives into a pool, then create a volume that is either simple, mirror, or with parity, and Windows does the rest. If a volume needs more space, toss some more drives in the pool.
You have no idea what you're talking about, sir. A mirror only duplicates the data. The writes are made syncronously to both sources, the reads are interleaved between devices to improve speed. In RAID-0, if either drive fails, the array is lost. In RAID-1, mirroring, data is written to two drives at the same time, and read back in an interleaved format. Unless the device itself reports a hardware error, the cluster will continue to read data back from every device on the chain. Mirroring can introduce silent bit rot because the data is read back from only one source at a time. RAID-1 (mirroring) is meant to prevent data loss due to hardware failure. It does not prevent corruption of the filesystem or your data via bit rot, and in fact under most usage scenarios, increases it.
Without parity checking, you simply aren't addressing bit rot. Period. It could be Raid 9 Million(tm) and if all it's doing is copying the data, and not comparing it, bit rot will still proceed apace, silently eating your data. But let's say you're a good administrator that has enabled parity. Great! But there's still a problem: parity cannot restore data that has become corrupted due to bit rot -- it is a detection-only mechanism. So if you have two drives in a RAID-1 with parity configuration, as you also suggest... it will detect the file corruption, but as it cannot correct it, it will then promptly seize up and fall over dead. This is because for every N clusters written, a parity cluster is also written; This allows the array to detect if that data chunk was correctly committed; But if the data on any of the clusters within the chunk are altered later, the RAID array will only know that this chunk of data (known as a stripe in RAID), is invalid. It cannot correct it.
The only way to truly prevent bitrot is by maintaining at least three complete copies of the data, and regularly compare between them. If one of them shows an inconsistency, the other two should still remain in agreement and that data chunk is then discarded and rewritten to the inconsistent device. This is how the Space Shuttle was designed with it's landing computer -- three fully independent computers, and each with three complete sets of sensors independently connected along main buses. Because bit rot is a major problem in space due to radiation, the system is designed at every level with 3x redundancy (or more; there are 7 gyrostabilizer systems on the ISS, for example).
RAID10 and similar systems are two RAID5 systems which are independent and regularly compare data; These can detect which system is inconsistent, so you will always have at least one copy of your data in a consistent state. But if the RAID ever becomes non-operational and has to be rebuilt, there will be a period of time where only one known good copy is available -- bit rot could occur during this time, and all you could do is detect it, not repair it. This is why you want triple redundancy -- so you can remove one of the systems for maintenance and still have two remaining copies, thus maintaining the ability to detect bit rot.
Now that I've explained all the ways that you're wrong, let me say that bit rot is probably not the cause of the OPs problems. Infact, USB devices are well-known for corrupting filesystems because of spontanious disconnects, power loss events, etc., and this is simply what can be expected in a typical residential environment. Even a RAID configuration in a residential environment isn't invulnerable to the "write hole" problem -- where data is partially committed to disk, but then the array suffers a power loss event.
This is what usually causes da

--
#fuckbeta #iamslashdot #dicemustdie