Ask Slashdot: Practical Bitrot Detection For Backups?
An anonymous reader writes "There is a lot of advice about backing up data, but it seems to boil down to distributing it to several places (other local or network drives, off-site drives, in the cloud, etc.). We have hundreds of thousands of family pictures and videos we're trying to save using this advice. But in some sparse searching of our archives, we're seeing bitrot destroying our memories. With the quantity of data (~2 TB at present), it's not really practical for us to examine every one of these periodically so we can manually restore them from a different copy. We'd love it if the filesystem could detect this and try correcting first, and if it couldn't correct the problem, it could trigger the restoration. But that only seems to be an option for RAID type systems, where the drives are colocated. Is there a combination of tools that can automatically detect these failures and restore the data from other remote copies without us having to manually examine each image/video and restore them by hand? (It might also be reasonable to ask for the ability to detect a backup drive with enough errors that it needs replacing altogether.)"
http://www.quickpar.org.uk/
http://chuchusoft.com/par2_tbb/
One single cmd will do that,
zpool scrub
I really hope this discussion provides good answers, with practical solutions for Windows, IOS, and Linux... I think that this is the sort of thing that everyone could really use!
Are there cloud storage providers that can do this for the above example of an approx. 2 TB data set, and provide complete security?
I don't know if there's a better solution, but you could store checksums of each archived file, and then periodically check the file against its checksum. It'd be a bit resource intensive to do, but it should work. I think some advanced filesystems can do automatic checksums (e.g. ZFS, BTRFS), but those may not be an option, and I'm not entirely sure how it works in practice.
It seems to me that you've already identified an easy solution: RAID. A simple mirror of 2x 2-4TB drives is pretty cheap these days, so it would seem to be an ideal solution for one of your copies. Keep one "live" copy on your normal desktop, one backup on an off-site RAID, and if you feel like it, another copy on a cloud service or other media--Tape backups aren't sexy, but they're pretty cheap and very effective at long-term cold-storage. BluRay disks aren't terribly expensive anymore, though the jury is still out on their long term (decade+) durability.
You can install Perforce which is a CM system like SVN etc. but it's very good w/large binary data. You can have it run a verify command nightly (or as often as you like) and it will compare the MD5SUM of ever version of every file (which was computed when that version of the file was commited to the CM system) with the current MD5SUM and let you know which ones have changed (bitrot).
If you don't CM your data, you can just do an MD5SUM recursively and store it off, then periodically repeat the procedure and diff the 2.
If you like GUIs Beyond Compare is an excellent program and it does snapshots (CRCs of directory trees) and then lets you compare the snapshot with an updated / recomputed version.
If your physical media is dying, you'll get hardware errors so restore from a(nother) backup and replace the media.
If your files are being corrupted, what kind of crappy filesystem are you using to store these precious memories?!!
Should have parred your data to begin with.
Tape. LTO is error-correcting and extremely stable and reliable.
It it expensive? Yes, for small datasets - for large datasets it can be much cheaper.
Is it a pain in the ass? Yes, but what's your data really worth?
rsync --checksum for the remote copies.
ZFS is a good filesystem for bitrot protection, you don't want to propagate the errors.
ZFS without RAID will still detect corrupt files, and more importantly tell you exactly which files are corrupt. So a distributed group of ZFS drives could be used to rebuild a complete backup by only copying uncorrupt files from each.
You still need redundancy, but you can get away without the RAID in each case.
Windows 8.1 and 2012 support this if you setup a storage space in a mirror (could be 2 standard external usb disks) and format with the resilient file system (ReFS) rather than NTFS, it will do background scans that correct for bitrot, it's new and not well proven but that's microsoft's claim anyway
[seen in the margins of a book on data backup owned by someone claiming to be Fermat reincarnated]
"I have found an elegant solution to the problem of self-healing distributed backups which are neither co-located nor in constantly aware of each other's state. The details are too long to fit in this space."
Or, similarly for BTRFS:
btrfs scrub start /btrfs
The functionality isn't new. Large robotic tape libraries would pull out tapes periodically and verify health of the media, copying to new if unwell.
About 5 years ago the RAID chip vendor folks were touting RAID 6, as required for sets of 1+ TB drives as the potential for experiencing a read fault on recovery of a failed RAID 5 becomes much more likely as drive volumes increase.
So first line of defense RAID storage with health monitoring, then are backups ( offsite as well )
Bitrot does happen.
When a disk has a bad block and detects that, it will try to read the data from it and put it on a block from the reserve-pool. However, the data might be bad and corrupt, so you lose data.
Disks do have a Reed-Solomon (aka par-files) index, so it can repair some damage, but it doesn't always succeed.
Anyway, what I do for important things, is have par2 blocks that go along with the data. All my photo-archives have par2 files attached to them.
I reckon you could even automate it. To have a script that traverses all directories and tries to repair the data if it's broken. If it fails, you get notified.
Well, don't worry about that. We can get you back before you leave. (Dr. Who)
First off, make sure you have a separate backup storage volume that doesn't get touched by normal applications and which keeps history. Backup doesn't protect you very much if accidental deletes or application bugs corrupt all your copies within one backup cycle. Use an appropriate backup tool to manage this, where appropriateness depends on your skill and willingness to tinker. You could use something as simple as an rsync --link-dest job, or rsync --inplace in combination with filesystem snapshots, or some backup suite that will store history in its own format.
For bit-rot protection of the stored backup data, make a backup volume using zfs or btrfs with at least two disks in a mirroring configuration (where the filesystem manages the duplicate data, not a separate raid layer). Set it to periodically scrub itself, perhaps weekly. It will validate checksums on individual file extents. If one copy of a file extent cannot be read successfully, it will rewrite it using the other valid mirror. This rewrite will allow the disk's block remapping to relocate a bad block and keep going. The ability to validate checksums is the value add beyond normal raid, where the typical raid system only notices a problem when the disk starts reporting errors.
Monitor overall disk health and preemptively replace drives that start to show many errors, just as with regular raid. Some people consider the first block remapping event to be a failure sign, but you may replace a lot of disks this way. Others will wait to see if it starts having many such events within days or weeks before considering the disk bad.
Warning for all UNIX newbies: that command will reset the file to 0 bytes. Just that you know.
(I've seen some cases when a rookie is setting up a Linux system and people jokingly throw him these "rm -rf /" commands and the poor guy actually ends up wrecking his system.)
We have hundreds of thousands of family pictures and videos we're trying to save
Yes, you've got to save them! Your children will be so thankful for countless extended family diashow evenings!
"Look, here is little Tim vomiting when he was 12 years old! How sweet! -- Another vomiting picture. -- Another one. -- I'll skip the next 11 images, still 12,371 to go after all..."
Add 20% par2 files.
zfs will detect and correct the bitrot. freenas is probably the easier solution for providing that file system to a household.
I'm glad you're bringing this up. I haven't seen any backup software that addresses bitrot. And bitrot does happen, I lost a few pics to it. What I do: I have a monthly script that makes a RAR archive from my pictures directory. RAR checks file integrity but also has "recovery" options that allow you to recover files from a damaged archive (to a point)
{Science sans conscience n'est que ruine de l'âme}
For single disk setups use ZFS with copies=2. You will lose half of your storage, but you will gain error correction. With the default copies=1, you get error detection, but not correction.
If you really want hassle free and safe, it would be expensive, but this is what I would do:
ZFS for the main storage - Either using double parity via ZFS or on a raid 6 via hardware raid.
Second location - Same setup, but maybe with a little more space
Use rsync between them using the --backup switch so that any changes get put into a different folder.
What you get:
Pretty disaster tolerant
Easy to maintain/manage
A clear list of any files that may have been changed for *any* reason (Cryptolocker anyone?)
Upgradable - just change drives
Expense - You can build it for about $1800 per machine or $3600 total if you go full-on hardware raid. That would give you about 4TB storage after parity (4 2TB drives - $800, Raid Card - $500, basic server with room in the case - $500)
What you don't get: Lost baby pictures/videos. I've been there, and I'd pay a lot more than this to get them back at this point, and my wife would pay a lot more than I would..
Your current setup is going to be time consuming, and you're going to lose things here and there anyway.. If you just try to do the same thing but make it a little better, you're still going to have the same situation, just not as bad. In this setup you have to have like 5 catastrophic failures to lose anything, sometimes even more..
Boy you must be a real hit at parties.
WinRAR isn't perfect, but it works on a number of platforms, be is OS X, Windows, Linux, or BSD. This provides not just CRC checking, but one can add recovery records for being able to repair damage. If storing data on a number of volumes (like optical media), one can make recovery volumes as well, so only four CDs out of a five CD set are needed to get everything back.
It isn't as easy as ZFS, but it does work fairly well for long term archiving, and one can tell if the archive has been damaged years to decades down the road.
Warning for all UNIX newbies: that command will reset the file to 0 bytes. Just that you know.
(I've seen some cases when a rookie is setting up a Linux system and people jokingly throw him these "rm -rf /" commands and the poor guy actually ends up wrecking his system.)
I think the general consensus is that if you're stupid enough to run a command you got from SomeRandomInternetAsshole420 without verifying what it will do first, you deserve to have your system wiped.
An enigma, wrapped in a riddle, shrouded in bacon and cheese
BTRFS and ZFS both do checksumming and can detect bit-rot. If you create a RAID array with them (using their native RAID capabilities) they can automatically correct it too. Using rsync and unison I once found a file with a nice track of modified bytes in it -- spinning rust makes a great cosmic ray or nuclear recoil detector. Or maybe the cosmic ray hit the RAM and it got written to disk. So, use ECC RAM.
But "bit-rot" occurs far less frequently than this: I find is that on a semi-regular basis my entire filesystem gets trashed (about once every year or three). This happened to me just last week...my RAID1 BTRFS partitions (both of them) got trashed because one of my memory modules went bad. In the past I've had power supplies go bad causing this, or brown outs, and in other cases I never identified the cause. I've seen this happen across ext3, jfs, xfs, and btrfs so it's (probably) not the file system's fault. In such cases, fsck will often make the problem worse. (Use LVM and its "snapshot" feature to perform fsck on a snapshot without destroying the original). You'd think these advanced filesystems would have a way to rewind to a working copy (for instance in BTRFS -- mount a previous "generation") but this seems to not be the case.
Anyway, btrfs guys, your recovery tools could be a lot better. The COW enables some pretty fancy recovery techniques that you guys don't seem to be doing yet. If you've got a great btrfs or zfs recovery technique, please reply and tell us.
1^2=1; (-1)^2=1; 1^2=(-1)^2; 1=-1; 1=0.
And yet, one of FLOSS's selling points is our great community support...
You do not have a moral or legal right to do absolutely anything you want.
There's really no way around it. Storage media is not permanent. You can store your important stuff on RAID but keep the array backed-up often. RAID is there to keep a disk*N failure from borking your production storage and that's it. If you can afford cloud storage, encrypt your array contents (encfs is good) and mirror the contents with rsnapshot or rsync to amazon, dropbox, a friends raid array, whatever. SATA drives are cheap enough to keep a couple sitting around to just plug in and mirror to every weekend but you'll probably find a friend's cable modem and rsync+ssh a very handy alternative (hint: check out --bwlimit option) when run from cron.
Join the Slashcott! Feb 10 thru Feb 17!
"We'd love it if the file-system could detect this and try correcting first, and if it couldn't correct the problem, it could trigger the restoration. But that only seems to be an option for RAID type systems, where the drives are colocated."
If you have ~2TB of irreplaceable memories set up a NAS with a RAID array. whilst bit-rot can be detected it can only correct itself if the file system knows what the bits should have been. To this end BTRFS and my recommendation ZFS can be set to say scan all data 1 a week/month etc and using the redundant data in the RAID array correct the 'Bit-Rot'.
I have a intel atom board in a old case with 4 drives(2x 500GB mirror and 2x 1TB mirror). I have FREENAS on this it is powered on every night by wake on lan. Backs up any new data and gets shut down. once a week it backs up new data then runs the command 'zfs scrub' this checks for bit-rot or inconsistencies in the file-system and corrects them if any are found.(can email you a warning if you want as well). This way if any files get damaged on a home pc/ laptop etc.. any user can turn on the NAS and recover there files from the shared folder.
1 point of warning ZFS is RAM hungry so 4GB is the minimum. something to keep in mind when ebaying for a old pc to use. others will also point out that file transfers are ~20-30MB/s with a low powered atom so use something with more grunt if its to be a 24/7 NAS.
This is a big deal in digital movie preservation. There will be a cloud solution based on Swift open source available in the next couple of months.
A two-disk RAID1, or a RAID5, theoretically ought to be able to detect when there's corruption, but shouldn't be able to correct it. If you've got two different data values, you don't know which one is right.
But it occurs to me: RAID6 (or three-or-more disk RAID1) really ought to be able to correct. Imagine a three-disk RAID1: if two disks say a byte is 03 and one disk says 02, then 03 is probably right. RAID6, similarly, has enough information to be able to do the kinds of repairs that you could do with par2.
It'd be cool to find out this is already in the kernel's md device. Probably not so yet, though. ?
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
make a torrent of your stuff, spread the copies around, use a private tracker, force a recheck on any file you think is corrupt and let the swam do it's thing.
I have been going through this issue myself. In a single weekend of photo and video taking, I can easily fill up a 16 gig memory card, sometimes a 32 gig. About 10 years ago I lost about two years worth of pictures due to bitrot (ie my primary failed, and the backup DVD-Rs were unreadable after only a year - I was able to recover only a handfull of photos using disc-recovery software). Since then, I kept at least three backups, and reburning discs every couple of years. But if I can fill up two BD-Rs in a weekend, and given the high price of media, that wasn't an option. Extra harddrives?
I finally realized the best way was just to get a Carbonite account. They are about $70 a year for unlimited encrypted storage space (if you are really anal, I guess you could always put things into TrueCrypt encrypted file containers and upload them). The worst part is how long it takes to do a backup on a residental broadband line (it would also suck if your ISP has data caps). It has taken me about 2 weeks to do half a terrabyte.
The deal is, the peace of mind that comes from this is huge, and it is cheaper than buying another harddrive.
Yes, I know that is not the question you asked, but I feel like it is a much more practical alternative. I mean, as I continue backing stuff up, I am sure I will pass a terrabyte. How much are you going to pay for discs, for harddrives? Then trying to keep them safe and secure, and having to worry about bitrot?
Seriously, I've lost family pictures and videos before even though I had backups, and it sucked. Do yourself a favor and get a cloud backup. Yeah, it may take a while to do your backups and restorations, but it is worth it.
M-DISC:
DVD format presently, BLU-RAY format in the future. Someday an electronic eye will just be able to look at the disc surface and see it all in one snapshot.
They aim for 1000 years. I expect 100. It may be reasonable. Just keep drives around.
http://www.mdisc.com/proving-ground/
WARNING: DO NOT RUN ANY COMMAND IN THE PARENT, THIS COMMENT OR ANY OF THE SIBLING COMMENTS.
You really suck at being an asshole too, the right command for destroying files and being innocently obfuscated is:
dd if=/dev/zero|pv|dd bs=1024 count=$(ls -s 'filename'|awk '{print $1}' of='filename'|openssl sha1
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion. -- Spazmania (174582)
Obviously. Since your data isn't constantly changing, and is videos and photos, M-Disc is the ideal solution.
It might be an overkill, but the open source backup software Bacula has a verify task, which you can schedule to run regularly. It can compare the contents of files to thir saved state in backup volumes, or it can compare the MD5 or SHA1 hashes which were saved in the previous run. I assume other backup software has similar features.
We have hundreds of thousands of family pictures and videos we're trying to save using this advice. But in some sparse searching of our archives, we're seeing bitrot destroying our memories. With the quantity of data (~2 TB at present),
As the proud owner of dozens of family photo albums, a stack of PhotoCDs etc which rarely see the light of day, the bigger challenge is whether anyone will ever voluntarily look at those terabytes of photos. Having been the victim of excruciating vacation slide shows that only consisted of 40-50 images on a number of occasions (not to mention the more modern version involving a phone/tablet waving in my face), I can only imagine the pain you could inflict on someone with the arsenal you are amassing.
How are you cataloging 2TB of media?
I'd suggest git-annex for this. It can do pretty much exactly what you ask: periodically scan all files in a repository, determine if any are corrupt, attempt to repair them, and (if necessairy) restore from a remote backup.
Don't forget the old-fashioned method: make archival prints of your photos and spread copies among your relatives. Although that isn't practical for "hundreds of thousands", it is practical for the hundreds of photos you or your descendants might really care about. The advantage of this method is that it is a simple technology that will make your photos accessible into the far future. And it has a proven track record.
Every other solution I've seen described here better addresses your specific question, but doesn't really address your basic problem. In fact, the more specific and exotic the technology (file systems, services, RAID, etc.) the less likely your data is to be accessible in the far future. At best, those sorts of solutions provide you a migration path to the next storage technology. One can imagine that such a large amount of data would need to be transported across systems and technologies multiple times to last even a few decades. But will someone care enough to do that when you're gone? Compare that to the humble black-and-white paper print, which if created and stored properly can last for well over a hundred years with no maintenance whatsoever.
Culling down to a few hundred photos may seem like a sacrifice, but those who receive your pictures in the future will thank you for it. In my experience, just a few photos of an ancestor, each taken at a different age or at a different stage of life, is all I really want anyway. It's also important to carefully label them on the back, where the information can't get lost, because a photo without context information is nearly meaningless. Names are especially important: a photo of an unknown person is of virtually no interest.
Sorry I don't have a low-tech answer for video, but video (or "home movies", as we used to call it) will be far less important to your descendants anyway.
A family archive maintained by the "tech guy/gal" in the family is also subject to failure from death or disability or the aforementioned maintainer. Any storage/backup solution should therefore be sufficiently documented (probably on paper, too) that the grieving loved ones can get things back after a year or two of zero maintenance and care of the system. That would also imply eschewing home-brew type systems in favor of using standard tools so a knowledgeable tech person not familiar with the creator's original design can salvage things in this tragic but possible scenario. Document the system so even if the family can't do it themselves, and an IT guy has to be contracted to resurrect the data, he'll have the information needed to do so.
Any system sufficiently dependent on regular maintenance by just one particular person is indistinguishable from a dead-man time-bomb.
I am not a crackpot.
100,000s -- like 300,000? More? How many of them will you actually ever look at again? Less 1% I'm guessing. Here's my advice (and it's what I do), step 1) when transferring pics to your computer, delete the ones that are out of focus, bad lighting, framed poorly, etc. This is about 15%. Step 2) once a month, go through the photos you have taken the previous month and delete those that just don't mean as much anymore (if they have decreased in emotional value in 30 days, just think how utterly worthless they would be in 5 years?). This takes care of another 30%. Step 3) once every 3 months, I and my wife pick the cream of the crop for physical prints. This is about 10%. These are stuck into photo albums, labeled and kept in a fire proof safe in our basement. So 200 photos a month, gets reduced to ~100, and then 10 per month are printed. YMMV
Convert photos to DNG in Adobe Lightroom and use the ability for it to check for file changes. Store on a Drobo with dual disk redundancy.
I work next to a moving and storage company. Occasionally the dumpster out back can be found unceremoniously overflowing with the contents of a forgotten storage locker. Anything of value has been teased out - you know what gets tossed? Everything else, especially photo albums, trophies, diplomas, etc.
“What is most personal is most general”— Carl Rogers
but there is a catch: to reliably detect bit-rot and other problems, you also need server-grade hardware with ECC.
ZFS (especially when your dataset-size increases and you add more RAM) is picky about that, too.
Bit-rot does not only occur in hard-disks or flash.
You should really, really take a hard look at every set of photos and select one or two from each "set", then have these printed (black and white, for extra longevity).
If this results in still too many images, only print a selection of the selection and let the rest die.
Windows 2000 - from the guys who brought us edlin
The solution to Bitrot and reading of old media is very simple and honestly I don't know why it comes up so much. Storage is DIRT CHEAP. 2TB of Data is NOTHING, you can get a 3TB+ external drive for $100 or even less on sale. Buy 3 drives, keep 1 in SAFELOCATION*, Back up to 1 drive every even week, and the second one every odd week, and once a month swap the one in the SAFELOCATION out for a local one and repeat the cycle. Increase or decrease frequency of SAFELOCATION swapping depending on level of paranoia.
There, the problem is simply and very cheaply solved and there is no level of bit rot that is going to cause all 3 of these backups to be destroyed within a 1 month time window.
* where SAFELOCATION is a off-premise location, either a close friend's house or a locked office desk or a family member's house or a safe deposit box
WARNING: DO NOT RUN ANY COMMAND IN THE PARENT, THIS COMMENT OR ANY OF THE SIBLING COMMENTS.
Unless you are working on the nsa's main database. Then you should run these commands several times, just To be sure the backup is complete. Then take a sledge hammer to the original files, for securit. And restore from the backup, to guarantee the backup worked.
Book a flight to Moscow first though
And yet, one of FLOSS's selling points is our great community support...
Every community with a notable population size is going to have its share of bad actors.
Besides, ever since you were a kid you've been taught to not trust strangers based on their word alone.
An enigma, wrapped in a riddle, shrouded in bacon and cheese
Your comment. is stupid, ignorant, and presumptuous. It's almost beyond belief.
1: Why do you think the OP is "obsessing" about the pictures? \
2: RE: "once the people in the pictures are dead"
It sounds like you're assuming they're all alive.
You are quite mistaken if you think that people lose interest in their parents pictures (or children) after their death.
Also, are you assuming that no one has relatives with historical significance?
Suppose the original poster is a relative of Franklin D Roosevelt, or Elvis Presley.
3: Quit being so ...
How do you know how much time they are spending on their archive?
4: "enjoy life while you are still ALIVE"
How do you know what our level of enjoyment is?
I'll give you a hint, child, some of us are well off and don't have to work.
What you don't know is that we're a very happy lot, and when we're not traveling in Europe, Asia, Australia, etc, we may spend some time fooling around with our photos.
again,
Your comment is stupid, ignorant, and presumptuous. It's almost beyond belief.
We wrote our own parallel filesystem to handle just that. It stores a checksum of the file in the metadata. We can (optionally) verify the checksum when a file is read, or run a weekly "scrubber" to detect errors.
We also have Reed-Solomon 6+3 redundancy, so fixing bitrot is usually pretty easy.
ZFS has proven that a wide variety of chipset bugs, firmware bugs, actual mechanical failure, etc are still present and actively corrupting our data.
And I expect that defragging aggravates this. Read a perfectly good block of data from disk into flaky RAM, have a bit flip, and write out that corrupted data to its new location. Even if the software is verifying its likely to verify against RAM and it did successfully write what is in RAM.
And then there is over clocking. If a computer is just used for gaming, no problem. But if its used for more serious things or archiving things of value to you then you may want to pass on over clocking. Folks who say you can verify an over clocked CPU are mistaken. Its not a crash or no crash thing, at a certain unpredictable point in over clocking an unpredictable CPU instruction may simply give an incorrect result. This incorrect result could end up in your data or image. I've seen over clocked CPUs mess up a text string that is supplied by the CPU itself, CPUID's vendor string.
As other people have mentioned, a lot of these errors can occur while you are actually copying the files. I have copied files and immediately executed md5sums on the source and dest files only to find differences. Unfortunately, I didn't start this practice until after I had to restore from backup only to find that some of the backup files were corrupted.
And given that this seems to be a common problem, why in the holiest of hells does the cp command not have a verify option? Yeah, it's easy enough to wrap the copy command with md5sums, but a verify option would be even easier. Throw in an auto-retry function on top of that and you'd be really cooking.
By the way, the submitter did not mention the current method of backup, but if they are using Linux with the cp command, they would be better served by moving over to something like rsync.
I've used ZFS under Linux for 5 years now for exactly this sort of thing. I picked ZFS because I was putting photos and other things on it for storage that I wasn't likely to be looking at actively and wouldn't be able to detect bit-rot until it was far too late. ZFS has detected and corrected numerous device corruption or unreadable issues over the years and corrected them, via monthly "zpool scrub" operations.
I have been backing these files up to another ZFS system off-site. But now I'm starting to look at other options because it's looking like I can begin doing it more cheaply than even my free hosting of a box I bought can provide.
Amazon Glacier reduces the cost of S3 storage by an order of magnitude, making 2TB of storage cost around $20/month. For a backup copy, it's hard to compete with this, even just buying a USB drive to stick somewhere... You do have to be careful about recovery though, they charge based on peak download speed (a very weird pricing).
The 'simplest' things in life frequently turn out to be the most complicated- at least in terms of the knowledge required to manage the system properly. And often, the complicated mathematical analysis provides very simple solutions- which seems like a paradox.
Look to how Google handles data for ultimate answers on current state-of-the-art storage systems. Google uses commodity equipment, with custom engineering approaches to managing aspects like expected errors and failures. Data loss occurs in various predictably unpredictable ways.
-check that the data is stored CORRECTLY in the first place. Sadly many systems will 'write' data, never ensuring the 'write' process occurred without fault.
-use systems like PAR2 to add maybe 10% redundancy information to allow small bit and block errors that occur later to be repaired. And NO, NO, NO- you do not need to strain your brain to wonder how systems like PAR2 work- just USE IT.
-store data you CANNOT afford to lose in more than one place, even if you cannot ensure that multiple copies are of the same generation. Recovering MOST of your data, in an older version, is far better than recovering none. You can worry about 'synchronisation' issues later, if at all.
-know that the more complex and painful your data protection method, the more likely you are NOT to use it properly, or at all. KISS (keep it simple, stupid) will ensure you take the steps to protect you data.
-if you believe 'bit rot' to be real (it isn't in the sense you suggest), you have no choice but to periodically check all your data, hopefully using a system like PAR2 to correct errors and rebuild the PAR sets. If you don't check it, someone else would have to (a service YOU would end up paying for one way or another), since there is NO way for data to passively check itself without the need to read and 'process' computationally (checksum test, PAR test, etc).
Treating all your data as of equal value will ensure lowest common denominator thinking- and you don't want this. Using PAR2, and multiple storage locations is a good enough passive defence for most data. The stuff you are paranoid about, you should periodically check and renew.
But like I said at the top, the statistical maths behind data protection is far more complicated than you might imagine. So, simply use the best working practices available from those entities with a REAL interest in doing the job properly. Anyone who says "use tape" or "only buy enterprise storage hardware" or "you must use RAID" can be safely excluded from your list of advice givers.
git annex is an open source project that lets you distribute files around various media (including external HDs, Amazon S3, SSH-connected computers, etc.). It has an fsck command for checking that your data still matches its checksums.
There's a GUI interface that makes it a lot like Dropbox, where you just add files to a folder, and they are sync'd.
It works on OS X and Linux, with an alpha for Windows.
-- rm -rf / tells you if you have root or not
I never archive any significant amount of data without first running this script at the top:
find -type f -not -name md5sum.txt -print0|xargs -0 md5sum >> md5sum.txt
Which is useful for finding errors, but not for fixing them. If the information is relatively important, you may want to check out parity archives:
https://en.wikipedia.org/wiki/Parchive
Research into Distributed Fault-Tolerant Filesystems has been going on for at least 40 years, with implementations flourishing since the advent of Ethernet and similar technologies. There are lots of options out there!
There are some fundamental things to consider:
1. All fault-tolerance requires redundancy. I'd recommend biting the bullet and going for full replication (redundant copies).
2. The copies should not be co-located (the real meaning of "distributed" in this context).
3. You should not trust the network: Not all copies will always be accessible simultaneously.
4. Updates should not be permitted unless a quorum (50%+1 or more) of replicated systems checks in and agree on the data content.
5. Updates should permeate all copies in the background.
6. Read-only access may be much more permissive, requiring as few as 2 or 3 replicates to be accessible.
7. History (repository-like) performance (including "undo") is often a desirable option.
The above is "Armageddon-grade" if there are at least 6 replicates with at least 3 wholly redundant networks. For basic reliable archiving, 3-4 copies on the Internet should be plenty good, depending on the system chosen.
dd if=/dev/cdrom of=/dev/null bs=512 (or a convenient multiple thereof)
Works for tapes, too.
I've been using this to verify my media for 30 years.
If only I had a lawn to show for it!
~childo
CAPTCHA: 'accrue'
Here's a cheap easy solution (assuming you can write some basic scripts)
1. Start by taking an MD5 of all your pics.Save the results.
2. Backup everything to a 2nd drive. Take MD5s and be sure they match using basic scripts.
3. Perioducally scan drive 1 and 2 and compare against their expected MD5 value. If one has changed, copy it from the other (assuming it is still correct)
You could expand this with more drives if you are extra paranoid. You could do this cheap, check regularly, and know when bitrot is happening.
Ninjas don't carry tic tacs
I think that when writable CDs first came out, we thought that they would last forever. And in some sense they do last long enough. The other day I found a CD binder full of games and a few backups from 1996. The most surprising of all was a collection of photos that I thought had been long lost, and with a little rsync running over and over and over, I got all the files off intact and saved them to my Flickr account.
The most important thing to understand, I think, is that we have to look at digital storage as a convenient and temporary medium and that anything longer lasting would need to be hard copied. It’s not a guarantee, but it’s a better likelihood of survival. Pictures can survive by pure chance for a couple hundred years. We’re lucky if our current stuff will handle a few years, much less natural disasters and history itself.
For many, the cloud seems to be a utopia, but corporate and national politics can make all your treasured media disappear without warning, and none of the free services give you a guarantee of safety if something craps out on their systems. And as for paid cloud services, ask yourself if anyone will bother to take care of it after you’re gone, or if anyone will bother to archive it, or if your family will just toss it aside even if they are able to get them as part of your estate. Ask yourself who you’re saving all that for. Are we just digital hoarders?
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
I think running "rm -rf/" is a right of passage.
I did it on a community web server (Suse on a Solaris Sparkstation 5) about ten years ago, hit CTRL+C within a few seconds once I realised what was happening. But it was too late, the filesystem as toast. Luckily the htdocs was still complete!!!!
Learning Linux the "hard way" is sometimes essential, and I sure as hell learnt my lesson about the dangers of being root, and the importance of backups.
That might actually take more work than just backing them all up properly.
If you're noticing data corruption on only 2TB it's probably not what we normally call bit-rot. A bit that changes state for no apparent reason within a very large set of data can be described as bit-rot, otherwise it's general data corruption which has many causes which all are understood: Poor media, poor transmission of data, overwriting of data etc. Once you've got the system sorted out so you don't get data corruption, start thinking about the nature of your data. How much redundancy is in it? If it's jpegs the almost none, so a single bit error could be serious to a file. If uncompressed TIFFs then there is a lot of data redundancy and the single bit error might only be an error of a single pixel, which you might not even notice. And finally, don't expect optical media to be safe from errors. Only use it as part of a DR plan.
Snapraid (free!) might be an option: http://snapraid.sourceforge.net/
It snapshots your data to some parity files on a separate drive. All you would have to do is occasionally copy those files offsite. Snapraid includes commands that allows you to check and fix bitrot as well.
CrashPlan could help you a lot. First, CrashPlan is a backup system, so it makes and manages a copy of your data, including every version of every file. CrashPlan addresses the bitrot problem on their side by running their own checksums on the stored files : if they detect an issue with a stored file, they will replace it with the original version, still stored on their computer. If some files get corrupted on your computer, you can restore them from CrashPlan, but you will need something on your side to tell you that something went wrong. Now, even if you realize that the file is corrupted years after it happens, you can still recover the previous non-corrupted version from CrashPlan.
Now, 2TB is a bit much to store on CrashPlan's cloud : unless you have a very fast connection (at least 100MB) it's going to take you a while to upload your data. The solution is to run your own CrashPlan PRO Enterprise server onsite (with periodical offsite backups of course). Don't be fooled by the name, it's pretty easy to set up and administer, and the licenses are fairly affordable (75$/user/year).
I've supporting CrashPlan PRO Enterprise in my company for 3 years, with 25 clients and about 1TB of data. While I'm not super-happy with the way the Code42 people run their CrashPlan business, the tech is solid. I'm kind of thinking that other backup systems work in similar ways.
Now, I hope that you'll excuse me for asking this question, but which kind of crappy file systems and hard drives are you using that generate significant levels of "bitrot" in files which are basically just sitting there?
Nobox: Only simple products.
Accidentally, I have been studying this subject for a while...
Bottomline: All methods will fail eventually. Your best bet is to have multiple layers of protection.
It's not really matter of what media holds it's data longest without rotting. All media will fail eventually. Rather it is question, if you can check your data often enough and are you able to copy it to a new media before media/data renders itself unusable.
You have small amount of data so spinning disk is your best and really the cheapest option here!
For small amounts eq 2TB of data RAID and error checking file system is still just fine. For a lot bigger amounts of data, you'll want to get rid of those RAID's and step up JBOD setups stacked with error checking distributed filesystem. And even then you will need to verify your file checksums regularly.
Keeping three copies is always minimum for successful voting of the good version. In addition it's never too safe having extra metadata, like checksums, on separate media as it takes virtually no space.
And what else there is after bit rot ... you may always "accidentally just delete" your all precious files, or your house (and all media) can burn to ashes, them cloud providers might shut down their services without warning ... you never can be sure. So always keep copies at several different locations / providers.
Rough calculations like MTTR (mean time to recovery) may help estimating needed I/O capacity to checkums check and refresh your media often enough. With modern hardware 2TB is very easily copied within hours. More speedier RAID-systems achieve 1 000MB/s throughput, so transfer would take only only 30 minutes!
So just copy your data often to new media and do those checkums. There's really no magic wand...
And propably nobody has told you about migrating legacy file formats to modern formats... but that's just another additional story to this preservation case...
An interesting way could some sort of ZFS based storage appliance. ZFS provides off the shelf bit ort decay protection using checksums at block and tree level, which are periodically scrubhed and repaired.
While RAID setups are more common with ZFS nothing stops you settimg a filesystem with inside redudancy (copies) inside same vdev (let'Say disk). Failure of one entire drive could be thought, but you can set a raid 1 mirror set. 2 TB would be easily manegeable even with consumer disks nowadays
There is also rsbep, which uses Reed Solomon FEC. This is a classic filter, so you can use it together with tar, gzip and gpg to protect archives against NSA snooping and bit rot simultaneously.
Something like:
$ tar -cz indirectory | rsbep | gpg -e > out.tar.gz.rs.gpg
La voila!
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Got to carve those pics in stone, in Egypt, else nobody will care about them later.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Some highlights:
When ZFS dies, it dies in a big and fairly comprehensive way, and ZFS will die if you under-provide it. In any event, you should RTFM before contemplating a build, and know the trade-offs you're getting in to.
Schwab
Editor, A1-AAA AmeriCaptions
http://git-annex.branchable.com/
Try again, but this time with subdirectories
PAR2 with subs: Multipar and alternate
I've been using it for well over a year, it works great. Was using this for a while -- it's OK, but Multipar is much better.
Or just continue to use PAR on single directories with subs placed in some type of archive (zip, 7z, tar) file.
None of these holds a candle to ZFS as a live file system, but these all work great when archiving files to DVD/BD.
Heck, I'm currently copying multiple dirs to BD and using Multipar as "only" a checksumming and renaming repair tool -- not even bothering with the file content recovery option. For that matter, I've even created a (single) disc with 300% recovery -- if I lose all of the primary files and over half of the recovery content bits, I can STILL recover the contents. (I've tested this by manually damaging the file contents. I have multiple copies in different places, too -- there are just a few static files that I do *NOT* want to lose.)
If the universe is someone's simulation -- does that mean the stars are just stuck pixels?
www.synology.com/ + some sort of amazon glacier/backblaze-like backup.
This would give you room for expansion and a pretty neat offsite for a reasonable price.
I've had a DS1512+ for over year without it faltering and it took all of 15 minutes to shove some disks in, configure the webpage interface and back it up.
Worst thing that will happen is the Synology will completely die and you'll have to recover data from online.
You might try to backing up with http://www.mdisc.com/what-is-mdisc/ I've been using them since they came out and all my backups still work. It is supposed to last a thousand years. I don't know about that, but they do seem to be better than backing up to regular dvd which I have had go bad in as little as a year.
No sigs in BETA. Beta SUCKS.
$15 TB / Month
http://lts2.evault.com/how-it-works/
A rite of passage? You must be joking! I've never met anyone stupid enough to have actually run that command with those parms. The first time someone tried that on me, I did a 'man rm' and looked the doc. I always thought that was the lesson; RTFM.
I want my cake sitting in the middle of the table, pretty as can be.
I want to enjoy the wonderful tastes sensations as well.
It would be really great if someone could eat the cake for me by proxy, but still let me enjoy the taste. With it still sitting in the middle of the table looking as pretty as can be.
I didn't do it because someone told me to. I did it as a mistake.
Are you telling me you have never made any mistakes in life? Sheesh!!
And you think anyone is going to care to look at your massive collection of family photos by the time bit rot sets in?
Did you ever think that perhaps it would be smarter to keep backups of say 10 important life-shattering moments, instead of every 30 seconds of everyone's lives in your extended mega family?
Do what I did and build up a FreeNAS server using an HP microserver, at least 8 gigs of RAM and 4 2TB drives configured as RAID-Z. You could put this together for about $800. The HP microserver supports ECC ram which you really do want. Install FreeNAS on a USB stick and boot off that. Set it up for weekly scrubs.
Then, because multiple redundant *and* offsite backup is the only way to feel truly safe (a RAID array, even with ZFS, does nothing to protect you from fire or theft), I backup as follows... I have 2, 3TB ESATA drives (i.e. external drives) configured as a mirrored pool (so far, I don't have a need for more than 3TB of backup). For an initial backup, copy your files from the RAID-Z to the mirror generating an MD5 checksum as you go. To save time, you can generate the md5 checksum at the same time you do the filecopy (as opposed to reading the file once for the copy, and then reading it again to generate a checksum) doing something like this...
cat source | tee destination | md5 >> checksums
Note, if you do the file copy and immediately read back to compute the destination checksum, you will no doubt be reading from the cache instead of disk which means you can't be truly sure the bits made it on the disk correctly. I never could figure out how to purge the ZFS cache after a copy, so my solution is just to make a list of files to copy, copy each file while generating a checksum, then, starting at the beginning of the file list, go ahead and generate a checksum for each destination file. The first file of many that you wrote to the backup pool shouldn't be in the cache anymore at that point, the verification read will have to go to disk which is what you want.
But I digress... After backing up to the mirrored pool, execute a zpool export of the mirrored pool, shutdown the server and disconnect the drives. You now have 3 copies of your data, two of which are very mobile. Now, take one drive and put it in a fire safe, take the other drive and store it off site... at work or safe deposit box or whatever. Now, very bad things will have to happen for you to lose data. If your server is stolen or melts, you have two other backups. If your server AND your fire safe are stolen, you have one remaining backup offsite.
The only downside here is that backup is not continuous. If a disaster does happen, you will lose any files since your last backup. But for data that is relatively static, like movies, music, photos, the changes between incremental backups (I do mine monthly) are pretty small.
In sum, I rely on zpool scrub to prevent bit-rot in the RAID-Z pool (and indeed, on the mirror drives as well), and use MD5 to verify when I have to copy bits between distinct pools.
Laptop is the master copy, since that's where I do photo editing. That gets backed up via rsync to a NAS at home. Multiple USB SATA drives at work back up the same data via RoboCopy. Once every few months, I run integrit (http://integrit.sourceforge.net/texinfo/integrit.html integrit) on the laptop then the remote drives. A shell script compares the integrit db output of each drive. If they match, all is good.
I haven't seen bitrot yet with this setup since I started using it in 2008. That covers 2 NAS setups, 4 laptop drives (on 2 separate laptops) and 6 different SATA drives (on 3 different USB bridges).
Would it be a good idea that for each bucket of dvds (25 or 50 dvd in each probably) make a error correction dvd with dvdisaster?
so if one of the dvds in the bucket go bad later I could use the others together with the error correction dvd to recreate the faulty files. The version of dvdisaster in my linuxdist is 0.72.1-1 should I use that, or some ppa?
...which is why so few people bother.
It's the tragedy of digital photography. Taking photos is cheaper than ever, yet the number of photos actually making it into frames and albums is about as low as it was when most people could only afford to have their relatives photographed after they died.
It's effectively that way in my house. Someone (human or pet) pretty much has to die (or be exceptionally adorable) before I'll bother to print and frame a picture of them.
In terms of historical record, this is a really bad state of affairs. It's the crap we don't consider worth keeping that is the real treasure for historians and archaeologists. The day to day stuff, not the "this is how we'd like to be remembered" stuff.
Sigh ... I'd love to have out-of-focus or poorly framed shots of my grandparents. *DON'T* delete anything. Move them to a "morgue" directory. If you abandon them, they are gone for good. I remember going through old photos looking for shots of old cars my *wife's* grandparents owned, and places where they lived. That is something I will *never* be able to do for *my* grandparents.
I keep multiple copies on local drives and in the cloud. For the local copies (in different locations) I use md5deep and hashdeep to detect bitrot:
http://md5deep.sourceforge.net/
I know it's not really an answer to your question since it's not done, but I started a tool to save and check metadata of files:
https://github.com/shane-kerr/fileinfo
Right now it just outputs a file with all of the meta-data (including SHA-224 hash of the file contents). If you think this seems interesting, I can whip up the part that uses that file to check the meta-data this weekend.
I use the MD5 solution mentioned above, but also back everything up to Amazon Glacier. From what I've read, retrieving your data can be a pain, but storage is only $1 a gigabyte per month and they say that they store multiple copies across multiple locations and periodically check for data integrity. If data integrity is lost, they repair it using the other copies. I asked them how often data is checked for integrity and they said:
"Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing. So, to address your first question, we performs checks frequently enough to ensure that we meet our design goal of 11 9s of average annual durability for an archive. In the very unlikely event that it is determined that one of your archives is not recoverable, we would contact you promptly."
What command are you referring to?
When you have a file that has one hash on one side and different hash on other side, you can't know in a script which file is the correct one.
So add a third location of the files and hash them. Now you have two identical hashes and one different, so you know which file has gone bad and needs replacing.
Script all that chechking and implement automatic corrections.
Bonus points for doing it on three computers at different locations.
It's a fun weekend project :)
I have a pair of 4TB disks that I keep cloned with rsync. Periodically I verify the contents using rsync -c, which forces rsync to do a full checksum on the files. A few times a year this will identify a file that is actually corrupt and I'll manually recover it from the good copy.
I have a home NAS (low electric use) running Ubuntu server. Every night it generates an md5 hash of all files on the drive.
On my primary system I run for 6 hours the same hashing on my primary drives. I then compare the files to look for changes and fix corruption accordingly.
Bit torrent?
Set up your very own very private tracker(s).
Create a torrent of the file trees to be duplicated and protected on the original host.
Leech it at all the redundant sites.
Wait for them all to complete the download and become seeds.
From time to time, but not all at the same time, force a recheck on each member of the swarm, to detect corruption
A failure should trigger a download to correct the corrupted block from the swarm.
You can probably get better advice on how to handle a growing archive.
I would probably try to add another torrent of the added files, then
wait for the swarm to download those files.
Then create a new torrent file that includes the old and the new in a single torrent and use that for the next forced recheck cycle.
You probably want to have a few scripts to automate the rechecks and updates.
--
The world is coming to an end, but don't stop seeding
Step 2) once a month, go through the photos you have taken the previous month and delete those that just don't mean as much anymore (if they have decreased in emotional value in 30 days, just think how utterly worthless they would be in 5 years?).
YMMV
I disagree with this. You never know what is going to "mean something" 10 or 20 years from now. When I was in the Army my buddies thought I was weird because I would take pictures of everything. Even stuff that was boring and had no emotional value whatsoever.
Now these days my old buddies want all my pictures from back then. Many times they we have discussed something on FB or whatever and I mention that I have a picture of that. They express surprise and ask me to post that picture. We then talk about that thing and how happy they are that I wasted money back then, (it was regular film back then that you had to pay for and pay to develop), on taking a picture of this.
I wouldn't be so quick to delete any pictures just because they are taking up some space. Disk space these days is cheaper and getting cheaper all the time. It's worth the time it takes to back up your data, (start it before you go to bed), not to lose any pictures that you might care about later.