Ask Slashdot: Best File System For the Ages?
New submitter Kormoran writes: After many, many years of internet, I have accumulated terabyte HDDs full of software, photos, videos, eBooks, articles, PDFs, music, etc. that I'd like to save forever. The problem is, my HDDs are fine, but some files are corrupting. Some videos show missing keyframes and some photos are ill-colored. RAID systems can protect online data (to a degree), but what about offline storage? Is there a software solution, like a file system or a file format, specifically tailored to avoid this kind of bit rot?
I prefer to chisel the 0s and 1s into a stone tablet. Very secure, no bit rot.
zfs
I've got somewhere between 20-30 TB that has been accumulating for more than 20 years on NTFS, and I've never seen any examples of "bit rot". My files today are identical to what they were 20+ years ago. I have to wonder what kind of filesystem that the poster is using.
I don't respond to AC's.
Is this even possible long term? What would have happened if you stored all of your information on PATA drives 10 years ago, its rare to find a motherboard with PATA on it now, yes there are converters and 3rd party PCI cards, but those are eventually going to dry up too.
Now, say you choose SATA, what happens when M2 becomes the defacto standard? So, why dont you choose M2? What happens when M2 is phased out?
It is not just the file system and the data you need to think about, its the physical hardware too. With the rate things change in hardware, and connecting that hardware to other hardware, its unrealistic that you could expect to be able to use your current storage media in 10 years, let alone 20, 30 or 40 years.
portfolio
Anything but that
The only historically tried and proven method of storing data for the very, very long term is hiding clay pots in the desert.
I recommend that.
Are you sure this is from an aging HDD? Maybe it's your eyes.
I mean, they were literally made for offline storage.
Otherwise, a bit more affordable and probably somewhat future proof, there's M-Disc, the disc that are supposedly made to last a thousand years.
The only historically tried and proven method of storing information for the long term.
If the bits on your drive are changing while the drive is offline, that isn't a filesystem issue. A filesystem issue would be if your OS wrote the wrong information to the drive, but that can't happen with an offline drive.
Tape drives will store your stuff for upwards of 10 years, up to 30 if you store them really well. They're also available in large sizes and is pretty cheap (about a cent per GB).
Still RAID is a good choice for your redundancy of choice.
Or paper: http://ollydbg.de/Paperbak/#1
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Joking...but not really. From today's Reddit Science AMA with Yaniv Erlich: https://www.reddit.com/r/scien...
There's some very old papyrus around.
The magic phrase to Google is "error correction codes" (ECC).
PAR2 uses Reed-Solomon error correction. parchive is the ECC file format specification, for Linux you will want PyPar or par2tbb, and on Windows you use a GUI called QuickPar.
Btrfs can be set to use ECC on a single disk.
You can slice a single disk into partitions and then use RAID1 or LVM mirroring, or RAID5 or RAID6. LVM can alao be useful to divide (and combine) any number of drives into any number of volumes, then you can RAID across the volumes.
If you Google "ecc disk", "ecc backup", or "ecc archive" you'll find other options, with details about each option.
ext4 is journaled and prevents loss in case of some file-corruption-prone events (like a sudden shutdown).
Slashdot, fix the reply notifications... You won't get away with it...
ZFS lets you use the multiple copies in a RAID array to correct such bit rot and seems to generally be popular with people storing multiple terabytes. You might also want to ask this question on reddit's /r/datahoarder for some experience. For offline storage you should then probably activate it and run a scrub once in a while.
The other suggestion would be to look at solutions employed on larger scales (libraries, archives), e.g. tape, distributed storage. For long-term storage you should also consider the possibility of soft- and hardware changes and thus maybe a "dumb" filesystem and easy access 20 years later might be more beneficial than a complicated filesystem and no access.
ZFS will guard against bit rot. That's not enough. RAID isn't enough. You need redundancy outside your home or office. Cloud maybe expensive for the amount of data you have, but Amazon S3 maybe the most affordable in that range. You could get S3 for maybe $15-20 a month if you have a terabyte of data. If that's cost prohibitive, rotate external drives regularly and keep one at work. You'll lose very little data since you're archiving things.
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
I'd go for any Linux file system because Linux is the platform that evolves the least. It's still in the 90s so in 2037 it will still be current.
(Watch out of the hater storm! Here they come!)
But it's kinda true if you omit the snideness of the first statement. Because it's maintained by the user base, it's less likely to "devolve" into something incompatible due to market pressure. I, myself, would go for an Apple file system but Apple isn't so keep in keeping the Mac current and it doesn't bode well for the future. There might be a great change in the horizon.
That's a well known problem to photographers, photos colors are affected over time. Keep the photo negatives in a safe place!
Slashdot, fix the reply notifications... You won't get away with it...
In a word, no: I don't think there's any filesystem which is designed to combat bitrot while offline. Logically that would just mean duplication of data anyway and hoping that the duplicates don't both get corrupted over whatever period they're offline.
Instead what you really want is a RAID array using ZFS with regular scrubs. A 'scrub' being where ZFS scans the entire contents of the disks, confirms checksums all still match and, if they don't, rewrites the data using the redundant disks in the array. Obviously it needs to be online to perform the scrubs but you could just boot it once a month for a few hours to do that.
If you only need a mirrored RAID rather than RAID5/6/7 then BTRFS can offer the same functionality with some additional flexibility and is also native to Linux (rather than BSD for ZFS).
Not all RAIDs are equal. If you want your data be safe use RAID 1 with second volume in a remote location (aka. offline backup).
Backblaze made a report of what SMART drives they see indicating imminent drive failure: https://www.backblaze.com/blog...
No media is perfect. There's just varying likelyhood of error rates over time, depending on the quality of the media. Without knowing ahead of time whether a specific piece of media is going to fail, the question needs to change from "How do I keep it from getting corrupted" to "How do I mitigate eventual corruption?"
And the question basically boils down to one answer: redundency.
Off the top of my head, I can think of three things you can do, and these are not mutually exclusive.
1. Multiple copies of data, stored in different locations. If something happens to a specific location, then at least the media is still safe elsewhere. Even if nothing happens to the location, media failure can still occur. The more copies you have, the more likely you will still have at least one good copy when the times comes that you want to access it.
2. Parity. There are plenty of tools available that allow you to add parity information to your files. For example, the RAR compression utility will allow you to add a 'recovery record' to your file. You choose how much RR you want to add, up to 10% of the file. Obviously this takes up additional space, but you can have a sizable portion of your .rar file become corrupted, and you can still retrieve it. Another thing you can use, is a tool that was popular in the old days of newsgroups: PAR. Unlike RAR which encapsulates your file into an archive, PAR files sit beside your data files. But the function is basically the same. PAR files provide parity data, which you can use to reconstruct files that have been damaged. I'm sure there are other tools available as well.
3. Migrate your data over time. The unfortunate fact is that media changes. If you want to keep your data for the long haul, you have two choices: Make sure that you keep backup hardware to read the media you want to read (which brings it's own longevity problems), because it may not exist in the future (eg: It's pretty darn hard to find 8" floppy drives anymore), or you periodically migrate your data to a new standard format. Just in the past 30 years, we've gone from Floppies->CDs->DVDs->Bluray->Flash(thumbdrives,SD,etc).
For the average Joe, you can't do much better than a simple array of disks with ZFS (it offers good quality integrity checks out of the box),
combined with an off-site backup which you would likely be unable to do anywhere else cheaper than Amazon Glacier's service.
Last time I checked, fitting 6x8 of disks striped in raidz2 configurations gave an optimal balance of reliability, capacity and speed, all feasible to have in a single box.
Offline is not a problem: you just switch off its power when not in use.
"Is there a software solution, like a file system or a file format, specifically tailored to avoid this kind of bit rot?"
Yes, ZFS is specifically tailored for this. Configure a zpool running RAID-Z2 with a hot spare or RAID-Z3. Half a dozen 6TB or 8TB disks should suffice.
Set it to auto-scrub regularly. Send logs and warnings to your email, and pay attention to them. (This is the hard part). Especially pay attention if they stop arriving. (This is even harder).
I have used Nexenta for some time, but the free product has a limit of 18TB of raw storage. If I was starting today I would use FreeNAS which has no such restriction.
The other comments about the futility of trying to do this long term are worth heeding, but that doesn't mean you shouldn't try. They key is to make this an active project rather than a passive archive, and to re-evaluate the best approach every few years.
Whatever media you choose, must be tested from time to time. Even a tape can suffer data loss if not used eventually. And hdd may be are susceptible to errors because of cosmic rays or magnetism if not powered for a long time, just guessing...
'Forever' is a long time.
'Offline' is difficult to deal with long-term (i am thinking decades to centuries) such is the nature of technology and the lack of any real history we have of digital data management,
Personally I would say the best bet is keeping your data 'live' online to some extent, it is the only real way to monitor and control the inevitable decay.
Basically your data's lifespan is related to how long you can convince someone to care for it for you.
"I bless every day that I continue to live, for every day is pure profit."
Pick your poison:
- Tape: inexpensive and slow, require frequent testing (backup we do, it's restoration the problem!), usually unreadable after 6 to 12 months or less (that's in production people).
- WORM: more expensive than tape and just as slow, work well in the medium term (meaning 10 years top).
- XFS NAS: faster than the above, require good hardware and a bit more work than either tape or worm. Don't forget to setup replication to multiple systems. May suffer from bitrot in the long term (checksumming/hashing files might be a good idea). Very stable, large capacity file system. Tape backup is always a good idea.
- ZFS NAS: slightly slower than XFS (at least, that's my experience, YMMV). Ultra-large capacity. Snapshotting is just a breeze. Again, replication to multiple, distant systems is mandatory. Very stable file system. Tape backup is always a good idea.
- DNA, 3D crystal lattice, holographic memory: what we are all going to use in the future. Still in beta testing, though.
- DVD: don't make me laugh.
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
As has been stated, your hard drives are not fine if you are starting to see data corruption. They are starting to die. The filesystem used is irrelevant. You have a hardware deterioration problem. Hard drives only last so long. When in constant use they'll eventually wear out. When sporadically used they're susceptible to other kinds of hardware failure issues. This includes magnetic issues, heat cycle issues, etc.
Hard drives are not permanent storage. If you choose to keep your data on hard drives, and hard drives alone, then you will have to accept the fact that you will forever be buying new drives and copying the old data to the new drives. Whether that's done manually or is simply an automatic procedure is up to you and how you choose to set up your system. If you don't do this, you will eventually lose all your data because chances are your drives will stop working at some point. They all do.
As has also been mentioned, your best best is to set up some hardware redundancy and some filesystem redundancy on top of that. In addition one or more extra copies in a different physical location. I'm confused as to how this basic stuff is on slashdot, frankly.
It looks like there are (at least) two with CRC: zfs and btrfs. Here's info for btrfs CRCs: https://en.wikipedia.org/wiki/...
You'd still need a backup or RAID solution to replace a bad black.
if bits were randomly changing you'd have corruption issues not faded images and videos missing keyframes. This is ridiculous.
I came to the datacenter drunk with a fake ID, don't you want to be just like me?
HDDs will die. If you want something that will last for many decades or even centuries without getting corrupted then you need to stop using a volatile filesystem. The best option is to go with write once media. The best option I know is M-DISC.
M-DISC's design is intended to provide greater archival media longevity.[3][4] Millenniata claims that properly stored M-DISC DVD recordings will last 1000 years.[5] While the exact properties of M-DISC are a trade secret,[6] the patents protecting the M-DISC technology assert that the data layer is a "glassy carbon" and that the material is substantially inert to oxidation and has a melting point between 200 and 1000 C.[7][8] -- Wikipedia
Anons need not reply. Questions end with a question mark.
It may have nothing to do with bits. It's possible the problem is a media player and/or driver compatibility issue or bug. I've seen where one media player/displayer can display an image or video fine, but another gags on it or distorts it. Probably a bug in the encoder and/or decoder.
As far as backups, make at least 2 copies. Bit-error-recovery schemes will usually require more storage space such that it's probably less hassle and more "insurance" to keep 2 regular copies rather than one copy with some fancy bit-correcting on it. Plus, in the future you may not be able to find a decoder for the fancy file encoder scheme.
Table-ized A.I.
Essentially MS's new ReFs does everything plus self healing except no alternate data streams and no booting from it. Your files sound like Archiving which is exactly what this can be for.
ZFS is nice I use it it makes assumptions about sane gear that are not safe on desktop grade hardware. BTRFS I also use works great. But for your specific use case snapraid is the thing to use. By that use case things that never change a big pile of files you keep adding to. Mind you your going to have to replace drives over time.
No sir I dont like it.
A archival optical format. M-DISC DVDs and Blu-ray are theoretically able to retain data for 1000 years. And DVD uses some error correcting codes already, Reed-Solomon I believe.
An SSD is a bad choice for archival, in some cases MLC Flash can decay and accumulate errors in 3 months while unpowered.
For a file system that is likely to be understood in the distance future, ISO 9660 with no file larger than 2 GiB should do the trick.
Packing your data into a custom archive file format that has more sophisticated forward error correction, like Turbo Codes, could be useful although perhaps inconvenient if you need special software to decode the files.
Keeping file of hashes (MD5, SHA1, crc32, cksum, cfv, whatever) for file integrity verification is very helpful for verifying if you have bit rot. As I've found most proprietary file formats cause programs to crash when they are corrupt.
Making N copies of your data and sending the discs to N destinations would allow you to recover most instances of partial data loss among all the discs, and total data loss of N-1 discs. I think N=2 or N=3 is plenty of paranoia without much overhead for an individual.
For short term, just throw it into the cloud. If your local backup
NOTE: in 100-1000 years, people interested in old data won't need an off-the-shelf DVD drive to read the data off a DVD, any researcher should be able to construct a purpose built drive. I mention this because USB, SATA and PATA won't be around as standards and the old electronics won't likely work reliable anyways. Even today, I think building a device to read a CD or DVD is within reach of a clever teenager.
“Common sense is not so common.” — Voltaire
1. Add lots of redundancy in the form of PAR2 files.
2. Store the whole lot as a tar format, dumped to the drive as a block device. This format is so simple that a future programmer will have no trouble reverse-engineering it, even if all documentation has somehow been lost, and there are no key structures which will render the whole thing impossible to read if lost. Just to be sure, the first thing going on there is a copy of the tar format specification.
3. Include also a copy of the par2 software for several operating systems, source code, mathematical explanation and format specification.
4. dd copy the drive to as many other drives as your budget allows.
5. Distribute the drives.
This approach should do for the next forty years or so. After that point it might get difficult for people to source a SATA controller, so you will have to migrate to new media.
Seriously, minimalism is underrated. There is such a thing as too much useless data. It's hard to catalog, it's hard to track, and if you sat down and sorted out what you actually could still use, most of it is probably worthless or you'd never find the time to use ever again. You might ask "well it's still worth storing IN CASE I ever find a use for it", but that's a typical data-hoarder sentiment that is unsustainable. You can't just keep buying media to store everything and never delete, it's a management nightmare results in these very issues.
I guarantee you, if you find you've deleted something and actually want to get it back, it's available somewhere on the Internet. If it's NOT, then it's a candidate for keeping. That's how minimalism works.
Just RAID it (preferably mirroring)store multiple redundant copies, physically separated. Either use a checksumming filesystem (i.e. zfs) or make your own checksums so you can recognize bitrot.
But you'll never know when things have degraded beyond recovery, .
Unless you're prepared to regularly validate that the data is still readable, you'd be better off storing the data at any major cloud vendor and let *them* verify integrity over time. Or better, mirror the data across multiple cloud providers.
My most important data is family photos (some scanned images date back to the early 1900's). I keep the image files on a RAID-6 hard disk array, which is backed up to a separate hard drive in another part of the house once a week (for quick local restores), everything is also backed up to a Crashplan cloud backup account, and all of the files are also backed up to AWS Glacier in a different country from me.
... obviously.
If I understand you correctly, you are asking what filesystems can error-correct in the face of physical bit rot.
I don't know of any commonly-used "disk-type" (local, not specifically designed for archival/offline media) file systems that have checksumming or RAID-style redundant data within the filesystem itself. Some distributed/clustered file systems have features like this, but they aren't well suited for offline storage in the way that you are thinking about (or, when used for offline storage, the redundancy is likely "optimized away."). I'm not familiar enough with the filesystems used by optical media and tape drives or their underlying hardware to know how much redundancy exists or at what layer the redundancy exists at, but I suspect it is "below" the filesystem level.
If you aren't interested in inventing an "optimal" solution in terms of storage space or time-to-create-or-read the backup, a "not much thought required/you can think it through in far less than an hour" solution is to create checksums for every backup you make (either per-file, per-"block," or some other way) then make a second copy of both the backup and the checksums. Store the two copies in different places. If it's very important, make a third copy but use a different format for this backup (for some documents, like a business letter, a printout is an acceptable backup).
--
If it's really really important, encrypt it and upload it to PasteBin and tell the world that it's political dirt on [insert politician's name here] and that you will release the encryption key if the politician doesn't resign. This will ensure that there will always be many copies in existence. *joke*
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Back in the day i had a 4.3gb hard drive in my computer; that's right a hdd the same size as a DVD. I had to uninstall one game to get enough space to install another. I used to run it as lean as i possible could but these days of multi terabyte drives i have relaxed and succumb to little bit of hoarding; maybe a few hundred gb worth.
But people who feel the need to keep multiple terabytes really need to look at themselves and think is it really necessary or is it simply hoarding. Digital hoarding can become very expensive very quickly; having to keep multiple drives containing the same data to ensure failure won't wipe it out. Having to routinely spin up the drives to ensure no damaged sectors have corrupted the data; doing a bit comparison to ensure both copies are identical.
You seriously need to look at the data and consider if it's worth keeping and if it is can it benefit from compression, even lossy. Do you really need a 50gb bluray rip when a 15gb x265 encode will be all but indistinguishable? Do you really need a 20mb png of an image that can be converted to a 1mb jpg with minimal loss of fidelity?
Just because we live in the age of single 10tb drives doesn't mean we should simply stop being selective or logical about our storage needs.
I have 2 backup routines. The first is a (nearly) whole system backup onto a 2tb external hard drive (excluding steam folder and other things that can be re-downloaded) - this is something i can just restore if my drives go bang.
The second routine is to backup the things i simply cannot live without. The things that i have written over the past decade or so which i simply could never replicate again. That is encrypted and mirrored in nearly a dozen separate places - on a usb stick, on my primary phones memory and sd card, on my secondary phones memory and sd card, on my mp3 player, in 3 separate 'clouds', on my tablet and on another external hard drive. That single encrypted file is less than 500mb. That is the difference between keeping the things you cannot live without and digital hoarding.
Get out your chisel and mallet. Carved into stone tablets your data can last for millennia. You will want to throw in a little error correction. Better get to chiseling...
Or use an optical format made for archiving.
for your OCD, hoarding, and anxiety disorder.
It may be that the codec you are using now isn't bug-for-bug compatible with the codec that was used to store the file.
It's also possible that the file was saved in a "not quite industry standard format" but that it would look fine on vintage hardware running a vintage OS with vintage device drivers and vintage software, but today's hardware and software interprets these "not quite industry standard-format" files in a way that exposes their flaws.
Got a Pentium II computer and a copy of Windows 98 in the basement?
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
https://www.backblaze.com/blog... There is also rsbep, see https://www.thanassis.space/rs...
Perl Programmer for hire
You've got terabytes of information you will never access again. How about just getting rid of most of it? Pick some subset you want to keep and then buy 3 HDDs and create triple copies of it Repeat this every year and you'll probably not lose any of the information.
in addition to ZFS, BTRFS also handles bitrot. I'm running a 4 disk BTRFS RAID 10 in my closet, mounting to a development machine on my desk via NFS, it's been working fine for about a year, and I scheduled a scrub a couple times a month whose purpose is exactly this, to catch and correct bitrot. It does so by using a CRC32 check, and if it detects a problem on one slice it overwrites that slice from the data on the good slice.
Also I have offline and offsite backups of very important items.
When using BTRFS read the wiki and settle on a kernel version and btrfs tools version that is sufficiently up to date, it's stabilized sufficiently for these kinds of things, but only if you are careful to run an up to date version that isn't marked as buggy on the wiki
If your data is on-line (stored in disks that are plugged in) then you want ZFS. Preferably with either mirroring, RAID or multi-copy turned on so you have more than one copy of each file. This allows the file system to repair files that fall victim to bitrot.
If your file system is off-line (not plugged in) then you should make multiple copies of each file/disk and store your backups in a different location. Chances are the same file will not get corrupted in both places in the same time frame.
Thanks for that one!
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Somehow all those postings I made back in the 90's are still available.
It looks like there are (at least) two with CRC: zfs and btrfs. Here's info for btrfs CRCs: https://en.wikipedia.org/wiki/... [wikipedia.org]
You'd still need a backup or RAID solution to replace a bad black.
If only Slashdot posts had CRC or something like that, the posts wauld say what the poster intended.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
That a job for Linear Tape FileSystem
https://en.wikipedia.org/wiki/...
Tape is (still) the best medium for Long Term Storge. Over the years tape (or more likely, the engineers) has agresively incorporated in the standards things like FEC codes (from reed-solomon to more exotic ones nowadays).
And since 2010, with LTFS, you can aceess the files with the convenience of a normal filesystem (but bear in mind, access is slow as hell).
Back up your data to tape (more than one set), and send it to specialized offline storage facilities (cimate controlled: ie. temperature/humidity/dust/light control) from different providers, in diferente geographical areas.
Since now there is only one true-tape standard (LTO-7 released in 2015, the tape business has been shrinking, so the proliferation os standards seems to be over now), so, if you use that today, chances are you will still find equipment to read it 50 years from now. Nonetheless, keep a few (as in two or more) SYSTEMS (Computer+Drive+SW) set up so that you can re-read. A cheapo micro formfactor mobo with an Atom Pocessor (but NOT the Atom C2000series PLEASE), linux, a 1Gbps nic and a tape drive should be more than enough. ....
Now, for Online, as other posters have said, ZFS WITH ECC memory (and therefore, a very expensive Xeon, or AMD server type mobo) and JBOD will do the trick.
*** Suerte a todos y Feliz dia!
Hard drives already use ECC on the physical platters (and SSDs do this too) to ensure bit rot is correctable (or at least detectable in the case of 2-bit failure). Unless you have a particular type of 3-bit failure, monitoring SMART stats and catching any read errors thrown by the drive is sufficient.
There are WAY too many ZFS/btrfs zealots. "Oh, it has CRCs!" Yeah, so what? CRCs will tell you that data is already damaged just like a hard drive will. "Oh, it has ECC!" Yeah, so does the physical medium, so how is that going to make a difference beyond the hard drive? It might (MIGHT) detect hardware failures between the medium and the CPU/RAM, but if that's going out then you've got a much bigger problem than a filesystem with ECC is going to be able to help. Plus your CRCs only help if you either read the data or scrub frequently; untouched data CRCs won't get checked otherwise which is no change from no CRCs at all.
The solution is what it always was: keep a backup copy of everything and restore from backup when something is damaged. If you want to detect data loss due to surface failure more quickly, you'll need to do a comparison of the backup against the data and see if a file has actually changed (rsync -rcvn source/path/ dest/path/ will display all changed files detected by full file checksums instead of time+size difference).
ZFS and btrfs have lots of spiffy features that may be helpful in avoiding or mitigating data loss (snapshots come to mind), but I really wish people would stop acting like they're a magic bullet that kills your bit rot problems. They're not. They never will be.
RAID-5, XFS, and an array on a different machine periodically mirroring with rsync snapshots has NEVER failed me and likely never will. If you're losing so much data that multiple files are getting corrupted, ZFS and btrfs ARE NOT GOING TO SAVE YOU.* Fix your computer.
RAID assumes statistically decoupled failure modes. One place I worked, they shut down a rack of servers for maintenance, and had over a dozen simultaneous lost drives when powering them up. Stiction. A design flaw in the bearings or heads or something caused the disks to get scraped to hell, completely unreadable. One brand of recent drives can't shut down properly because the capacitor that powers the emergency-parking the heads isn't strong enough to overcome unexpectedly high friction in the bearings, if the grease settles for too long.
So yeah, just assume you're going to lose that data some day, and learn to live with that fact.
Data recorded on parchment has survived for hundreds of years, through numerous world wars, environmental catastrophes, human stupidity, etc.
Make sure to back it all up on clay tablets.
> I'd like to save forever
Why? Nobody will want your crap after you die. What you are doing is called 'hoarding'. If you have traces of sanity, you should destroy this crap and start living you life.
A related thought: whoever worries too much about 'bit rot' should keep reminding themselves that this is exactly what happens to us - both our bodies and minds - in the course of each and every minute. Both our genetic material and our brain cells continually deteriorate.
Don't waste your time.
Its not the only solution of its type, but it is imo the best:
http://www.snapraid.it/
It is perfect for your kind of situation - long term, reliable, efficient storage of lots of data that seldom changes. Think of it as offline RAID backup, it works like RAID, but it computes parity during your backup operations "offline"..
The beauty of it, imo, is that is is not file system dependent. It works with NTFS, EXT2, HFS, whatever. It works on Linux, Windows, Macs, whatever. You don't need special controllers, and your hard drivers do not have to be matched to each other. You can even include drives on different buses (some on USB, some on SATA, whatever).
It doesn't mess with your data at all - your files are stored normally and can be accessed normally, there is no difference between using it and not using it under normal operation - there is no performance impact at all (it only does anything during backup operations - and even then it is very lightweight if your data doesn't change drastically day to day). You just schedule it to run on a regular basis and it does it thing. It detects and recovers from bit rot in much the same way as ZFS (although you need double parity or more to really ensure full protection from multiple drive failures). You can be as paranoid as you want, it just takes more storage to be more paranoid :)
It isn't good for frequently changing data, and it isn't so great for huge amounts of small files either. It takes a long time to generate parity setup if you have lots of data. You have to be comfortable with command line usage and you have to have some way to schedule jobs. Those issues aside, for things like media libraries and archival storage, it is easily the least painful, most effective solution I have ever used. And its free to boot (and opensource).
Highly Recommended.
- sigs are stupid
Paper Tape - As long as you don't damage it, it will never suffer data loss.
Your thin skin doesn't make me a troll
QuickPar on Windows is long-obsolete. MultiPar is the more modern variant.
... be it on an optical disk or another storage medium, I first add ~25% error correction data with http://www.dvdisaster.com/en/i....
So far I have only needed it once, when I wheeled over a DVD with my office chair, when it was enough to recover the data.
archive stuff and upload everything to usenet.
A lot of bit rot is actually caused by faulty RAM.
When data is moved around, it has to go through RAM, and even smart filesystems like ZFS may not help you there. Servers usually have ECC memory for that reason and ZFS explicitly recommends it.
For live data, some Nas devices like synology have a 'scrubbing' option where it can rewrite your dataset once a month to prevent magnetic levels from degrading too much, and prevent bit rot by doing so.
ZFS on Linux (http://zfsonlinux.org/) is a great option via the great work done by Lawrence Livermore National Laboratory Also, have a look at: http://open-zfs.org/wiki/Distr... and http://open-zfs.org/wiki/Compa... for solutions where ZFS is integrated into various solutions.
You'd still need a backup or RAID solution to replace a bad black.
I hope you mean a Western Digital Black.
Bit-rot doesn't happen in storage. (Because it has an ECC on the hard disk; you either get your data back, or you get a read error!)
If you have numerous examples of file corruption, they happened as you were reading, writing, moving/copying the data. Think about how you've done that in the past. Especially think about whether you used any external docks, USB adapters, or similar.
..forever is a very long time. Your stated aim is simply impossible. Delete your data, because the reality here is that no-one, not even you, if you really examined your own feelings on the matter, honestly cares about your terabytes of digital driftwood.
Of course, if you are really intent on storing this information forever, then you're going to have to consider what happens when you die. For this, you're going to have to become rich, because no-one is going to look after this stuff for free. You'll also need a library of hashes of the files, to ensure integrity, and naturally at least two copies of each. You'll have to write software to continually re-calculate the hashes, and check against your library, but that's OK, because you could probably sell this kind of archival service to other OCD-stricken humans, which takes care of your money problems.
In fact, it occurs to me, that the real and only answer to your problem, is to invest all your time in building a company that provides this service, use the proceeds to look after your data too, and write provisions into your will that your data is preserved forever.
And thermodynamics, be damned.
Over the years, I've had failures in my CD/DVD archive, hard disks, and solid state storage (USB, CF, MMC, SD). Consumer grade hardware isn't designed for longevity. The only rock solid archival medium that's never failed me is obscure and dead. That said, ISO filesystems on CD-ROM will likely be readable for a decade or two longer... as long as your media doesn't rot. FAT is the next most ubiquitous.
I agree on PAR2, simply because it's a file you can easily copy around, take backup off and so on. From a 1GB file I have ~3000 source blocks and ~30 recovery blocks, so I can recover from a lot of bit flips or failed 4kb sectors for a 1% size gain. If it's a photo set I usually make sure I can recover at least one completely missing photo. The nice thing is that it's sufficiently overkill you can probably go through several hardware generations without checking/repairing before you accumulate an unrecoverable number of errors. Which is good, because it's fairly CPU intensive so I wouldn't really want to go through an 8TB drive often. But I've found that an on-demand check when I actually need it is fine for content that is "in storage". It's not like it happens very often or applications and other more bit-flip sensitive formats would be screwed up quite often.
Live today, because you never know what tomorrow brings
The magic phrase to Google is "error correction codes" (ECC).
PAR2 uses Reed-Solomon error correction. parchive is the ECC file format specification, for Linux you will want PyPar or par2tbb, and on Windows you use a GUI called QuickPar.
Btrfs can be set to use ECC on a single disk.
You can slice a single disk into partitions and then use RAID1 or LVM mirroring, or RAID5 or RAID6. LVM can alao be useful to divide (and combine) any number of drives into any number of volumes, then you can RAID across the volumes.
If you Google "ecc disk", "ecc backup", or "ecc archive" you'll find other options, with details about each option.
ECC is probably not going to fully cut it. That just increases the number of errors that can be corrected, usually not by a large amount. Long term entire disks are going to die, or if your lucky only part of an entire disk.
I'm thinking of something like torrent. For instance, suppose your set of files decomposes into 100 torrent chunks. Each chunk is a MB or whatever. Each chunk can be copied onto multiple destinations. All you need to do is obtain a full set. You can extend this a bit, by encoding the chunks similarly to how raid-6 does, such that you effectively have to recover two out of four sub chunks to recover that larger chunk. (You would still have to recover all 100 chunks to recover the original file.) (Maybe you can do better than Raid-6...)
At any rate, each disk should contain a copy of all the file names and the checksums of the major chunks. This is a fairly small amount of info, so its being copied everywhere is harmless. The distributed ECC allows you to recover files with some parts missing or corrupt. Combine that with as mentioned just insuring you have copies of all the pieces reasonably distributed and your in pretty good shape. The use of something like torrent chunks basically adds the ability to verify that chunk is intact. Also breaking things up into pieces, increases the odds of finding enough intact pieces to recover the whole.
I think back blaze may do some of this. With some actual effort, you could probably figure out how to optimize this and see what failure patterns you really protect against.
On a side note, the "torrent chunks" should typically try not to smear small files across more than one chunk. We want to be able to do a partial recovery, if the data is badly corrupted. The main key is to decompose the problem into a set of smaller problems such that you can determine with some degree of certainty just how reliable the resulting system is and what your failure profiles are. Depending on how things are structured you may be able to control the possible failures to ensure that things really do fail somewhat gracefully, as opposed to simply either having enough to reconstruct all, or nothing.
Your parents, your spouse, your children, your pets,
everyone you know, everyone you've ever heard of,
the earth, the solar system, the milky-way galaxy,
you and your pitiful little hard drive will all die in time.
I, on the other hand, purchased the Extended Warranty Plan.
UDF is the RW format for dvd-rw and can be used on HDs in all modern OS (it requires format version 2.01)
The format is resilient, as DVD-R(W) may have scratches and have CRC in metadata... sadly it do not have CRC in data, as the DVD reader/physical format also have some recovery info, so UDF didn't add it directly.
It is still a good format, being a ISO, it should have a long life and be read for a long time. Of course, for HDs, i would bet that mechanical problems will probably be a problem sooner.
other than UDF, ZFS and BTRFS both have CRC and should be resilient and the format is set and should not change. but there are other formats with CRC, check the wikipedia for more options
Finally, probably the format that you store the files is also important, a solid RAR or TAR may cause problems in the future than compressing each file with gzip. Probably the best option is store the files using par, as it was created to permit access to the files even if several blocks can't be read. some backup tools support this, directly , as DAR or but, or indirectly, as backuppc (search ArchivePar) on the archive step
Whatever you do, a followup of this in one year (or more) is a good idea, as the theory and real life may be different things :)
Higuita
Those green hanging folders seem to last forever. When was the last time you saw one wear out to the point where it was thrown away? Typically they seem to last a lifetime.
An Introduction to the Z File System (ZFS) for Linux.
Quote: "ZFS is capable of many different RAID levels, all while delivering performance thatâ(TM)s comparable to that of hardware RAID controllers."
That sounds good to me. I want to avoid hardware RAID because, when hardware RAID controllers fail, they are often difficult to replace.
Lots Of Copies Keeps Stuff Safe (this is what is recommended).
Google Drive, One Drive, DropBox, Box, Corbonite, Crashplan...sign up for several and spread copies around.
At least one is likely to survive the fires, floods, tornadoes that will inevitably doom any single approach.
my HDDs are fine, but some files are corrupting
Your HDDs are not fine.
When all you have is a hammer, every problem starts to look like a thumb.
CAREFUL. If you're using ECC on your file system you also need ECC memory. If not, a bad bit in memory could trigger an ECC validation failure on good data and then the file system's cascading 'data corrections' may wipe out your entire partition!
Someone mentioned this on Slashdot awhile ago with a far better explanation than I can give.
UDF on archive quality optical WORM media, such as BDR M-Disc
Snowden and Manning are heroes.
More generally, be careful of cascading error correction. Some types by nature will not cascade (these can generally be thought of as 1-dimensional), other types should check for a cascading effect before doing a correction.
That's why we know more about ancient cultures of the middle east than we do about the cultures that destroyed them.
Simple ...... Use ZFS with ECC ram. It checksums all of your data and using ECC ram will prevent corruption during off disk filesystem operations
>"The magic phrase to Google is "error correction codes" (ECC)."
Ok, maybe o/t, but: I always knew ECC as "Error Checking and Correcting"
Has the meaning changed over time, or is this a case of a collision in TLA-space?
Yeah, yeah, I'm too lazy to google/wikipedia it.
Do NOT use btrfs. I have real world experience with it and it is horribly unstable, but everybody claims it is stable. It's the edge cases that cause the entire filesystem to become totally unusable, not to mention the features that were never really tested properly and people were using thinking they were stable for years (RAID5).
It's really alarming that it's a default option on synology disk stations!
Or, you know, you can just use ZFS and have all this and more.
Besides, BTRFS is buggy, beta, and dead.
If you have multiple backups, you can fill in "bad blocks" in one backup with another backup; that's probably the simplest and most easy-to-use solution. You can calculate the probability of an unrecoverable error easily.
If you want something more efficient, you can use various forward error correction tools or file systems. Tahoe-LAFS is one such system, though perhaps more complex than you might want.
Concur. File corruption due to "age" will not occur without hard read errors. Also, "ill-coloured photos" likely would not be ill-coloured in the case of actual data corruption, but would have whole blocks of hash in them. The user claims to have multiple terabyte sized hard drives - hard drives in this size category userd for archival storage are simply not old enough to be suffering data corruption due to age. The only hard drives suffering so are MFM hard drives that likely the poster wouldn't have a clue how to even interface into a current computer. Hard drives used for archival data storage will likely not age degrade before the interface standard they are based on becomes obsolete. Thus, a perfectly reasonable archival data storage strategy is to simply copy data from one hard drive to a newer (likely much larger and faster) drive when the next generation interface becomes standard, and before the previous generation is totally obsolete. For example, one can still get PATA + SATA USB adapters, SATA + M.2 adapters, etc.
If the user who submitted this question is actually experiencing a problem at all, suggest that PEBCAK. Better explanation is the poster is not actually experiencing current problems at all, but is simply trying to sound important with inflated claims of reams of data and that Slashdot has been had.
Further, no person with Slashdot posting authority should have been ignorant of any of the issues in this question that make its legitimacy questionable at best, and certainly not Slashdot worthy in any circumstance.
Digital photos do not become "ill colored." The degradation would have to affect the entire photo in a consistent fashion and, depending on compression used, the degradation would need to first decode, change, and then re-encode the image.
Bit rot does not cause this.
Depending on the compression used, with real bit rot, you will get either an unusable file, a usable file with a few corrupt blocks, or a regular file with a few pixels messed up here or there. You will not see whole-picture color alterations. These are the only failure modes that bit rot creates in a digital image.
It is more likely that the lighting in your room, or your monitor, or the color balance settings in the OS, or even the memory in your head has changed, than that bit rot or an unreliable file system will have created changes like you describe, because it's physically impossible unless you have a virus actively altering them.
Hammer has significant check sums that assure the health of the data. However nothing can prevent actual bit rot. In addition to check sums to verify data, you need some sort of extended error correction data stored with the file. There are user space applications that can apply this on a block for block. One example might be the dvdisaster utility available in Debian which computes the ECC blocks to help preserve data on DVD.
I've created PAR2 files for all my photos. I've got a kid and although I make multiple backups, I neither trust the filesystem (HFS+) nor the backup (Time Machine and CrashPlan). Especially with photos, it's really easy because it makes sense to put them in directories per time period (for example every quarter or month), for instance, "/Pictures/2017 Q1". When you create a new folder, just create par2 files in the old folder, like so:
$ par2create par2file *
To verify them:
$ par2verify par2file.par2
Big advantages of par2 versus other methods:
- It works independent of file system
- It can not only verify but also repair
8 of 13 people found this answer helpful. Did you?
If you don't want to go with the great integrated solutions above, such as ZFS or PAR, you can use a hash program like md5sum. Its multithreaded equivalent md5deep also has a nice "recursive" option. Create a text file with all the md5sums of all the files. Then repeat a year later, and compare (via sort, diff). Advantages of an explicit checksum is that it's more compatible, and you only ever need one drive connected to read / write data, and the data are available instantly. Disadvantages: more work, need twice the storage (where ZFS and PAR can use more efficient encoding).
I will also note that I have never had bit rot attributable to hard drives on my ZFS pool, over about 5 years. This is using "green" 2TB drives. Check your hardware, especially RAM (credit to GuB-42 who mentioned this above), cables and power supply.
Disks are pretty much safe, it's the ECC-less transfer to new disks which is risky. It is very very unlikely for bitrot on disk to go unnoticed thanks to ECC but when you migrate your data to bigger disks, new filesystems then an occasional bitflip due to critical timings (SDRAM, busses, chips, clocks), old PATA cables (no data checksum) or caused by EMI, radiation etc. increases with the size of the dataset. The solution is checksumming before- and after the copy. Even deprecated algorithms (MD5, SHA1) will do.
Comment removed based on user account deletion
...I'd recommend RAR with as big a recovery record as you can get away with, or PAR - These both provide really good error correction for small bits (haha) of corruption - and then store the lot on whatever you have to hand (magnetic tape, magnetic disk, optical disk, 3d printed stone tablet etc.) and transfer to newer media every now and then to avoid the data being stuck on obsolete tech. (Like the data I have trapped on my EZ135 cartridges ;_;)
Ignore the idiots suggesting ZFS and BTRFS - They only do error checking, not correction, so they will only tell you the data is corrupt, but not repair it.
Actually that's not 100% true - They can repair if they are in a RAID configuration (i.e. 1, 5 or 6) but then you'd have to keep those disk together and, esp. with 5 and 6, the potential to lose all the data on the disks is higher e.g. if some of the disks die (Then you can't even use a data recovery company to get stuff back!)
btrfs is in mainline now, and has a number of years to have settled down. Even if you don't like the more advanced features, it has some that tick all the boxes:
- good on-disk checksums to detect errors (incl bit rot) for metadata and data
- RAID modes to protect from whole disk failures
- realtime and online scrubbing to detect and recover from checksum failures from another copy (RAID1) or rebuilding from parity (RAID5/6) - no action required from user (contrast with PAR solutions proposed)
- subvolumes for segregation of data if needed, especially if there is a desire to consolidate multiple older drives and especially useful to pool capacity from these disparate sources to implement RAID modes.
- online reshaping for the above So, even if you're not accessing the data frequently, if OP cares about data I'm sure it's no hassle to plug them in once a year and let a scrub run (for a couple of days if needed, I know this part of the code is still terribly slow). Even if btrfs is deprecated today, it will be a long time before support is removed from the kernel, and even longer before the last distro stops supporting it, and longer yet before that last distro release refuses to boot on whatever incarnation of hardware is available to plug the drives into. All the while the data is free to be migrated onto new spare/surplus drives and a new filesystem if needed.
There are several projects/tools out there.
Search for reed-solomon
https://www.thanassis.space/rs...
http://unix.stackexchange.com/...
I used par2 to put my videos on CD-R, but those are now 10 years old and I did not check if it's still readable :-)
Atari rules... ermm... ruled.
"for the ages" means a tradeoff against bitrot resistance and readability. esoteric fses are unlikely to be easily readable in a few decades.
I've settled on NTFS because of its support in $any_os and likely support in the decades to come. I supplement it with parity files to not only be aware of bitrot but also to have some limited resistance to it. I wrote a tool that allows you to easily analyse file hierarchies and check, make or monitor their par2 status. https://github.com/brenthuisman/par2deep/
"par2 c -R -r15 your_desired_par2_filename_here ./*" without the quotes
this is fine with ext4 or ntfs or hfs for that matter
15% should be ok but you can -r20 or -r25 or whatever your like
zfs with copies=2 on a single disk or a two-disk mirror or better is probably OK without the par2
Archive to M-discs. They last for 1000 years, which should be long enough for most people!
That's not how "bit rot" works and your file system doesn't have anything to do with changing files either. If your FS is the problem, you won't be able to access the file. If bitflips would be a problem, your file would be corrupt (which doesn't shift your colors, but causes errors in the image. single bitflips probably won't even be visisble at all).
Then the "bit rot" isn't a problem on hard drives manufactured in this millenium. Either your drive fails or has bad sectors or your files are probably okay. The probability of a single bitflip is very very low, the bit rot you think you're observing is obviously something else. I guess you got a new monitor and a less tolerant video player or something similiar.
Reed-Solomon is a block-based ECC rather than stream-based (convolutional) memory errors would effect only that one block.
Probably the most convenient way to use Reed-Solomon, where the math works out nicely, is to apply it to 512-byte blocks. That also happens to be the native size of hard drive sectors, so that's how it's most often used. Each sector has it's own ECC. The ECC of one block doesn't effect any other block. There are several decoding algorithms for Reed-Solomon which may have different characteristics as far as how many bits in that block might be affected by a memory error of one bit.
Extended binary Golay code uses represents 12 bits of data with 24 encoded bits and corrects up to three errors in those 12 bits. It can detect up to seven errors. A memory bit-flip wouldn't be a problem, but eight flipped bits could result in all 12 bits being read incorrectly.
The other class is convolutional (stream-based) codes. As a class, convolutional codes aren't limited to a fixed-size block, so some set of errors could propagate. Of course smart people design these codes, so I can't think of any off hand that are designed such that they actually propagate errors in an unbounded way. The general type would most likely be one that looks a bit like a cross word puzzle of many dimensions - a single bit gives information about many otherwise unrelated bits.
Convolutional codes are typically used "close to the metal", with analog values rather than digital.
Consider you're applying ECC to the electrical signal in a cable, or a wireless signal. The protocol may specify that 1 volt positive (or higher) is logical true, 1 volt negative (or lower) is logical false. Suppose you're using triple modular redundancy, which simply means you send each signal three times. You might read the following values:
+1.8
-0.6
+0.3
Even though two of the three values are invalid, we can see that they are clearly biased toward the positive and therefore treat it as logical true. Space probes sending pictures from millions of miles away require convolutional coding with high redundancy.
I've been archiving file since the 1980's, and have a ~20TB collection at the moment.
Your potential sources of data loss and solutions are:
1. Problem: Not being able to find the file in all the disks you have.
Solution: Organize stuff, even if that means using twice the number of disks to leave room to add future files in the right place. Use the Library of Congress Cataloging system when possible; They really did think of everything.
2. Problem: Not being able to find a reader for the file. (Thus mis-colored images as the image formats evolve.)
Solution: Use only a few common formats: JPEG, PDF, DOCX, TXT, GIF. Seriously, even avoid TIFF files.
3. Problem: Disk failure.
Solution: Keep multiple copies of everything in different physical locations.
4. Problem: Not being able to read the obsolete disk format (e.g., MFM, RLL, proprietary 1.76MB formatted 3.5" floppies.)
This is what you asked about.
Solution: Keep everything in the MOST COMMON formats available. ZFS is evolving so quickly that you will not be able to find a reader for today's version in 5 years. Stick to NTFS only because FAT32 refuses to deal with large files. Stick to USB enclosures, and make sure you can easily remove the SATA drives to access them directly if necessary.
It's tough to beat M-DISC's purported shelf-life of over 1,000 years. The discs are a couple dollars apiece and the drives are around $50. If you're concerned about not having the hardware required to access the filesystem on the disc in the future, simply migrate your files from one M-DISC (or equivalent) to another every so often to mitigate this risk. Each time you burn the files, perform a bit-by-bit analysis to ensure that all the hashes line-up, and you should be in good shape.
From our own bodies we can learn this (tumours).
On a long enough timeline even our own DNA becomes corrupt through replication errors and within a relatively small amount of generations little to none of it remains as to be recognisable.
One cannot reasonably expect to retain data indefinitely- whether one looks at it from a hardware or software perspective.
This needs to be put in perspective when discussing matters such as AI - eventually it shall almost certainly suffer a computational amnesia, dementia or insanity as a result of the above.
But for the purpose you seem to have asked, a combination of well chosen hardware and software solutions should give you some good mileage. The rest is really semantics. You can't buy immortality off the shelf.
Make copies of things you care about occasionally on new media. If you don't care about something, let it rot. It's very liberating, kind of like burning your desk.
Have you read my blog lately?
I like this slashdot question. But I'd like to expand it. Because like the OP, I have a couple terabytes of crap I'd prefer not to lose. I currently employ a manual mirroring of the drive to an offline drive of equal size and store that drive away from my computers.
My problem is portability though. Currently, both my drives (the online and offline copy) use NTFS for the filesystem. I choose NTFS because I want the drive to be accessable from both Linux and Windows machines. Like for example, I like to take the drive with me when I travel so I can watch my videos whereever I go should I get bored.
So my expanded question is: Which filesystem is the best for data retention and portability across Windows and Linux?
As an additional, if anyone wants to bite, how come there are no decent third party file system drivers for Windows? It seems like long past due for some good third-party filesystem drivers to be out there and usable.
ps. Never experienced any form of bit rot on standard spinner HDDs. Only time I've ever had issues with data loss on media is with recordable CDs and DVDs which I've long since stopped using for any purpose due to their proven unreliability. USB flash drives are also similarly unreliable as long term storage. I've multiple times gone to use a USB flash drive and discovered it's blank or scrambled and unusable.
>PAR2 uses Reed-Solomon error correction.
I'm no expert, but it seems to me that when the correct value of the data is disputed, cutting the data in half as a solution is a Bad Idea(TM) . . .
hawk
... pure, virgin 0's and 1's.
--I wonder if anyone has ever thought of porting par2 to a FUSE filesystem? ;-)
--Here's the closest thing I found with a quick search:
http://askubuntu.com/questions...
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
First off, people in this thread seem to think that all the information people are saving is about them.
I have lots of data as well, but most if it is about my kids and family. My kids are starting to hit their teens, and we still go back and watch old family videos. Most are short, under 2 min. They capture points of times in our lives that we won't get back. Our old house, in a different state... friends we had then, neighbors. Not like full documentaries on them, but fading memories. It's not important to anyone else. And if I leave that to them, they may keep it or destroy it. But that is their choice to make. With the ability to capture all of this with much more ease than the previous generation, why shouldn't we? I only have a few photos of my grandparents, and I would like to have more. But they don't exist. I don't have any videos of them at all.
It's not about being famous, or because you are important in the world's eye. There are entire professions dedicated to preserving that. It's about whatever YOU want to preserve. I once found a website that had information about my family name, I had never seen it anywhere before. Pictures and scans of documents, etc. I saved that information off, and about a year after that, the site went away. Since then, ancestry.com came to be, and I was able to use that information I saved to help stitch together our family tree.
We are in the information age, I don't understand why someone would be so opposed to preserving that.
You sound very angry about something, you should probably figure out what that is before it's too late.
My beliefs do not require that you agree with them.
The device error correction is probabilistic. It won't necessarily know the data is "bad". And there can be firmware bugs which make it return or store bad data. What about phantom reads and writes. https://www.youtube.com/watch?... is a very interesting presentation from Bryan Cantrill about all sorts of bitrot and storage stuff.
The ZFS on-disk format is extremely well documented and not that terribly hard to understand.
ZFS On-Disk Specification
It's conceptually well thought out and doesn't require a lot of corner cases. It's stable and common to every current ZFS implementation. I haven't looked at all the feature bits subsequently added, but I don't think many of them complicate recovery of basic files whatsoever.
0:06 / 43:27 Examining ZFS On-Disk Format Using mdb and zdb: Max Bruning
This is not ZFS Internals for Dummies, but do note how the available tools are first rate. Plus, the on-disk structure is integrity checked all the way down. Most likely, any misconception about bit-patterns will be brutally put to rest by the next disk block you fetch.
Finally, there's very little fundamental churn here, because ZFS was designed for the future on day one.
—
ZFS has one Achilles heel: the absence of block-pointer rewrite. Basically, the integrity layer is overly rigid about block placement, and thus certain kinds of desired flexibility are off the table, now and forever, until ZFS 2.0 comes with a different on-disk placement record (which might never happen, as the principals all refer to BPR in hushed voices as some large, daunting project—probably because maintaining the historical testing standard requires industrial-strength support).
—
Contrary to your ludicrously uninformed tar-pit scenario, ZFS is a paragon of long-term, binary-format stability.
You must somehow think the future can barely start a fire by rubbing two sticks together.
Have you ever turned the pages of Office "Open" XML?
This is potentially a hundred times more baroque, Byzantine, and baffling to some rub-stick deprived future generation, that wakes up severely hung-over and back-to-the-buggies as the dust settles on the AI apocalypse.
(How do we finally beat the AI uprising? Probably though nefarious virus bearing an OOXML payload, which even the ascendant AI-powered globally-distributed firewall fatally misclassifies.)
—
For true geeks, ZFS origin story from the horse's mouth.
The Birth of ZFS by Jeff Bonwick
BP-rewrite mentioned at 18:00. Something technical about DVA (data virtual address). Then a horrible "bolt on" is mentioned. Then a mike drop.
Circa 7m: it's not going to be a team of 80 people, it's just going to be me and Matt at the white board all day, every day. (And y'all knows how that goes. Turns out "tank" is a character from the Matrix. )
Circa 11m20: ztest and zloop
QuickPar on Windows is long-obsolete. MultiPar is the more modern variant.
Filesystem for the ages, eh?
For this purpose I would deploy a private bittorrent tracker. .torrent files and add them in your torrent client, run data integrity check every once in a while. Have at least of two of these client nodes and you should be good to go.
You'd have to create
Ensures data recovery in case of corruption (BitTorrent will download just the corrupted blocks from the other peer).
http://www.theverge.com/2016/2...
No info is ever lost there. It lasts (almost) forever right at the event horizon.
Retrieving it is a bit challenging, though.
Self-importance and self-indulgence is the root of ALL evil.
PAR2 not a filesystem, but rather a means of generating error-correction codes to detect and repair damage to files.
The actual PAR2 algorithm hasn't changed, though development for PAR3 is ongoing. It's simply that one particular implementation, QuickPar, is obsolete, while MultiPar, a similar program that is completely compatible is more modern.
I know I'm terribly late, but since I read this Ask Slashdot while I was finishing up / testing the tool in subject, I can't avoid a shameless plug. So here it is.
https://github.com/MarcoPon/SeqBox - Sequenced Box container - A single file container/archive that can be reconstructed even after total loss of file system structures
You can encode a file in a SBX container, and gain better recoverability if disaster happens, and also bit-rot detection, since each block is CRC tagged.
In addition, if you make multiple copies of the SBX file (on the same volume, in different media, whatever), and every one of them is damaged in some way, the SeqBox recovery tools can scan for valid SBX blocks on multiple files/block devices, collect all the good parts from all available source, and (hopefully) gather enough data to reassemble the original container.