Slashdot Mirror


File Systems Best Suited for Archival Storage?

Amir Ansari asks: "There have been many comparisons between various archival media (hard drive, tape, magneto-optical, CD/DVD, and so on). Of course, the most important characteristics are permanence and portability, but what about the file systems involved? For instance, I routinely archive my data onto an external hard drive: easy to update and mirror, but which file system provides the best combination of reliability, future-proofing, data recovery, and availability across multiple platforms (Linux, OS X, BeOS/Zeta and Windows, in my case)? Open Source best guarantees the future availability of the standard and specification, but are file systems such as ext2 suitable for archival storage? Is journaling important?"

11 of 105 comments (clear)

  1. Re:How Archival? by larien · · Score: 3, Interesting
    Just to be pedantic, ISO isn't the filesystem, it's either ISO9660 (CD-ROM) or UDF (DVD).

    However, you're correct that both are ubiquitous standards and likely to be readable by all modern operating systems and should be for some time to come.

  2. No Filesystem is Best by xanalogical · · Score: 2, Interesting

    If you're only using it for archive, writing anew each time, then skip the file system altogether. Treat the media like a block device, tar or otherwise archive your backup and just write the tar as a single, linear sequence of bytes. And don't compress it, so that a bit error early in the sequence doesn't mess up later blocks.

    Now which archive format is best - tar, cpio, etc.? I've heard that cpio is a much simpler underlying format.

    And if you have the space, write the archive sequence multiple times onto the block device, so if one block is destroyed you can pick it from from a peer location.

    -Jeff

  3. What about error correction? by F00F · · Score: 5, Interesting

    I've been wondering lately why no common file systems seem to implement error correcting codes (ECC/EDAC).

    In hardware, there's often a checksum, ECC/Hamming code, parity bit, Reed-Solomon code, etc. to detect and/or correct for inadvertent bit flips. But, as far as I know, no error correcting information is ever stored within the filesystem itself. Certainly the filesystem tracks how many blocks are dedicated to a particular file, and how many bytes long the file is, and one can always hash the file twelve ways to Sunday to assure that it hasn't changed since it was originally hashed, but none of that helps repair errors to the file should the medium that's being used to store it decay beyond what's already correctable via the medium access hardware.

    I can imagine scenarios where, for example, the RAM buffer in a hard drive is upset and perfectly encodes the wrong bit into a file (or even multiple stripes + parity in a RAID). In this case, the medium access hardware is useless (the data was, after all, ecoded perfectly wrong), but ECC in the filesystem would detect and potentially correct the error the next time the file was read back, even if it were decades later. I appreciate that it would add overhead, and thus maybe shouldn't be the default, but I don't see it being even an option anywhere, and some people would pay the performance penalty to get the data integrity benefit.

    Especially in instances like encrypted (or compressed, or both) loopback file systems where one bad bit can destroy an entire partition, why don't we have more data assurance layers available? Or have I just not found them?

    Whining of which, what was the deal with GNU ecc? Everyone speaks of "oh, yeah, the algorithm was deeply flawed, bummer..." but I don't ever see any details ...

  4. Re:The best archival filesystem by jamesh · · Score: 2, Interesting

    I'm sure that a while ago I read about a system that could print encoded data onto paper at a reasonably high density (eg not readable by a human, but easily decoded with a scanner). At a 'plucked out of the air' figure of .25mm x .25mm per 'bit', and an equally 'plucked out of the air' figure of 11 bits of data per byte (to allow for clocking and maybe some error correction), you'd fit about 80kbytes on a single page of A4, and about 40mb per 500 sheet ream. Not that high (and possibly much higher or much lower once you stop plucking figures out of the air :), but if you had some stuff that you wanted stored for a seriously long time it might be feasible. Add in a few pages describing the encoding you have used and store it properly, and it might still be useful in thousands of years...

  5. Simple... by evilviper · · Score: 3, Interesting
    I routinely archive my data onto an external hard drive: easy to update and mirror, but which file system provides the best combination of reliability, future-proofing, data recovery, and availability across multiple platforms (Linux, OS X, BeOS/Zeta and Windows, in my case)?

    Ext2 fs mounted rw,sync. When just reading, or just writing, async can't possibly help performance. You're strictly limited by disk I/O. Async will, however, cause irrecoverable corruption if there's a system crash or power failure, which was a source of great frustration with Linux before the journaling filesystems came along.

    Ext2 can be read by nearly even operating system out there, and doesn't have the numerous limitations of FAT32.

    Which, incidentally, is the exact same answer I gave a few months ago, when the last guy wrote an Ask Slashdot to ask the exact same question...
    --
    Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  6. Is "sticktion" still a problem? by Marrow · · Score: 2, Interesting


    If you leave a drive in a closet for 10 years, will it still spin up?

  7. Non-IT answer by Overzeetop · · Score: 4, Interesting

    The best file system for archival purposes is the one you're using today. Why? Because of you want that archive to be readable in any expedient manner, you are going to have to constantly monitor and update the media on which it is stored. All media will degrade over time, and you will have no idea how bad that degradation has been until you re-read it. No vendor will compensate you for the loss of your data, because there is some data which simply cannot be recreated.

    If you want archival storage, you need to have your data on- or near-line, and rewrite the data to the "new" hardware every couple of years. By choosing a filesystem that is current, you are more likely to be cable to read it in a couple years than if you (try to) stick with a single filesystem. I know this sounds like a lot of work, but if the data is truly worth archiving, it's worth keeping both the storage mechanism and format up to date.

    --
    Is it just my observation, or are there way too many stupid people in the world?
  8. Worry about the hardware, not software by MightyYar · · Score: 4, Interesting

    Thanks to the emulation community, I can read data from an old Commodore 64, Apple ][e, Atari, etc. on any modern computer running any mainstream operating system. What I cannot do is easily hook up an old Apple ][e disk drive to my modern hardware very easily. The filesystem will not really matter so much, because even if Wintel goes the way of the Commodore 64, someone will make a DOSBOX-esque emulator for it. Getting data off of an ATA, SATA, USB, or Firewire drive might be more challenging once new hardware ceases to support those standards.

    Personally, I just throw stuff on external hard drives. 3-5 years later, the new drives are so much bigger, faster, and cheaper that it becomes economical to consolidate to a new drive. I still have data from a 286 that had nothing but floppies, an Apple ][e, and 2 dead Macintoshes. Even my old Windows 95 computer lives on as a VirtualPC image. I don't really use them that much, but the Apple ][e and 286 stuff is under 50 megs, and the VirtualPC image is 2GB. The images of the old Mac hard drives total less than 1GB... it's simply not worth deleting them and it's kind of fun to have my old computers still around, if only "virtually".

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  9. Don't use FAT by AusIV · · Score: 2, Interesting
    FAT has issues with partitions larger than 32 GB and files larger than 4 GB. It's nice for Flash drives that you're taking from a Windows PC to a Mac to a Linux box, but if you're talking about serious arches, you'll definitely run into the first problem, and quite possibly run into the second.

    I use Ext3 for my backup drive, and this driver for when I need to attach it to a Windows box.

  10. What are your parameters? by davidwr · · Score: 2, Interesting

    Do you need data-readback in a matter of seconds? Minutes? hours? days?
    Do you need storage for years, decades, centuries, millennia, 10,000 years, or longer?
    Do you need an indexing system based on content or just on title/filename?
    Can the data be printed out or carved into stone without losing important information?
    Is this a go-to-jail-if-you-don't legal requirement, a may-go-bankrupt-if-you-don't business requirement, or a save-us-a-bunch-of-money-nice-thing-to-have requirement?
    Do you think the cost of researching the "best" solution worth the improvement over the 2nd- or 3rd-best solution?

    Let's assume you need it for 50 years, access is infrequent, and you can wait 24 hours for data recovery. Talk to the folks at Iron Mountain and other data-retention warehouses, they are experts in the field and will be happy to consult with you or do the entire job turn-key.

    My hunch:
    For most applications involving less than 50 year data retention, making 2 copies of the raw data, to a currently supported stable media such as tape or archival DVD, stored in separate locations, is key. Make sure the data is both in the original format and in a published-standard format which is widely supported.
    Keep multiple machines that can read the data around for as long as you need the original format. Every few years or as needed, verify the data is intact, re-convert the data from the original format or, if that format is unreadable, the highest-fidelity published-standard format, to a currently-supported published standard, and save it to a currently-supported archival format.

    Ideally, in 50 years time, you will have the original media plus several updated copies. You may or may not be able to read the original media but your most recent copies will be close enough to the original to be useful. If you are very lucky, the most recent copies will be identical to the originals AND you will still have the software and hardware to read them.

    Oh, for anything REALLY important, print it out on archival paper, or carve it into stone.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  11. PSA: Worse comes to worst by Anonymous Coward · · Score: 1, Interesting

    Worst come to worst... The expression is "Worse comes to worst" as in "should the condition arise such that what was considered 'worse' is so bad that it's now the worst thing that could happen..."