Slashdot Mirror


File Systems Best Suited for Archival Storage?

Amir Ansari asks: "There have been many comparisons between various archival media (hard drive, tape, magneto-optical, CD/DVD, and so on). Of course, the most important characteristics are permanence and portability, but what about the file systems involved? For instance, I routinely archive my data onto an external hard drive: easy to update and mirror, but which file system provides the best combination of reliability, future-proofing, data recovery, and availability across multiple platforms (Linux, OS X, BeOS/Zeta and Windows, in my case)? Open Source best guarantees the future availability of the standard and specification, but are file systems such as ext2 suitable for archival storage? Is journaling important?"

30 of 105 comments (clear)

  1. If you are worried about interoperability use FAT by Xner · · Score: 2, Insightful
    It's simple and supported by almost all machines and devices. Worst come to worst you can hunt for your data with with grep and dd.

    If you are not constantly editing the information (and you won't be, it's for archival purposes) the admittedly major downsides of not being journalled and being prone to fragmentation are non-issues. You might run into problem with capacity limits and/or file size limits though.

    --
    Pathman, Free (as in GPL) 3D Pac Man
  2. The best archival filesystem by Helvidius · · Score: 4, Funny
    I have heard that the most permanent way of preserving data for long, LONG time is to write your data in stone. Granite being one of the best. Aside from that, computer data will lost a much shorter time than even the printed word. So buy some acid-free, archival quality paper and print those bits out!

    Of course, that's just my opinion--then again, I could be wrong.

    --
    "Care about people's opinions and you will be their prisoner." ~~Tao Te Ching~~
    1. Re:The best archival filesystem by jamesh · · Score: 2, Interesting

      I'm sure that a while ago I read about a system that could print encoded data onto paper at a reasonably high density (eg not readable by a human, but easily decoded with a scanner). At a 'plucked out of the air' figure of .25mm x .25mm per 'bit', and an equally 'plucked out of the air' figure of 11 bits of data per byte (to allow for clocking and maybe some error correction), you'd fit about 80kbytes on a single page of A4, and about 40mb per 500 sheet ream. Not that high (and possibly much higher or much lower once you stop plucking figures out of the air :), but if you had some stuff that you wanted stored for a seriously long time it might be feasible. Add in a few pages describing the encoding you have used and store it properly, and it might still be useful in thousands of years...

    2. Re:The best archival filesystem by _Sharp'r_ · · Score: 2, Insightful

      Stone? Easily chipped or cracked if dropped, low tensile strength, not very portable? No thanks.

      Try thin metal plates. A little more difficult to etch by hand (which can be alleviated by using the right malleability of gold), but well worth it for the long-term benefits of damage-resistance and portability.

      --
      The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
    3. Re:The best archival filesystem by mrchaotica · · Score: 2, Insightful

      The downside of gold is that invading Conquistadors (or otherwise no-good people) might try to melt it down into bars or bullion, destroying your data.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  3. Don't overlook popularity by fromvap · · Score: 3, Insightful

    I would say that ubiquity is the most important factor in being able to read something in the future, not it being open source. FAT32 is certain to be easily, if not legally, accessible for the very short expected lifetime of an external harddrive. To improve data recovery capabilities, you might like to create some archives in RAR format for error checking, with PAR2 files for redundancy and recovery. Hard drive space is cheap, so for safety keep the uncompressed files as well as the archives. Since hard drives fail, you should have more than one of them. And ideally, make DVDs also. I created some files with early betas of Openoffice 2, and it was not at all easy to open them once the file format changed before the final release. As another example, despite it being open source, the legal problems of Reiser may cause that file system to be inconvenient to access in the future. An outdated, but very popular legacy format will have support that will last far longer than people want it to. Because of the high marketshare that Wordperfect had in the days of Noah, even now you can open Wordperfect files in Word and Openoffice. If you think FAT32 will be unreadable anytime soon, think again.

    1. Re:Don't overlook popularity by piranha(jpl) · · Score: 2, Insightful

      Does anyone use RAR outside of the copyright infringement scene?

    2. Re:Don't overlook popularity by MrHanky · · Score: 2, Insightful

      I second this. FAT-32 isn't the most robust file system out there, but it's ubiquitous and well understood. Robustness is probably not the most important aspect for archival storage, if that means write once and store, and it's meaningless if you can't read the format. It's not a modern file system, though, and has some problems (4 GB file size limit, etc.).

      I wouldn't say the same goes for RAR. It's a proprietary format, owned by a company and used mainly for piracy. I know you can extract it on many OS today, but I wouldn't trust it for tomorrow. Neither would I trust Word to open Word Perfect files -- I've received RTF files created by the latter that I couldn't open in Word. Market share alone doesn't guarantee anything, you need a format that is well known. Sadly, neither WP nor Word documents are.

    3. Re:Don't overlook popularity by RupW · · Score: 4, Informative

      Does anyone use RAR outside of the copyright infringement scene? Yep, I do. It's widely accepted, better than zip and better than .tar.gz or .tar.bz2 because it orders the files more intelligently than tar before trying to compress them. tar.rz goes some way to address that but you have to do it in two steps because rzip doesn't pipe. .tar.rz compression is about equivalent for large numbers of small files but rzip will often beat rar single large files.

      The killer feature back in the day was the first good implementation of disk splitting. But the compression still stands up now.

      On my 'if I ever get free time' list is to implement rar's file ordering in GNU tar to see if that helps gzip and bzip2 catch up RAR's compression ratio. But I've no idea if/when I'll ever get around to that.

      -- paid-up RAR user since 1996.
  4. How Archival? by Stone+Rhino · · Score: 4, Insightful

    Is this going to be relatively live, with data being mirrored onto it regularly, or is this going to be written once and accessed occasionally from then on? If you're only going to write to it a very small portion of the time, (or even WORM), journaling will be useless to you, since anything that takes out your data won't be stopped by it.

    How far into the future are you going to need it? I understand the whole "not wanting to become unreadble," but honestly, no one's going to bother re-implementing a filesystem to look at their old vacation photos. Pick a popular filesystem, and you'll be sure of support down the line. FAT's still doing just fine for itself, and the ISO filesystems for CDs and DVDs will be readable as long as people are making drives for them.

    All of the data integrity features on filesystems aren't going to protect against disk failure/media wearing out, and error correction on that scale is beyond the scope of any one disk to handle. Like the department jokingly advised, parity files and other methods can handle this in a robust, media-spanning manner, and protect against everything from a few flipped bits to a whole-disk data loss (assuming you have enough parity data).

    I think the reason not much talk about filesystems has been going on is because they're mostly irrelevant for this task. They're designed to handle the issues of a live environment; the issues that archives face are beyond the capability of how you choose to store your data on each piece of media to solve.

    --


    Remember, there were no nuclear weapons before women were allowed to vote.
    1. Re:How Archival? by larien · · Score: 3, Interesting
      Just to be pedantic, ISO isn't the filesystem, it's either ISO9660 (CD-ROM) or UDF (DVD).

      However, you're correct that both are ubiquitous standards and likely to be readable by all modern operating systems and should be for some time to come.

  5. No Filesystem is Best by xanalogical · · Score: 2, Interesting

    If you're only using it for archive, writing anew each time, then skip the file system altogether. Treat the media like a block device, tar or otherwise archive your backup and just write the tar as a single, linear sequence of bytes. And don't compress it, so that a bit error early in the sequence doesn't mess up later blocks.

    Now which archive format is best - tar, cpio, etc.? I've heard that cpio is a much simpler underlying format.

    And if you have the space, write the archive sequence multiple times onto the block device, so if one block is destroyed you can pick it from from a peer location.

    -Jeff

    1. Re:No Filesystem is Best by Aladrin · · Score: 4, Insightful

      You'd be MUCH better off creating PAR2 files for the archive set, instead.

      If you made 2 copies of the archive on the media, and piece 10 of both sets die, you've lost everything. If you made 1 copy of the archive, and a 10% par set, any 10% of the pieces (data and parity both) could die and you'd still have your data. If you made a 100% par set, you could lose half of the data and parity and still recover. And it doesn't matter which portions.

      Add to that the fact that if you lost piece 10 in archive 1, and piece 9 in archive 2, it would be not much fun to figure out the dead pieces and make a full archive again. With PAR2, the tool will do the work for you.

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    2. Re:No Filesystem is Best by Anonymous Coward · · Score: 2, Informative

      Depends, a 100% par set for a 100GB archive would take forever even on the faster machines. Even a simple "small" 4GB par set for a DVD backup takes hours on an Opteron 250.

  6. What about error correction? by F00F · · Score: 5, Interesting

    I've been wondering lately why no common file systems seem to implement error correcting codes (ECC/EDAC).

    In hardware, there's often a checksum, ECC/Hamming code, parity bit, Reed-Solomon code, etc. to detect and/or correct for inadvertent bit flips. But, as far as I know, no error correcting information is ever stored within the filesystem itself. Certainly the filesystem tracks how many blocks are dedicated to a particular file, and how many bytes long the file is, and one can always hash the file twelve ways to Sunday to assure that it hasn't changed since it was originally hashed, but none of that helps repair errors to the file should the medium that's being used to store it decay beyond what's already correctable via the medium access hardware.

    I can imagine scenarios where, for example, the RAM buffer in a hard drive is upset and perfectly encodes the wrong bit into a file (or even multiple stripes + parity in a RAID). In this case, the medium access hardware is useless (the data was, after all, ecoded perfectly wrong), but ECC in the filesystem would detect and potentially correct the error the next time the file was read back, even if it were decades later. I appreciate that it would add overhead, and thus maybe shouldn't be the default, but I don't see it being even an option anywhere, and some people would pay the performance penalty to get the data integrity benefit.

    Especially in instances like encrypted (or compressed, or both) loopback file systems where one bad bit can destroy an entire partition, why don't we have more data assurance layers available? Or have I just not found them?

    Whining of which, what was the deal with GNU ecc? Everyone speaks of "oh, yeah, the algorithm was deeply flawed, bummer..." but I don't ever see any details ...

    1. Re:What about error correction? by whovian · · Score: 2, Informative

      zfs supports checksums (http://en.wikipedia.org/wiki/Comparison_of_file_s ystems#Allocation_and_layout_policies) but it is incompatible with GPL (http://linux.inet.hr/zfs_filesystem_for_linux.htm l). However, Ricardo Correia has an alpha version of zfs for FUSE/Linux (http://zfs-on-fuse.blogspot.com).

      --
      To-do List: Receive telemarketing call during a tornado warning. Check.
  7. This question keeps popping up by rjforster · · Score: 3, Insightful

    In one form or another anyway. People keep asking about the _best_ way to store data for a long time (for some definition of best)

    My take on this problem is that you should use the best you reasonably can today. Then in 5 years time when there is a new technology out there, move over to that for archiveing your new data AND move your old data over while you still have working hardware.
    I went from floppy disks to LS-120 drives. From LS-120 drives to CDs. From CDs to DVDs. I'll go from DVDs to whichever of HD or BD seems best in a couple of years (unless something else crops up). I might use hard drives instead but I'm not sure yet. The point is I don't need to decide until I need to store that much.
    If you're playing in the big leagues do the same with the various formats of giganto capacity tape storage etc.

    Plan around the shelf-life and working life of the hardware you can get and the answer drops out.

  8. Simple... by evilviper · · Score: 3, Interesting
    I routinely archive my data onto an external hard drive: easy to update and mirror, but which file system provides the best combination of reliability, future-proofing, data recovery, and availability across multiple platforms (Linux, OS X, BeOS/Zeta and Windows, in my case)?

    Ext2 fs mounted rw,sync. When just reading, or just writing, async can't possibly help performance. You're strictly limited by disk I/O. Async will, however, cause irrecoverable corruption if there's a system crash or power failure, which was a source of great frustration with Linux before the journaling filesystems came along.

    Ext2 can be read by nearly even operating system out there, and doesn't have the numerous limitations of FAT32.

    Which, incidentally, is the exact same answer I gave a few months ago, when the last guy wrote an Ask Slashdot to ask the exact same question...
    --
    Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
    1. Re:Simple... by jonesy16 · · Score: 2, Insightful

      Explore2fs is written and supported by one person and currently doesn't list support for Vista. I would find it hard to recommend to someone else that they use this and expect it to be a reliable solution 5 . . .10 years down the road. And if it was so easy to support ext2 on OSX then why is there no reliable support for Tiger. Last I checked into it (about a month ago) there was ONE person who was working on the project and it had been sitting idle for a while. Given that a lot of Mac users are also linux users, I don't see why there woudln't be widespread support if it was "quite easy". The advantage to the FAT filesystem is that it has been around forever with little changes. It will support MOST archival requirements for file size, etc.

  9. Is "sticktion" still a problem? by Marrow · · Score: 2, Interesting


    If you leave a drive in a closet for 10 years, will it still spin up?

  10. Non-IT answer by Overzeetop · · Score: 4, Interesting

    The best file system for archival purposes is the one you're using today. Why? Because of you want that archive to be readable in any expedient manner, you are going to have to constantly monitor and update the media on which it is stored. All media will degrade over time, and you will have no idea how bad that degradation has been until you re-read it. No vendor will compensate you for the loss of your data, because there is some data which simply cannot be recreated.

    If you want archival storage, you need to have your data on- or near-line, and rewrite the data to the "new" hardware every couple of years. By choosing a filesystem that is current, you are more likely to be cable to read it in a couple years than if you (try to) stick with a single filesystem. I know this sounds like a lot of work, but if the data is truly worth archiving, it's worth keeping both the storage mechanism and format up to date.

    --
    Is it just my observation, or are there way too many stupid people in the world?
  11. Worry about the hardware, not software by MightyYar · · Score: 4, Interesting

    Thanks to the emulation community, I can read data from an old Commodore 64, Apple ][e, Atari, etc. on any modern computer running any mainstream operating system. What I cannot do is easily hook up an old Apple ][e disk drive to my modern hardware very easily. The filesystem will not really matter so much, because even if Wintel goes the way of the Commodore 64, someone will make a DOSBOX-esque emulator for it. Getting data off of an ATA, SATA, USB, or Firewire drive might be more challenging once new hardware ceases to support those standards.

    Personally, I just throw stuff on external hard drives. 3-5 years later, the new drives are so much bigger, faster, and cheaper that it becomes economical to consolidate to a new drive. I still have data from a 286 that had nothing but floppies, an Apple ][e, and 2 dead Macintoshes. Even my old Windows 95 computer lives on as a VirtualPC image. I don't really use them that much, but the Apple ][e and 286 stuff is under 50 megs, and the VirtualPC image is 2GB. The images of the old Mac hard drives total less than 1GB... it's simply not worth deleting them and it's kind of fun to have my old computers still around, if only "virtually".

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    1. Re:Worry about the hardware, not software by Gothmolly · · Score: 2, Funny

      Dude, I'm sure you could find all that pr0n on the Internet again if you had to. Let it go.

      --
      I want to delete my account but Slashdot doesn't allow it.
  12. I use ext3 by rduke15 · · Score: 2, Insightful

    I use ext3 on my external backup disks because:
    - it is much better and more reliable than FAT32
    - it is both open source and (relatively) widely used, so I expect there will always be some way to read it
    - it can easily be read by attaching it to any machine and booting some Linux LiveCD or bootable USB.
    - the OS which traditionally can read ext2/3 is itself open source and also widely used, so there is no fear that it would become unavailable

    For archival and backup, I feel all these advantages far outweigh the slight inconvenience that the disks are not readable directly by Windows and Mac, requiring either a driver or a reboot into Linux.

    The important point is to label the disks very clearly. Otherwise, someone connecting them to a Windows or Mac machine may believe the disk is empty and re-partition/re-format it! I would not only put a big explanatory label on the disk's case, but also name the volume something like "Linux-..." or "Linux-ext3-...", and also explain to persons involved (manager(s) + people handling the disks) that they are not readable in Windows (some people don't read even big labels...).

  13. Don't use FAT by AusIV · · Score: 2, Interesting
    FAT has issues with partitions larger than 32 GB and files larger than 4 GB. It's nice for Flash drives that you're taking from a Windows PC to a Mac to a Linux box, but if you're talking about serious arches, you'll definitely run into the first problem, and quite possibly run into the second.

    I use Ext3 for my backup drive, and this driver for when I need to attach it to a Windows box.

  14. bad advice by oohshiny · · Score: 3, Insightful

    Buy something that has dedicated commercial support for the next 20-40 years

    You mean like DEC or any of the other out-of-business dinosaurs?

    As someone who has been through this, I can only say: do NOT buy anything that depends on "dedicated commercial support"; the companies and industry standards you think are going to be around for "20-40 years" are probably either not going to be, or they are not going to give a damn about you.

    Use open standards and open formats, with multi-vendor support; that's the only way to go. And you need to keep your eyes open and move to new formats and standards as the world changes.

    If LTO is the right choice, it's the right choice because of that. But I'm not convinced that LTO is going to be long-lived enough as a standard, no matter how many companies have tied their fortunes to it right now.

  15. Tape by vadim_t · · Score: 2, Informative

    Here's why: IMO, unless you're doing it for a company, the most important thing is convenience.

    If it's your job, sure, you'll do it whether it's convenient or not.

    If it isn't, you'll quickly get tired of messing with CDs, plugging/unplugging hard drives, etc. So I went with the most convenient media possible: tape. Stick a tape into the drive, walk away, store when it spits it out. It doesn't interfere with the computer's usage since nothing else uses tape.

    For absolute convenience, get a tape robot from ebay. Then it can be completely automatic.

    Filesystem: use plain tar to write to the tape. If you must use compression, compress files individually, not the whole tape.

    Paranoid implementation: Tapes have file marks. You can ask the tape drive to give you file #1 for instance. You can use this to store some useful stuff in a format that will always be recoverable so long you have a drive that can read the tape. Store like this:

    File 1: Text document explaining what's all this stuff, and what's on the tape.
    File 2: RFC for tar format
    File 3: RFC for compression format
    File 4: source for tar program
    File 5: source for decompression program
    File 6: backup

    A tape formatted like this should be readable so long a drive capable of reading the data in it survives. To ensure that, go with a popular tape format, which is reliable, open, and has a high capacity (so that it's unlikely to become obsolete too fast)

  16. ZFS - FTW by GuyverDH · · Score: 3, Informative

    While not as widely used (yet), it will eventually become the de-facto standard in safe filesystems.

    I've thrown all kinds of problems at it, and it has yet to lose a single byte of data.
    Add to that, taking snapshots every (x) minutes, you can look back in time as easily as reading a folder.

    With RAIDZ2 in the latest releases, you can set up sets that can withstand the loss of 2 physical drives. If you couple multiple RAIDZ2 sets into a single pool, you've increased the redundancy even further. With plain old JBOD and multiple controllers, you can reach levels of availability that only expensive EMC/Hitachi/StorEdge systems have reached in the past.

    It's opensource as well (although it's the Sun flavor at this time), and being worked on at www.opensolaris.org. I believe Sun is contemplating switching it to GPL at this time.

    --
    Who is general failure, and why is he reading my hard drive?
  17. What are your parameters? by davidwr · · Score: 2, Interesting

    Do you need data-readback in a matter of seconds? Minutes? hours? days?
    Do you need storage for years, decades, centuries, millennia, 10,000 years, or longer?
    Do you need an indexing system based on content or just on title/filename?
    Can the data be printed out or carved into stone without losing important information?
    Is this a go-to-jail-if-you-don't legal requirement, a may-go-bankrupt-if-you-don't business requirement, or a save-us-a-bunch-of-money-nice-thing-to-have requirement?
    Do you think the cost of researching the "best" solution worth the improvement over the 2nd- or 3rd-best solution?

    Let's assume you need it for 50 years, access is infrequent, and you can wait 24 hours for data recovery. Talk to the folks at Iron Mountain and other data-retention warehouses, they are experts in the field and will be happy to consult with you or do the entire job turn-key.

    My hunch:
    For most applications involving less than 50 year data retention, making 2 copies of the raw data, to a currently supported stable media such as tape or archival DVD, stored in separate locations, is key. Make sure the data is both in the original format and in a published-standard format which is widely supported.
    Keep multiple machines that can read the data around for as long as you need the original format. Every few years or as needed, verify the data is intact, re-convert the data from the original format or, if that format is unreadable, the highest-fidelity published-standard format, to a currently-supported published standard, and save it to a currently-supported archival format.

    Ideally, in 50 years time, you will have the original media plus several updated copies. You may or may not be able to read the original media but your most recent copies will be close enough to the original to be useful. If you are very lucky, the most recent copies will be identical to the originals AND you will still have the software and hardware to read them.

    Oh, for anything REALLY important, print it out on archival paper, or carve it into stone.

    --
    Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
  18. Hardware isn't everything by Rob+Simpson · · Score: 2, Insightful
    Even with hardware that seems to be working perfectly fine, in the process of storing and repeatedly transferring stuff between different types of storage I've had errors crop up.

    Sure, I could use archives with checksums or RAID, but it'd be nice if there was an option to sacrifice some speed and space on a single form of storage to improve the reliability without going to such cumbersome lengths.