Slashdot Mirror


User: bollacker

bollacker's activity in the archive.

Stories
0
Comments
1
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 1

  1. We are the Internet Archive on On Preservation of Digital Information · · Score: 1

    Given our organization's mandate, I thought I should throw in my $.02.
    Although still ramping up and learning how to make things work, we are
    trying to ARCHIVE THE ENTIRE INTERNET FOREVER. Crawling or other
    forms of collection are used to download the information, and we store
    everything on hard drive. We plan to have about 100TB of HTML,
    images, Usenet, streaming media, etc.. within two years, and we have
    some collections that reach back to 1996.

    Currently, we do no backups of the hard drives, because given their
    low failure rate (about 1% in our history), it's less lossy overall
    to use that space for new data rather than redundancy. By the time we
    reach equilibrium with the Internet so that our download rate
    approaches the information generation rate of the Internet, we'll have
    some sort of backup mechanism in place. Probably software RAID of
    some form.

    As time passes, we will copy data to new media, but it will be on
    disk, this will be much easier than if it were on tape or printed. I
    have a vision that in the long run, we may be able to use something
    like an Intermemory (intermemory.org) to create a distributed
    filesystem that is the storage analog to distributed.net. In an
    intermemory, folks donate storage space, so that collectively, a huge
    amount of capacity is available. A lot of redundancy is used so that
    earthquakes, floods, govt. coups, and massive hardware failures are
    still unlikely to result in data loss. As folks' PCs fail or are
    upgraded, the simply plug in the new store unit (hard drive,
    holographic, etc.) and their part of the intermemory is reconstructed
    (like RAID 5).

    There's also been comments about how to handle (index/search/browse)
    so much data if it is all archived. This is an area of active
    exploration in which we are working with research groups and others.
    Generally, we've found that working with flat ascii files and perl
    scripts is one of the few approaches that scales up to TB of
    information on reasonably priced hardware.

    From a fanciful perspective, I see us eventually being something like
    the "Library Institute" of David Brin's books, or being the digital
    analog to the Library of Alexandria. As we are a non-profit, access to
    are our archives is freely available (see archive.org) and we
    encourage users of a broad range of types. If you are interest in
    seeing a large scale implementation of archiving heterogeneous digital
    information, check us out. As a shameless plug, we are also looking
    to hire developers and researchers. What we develop is open source
    and encourage its dissemination.

    Kurt Bollacker
    Technical Director, The Internet Archive. (www.archive.org)