Slashdot Mirror


National Archives' Digital Woes

Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"

5 of 190 comments (clear)

  1. Google Search Appliance by TubeSteak · · Score: 2, Informative
    Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...
    Except when you're right

    The Google Search Appliance
    http://www.google.com/enterprise/gsa
    What it does

    The Google Search Appliance makes the sea of lost data on your web servers, file systems and relational databases instantly available with one mouse click. Just point it toward your content, add a search box to your site, and in a matter of hours, your users will be able to search through more than 220 different file formats in any language. The Google Search Appliance indexes up to 15 million documents, and its security features ensure that users only see the documents to which they have proper access.

    How it works

    The Google Search Appliance crawls your content and creates a master index of documents that's ready for instant retrieval using Google's search technology whenever a customer or employee types in a search query. The Google Search Appliance is easy to set up and requires minimal ongoing administration, making it extremely cost-effective. The Google Search Appliance starts at $30,000 to search up to 500,000 documents.
    FAQs

    Though it isn't really ontopic, Google search appliances are vulnerable to various exploits & Google does provide patches.
    --
    [Fuck Beta]
    o0t!
  2. Re:some funny math by Wildfire+Darkstar · · Score: 4, Informative

    Speaking as a trained archivist, I can say that the problem isn't finding storage space for the e-mails, per se. It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time). Which means that the problem cannot be satisfactorily solved by dumping every e-mail onto a hard drive somewhere and forgetting about them. They all need to be indexed and cataloged, and provisions need to be made to ensure that the data can be migrated onto newer technology when it becomes necessary to do so without losing any of the information (or metadata) associated with it.

    The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with. While storage space itself is a concern, to some degree, given that this material will continue to accumulate, the larger problem is how to manage this material. Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.

    --
    Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
  3. Re:Plain Text by elronxenu · · Score: 2, Informative
    Legally they're not allowed to convert the documents.

    IMHO, storing them on 8-track tape is a massive blunder. 8-track is already obsolete. What they should be doing is either keeping them all on spinning storage (with massive amounts of redundancy) or burn multiple redundant copies to DVD.

    Either way, they will have to deal with the problem of unreliable storage - it's easier to cope with if the problem can be automatically detected, and the data recovered from a backup and re-copied automatically. This should be possible with both DVDs and spinning storage. DVDs would need to be regularly loaded into the machine and read in their entirety. If a DVD shows errors, another copy of the DVD needs to be re-copied to replace the failing DVD.

    I guess this is a good time to point out:

    • The difficulty to access these documents in 100 years is mostly a function of the tools used to create the documents in the first place, not of the archiving system itself.
    • Start using ODF format for word processor docs if you want to be able to read them in 100 years
    • Make them readily available to the public to ensure that the good stuff is copied over and over again.
  4. Re:Internet Archive by fiji · · Score: 2, Informative

    For some value of entire.

    TIA is pretty damn impressive, but they certainly don't get all of it.

    1: There is more to the internet than the web
    2: They don't do a lot of dynamic pages... so a lot of forums will probably be ignored (not that that necesarilly loses anything useful ;-)
    3: They only get images if you request it
    4: Sites can request that they not be spidered (robots.txt)
    etc.

    -ben

  5. Re:Plain Text by Wildfire+Darkstar · · Score: 3, Informative
    What's to keep NARA from converting most electronic record to plain text?

    Potentially Armstrong v. Executive Office of the President. Format shifting is a fantastically tricky minefield to navigate. The aforementioned court case dealt specifically with the practice of printing e-mail communication and storing it as a paper record, but it speaks to the standard problems of conversion: you need to be entirely certain that you're not losing any information in the conversion process. This includes transmission information, metadata, and so on. Which isn't to say that plain text conversion can't be done in a lot of cases, but rather that it's something that needs to be undertaken very carefully.

    And while NARA has been embarking on some wonderful digitization projects, no paper-born records have been replaced by electronic conversions as of yet, for precisely the same reason. The electronic conversion augments the original paper record, but NARA still needs to maintain and preserve the paper record for as long as they have always been legally required to do so.
    --
    Sean Daugherty "I have walked in Eternity -- and Eternity weeps."