Slashdot Mirror


National Archives' Digital Woes

Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"

6 of 190 comments (clear)

  1. some funny math by Yonder+Way · · Score: 4, Interesting

    100 million emails
    let's be generous and say that the average email is 8192 bytes in size (8KB)

    100,000,000 * 8KB = ~800GB

    That's not much at all. And that's if you store it uncompressed.

    Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.

    1. Re:some funny math by Wildfire+Darkstar · · Score: 4, Informative

      Speaking as a trained archivist, I can say that the problem isn't finding storage space for the e-mails, per se. It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time). Which means that the problem cannot be satisfactorily solved by dumping every e-mail onto a hard drive somewhere and forgetting about them. They all need to be indexed and cataloged, and provisions need to be made to ensure that the data can be migrated onto newer technology when it becomes necessary to do so without losing any of the information (or metadata) associated with it.

      The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with. While storage space itself is a concern, to some degree, given that this material will continue to accumulate, the larger problem is how to manage this material. Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.

      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
  2. Plain Text by CWRUisTakingMyMoney · · Score: 5, Insightful

    What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.

    --
    Those who anthropomorphize science and/or nature already believe in an intelligent designer.
  3. One Word: Google by Nova+Express · · Score: 5, Insightful
    Really, either Internal or External. Take out anything that might injure National Security, then turn the rest over for Google to index. Hell, send a copy of everything to Google, for that matter; they've got room. Keep a record of searches and visits to documents by codeword and frequency and build index that way. Create a datasea, index it, and let citizens swim in it. As long as the e-mail is in at least a remotely standard format, what's the problem?

    (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

    --
    Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)

    http://www.lawrenceperson.com/

  4. I'd love to read those emails... by rampant+mac · · Score: 4, Funny
    "The National Archives [...] will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years"

    I'd love to read those emails, seeing as how we've gone from:

    From: bclinton@whitehouse.gov
    To: hclinton@whitehouse.giv
    CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
    Subject: omglol, you got to get me some of these!

    I want these for Christmas! http://www.big-fat-cigars.com/



    To something along the lines of:

    From: gbushjr@whitehouse.gov
    To: dickc@whitehouse.giv
    CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
    Subject: Are they for real? Can we attack them too?

    Subject sayz it all, any toughts Dick? I think we can git `em.

    > DYKE BOURDER OIL SERVIES
    > OFFER FOR SALE OF NIGERIAN CRUDE OIL
    >
    > Dear Sir,
    >
    > I am President of blah blah blah...

    --
    I like big butts and I cannot lie.
  5. Format obsolesence by StikyPad · · Score: 4, Insightful

    There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.