Slashdot Mirror


National Archives' Digital Woes

Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"

38 of 190 comments (clear)

  1. Not A Problem... by ferrellcat · · Score: 2, Funny

    "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House,.." Thanks to the Patriot Act, this number will be reduced to roughly four, including one such email with a complelling advertisement for V14GR4!!!!!!11

  2. some funny math by Yonder+Way · · Score: 4, Interesting

    100 million emails
    let's be generous and say that the average email is 8192 bytes in size (8KB)

    100,000,000 * 8KB = ~800GB

    That's not much at all. And that's if you store it uncompressed.

    Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.

    1. Re:some funny math by NonSequor · · Score: 3, Funny

      This is the Bush administration we're talking about. They all use HTML mail with lots of attached graphics. On top of that, many messages get forwarded hundreds of times.

      --
      My only political goal is to see to it that no political party achieves its goals.
    2. Re:some funny math by Wildfire+Darkstar · · Score: 4, Informative

      Speaking as a trained archivist, I can say that the problem isn't finding storage space for the e-mails, per se. It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time). Which means that the problem cannot be satisfactorily solved by dumping every e-mail onto a hard drive somewhere and forgetting about them. They all need to be indexed and cataloged, and provisions need to be made to ensure that the data can be migrated onto newer technology when it becomes necessary to do so without losing any of the information (or metadata) associated with it.

      The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with. While storage space itself is a concern, to some degree, given that this material will continue to accumulate, the larger problem is how to manage this material. Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.

      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
    3. Re:some funny math by MillionthMonkey · · Score: 2, Funny

      That's not much at all. And that's if you store it uncompressed.

      And any compression routine will immediately tokenize the long heavily repeated phrases: "September 11, 2001", "Global War on Terror", "aid and comfort to the enemy", "America's will is strong", "central front in the war on terror", "the American people are safer", "9/11", "we will prevail". There isn't a lot of entropy in this particular dataset.

    4. Re:some funny math by Crudely_Indecent · · Score: 2, Interesting

      Too bad they already awarded the contract to lockheed martin (someone had their palm greased in that deal), as my company deals with document conversion and archiving (of this scale) on a regular basis. The NA concern was converting the documents to modern formats and yet retaining the original document... Peanuts....my systems do it on the fly.

      Oh well....$308 Million dollar contract goes bye bye.....

      When did lockheed martin get into the document management business?

      --


      "Lame" - Galaxar
    5. Re:some funny math by Mikelikus · · Score: 2, Funny

      I know, I know, I know!! Why don't they use Google Desktop!!?

      --
      -- Would it be acceptable to just put my name on my sig?
    6. Re:some funny math by Black+Marlin · · Score: 2, Insightful
      Well, now that you know that the federal government has email storage issues, perhaps your company needs to step up and learn about how to bid on federal contracts. State governments are in smaller versions of the same boat. Our governor may not be turning over 100 million emails to our State Archives, but it will be a bunch. Even the last geezer governor transferred a big chunk.

      If you've got the best mousetrap, you need to find out more about how to make your product available to the archives community.

      Some places to learn more:
      The Society of American Archivists
      The Association of Records Managers and Administrators
      The Council of State Archivists

  3. Plain Text by CWRUisTakingMyMoney · · Score: 5, Insightful

    What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.

    --
    Those who anthropomorphize science and/or nature already believe in an intelligent designer.
    1. Re:Plain Text by elronxenu · · Score: 2, Informative
      Legally they're not allowed to convert the documents.

      IMHO, storing them on 8-track tape is a massive blunder. 8-track is already obsolete. What they should be doing is either keeping them all on spinning storage (with massive amounts of redundancy) or burn multiple redundant copies to DVD.

      Either way, they will have to deal with the problem of unreliable storage - it's easier to cope with if the problem can be automatically detected, and the data recovered from a backup and re-copied automatically. This should be possible with both DVDs and spinning storage. DVDs would need to be regularly loaded into the machine and read in their entirety. If a DVD shows errors, another copy of the DVD needs to be re-copied to replace the failing DVD.

      I guess this is a good time to point out:

      • The difficulty to access these documents in 100 years is mostly a function of the tools used to create the documents in the first place, not of the archiving system itself.
      • Start using ODF format for word processor docs if you want to be able to read them in 100 years
      • Make them readily available to the public to ensure that the good stuff is copied over and over again.
    2. Re:Plain Text by Wildfire+Darkstar · · Score: 3, Informative
      What's to keep NARA from converting most electronic record to plain text?

      Potentially Armstrong v. Executive Office of the President. Format shifting is a fantastically tricky minefield to navigate. The aforementioned court case dealt specifically with the practice of printing e-mail communication and storing it as a paper record, but it speaks to the standard problems of conversion: you need to be entirely certain that you're not losing any information in the conversion process. This includes transmission information, metadata, and so on. Which isn't to say that plain text conversion can't be done in a lot of cases, but rather that it's something that needs to be undertaken very carefully.

      And while NARA has been embarking on some wonderful digitization projects, no paper-born records have been replaced by electronic conversions as of yet, for precisely the same reason. The electronic conversion augments the original paper record, but NARA still needs to maintain and preserve the paper record for as long as they have always been legally required to do so.
      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
  4. Re:OK by AvitarX · · Score: 2, Funny

    I think the problem is they are trying to store the records on 8-tracks.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  5. One Word: Google by Nova+Express · · Score: 5, Insightful
    Really, either Internal or External. Take out anything that might injure National Security, then turn the rest over for Google to index. Hell, send a copy of everything to Google, for that matter; they've got room. Keep a record of searches and visits to documents by codeword and frequency and build index that way. Create a datasea, index it, and let citizens swim in it. As long as the e-mail is in at least a remotely standard format, what's the problem?

    (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

    --
    Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)

    http://www.lawrenceperson.com/

    1. Re:One Word: Google by AndroidCat · · Score: 2, Funny

      Stick it all on one box, then install p2p software. Name all the files to song titles and it'll spread even faster. (Of course, the RIAA might go after John W. Doe...)

      --
      One line blog. I hear that they're called Twitters now.
  6. Technology explodes by Architect_sasyr · · Score: 3, Funny

    Well, if the technology that uses the emails is exploding, surely the software/systems that archive the software are too.

    A couple of BSD box's with some Oracle or similar should do it.

    --
    Me failed English...
    FreeBSD over Linux. If my comments seem odd, this may explain...
  7. Sounds like a business opportunity by Alcimedes · · Score: 3, Interesting

    Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?

    I'm sure a universal data conversion tool would be worth a pile of money.

  8. iPod Mod... by __aaclcg7560 · · Score: 3, Funny

    The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod? Does it come with a Saturday Night Live dance cover?

  9. Re:XML? by grcumb · · Score: 2, Insightful

    "Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?"

    Because it's not the right tool for every job. XML is explicitly a data interchange format. I've worked with material like this in the past, and I can tell you from experience that processing large volumes of XML (or any text-based markup format, for that matter) is extremely expensive in terms of processor and memory resource usage.

    That said, I agree that in this case XML-formatted plain text is the right format, specifically because it is very suitable as a data interchange format. When one is archiving large volumes of data for intedeterminate periods of time (possibly decades), then it's worth the extra pain to maintain the source in the most flexible format.

    I do not want to suggest, though, that this is the best format for accessing or processing the data. I'd suggest a source repository where text data is fielded with the proper metadata which can be updated periodically if necessary. Data can then be drawn from there and stored in a more accessible (e.g. database) format and that data store can be accessed by researchers, lawyers and lawmakers, etc. This has the double benefit of keeping the source material safe because we're not interacting with it constantly and making it accessible in the most appropriate technology of the day.

    As someone has already stated, this is not exactly rocket science. It does require a certain simplicity and elegance of design, so I have very little hope that it will be implemented as I've described. 8^)

    --
    Crumb's Corollary: Never bring a knife to a bun fight.
  10. Re:OK by Sinryc · · Score: 3, Funny

    Yea, there is something your not seeing. The fact of the matter is they are talking about STORING the saved data. Not opening it.

    Good job on getting modded well. Anytime someone says "Open Source it" They get modded pretty well.

    Good job.

    --
    Yay, I have a sig.
  11. I'd love to read those emails... by rampant+mac · · Score: 4, Funny
    "The National Archives [...] will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years"

    I'd love to read those emails, seeing as how we've gone from:

    From: bclinton@whitehouse.gov
    To: hclinton@whitehouse.giv
    CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
    Subject: omglol, you got to get me some of these!

    I want these for Christmas! http://www.big-fat-cigars.com/



    To something along the lines of:

    From: gbushjr@whitehouse.gov
    To: dickc@whitehouse.giv
    CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
    Subject: Are they for real? Can we attack them too?

    Subject sayz it all, any toughts Dick? I think we can git `em.

    > DYKE BOURDER OIL SERVIES
    > OFFER FOR SALE OF NIGERIAN CRUDE OIL
    >
    > Dear Sir,
    >
    > I am President of blah blah blah...

    --
    I like big butts and I cannot lie.
  12. The obvious solution by the+eric+conspiracy · · Score: 2, Funny


    rm -rf /

  13. Google Search Appliance by TubeSteak · · Score: 2, Informative
    Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...
    Except when you're right

    The Google Search Appliance
    http://www.google.com/enterprise/gsa
    What it does

    The Google Search Appliance makes the sea of lost data on your web servers, file systems and relational databases instantly available with one mouse click. Just point it toward your content, add a search box to your site, and in a matter of hours, your users will be able to search through more than 220 different file formats in any language. The Google Search Appliance indexes up to 15 million documents, and its security features ensure that users only see the documents to which they have proper access.

    How it works

    The Google Search Appliance crawls your content and creates a master index of documents that's ready for instant retrieval using Google's search technology whenever a customer or employee types in a search query. The Google Search Appliance is easy to set up and requires minimal ongoing administration, making it extremely cost-effective. The Google Search Appliance starts at $30,000 to search up to 500,000 documents.
    FAQs

    Though it isn't really ontopic, Google search appliances are vulnerable to various exploits & Google does provide patches.
    --
    [Fuck Beta]
    o0t!
  14. Format obsolesence by StikyPad · · Score: 4, Insightful

    There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.

  15. Internet Archive by arrrrg · · Score: 2, Interesting

    If the Internet Archive can back up the entire internet every few months, I would think the National Archive could handle a few hundred million emails.

    1. Re:Internet Archive by fiji · · Score: 2, Informative

      For some value of entire.

      TIA is pretty damn impressive, but they certainly don't get all of it.

      1: There is more to the internet than the web
      2: They don't do a lot of dynamic pages... so a lot of forums will probably be ignored (not that that necesarilly loses anything useful ;-)
      3: They only get images if you request it
      4: Sites can request that they not be spidered (robots.txt)
      etc.

      -ben

  16. ASCII Text by Spazmania · · Score: 2, Insightful

    electronic documents created today may not be legible on tomorrow's devices

    ASCII text has been around for decades and oh by the way Internet-formatted email is 100% representable as ascii text since that's how its still transferred today.

    This supposed problem is a real problem only for those with Exchange, Domino or Groupwise which creates email in custom, internal formats.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  17. Re:Why store all of this by YrWrstNtmr · · Score: 3, Insightful
    What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

    Often, you don't know whats important, until long after the fact. Storage space is so cheap and easy, it doesn't make sense to try to filter, as its happening. Inevitably, something important/crucial/worldchanging would get lost, resulting in cries of government censorship.

    And I'd say for a presidency...ALL of it is crucial.

    Random conversations, recorded by the secretary, then 'erased', has already caused one president to resign. What was in that erased 18 minutes? The NARA may actually find out.

  18. Official History? by datafr0g · · Score: 3, Insightful

    The National Archives, entrusted to preserve America's official history...

    The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...

    Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind! :)

    --
    "Who says nothing is impossible? Some people do it every day!" - Alfred E. Neuman
  19. I know of two emails that aren't. by User+956 · · Score: 2, Funny

    Clinton only sent two emails during his entire 8 years in office.

    "His administration generated about 40 million messages - mostly memos and notes among aides and cabinet members. Of the two Mr Clinton sent, one was a test to see if the president could push an e-mail button. The other was addressed to astronaut John Glenn"

    That shouldn't be hard to archive.

    (on a slightly related note, I wonder what percentage of those are/were spam, and if they have to archive all those spam messages for online poker and hot wet bitches?)

    --
    The theory of relativity doesn't work right in Arkansas.
  20. MySQL? by jacklexbox · · Score: 2, Insightful

    Please correct me if I am wrong, as I probably am, but would like to have this explained to me. Why couldn't all the emails be stored as plain text in a MySQL database with either a web interface (php?) or an application written in an interpreted language (Java or Ruby)? Does that make sense? Is there something I am missing?

  21. Talk to the Catholic Church by toupsie · · Score: 3, Funny

    Monks have done an amazing job preserving important documents over the years. In fact, Xerox worked with Brother Dominic in the field of document preservation. Print out all the e-mails on archive quality paper and store them underground. Be sure they are also translated in Spanish so future Americans will be able to read them.

    --
    Strange women lying in ponds distributing swords is no basis for a system of government.
  22. Re:Why store all of this by dangitman · · Score: 2, Insightful
    What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

    That is an absolutely insane idea for government policy. We shouldn't decide what's important for the future - the future history writers decide that for us. Who is it that decides what is important? The public owns the government, and has the right to retain everything it does. Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute, and current heroes of government will not receive their dues or recognition.

    Make no mistake, some of the most insignificant things in past peoples' lives, have provided the most significant insights into humanity when later discovered by historians, anthropologists or archaeologists. It's what we consider "trash" today that will tell our story to future generations. When that trashball heads back to Earth, you wanna make goddamn sure you wear noseplugs and know how to make 20th Century trash.

    --
    ... and then they built the supercollider.
  23. Re:Why store all of this by Benwick · · Score: 2, Insightful

    Speaking as a trained archaeologist (and I'm not just saying that for effect), it would definitely be wrong to filter out the "unimportant" who-got-coffee when, because it makes a false judgment about what sort of information will be of interest to scholars of the future. There are all kinds of weird correlations possible, too -- "Presidential Coffee Breaks and the History of Global Commerce in the Post-Lewinsky Era," etc. One might want to study what lower-level White House bureaucrats did, too -- who knows. It's all primary source material.

    If all of this sounds boring to you, that's why you're not an Archaeologist. Of course, neither am I. But I did study it.

  24. Stop & think before posting, please by maggard · · Score: 3, Insightful
    Turning emails into text files, all graphics attachments into PNGs, etc. isn't the issue.

    How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax.su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.

    Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...

    Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.

    So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.

    At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...

    --
    I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
    1. Re:Stop & think before posting, please by dogugotw · · Score: 2, Insightful

      A-freakin-men!

      Seems like there is about 100:1 'understand:clueless' post ratio here.

      Converting the body of an email or document (word, pdf, excel, powerpoint, html, whatever) is trivial. Maintaining all of the meta data associated with the document/email is not. Maintaining the original context is not trivial. Let's not forget that something like highlighting, font color, underlining, bold face, or italics within a message may have meaning - if you convert to all ascii, the formatting and the meaning that went with it are gone and the saved information has less value.

      XML might be a solution but for it to work, all of the existing production systems must be changed to xml compliant systems and users must be retrained and policies to manage the newly created data must be updated.

      I work for a medical device company and struggle with these issues every day and I only need to worry about data for 10 years or so. I cannot imagine trying to keep today's data meaningful for 50 years.

      If anybody has a solution that is:
      Free
      Transparent to the users
      Transparent to the admins/developers/maintainers
      Easy to implement
      Doesn't require revalidation (oh, you didn't know regulated systems had to be validated before use and change controlled and tested at each change???)

      feel free to chime in.

  25. Are we blinded? by electrosoccertux · · Score: 2, Interesting

    Lately I've been wondering how great Google really is, and whether its deserving of the love I give it. Sure, I think the company Google is full of geniuses coming up with some of the best ideas since bread & butter.

    But then I ask myself how much time I've spent trying to find things online. I've been finding Google to be increasingly less useful. When was the last time you googled, looking for information, and found nothing related? When was the last time you had to rephrase your search query not once, not twice, not three times, but four or five times? Now, when was the last time you googled for something besides Wikipedia (or any other well known site) and found what you wanted on the first page? I can tell you that for me, the times I've been able to check off "found in under 15 seconds" have become scarcer and scarcer. Since, I've increased results to 20 per page. That's helped a bit. But most of the time I'm having to rephrase my search query multiple times. After 5 or 6 tries, I usually find what I want halfway down the page. Why is this?

    I've had several thoughts on this issue lately. Google could be filling up with spam - pages optimized just to get a high pagerank. Or perhaps I'm asking Google to find me increasingly complex and niche information. Being a GT student, its entirely possible I'm simply asking it for things most other people don't find useful. But I didn't have these problems until, at most, two months ago. Or perhaps what I fear is becoming a reality: Google's IPO has turned the company in a different direction. Maybe their slogan is changing from a "do no evil" to a "do less good" stance? Am I crazy? Or are we blind, and is what I say true? Are we loving Google only because they're giving Microsoft a run for their money?

    Don't get me wrong, Google has plenty of wonderful services: Google Earth, Gmail, the new click-a-button-and-have-that-company-phone-me service, etc. But is it possible that they're beginning to sell out the top results in their searches? Consider the evidence: I've been spending more time than ever finding quality links. Google's IPO was but a few months ago. Also, in talks with AOL, Google now plans to offer not only specialized AOL ads, but also FLASHier adsense ads. So is it probable that Google is selling a place in their top results? I'm very inclined to think so. And so, just recently, I've come to question my devotion to Google.

    Am I the only one wasting search time? I think its time we re-evaluate Google's search engine, and think twice before we offer our praise.

  26. Microsoft Word by csplinter · · Score: 2, Funny

    They should experience how the latest version of Microsoft Office can help them better manage documents, organize workload, and collaborate with coworkers--not just from their desk, but from almost anywhere! Why? So that their system will deliver the features, options, and performance they need to maximize their productivity and enjoyment, to insure that their software is authentic, properly licensed and supported by Microsoft or a trusted partner, so that they will get access to updates, enhancements, and innovations that help them protect and do more with their PC! In conclusion, If you don't believe that Microsoft Office has REAL Ultimate Power you better get a life right now or they will chop your head off!!! It's an easy choice, if you ask me.

  27. yes, oh yes by misanthrope101 · · Score: 2, Insightful
    100 million hate-emails is a lot of hate mail.
    It is all hate mail, right?
    Yes. Every last one was to people who questioned or disagreed with a decision made by the Bush administration. Strangely, every single email contained the same text: "Why do you hate America?" Apparently it's the most cogent, incisive argument around.

    Or was this about email received by the White House? All of that routed through a special team working out of the office of the Vice President. All of that email was also identical: "Cheney was right all along."

    These two may seem like odd coicidences, but only if you hate America. Your email will be forthcoming.