National Archives' Digital Woes
Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
100 million emails
let's be generous and say that the average email is 8192 bytes in size (8KB)
100,000,000 * 8KB = ~800GB
That's not much at all. And that's if you store it uncompressed.
Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.
What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.
Those who anthropomorphize science and/or nature already believe in an intelligent designer.
(Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)
Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)
http://www.lawrenceperson.com/
Well, if the technology that uses the emails is exploding, surely the software/systems that archive the software are too.
A couple of BSD box's with some Oracle or similar should do it.
Me failed English...
FreeBSD over Linux. If my comments seem odd, this may explain...
Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?
I'm sure a universal data conversion tool would be worth a pile of money.
The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod? Does it come with a Saturday Night Live dance cover?
Yea, there is something your not seeing. The fact of the matter is they are talking about STORING the saved data. Not opening it.
Good job on getting modded well. Anytime someone says "Open Source it" They get modded pretty well.
Good job.
Yay, I have a sig.
I'd love to read those emails, seeing as how we've gone from:
From: bclinton@whitehouse.gov
To: hclinton@whitehouse.giv
CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
Subject: omglol, you got to get me some of these!
I want these for Christmas! http://www.big-fat-cigars.com/
To something along the lines of:
From: gbushjr@whitehouse.gov
To: dickc@whitehouse.giv
CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
Subject: Are they for real? Can we attack them too?
Subject sayz it all, any toughts Dick? I think we can git `em.
> DYKE BOURDER OIL SERVIES
> OFFER FOR SALE OF NIGERIAN CRUDE OIL
>
> Dear Sir,
>
> I am President of blah blah blah...
I like big butts and I cannot lie.
There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.
https://www.eff.org/https-everywhere
Often, you don't know whats important, until long after the fact. Storage space is so cheap and easy, it doesn't make sense to try to filter, as its happening. Inevitably, something important/crucial/worldchanging would get lost, resulting in cries of government censorship.
And I'd say for a presidency...ALL of it is crucial.
Random conversations, recorded by the secretary, then 'erased', has already caused one president to resign. What was in that erased 18 minutes? The NARA may actually find out.
The National Archives, entrusted to preserve America's official history...
:)
The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...
Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind!
"Who says nothing is impossible? Some people do it every day!" - Alfred E. Neuman
Monks have done an amazing job preserving important documents over the years. In fact, Xerox worked with Brother Dominic in the field of document preservation. Print out all the e-mails on archive quality paper and store them underground. Be sure they are also translated in Spanish so future Americans will be able to read them.
Strange women lying in ponds distributing swords is no basis for a system of government.
How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax.su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.
Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...
Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.
So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.
At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.