National Archives' Digital Woes
Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.
Those who anthropomorphize science and/or nature already believe in an intelligent designer.
So why don't they just use open source data formats? Is there something more complicated here that I'm not seeing?
(Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)
Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)
http://www.lawrenceperson.com/
"Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?"
Because it's not the right tool for every job. XML is explicitly a data interchange format. I've worked with material like this in the past, and I can tell you from experience that processing large volumes of XML (or any text-based markup format, for that matter) is extremely expensive in terms of processor and memory resource usage.
That said, I agree that in this case XML-formatted plain text is the right format, specifically because it is very suitable as a data interchange format. When one is archiving large volumes of data for intedeterminate periods of time (possibly decades), then it's worth the extra pain to maintain the source in the most flexible format.
I do not want to suggest, though, that this is the best format for accessing or processing the data. I'd suggest a source repository where text data is fielded with the proper metadata which can be updated periodically if necessary. Data can then be drawn from there and stored in a more accessible (e.g. database) format and that data store can be accessed by researchers, lawyers and lawmakers, etc. This has the double benefit of keeping the source material safe because we're not interacting with it constantly and making it accessible in the most appropriate technology of the day.
As someone has already stated, this is not exactly rocket science. It does require a certain simplicity and elegance of design, so I have very little hope that it will be implemented as I've described. 8^)
Crumb's Corollary: Never bring a knife to a bun fight.
There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.
https://www.eff.org/https-everywhere
electronic documents created today may not be legible on tomorrow's devices
ASCII text has been around for decades and oh by the way Internet-formatted email is 100% representable as ascii text since that's how its still transferred today.
This supposed problem is a real problem only for those with Exchange, Domino or Groupwise which creates email in custom, internal formats.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Often, you don't know whats important, until long after the fact. Storage space is so cheap and easy, it doesn't make sense to try to filter, as its happening. Inevitably, something important/crucial/worldchanging would get lost, resulting in cries of government censorship.
And I'd say for a presidency...ALL of it is crucial.
Random conversations, recorded by the secretary, then 'erased', has already caused one president to resign. What was in that erased 18 minutes? The NARA may actually find out.
The National Archives, entrusted to preserve America's official history...
:)
The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...
Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind!
"Who says nothing is impossible? Some people do it every day!" - Alfred E. Neuman
Please correct me if I am wrong, as I probably am, but would like to have this explained to me. Why couldn't all the emails be stored as plain text in a MySQL database with either a web interface (php?) or an application written in an interpreted language (Java or Ruby)? Does that make sense? Is there something I am missing?
That is an absolutely insane idea for government policy. We shouldn't decide what's important for the future - the future history writers decide that for us. Who is it that decides what is important? The public owns the government, and has the right to retain everything it does. Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute, and current heroes of government will not receive their dues or recognition.
Make no mistake, some of the most insignificant things in past peoples' lives, have provided the most significant insights into humanity when later discovered by historians, anthropologists or archaeologists. It's what we consider "trash" today that will tell our story to future generations. When that trashball heads back to Earth, you wanna make goddamn sure you wear noseplugs and know how to make 20th Century trash.
... and then they built the supercollider.
Speaking as a trained archaeologist (and I'm not just saying that for effect), it would definitely be wrong to filter out the "unimportant" who-got-coffee when, because it makes a false judgment about what sort of information will be of interest to scholars of the future. There are all kinds of weird correlations possible, too -- "Presidential Coffee Breaks and the History of Global Commerce in the Post-Lewinsky Era," etc. One might want to study what lower-level White House bureaucrats did, too -- who knows. It's all primary source material.
If all of this sounds boring to you, that's why you're not an Archaeologist. Of course, neither am I. But I did study it.
How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax.su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.
Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...
Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.
So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.
At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
Or was this about email received by the White House? All of that routed through a special team working out of the office of the Vice President. All of that email was also identical: "Cheney was right all along."
These two may seem like odd coicidences, but only if you hate America. Your email will be forthcoming.
If you've got the best mousetrap, you need to find out more about how to make your product available to the archives community.
Some places to learn more:
The Society of American Archivists
The Association of Records Managers and Administrators
The Council of State Archivists