National Archives' Digital Woes
Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
100 million hate-emails is a lot of hate mail.
It is all hate mail, right?
"The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House,.." Thanks to the Patriot Act, this number will be reduced to roughly four, including one such email with a complelling advertisement for V14GR4!!!!!!11
100 million emails
let's be generous and say that the average email is 8192 bytes in size (8KB)
100,000,000 * 8KB = ~800GB
That's not much at all. And that's if you store it uncompressed.
Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.
What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.
Those who anthropomorphize science and/or nature already believe in an intelligent designer.
So why don't they just use open source data formats? Is there something more complicated here that I'm not seeing?
(Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)
Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)
http://www.lawrenceperson.com/
Well, if the technology that uses the emails is exploding, surely the software/systems that archive the software are too.
A couple of BSD box's with some Oracle or similar should do it.
Me failed English...
FreeBSD over Linux. If my comments seem odd, this may explain...
Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?
I'm sure a universal data conversion tool would be worth a pile of money.
Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?
If everything is in MS Office, it's guaranteed to be inaccessible after just two upgrades.
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
Plus the director should be called dorkrector.The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod? Does it come with a Saturday Night Live dance cover?
Let Google handle it?
"We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive."
Those are called LUGs
I think we deserve to be told how many Library of Congresses that takes up!
the layman's guide to computer science
Lockheed officials have recommended using a handful of widely accepted formats such as the popular Internet software language HTML. .
Those responsible have been sacked.
illegitimii non ingravare
Better make two tapes in case Sandy Berger sneaks off with one.
I'd love to read those emails, seeing as how we've gone from:
From: bclinton@whitehouse.gov
To: hclinton@whitehouse.giv
CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
Subject: omglol, you got to get me some of these!
I want these for Christmas! http://www.big-fat-cigars.com/
To something along the lines of:
From: gbushjr@whitehouse.gov
To: dickc@whitehouse.giv
CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
Subject: Are they for real? Can we attack them too?
Subject sayz it all, any toughts Dick? I think we can git `em.
> DYKE BOURDER OIL SERVIES
> OFFER FOR SALE OF NIGERIAN CRUDE OIL
>
> Dear Sir,
>
> I am President of blah blah blah...
I like big butts and I cannot lie.
rm -rf /
So they'll block all email retention, claiming, uh, national security or something.
The Google Search Appliance
http://www.google.com/enterprise/gsa FAQs
Though it isn't really ontopic, Google search appliances are vulnerable to various exploits & Google does provide patches.
[Fuck Beta]
o0t!
We've all had our "I gotta keep everything I do, download, see or hear in my records" moments, and sometimes they may last for years before we realize we don't need 99% of it anyway and will never never use it.
Information is infinite, there's no ends to the amount of information anyone of us can produce. Storing everything is old school, new school recognizes that fact and stores only important information.
What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.
There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.
https://www.eff.org/https-everywhere
If the Internet Archive can back up the entire internet every few months, I would think the National Archive could handle a few hundred million emails.
electronic documents created today may not be legible on tomorrow's devices
ASCII text has been around for decades and oh by the way Internet-formatted email is 100% representable as ascii text since that's how its still transferred today.
This supposed problem is a real problem only for those with Exchange, Domino or Groupwise which creates email in custom, internal formats.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
You fail to take into account html email, attachments, large email threads where everyone replies to all (very common in a large organization).
The average email is 500,000 bytes in size (500K).
100,000,000 * 500KB = ~50,000GB = ~50 Terabytes of information
That's a lot of data even if you store it compressed.
You'd need 1250 DLT tapes or 250 LTO1 tapes or 125 LTO3 tapes to back up that data.
Compressing that data with Bzip 2 would take:
0.625 * 50,000,000 = 31250000 seconds = ~520833 minutes = ~8680 hours = 361 days = ~1 year
Ya know, this whole "obsolesence" thing could probably be avoided with open document formats.
Next on /. A neat mod to make your ipod play 8 track tape.
The 8-track was a great idea. Bad design.
It had the best User Interface.
1.One Tape - slide it in
2.One Button - Press for Selection
3.Four Lights - Four Play List
The player was as simple as you can get with tapes.
1.One Motor
2.One Solinod
3.One Tape head
4.One Audio Amp
5.A few lights and hardware to tie it together.
6.A Box!
I say we need to start an Open Source/Hardware rework of the design. The patents might not be a problem anymore.
1.High Quality Design of the Mechanisum
2.LCD front panel Song display
3.Maybe a seek(fast Foward) button added.
4.Digital/Analog Tapes
A.Both analog and digital tape ability.
B.Backwards compatibility with orignal 8-tracks.(yes i know most are broken now)
5.Dolby/Suround on Digital Tapes.
I would love to have a so eligant design in an old 50's Chevy. Retro but High Tech at the same time.
So Cool...
I don't want a pickle; I just want a Motor-Cycle! A four foot cop arrived with a five foot gun!
The National Archives, entrusted to preserve America's official history...
:)
The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...
Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind!
"Who says nothing is impossible? Some people do it every day!" - Alfred E. Neuman
Well.. One LOC is 11,362.5 GB (based off this)
If 100 Million e-mails is ~800GB
then 100 Million e-mails is about 7.04%
[Fuck Beta]
o0t!
Let's be honest and admit they use M$ junk. You know they are slinging around 70MB power point files, word docs, ad nauseum. Getting that all put into something legible is hard to do. Try opening your Excel 4 files, for example. Did you remember to install the right fonts and equation editor? If all non text were pumped to pdf or html, things would be a little easier but still larger. The challenge is automating the conversion given an administration that's cluelessly in love with all things M$.
Now that we've thought for two seconds, let's visit the article:
the Archives is struggling to devise a system for storing the enormous amount of digital information in a format that will allow it to be accessed 20, 75, even 200 years from now by historians, students and average Americans looking for a first-hand accounts of the federal government's activities.
Sounds a lot like that Mass. mess. Reading on ...
For example, when the electronic records of the Sept. 11 presidential commission arrived at the Archives a year ago, "it was the equivalent of all the fully processed electronic records we had received in 30 years," or about one terabyte of data, says Robert Chadduck, computer engineer overseeing the Archives' search for a solution.
Oh my, better get a bigger drive than 800GB.
Part of the impetus for wanting to come up with a comprehensive strategy for digesting electronic records is the desire to make them accessible via the Internet, rather than requiring people visit one of the Archives facilities, request a tape and then wait for a copy be mailed to them.
Suckage.
federal law requires that government documents be kept in their original formats to verify authenticity -- particularly documents that may be used in court.
Oh shit, they are going to become a Digital Williamsburg. I suggest they start learning Bochs, because it's unlikely they will be able to keep some dinky P1 running (with it's "original" CD) to read Bill Clinton's love letters to Paula Jones, much less connect it to a network. Preserving the original format is a good idea, but documents must also be converted to some reasonable publication format before the ability and interest in such conversions goes away.
Friends don't help friends install M$ junk.
Clinton only sent two emails during his entire 8 years in office.
"His administration generated about 40 million messages - mostly memos and notes among aides and cabinet members. Of the two Mr Clinton sent, one was a test to see if the president could push an e-mail button. The other was addressed to astronaut John Glenn"
That shouldn't be hard to archive.
(on a slightly related note, I wonder what percentage of those are/were spam, and if they have to archive all those spam messages for online poker and hot wet bitches?)
The theory of relativity doesn't work right in Arkansas.
Please correct me if I am wrong, as I probably am, but would like to have this explained to me. Why couldn't all the emails be stored as plain text in a MySQL database with either a web interface (php?) or an application written in an interpreted language (Java or Ruby)? Does that make sense? Is there something I am missing?
Take mercury delay lines. They kept data by continuously sending sound impulses inside a tank filled with mercury, and the impulses were recycled through to refresh the storage.
Well, this could be done with a *HUGE* disk array, where you add drives to increase storage, and "retire" broken or obsolete drives and it would evolve as technology does, never losing any of it's data.
Monks have done an amazing job preserving important documents over the years. In fact, Xerox worked with Brother Dominic in the field of document preservation. Print out all the e-mails on archive quality paper and store them underground. Be sure they are also translated in Spanish so future Americans will be able to read them.
Strange women lying in ponds distributing swords is no basis for a system of government.
Well, if they would just run their mail through SpamAssassin it should make the problem far more manageable...
Oh well, what the hell...
ODF. Am I wrong? Isn't the whole point of ODF is to have a format for documents that will be around longer than any company. As for emails; text and html should be easily accessible.
Please correct me if I am wrong.
PS: I am aware that some email clients butcher html.
Regards
Why not print the e-mails on paper? Seems to me that the National Archives are already well equipped to archive paper documents, and the data would last at least several hundred years.
:)
Of course, stone tablets have proven to be the most durable data storage medium to date, lasting upwards of 5,000 years, but that would probably be overkill.
...'the server ate my email' to any queries about critical email messages.
Like the Clinton team did.
resigned
How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax.su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.
Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...
Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.
So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.
At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
Lately I've been wondering how great Google really is, and whether its deserving of the love I give it. Sure, I think the company Google is full of geniuses coming up with some of the best ideas since bread & butter.
But then I ask myself how much time I've spent trying to find things online. I've been finding Google to be increasingly less useful. When was the last time you googled, looking for information, and found nothing related? When was the last time you had to rephrase your search query not once, not twice, not three times, but four or five times? Now, when was the last time you googled for something besides Wikipedia (or any other well known site) and found what you wanted on the first page? I can tell you that for me, the times I've been able to check off "found in under 15 seconds" have become scarcer and scarcer. Since, I've increased results to 20 per page. That's helped a bit. But most of the time I'm having to rephrase my search query multiple times. After 5 or 6 tries, I usually find what I want halfway down the page. Why is this?
I've had several thoughts on this issue lately. Google could be filling up with spam - pages optimized just to get a high pagerank. Or perhaps I'm asking Google to find me increasingly complex and niche information. Being a GT student, its entirely possible I'm simply asking it for things most other people don't find useful. But I didn't have these problems until, at most, two months ago. Or perhaps what I fear is becoming a reality: Google's IPO has turned the company in a different direction. Maybe their slogan is changing from a "do no evil" to a "do less good" stance? Am I crazy? Or are we blind, and is what I say true? Are we loving Google only because they're giving Microsoft a run for their money?
Don't get me wrong, Google has plenty of wonderful services: Google Earth, Gmail, the new click-a-button-and-have-that-company-phone-me service, etc. But is it possible that they're beginning to sell out the top results in their searches? Consider the evidence: I've been spending more time than ever finding quality links. Google's IPO was but a few months ago. Also, in talks with AOL, Google now plans to offer not only specialized AOL ads, but also FLASHier adsense ads. So is it probable that Google is selling a place in their top results? I'm very inclined to think so. And so, just recently, I've come to question my devotion to Google.
Am I the only one wasting search time? I think its time we re-evaluate Google's search engine, and think twice before we offer our praise.
You got to be kidding!!! why didnt they just create master DVD disc then press/create DVD data disc for every local library that want to have a copy?
The other big question is why hasn't Google offer free service to archive all those data ?
My sister got biten by a moose once. Mind you, it was a prety good moose Thos responsible for sacking the people wo sacked the original editors have also been sacked The argument was completely re-edited at the last moment by a team of equadorian mountain Llama's
They should experience how the latest version of Microsoft Office can help them better manage documents, organize workload, and collaborate with coworkers--not just from their desk, but from almost anywhere! Why? So that their system will deliver the features, options, and performance they need to maximize their productivity and enjoyment, to insure that their software is authentic, properly licensed and supported by Microsoft or a trusted partner, so that they will get access to updates, enhancements, and innovations that help them protect and do more with their PC! In conclusion, If you don't believe that Microsoft Office has REAL Ultimate Power you better get a life right now or they will chop your head off!!! It's an easy choice, if you ask me.
Or was this about email received by the White House? All of that routed through a special team working out of the office of the Vice President. All of that email was also identical: "Cheney was right all along."
These two may seem like odd coicidences, but only if you hate America. Your email will be forthcoming.
Ya mean these folks don't send out Word, Powerpoint, Excel, funny pics and all that other attachment crap to the 100 folks on the cc: list? I'm thinking 8K is probably a very low end estimate given the junk that shows up in my inbox.
Print it all out using stable inks on acid-free paper. ;-)
- This will give the librarians something to do, and will be immune to technology going obsolete
"At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.'"
That's not really the fault of tech, that's a problem with companies trying to engage in vendor lock-in tactics. Keep things simple and standardized (aka ascii/plain text or open formats) and this should be a non issue. Keep everything in PDF or DOC format and yeah... you'll have problems. Take it from the State of Maine, they know full well what they are doing.
Software Media Library
I have to say the biggest problem they face is that fact that the entire US Government is not on one standard for electronic documents. NARA uses GroupWise for it's e-mail. Other agencies use Exchange/Outlook. Some agencies still use text mode e-mail on a mainframe or UNIX box. People I speak with in the Navy tell me that the whole navy uses a bunch of different formats for everything from e-mail to work processing documents.
The government is only recently adopting PDF files, because PDFs before version 1.5 of the spec were not Section 508 compliant, and a screen reader could not read them.
Flash animations on web sites are out also due to Section 508 compliance. NARA's headache would be greatly reduced if they could standardize on a format that everyone uses across the board in all agencies.
Sadly, the government doesn't work that way...
Comment removed based on user account deletion
There are major technical problems with ODF.
The first and biggest one is that it doesn't help to entrench the MS Office monopoly. In fact, it tends to work against this goal, because other vendors can freely interoperate with ODF documents.
Another major problem with the ODF format is that nobody is able to impose a "tax", or to require special individual license permission for each new software which reads and writes the format.
Finally, ODF is tainted by that "open source" movement. Respected, successful business leaders of our nation have denounced it with phrases such as "...like a cancer" and "...infects other intellectual property", "is un-American", and other similar remarks.
Considering the above problems with ODF, and because I am cynical and have lost all faith in a system which is hopelessly corrupt, I don't expect ODF to actually be used by the government. It simply doesn't put money into the right pockets.
The price of freedom is eternal litigation.
I started archiving slides of archaeological digs for one of the local universities about twelve years ago. At the time Kodak gave an estimated shelf life of 100+ years for the recorded cds. We actually started having data corruption within a few years. Even with multiple copies stored in different, climate controlled locations.
Now, the slides are more degraded than twelve years ago and we have are back to looking at other methods of archiving the data since we can't predict what will happen to the digital storage down the road 10-15 years, much less hundreds of years from now.
Also, I used to have a business that would recover and convert data from one format to another. You wouldn't believe the number of businesses that archive data in one format and put it in storage, have a catastrophic failure and, upon trying to recover data, find that they no longer have the equipment to retrieve the data.
Converting the data was easy - if you still had the equipment that was used to archive the tape, cartridge, diskette, etc.
If I experience these problems in a small city within 15 or so years, I can't imagine what problems a project of that scope (archiving the whole of the gov't's data) would have while trying to preserve it for future research or historic context.
-JM
This isn't quite like 8-tracks where the players aren't made anymore.
Obviously if you dump this stuff to tape then the comparison holds...for a little while. I would expect that any company, upon upgrading their archive hardware would migrate existing data to the new equipment!
Then the issue is simply with format. I find it hard to believe that in 500 years, no one will be able to decipher a txt, doc, png, jpg, etc. These are SOFTWARE formats--not hardware. Thus, you don't need to maintain any piece of equipment, just a little code. Sure, the more obscure formats like wps, wks, etc. might give you trouble but comon--you will have the same problem with well documented open source formats that are not very popular (like xcf?).
I don't think open source alone is the solution here.
Yes, I know; Whoosh!
Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana
I've never seen a more compelling argument for OpenDoc. (and/or a conversion requirement to OpenDoc.)
Take out anything that might injure National Security, then turn the rest over for Google to index.
Dear Sir,
I write to inform you of my desire to acquire [REDACTED] in your country on behalf of [REDACTED] of the [REDACTED] in Nigeria. Considering his very strategic and influential position, he would want the [REDACTED]. He further wants [REDACTED], until [REDACTED]. Hence our desire to have [REDACTED].
[28 LINES REDACTED FOR SECURITY PURPOSES]
Your quick response will be highly [REDACTED]. Thank you in anticipation of [REDACTED].
Yours [REDACTED],
[REDACTED]
I sympathize - sort of. In the '90's I worked on a defense contractor's account during their physical move to a "black hole": data, machines, everything. An engineer called me; he was close to retirement, and had worked on the original F101, and the 105. Twenty years ago he had archived his drawings to the system they were using: Macs.They didn't get UNIX until the 90s. He had all the drawings etc under his desk and on shelves on floppys and wanted them converted to either Wintel, or current Mac (8.0 at that time.) Not an easy process. But all those emails go to the retiring President for his library, and his crew there can worry about it. The Pffice of the POTUS, and the archives are only responsible for "historic" documents. Bush has some bright people (I think.)