Archiving Digital History at the NARA
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?
I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.
It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.
Perhaps, the answer is compression.
Does anyone know whether there is an upper limit to text compression?
In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?
Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.
If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.
In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.
Oh well, what the hell...
I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.
Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.
Personally, I take the "you never know" ideology and save everything.
Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.
Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).
By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.
To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.
Two wrongs don't make a right, but three lefts do.
And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.
What do you think about Ralph's thoughts?
xkcd.com - a webcomic of mathematics, love, and language.
Actually, one of the main complaints Historians have is incomplete information about the past. Not having every little tidbit makes it impossible to figure out how people actually lived. History _should_ be more than just names, dates, and events. If we can properly preserve and index items that seem really mundane to us, future generations have a _much_ better chance of having some real understanding of how we developed as a society.
Never attribute to malice that which can be adequately explained by incompetence. -- Jerry Pournelle
I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...
Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?
Absolutely, yes!
History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.
One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.
History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.
The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?
Those memos and IMs comprise that understand of people today.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.
Digital Citizen