National Archives' Digital Woes
Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
100 million emails
let's be generous and say that the average email is 8192 bytes in size (8KB)
100,000,000 * 8KB = ~800GB
That's not much at all. And that's if you store it uncompressed.
Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.
Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?
I'm sure a universal data conversion tool would be worth a pile of money.
If the Internet Archive can back up the entire internet every few months, I would think the National Archive could handle a few hundred million emails.
You fail to take into account html email, attachments, large email threads where everyone replies to all (very common in a large organization).
The average email is 500,000 bytes in size (500K).
100,000,000 * 500KB = ~50,000GB = ~50 Terabytes of information
That's a lot of data even if you store it compressed.
You'd need 1250 DLT tapes or 250 LTO1 tapes or 125 LTO3 tapes to back up that data.
Compressing that data with Bzip 2 would take:
0.625 * 50,000,000 = 31250000 seconds = ~520833 minutes = ~8680 hours = 361 days = ~1 year
Lately I've been wondering how great Google really is, and whether its deserving of the love I give it. Sure, I think the company Google is full of geniuses coming up with some of the best ideas since bread & butter.
But then I ask myself how much time I've spent trying to find things online. I've been finding Google to be increasingly less useful. When was the last time you googled, looking for information, and found nothing related? When was the last time you had to rephrase your search query not once, not twice, not three times, but four or five times? Now, when was the last time you googled for something besides Wikipedia (or any other well known site) and found what you wanted on the first page? I can tell you that for me, the times I've been able to check off "found in under 15 seconds" have become scarcer and scarcer. Since, I've increased results to 20 per page. That's helped a bit. But most of the time I'm having to rephrase my search query multiple times. After 5 or 6 tries, I usually find what I want halfway down the page. Why is this?
I've had several thoughts on this issue lately. Google could be filling up with spam - pages optimized just to get a high pagerank. Or perhaps I'm asking Google to find me increasingly complex and niche information. Being a GT student, its entirely possible I'm simply asking it for things most other people don't find useful. But I didn't have these problems until, at most, two months ago. Or perhaps what I fear is becoming a reality: Google's IPO has turned the company in a different direction. Maybe their slogan is changing from a "do no evil" to a "do less good" stance? Am I crazy? Or are we blind, and is what I say true? Are we loving Google only because they're giving Microsoft a run for their money?
Don't get me wrong, Google has plenty of wonderful services: Google Earth, Gmail, the new click-a-button-and-have-that-company-phone-me service, etc. But is it possible that they're beginning to sell out the top results in their searches? Consider the evidence: I've been spending more time than ever finding quality links. Google's IPO was but a few months ago. Also, in talks with AOL, Google now plans to offer not only specialized AOL ads, but also FLASHier adsense ads. So is it probable that Google is selling a place in their top results? I'm very inclined to think so. And so, just recently, I've come to question my devotion to Google.
Am I the only one wasting search time? I think its time we re-evaluate Google's search engine, and think twice before we offer our praise.