Slashdot Mirror


National Archives' Digital Woes

Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"

190 comments

  1. A lot of hate-mail by Beuno · · Score: 0, Flamebait

    100 million hate-emails is a lot of hate mail.
    It is all hate mail, right?

  2. Not A Problem... by ferrellcat · · Score: 2, Funny

    "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House,.." Thanks to the Patriot Act, this number will be reduced to roughly four, including one such email with a complelling advertisement for V14GR4!!!!!!11

    1. Re:Not A Problem... by Anonymous Coward · · Score: 0

      Yes, but how much of that binary data is duplicated? (images in signatures, Word-documents etc.). When you're making archive software, you (should) make an algorithm that detects binary duplicates and store it only once.

  3. some funny math by Yonder+Way · · Score: 4, Interesting

    100 million emails
    let's be generous and say that the average email is 8192 bytes in size (8KB)

    100,000,000 * 8KB = ~800GB

    That's not much at all. And that's if you store it uncompressed.

    Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.

    1. Re:some funny math by NonSequor · · Score: 3, Funny

      This is the Bush administration we're talking about. They all use HTML mail with lots of attached graphics. On top of that, many messages get forwarded hundreds of times.

      --
      My only political goal is to see to it that no political party achieves its goals.
    2. Re:some funny math by Anonymous Coward · · Score: 0
      ...say that the average email is 8192 bytes...
      Attachments can balloon the size of emails. When I create a simple word document with a standard company header and a small amount of formatted text it's about 1MB. Powerpoints are the same way and loaded with graphics.
    3. Re:some funny math by AndroidCat · · Score: 1

      Odd are, no one trims their replies. Just save the most recent one and you're done!

      --
      One line blog. I hear that they're called Twitters now.
    4. Re:some funny math by Ithika · · Score: 1

      What's that when converted to the storage capacity unit du jour, the Library of Congress (or LoC). How many LoCs is 100 million emails?

    5. Re:some funny math by cashman73 · · Score: 1

      Plus, don't forget that the Bush Administration gets messages from both of the "internets".

    6. Re:some funny math by xdc · · Score: 1
      How many LoCs is 100 million emails?

      Moreover, how many football fields is 100 megamessages? (Speaking of the storage capacity unit du jour.)

    7. Re:some funny math by Wildfire+Darkstar · · Score: 4, Informative

      Speaking as a trained archivist, I can say that the problem isn't finding storage space for the e-mails, per se. It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time). Which means that the problem cannot be satisfactorily solved by dumping every e-mail onto a hard drive somewhere and forgetting about them. They all need to be indexed and cataloged, and provisions need to be made to ensure that the data can be migrated onto newer technology when it becomes necessary to do so without losing any of the information (or metadata) associated with it.

      The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with. While storage space itself is a concern, to some degree, given that this material will continue to accumulate, the larger problem is how to manage this material. Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.

      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
    8. Re:some funny math by Anonymous Coward · · Score: 0

      yea, the real internet and the one invented by Al Gore :-)

    9. Re:some funny math by thePfhitz · · Score: 1
      Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.

      That's easy. Easy-peasy. At ~2680 megabytes per account right now, give them 306 Gmail accounts!

    10. Re:some funny math by MillionthMonkey · · Score: 2, Funny

      That's not much at all. And that's if you store it uncompressed.

      And any compression routine will immediately tokenize the long heavily repeated phrases: "September 11, 2001", "Global War on Terror", "aid and comfort to the enemy", "America's will is strong", "central front in the war on terror", "the American people are safer", "9/11", "we will prevail". There isn't a lot of entropy in this particular dataset.

    11. Re:some funny math by Crudely_Indecent · · Score: 2, Interesting

      Too bad they already awarded the contract to lockheed martin (someone had their palm greased in that deal), as my company deals with document conversion and archiving (of this scale) on a regular basis. The NA concern was converting the documents to modern formats and yet retaining the original document... Peanuts....my systems do it on the fly.

      Oh well....$308 Million dollar contract goes bye bye.....

      When did lockheed martin get into the document management business?

      --


      "Lame" - Galaxar
    12. Re:some funny math by commodoresloat · · Score: 1
      When did lockheed martin get into the document management business?

      Sounds like they just did.

    13. Re:some funny math by crazyphilman · · Score: 1

      If the NARA really wanted to be sharp about this, they could load all the emails into a running database instead of onto media, then back it up to tape periodically (this ensures the tapes will keep working, etc). They could go in all sorts of directions from this starting point, including cloning the database to produce WORKING backups, etc.

      Somehow, I find a running server more trustworthy than a bunch of CDs in a box. At least I can go ASK the thing whether it still works... :)

      --
      Farewell! It's been a fine buncha years!
    14. Re:some funny math by ajs · · Score: 1

      This doesn't make sense to me.

      First, you have the mail itself. RFC2822 is an international standard, so that seems like the right way to go to store the mail. For indexing, there are any number of mail archival systems, some better and some worse than others, but most handle 2822 just fine.

      Now you get into attachments. Here, you presumably want to convert everything down into one of PDF (semi-open format where there are at least several competing readers), Open Document (open format with an open source reader as the primary implementation) or PNG (open specification with many implementations) depending on the nature of the attachment.

      So where's the problem?

    15. Re:some funny math by Anonymous Coward · · Score: 0

      I think if they didn't contract to google, they contracted with the wrong people for the email archive.

    16. Re:some funny math by Anonymous Coward · · Score: 0

      Plus they're sending large binaries of gay porn to eachother. All republicans are really repressed homos. That's how that whole Jeff Gannon thing happened. One homo doing another homo a favor, getting whitehouse access without background checks, etc. Plus I'm sure Bush himself constanting emailing people copies of that "monkey scratching his butt and smelling it, then falling over from the smell" video to people, even though everyone on the planet has seen it like 5 years ago. That moron still busts out loud laughing everytime he see's it. Plus tons of email traffic between Dick Cheney and Satan.

    17. Re:some funny math by number11 · · Score: 1

      Too bad they already awarded the contract to lockheed martin (someone had their palm greased in that deal), as my company deals with document conversion and archiving

      So how much did you give to Jack Abramoff? Nothing? Maybe that explains it?

    18. Re:some funny math by MagnusDredd · · Score: 1

      It's called pdf and links and displayed header information.

      It's damned simple. Use a unix time stamp plus subject and sender for the name of the file. Then create directories for mailboxes and drop links into the directories that correlate to the sender and reciever. You name the link with the standard crap you see in a email client. You name it Date/Time|Subject|Sender.

      Now what you have is a list of emails (links to the actual PDFs) recieved by date and time in a time ordered fashion. Any coder worth a crap can use "|" or whatever character they wish to separate fields for alternate ordering of lists. Then all you need is a front-end to display this...

      You place links in the PDFs covering the to: and sender: fields that will take you to the apprpriate mailboxes. I'd probably also divide the mailboxes into a User/Year/Month/Day directory format with the links to the emails residing in the appropriate directory in order to keep things in some kind of workable size.

      There'd be a bit more to it, but that's what I've come up with in about 2 minutes. I'm pretty sure the rest of the logistical angle would likewise be readily doable.

      The point is:
      PDF is a fairly open format that will be readable in the future.
      PDF supports linking.
      Directory structure will exist as a metaphore into the forseeable future (and if not then a database layout of fields matching this layout will do the same thing.) Once again, writing some scripts to take a directory structure of links and import it into a database would take a bit of work, however I know of multiple perl coders who'd be able to do this.
      This layout would be damned simple to web enable.

    19. Re:some funny math by Mikelikus · · Score: 2, Funny

      I know, I know, I know!! Why don't they use Google Desktop!!?

      --
      -- Would it be acceptable to just put my name on my sig?
    20. Re:some funny math by evanism · · Score: 1

      what a trivial problem. I run a biz with 16TB of RAID5 all online pumping out over *2TB per day*.

      Crikey, if one small business can hammer it out, and have it all indexed and publicly available, while being massively redundant and impossible to crash, I'm certain that the bloody government can do it... unless they are incompetent....

      --
      Just bought a new quantum computer, but I'm uncertain how it works.
    21. Re:some funny math by Black+Marlin · · Score: 2, Insightful
      Well, now that you know that the federal government has email storage issues, perhaps your company needs to step up and learn about how to bid on federal contracts. State governments are in smaller versions of the same boat. Our governor may not be turning over 100 million emails to our State Archives, but it will be a bunch. Even the last geezer governor transferred a big chunk.

      If you've got the best mousetrap, you need to find out more about how to make your product available to the archives community.

      Some places to learn more:
      The Society of American Archivists
      The Association of Records Managers and Administrators
      The Council of State Archivists

    22. Re:some funny math by Crudely_Indecent · · Score: 1

      It is a relatively new product (6 months)

      We're registered as gov't contractors, but I'd never seen anything like this come across the wire (it is a big wire)

      Thanks for the references, I'll check those out!

      --


      "Lame" - Galaxar
    23. Re:some funny math by Anonymous Coward · · Score: 0

      Retrieval is a big concern as is the media in which the data is stored. How many of us still have 8" disks, the drives to read them, the system; as well as the OS, AND the knowledge to use the technology.

      This is the biggest problem with ever changing technologies (and probably one of the biggest props for open standards), in that it is impossible to keep all of those technologies available and accessible.

      Ironically this is also the reason that paper and writing have survived for so long. We can still look at cuniform tablets and hieroglyphics on papyrus because they are low level technologies that are fairly stable (don't degenerate rapidly in fluctuating climates). Leave a manuscript on the dashboard of your car in the summer and it is likely to survive. A floppy disk or even a laptop will likely go tummy up.

      I feel for those archivists - no perfect solution for the volumes of data they are collecting.

    24. Re:some funny math by greg_barton · · Score: 1

      The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with.

      You're kidding, right?

    25. Re:some funny math by Anonymous Coward · · Score: 0

      I've got my palm greased right now.

    26. Re:some funny math by ckaminski · · Score: 1

      That, sir, earned you a place in my friends list.

    27. Re:some funny math by Alpha_Traveller · · Score: 1

      "It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time)."

      Yes, but it'll still only fit on a single tape. ;D

      --
      "Love is like pi - natural, irrational, and very important." (Lisa Hoffman)
    28. Re:some funny math by Anonymous Coward · · Score: 0

      They're probably just going to sub it out and skim a percentage off it... your company might want to look into that.

    29. Re:some funny math by MankyD · · Score: 1

      You may still be in luck. Lockheed Martin largely touts itself as a Systems Integrator. Depending on the contract, they make few software products themselves. Instead, they turn to Commercial Off The Shelf (COTS) software for most of their solutions. Their main role is to allow multiple vendors (archival systems, viewing systems, email systems, and other related technolgogies) work together. They work to sell a complete solution, top to bottom, composed of products from companies like yours.

      --
      -dave
      http://millionnumbers.com/ - own the number of your dreams
    30. Re:some funny math by crazyphilman · · Score: 1

      Cool! Thank you, and happy New Year!

      --
      Farewell! It's been a fine buncha years!
    31. Re:some funny math by hurfy · · Score: 1

      "When did lockheed martin get into the document management business?"

      Don't know. You should email the White House and ask.

    32. Re:some funny math by hurfy · · Score: 1

      Easy, take off the robot.txt file and let someone else do it ;)

      Ok, it tells me to wait but not HOW LONG to wait....la dee da de da........

    33. Re:some funny math by joto · · Score: 1
      This is the Bush administration we're talking about. They all use HTML mail with lots of attached graphics. On top of that, many messages get forwarded hundreds of times.

      Huh, HTML?

      This is the Bush administration we're talking about. They all send each other word documents (or perhaps a powerpoint presentation if they need to invade a country). HTML is for techies!

    34. Re:some funny math by KORfan · · Score: 1

      Actually, sometimes we get stuff like a pronouncement of Elephant Safety Week or a two-page memo telling us to cooperate with the investigators when they come to pick up records for an investigation of the Secretary of the Interior. They write the memo, print it, sign it, then scan it and send it to us as an image file. argh!

    35. Re:some funny math by icbkr · · Score: 1

      Use a magical intelligent compression system and it goes to zero. No intelligent content, nothing compressed.

    36. Re:some funny math by qvek · · Score: 0

      He's referring to the speech on which Bush was replying to the Internet rumors that there was going to be a draft for Iraq. He said "I hear there's rumors on the Internets". I'm not sure why he pluralized it.

  4. Plain Text by CWRUisTakingMyMoney · · Score: 5, Insightful

    What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.

    --
    Those who anthropomorphize science and/or nature already believe in an intelligent designer.
    1. Re:Plain Text by Anonymous Coward · · Score: 0

      Not sure if I agree on OpenDocument (too much in the way of politics around it, thanks to a certain competitor), but text files are the way to go in my opinion as well. I have a bunch of old text files from the past 20 years made on numerous machines -- all of which are still readable, editable, and printable on today's machines.

    2. Re:Plain Text by Stan+Vassilev · · Score: 1

      You have to recognize that not only the format is prone to become obsolete, but the media too (as in: you can't play audio tape music in your CD-ROM :).

      Digital is great, but preserving it in time is hard: you need media that can last long, media reader that works with the modern equipment, file system format you can comprehend and reader software to display the documents to you.

    3. Re:Plain Text by elronxenu · · Score: 2, Informative
      Legally they're not allowed to convert the documents.

      IMHO, storing them on 8-track tape is a massive blunder. 8-track is already obsolete. What they should be doing is either keeping them all on spinning storage (with massive amounts of redundancy) or burn multiple redundant copies to DVD.

      Either way, they will have to deal with the problem of unreliable storage - it's easier to cope with if the problem can be automatically detected, and the data recovered from a backup and re-copied automatically. This should be possible with both DVDs and spinning storage. DVDs would need to be regularly loaded into the machine and read in their entirety. If a DVD shows errors, another copy of the DVD needs to be re-copied to replace the failing DVD.

      I guess this is a good time to point out:

      • The difficulty to access these documents in 100 years is mostly a function of the tools used to create the documents in the first place, not of the archiving system itself.
      • Start using ODF format for word processor docs if you want to be able to read them in 100 years
      • Make them readily available to the public to ensure that the good stuff is copied over and over again.
    4. Re:Plain Text by Wildfire+Darkstar · · Score: 3, Informative
      What's to keep NARA from converting most electronic record to plain text?

      Potentially Armstrong v. Executive Office of the President. Format shifting is a fantastically tricky minefield to navigate. The aforementioned court case dealt specifically with the practice of printing e-mail communication and storing it as a paper record, but it speaks to the standard problems of conversion: you need to be entirely certain that you're not losing any information in the conversion process. This includes transmission information, metadata, and so on. Which isn't to say that plain text conversion can't be done in a lot of cases, but rather that it's something that needs to be undertaken very carefully.

      And while NARA has been embarking on some wonderful digitization projects, no paper-born records have been replaced by electronic conversions as of yet, for precisely the same reason. The electronic conversion augments the original paper record, but NARA still needs to maintain and preserve the paper record for as long as they have always been legally required to do so.
      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
    5. Re:Plain Text by Detritus · · Score: 1

      I think you mean 9-track tape. 8-track tapes are used for car audio systems. 9-track tapes have been around for 40 years. They work, and if needed, it isn't that difficult to build new tape drives. How many other data formats have come and gone in that time period? Newer isn't always better.

      --
      Mea navis aericumbens anguillis abundat
    6. Re:Plain Text by elronxenu · · Score: 1
      Silly me. 9 tracks, including the parity bit.

      But I stand by my claim that they are obsolete. NASA is faced with a huge problem to recover the data off thousands of tapes written during the earliest space missions. After 40 years the oxide is flaking off the tapes and recovery is a delicate and dangerous process, often involving the destruction of the original tape.

      It remains to be seen how long current technologies like CD and DVD will last before degradation causes data loss. Some say hundreds of years, some say only a few years, and others claim that the media is so unreliable that it can be unreadable immediately after being written.

      This topic comes up every month or two on slashdot. The best answer I know of to the problem of data degredation and technology obsolescence is to continually reprocess archived data, upgrading it to new storage media and new storage formats every few years.

      For example, I was able to recover all my old TRS-80 code from the early 1980s from their original diskettes, with some effort. It doesn't make sense, however, to retain these programs in their original TRS-80 format (i.e. tokenised BASIC) because the meaning of the tokens can be lost very easily. So I converted them all to straight ASCII. Now anybody can read the code. And I CVSed them all too, so the history of the code can be examined (this was particularly useful when I had several similar copies of the same program). Now I don't need to worry about how to read 20-year-old 5.25" diskettes anymore; I just have to keep the processed files available.

    7. Re:Plain Text by misfit815 · · Score: 1

      Here's a suggestion: Use virtual machines. One good modern system (excuse me, two identical systems, physically separate, in bomb shelters or old minuteman silos, or whatever) with VMware running Win9x, *nix, DOS, whatever, virtual machines. All the original emails in all their original formats, with all the original software. And they're on a VM that can be moved around, backed up, etc. You wanna see emails from Janet Reno's PC in 1996? Fine, here's a *copy* of that VM. Enjoy.

      --
      Jesus told him, "I am the way, the truth, and the life. No one can come to the Father except through me. - John 14:6 NLT
  5. OK by pHatidic · · Score: 1, Insightful

    So why don't they just use open source data formats? Is there something more complicated here that I'm not seeing?

    1. Re:OK by AvitarX · · Score: 2, Funny

      I think the problem is they are trying to store the records on 8-tracks.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    2. Re:OK by Sinryc · · Score: 3, Funny

      Yea, there is something your not seeing. The fact of the matter is they are talking about STORING the saved data. Not opening it.

      Good job on getting modded well. Anytime someone says "Open Source it" They get modded pretty well.

      Good job.

      --
      Yay, I have a sig.
    3. Re:OK by Hemlock+Stones · · Score: 1

      Yes you must be right. Except what is the use in storing data that is never opened? Read never data storage, a concept in need of a patent!

    4. Re:OK by twitter · · Score: 1
      I think the problem is they are trying to store the records on 8-tracks.

      They are. Quoting the article is such fun:

      for now, at least, the Archives uses electronic storing methods similar to those adopted in the 1960s and 1970s, transferring data onto magnetic tapes because that is the only format the archivists know will work indefinitely.

      I hope they are using GNU tar to hold that mess togher.

      The bigger problem is translating proprietary formats for the ages while maintaining the original format as required by law and making them available to people without having to mail order a big tape.

      The Archives recently awarded Lockheed Martin Corp. a $308 million, six-year contract to work on creating a system for saving and accessing electronic data over time. Lockheed officials have recommended using a handful of widely accepted formats such as the popular Internet software language HTML to save information and using digital adaptors to translate that into a new language when it becomes obsolete.

      Better them than me. Hopefully it won't just like to Word Docs and their "open" XML containers, Euck.

      --

      Friends don't help friends install M$ junk.

    5. Re:OK by Rimbo · · Score: 1

      "Is there something more complicated here that I'm not seeing?"

      Massachussetts vs. Microsoft, q.v.

    6. Re:OK by 1u3hr · · Score: 1
      Read never data storage, a concept in need of a patent!

      You'd have to hope no one came across this prior art from Signetics back in 1972.

    7. Re:OK by Anonymous Coward · · Score: 0

      Not really sure how it would help the problem in hand - I don't think it's an issue of whether the format specification is KNOWN as to whether it will be readily supported in either hardware or software terms in 30 years.

      I think part of this comes down to people's experience of the last 30 years of IT - backward compatibility has often been poor, and migration, especially of archived data, as expensive - i.e. lets just consider moving the same file without changing the format at all, I might have moved it from an 8" floppy to 5.25" to 3.5" to CD-R (that's an example - the real situation with archive data would be between a series of different backup tape and optical disc formats).

      The point being that as each hardware generation has become desupported there's been a need to physically migrate archived data between formats or machines. On the other hand, the price of storage has now fallen to the point where there's far less need for off-line / archival storage (except for back-up) - and it's certainly a lot easier to migrate data from a 40g external drive onto a new 250g one, than from CD-R to DVD.

      There's also the fact that commodity on-line storage is now cheaper than specialised archive retrieval systems (automated tape libraries, CD jukeboxes, etc) - companies are now transferring archive video onto on-line systems.

      There are of course some other issues in there I've missed off, like connections (RS232, parallel, USB) etc - if you had an 8" disk and an 8" drive, it would still be difficult to get data off it onto a modern PC. File system format is another, but less important.

      The second major issue is the software one - anyone who has been working with IT over a long time period, or whose job entails thinking in a timescale of 15 years rather than 15 months will have a realistic view of backward compatibility of data formats. Especially if you're then comparing that to an archive streching back over hundreds of years.

      Is an Open Source format less liable to change, or open source software less likely to break backward compatibility? (Possibly, as there will be less reasons for business driven upgrades, but formats also die for technical reasons - i.e. will all image editing software always support TIFF?).

      It would certainly minimise some risks, with regards software becoming unavailable, or changing computer platforms, but on the other hand, we all know how easy it is to open (if not modify and send back) Word documents in O/S software.

      I think data format will continue to remain an issue so long as software continues to advance, but I don't think it is out of the scope of the National Archives job to ensure it has systems that can read all the data it posseses, even if commonly available software discards those formats - and O/S software can certainly help there.

      The other thing I think about is the way that projects like various computer emulators have brought back dead formats. At one point it was true to say that early 80s computer games were 'lost' unless you had working 80s hardware and casette tapes - now most of those games have been retrieved from tape and you can run them under emulation (you can get emulators of most 8-bit machines that run under Java). It's becoming increasingly true of the 16 bit machines too.

      I'm becoming less convinced this is an issue -

  6. One Word: Google by Nova+Express · · Score: 5, Insightful
    Really, either Internal or External. Take out anything that might injure National Security, then turn the rest over for Google to index. Hell, send a copy of everything to Google, for that matter; they've got room. Keep a record of searches and visits to documents by codeword and frequency and build index that way. Create a datasea, index it, and let citizens swim in it. As long as the e-mail is in at least a remotely standard format, what's the problem?

    (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

    --
    Lawrence Person (lawrencepersonh@gmailh.com (remove all "h"s to mail)

    http://www.lawrenceperson.com/

    1. Re:One Word: Google by AndroidCat · · Score: 2, Funny

      Stick it all on one box, then install p2p software. Name all the files to song titles and it'll spread even faster. (Of course, the RIAA might go after John W. Doe...)

      --
      One line blog. I hear that they're called Twitters now.
    2. Re:One Word: Google by sckeener · · Score: 1

      It has been said previously, but metadata

      I don't think google is indexing metadata and wouldn't it be just sneaky to have a plain Hello Jane type email have a secret message in the metadata. Everything last to be kept and indexed.

      --
      "Only one thing, is impossible for god: to find any sense in any copyright law on the planet." Mark Twain
    3. Re:One Word: Google by AmberBlackCat · · Score: 1

      As long as we’re being cheerleaders for Google and Open Source, I think Copernic does a good job of indexing email. I don’t have any trouble finding old mail or anything else in the context I’m looking for. Maybe the NARA guys could develop a large-scale version of that and add some mechanism to associate certain messages with certain events or situations.

    4. Re:One Word: Google by milimetric · · Score: 1

      (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

      You have achieved true enlightenment. Go forth my friend and enjoy nirvana.

  7. Technology explodes by Architect_sasyr · · Score: 3, Funny

    Well, if the technology that uses the emails is exploding, surely the software/systems that archive the software are too.

    A couple of BSD box's with some Oracle or similar should do it.

    --
    Me failed English...
    FreeBSD over Linux. If my comments seem odd, this may explain...
    1. Re:Technology explodes by Anonymous Coward · · Score: 0

      I can just see someone talking about blowing up the national archives...
      parent has a point though. Just take some XML and load it all into a DB. Simple solution to a complex problem

      Prove me wrong!

    2. Re:Technology explodes by ottothecow · · Score: 1

      Logistically it would make sense to feed everything into a single type of database (its ok to have seperate ones for different things to keep the size down and the preformance up as long as they are all the same kind). Database software gets updated and makes it easy to update the database to the new version. Even if Oracle goes out of business, you can bet that every company who continues will have a function to convert from an oracle database to grab customers. As long as they keep the database fairly modern, the actual text stored inside of it will always be accessable and I dont think the idea of a database is going to go away anytime soon.

      --
      Bottles.
  8. Sounds like a business opportunity by Alcimedes · · Score: 3, Interesting

    Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?

    I'm sure a universal data conversion tool would be worth a pile of money.

    1. Re:Sounds like a business opportunity by dangitman · · Score: 1
      Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?

      Sounds more like a governance opportunity to me. the National Archive could spearhead the push to develop sophisticated open standards (open Document doesn't satisfy all archival purposes) that all of government, and the public, could use.

      Of course, we are living in Bush-World(tm) - so any constructive and useful action by public employees is considered treason. Especially if it concerns "preserving history" - which the Republicans are deathly afraid of. Librarians are people who should be spied on, as are readers of literature. The Diebold scandal with their voting machines is the perfect example of how accurate information and recording history is opposed by the current administration.

      --
      ... and then they built the supercollider.
  9. XML? by DamienMcKenna · · Score: 1

    Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?

    1. Re:XML? by kennygraham · · Score: 1

      As much as I love XML, if you get complicated with it, the parsing is horribly slow.

    2. Re:XML? by grcumb · · Score: 2, Insightful

      "Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?"

      Because it's not the right tool for every job. XML is explicitly a data interchange format. I've worked with material like this in the past, and I can tell you from experience that processing large volumes of XML (or any text-based markup format, for that matter) is extremely expensive in terms of processor and memory resource usage.

      That said, I agree that in this case XML-formatted plain text is the right format, specifically because it is very suitable as a data interchange format. When one is archiving large volumes of data for intedeterminate periods of time (possibly decades), then it's worth the extra pain to maintain the source in the most flexible format.

      I do not want to suggest, though, that this is the best format for accessing or processing the data. I'd suggest a source repository where text data is fielded with the proper metadata which can be updated periodically if necessary. Data can then be drawn from there and stored in a more accessible (e.g. database) format and that data store can be accessed by researchers, lawyers and lawmakers, etc. This has the double benefit of keeping the source material safe because we're not interacting with it constantly and making it accessible in the most appropriate technology of the day.

      As someone has already stated, this is not exactly rocket science. It does require a certain simplicity and elegance of design, so I have very little hope that it will be implemented as I've described. 8^)

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    3. Re:XML? by Geoffreyerffoeg · · Score: 1
      XML! Seriously, why isn't it used to structure everything?
      <?xml version="1.0">
        <bitmap>
          <title>Pathological Example</title>
          <format colors="color" bpp="24" />
          <generator>
            <software:software>
              <software:title>Slashdot XML Paint</software:title>
              <software:version>1.0</software:version>
            </software:software>
          </generator>
          <transparency>
            <pixel>
              <component color="red" value="255" />
              <component color="green" value="255" />
              <component color="blue" value="255" />
            </pixel>
          </transparency>
          <row>
            <column>
              <pixel>
                <component color="red" value="24" />
                <component color="green" value="73" />
                <component color="blue" value="65" />
              </pixel>
              <pixel>
                <component color="red" value="192" />
                <component color="green" value="168" />
                <component color="blue" value="16" />
              </pixel>
              <pixel space="CMYK">
                <component color="cyan" value="2" />
                <component color="magenta" value="8" />
                <component color="yellow" value="18" />
                <component color="black" value="36" />
              </pixel>
              <pixel>
                etc.
      That's why.
    4. Re:XML? by kennygraham · · Score: 1

      You forgot to define your software namespace, you insensitive clod!

    5. Re:XML? by phaggood · · Score: 1

      parsing can be slow for large docs

      ..for those not executing this job on a handy desktop beowolf cluster running XSLT scripts in parallel.


      What?

    6. Re:XML? by wootest · · Score: 1

      That's like saying "Sounds like a job for everyone's favorite medium, paper!" (Or I suppose one could even argue that XML is more like wood pulp than paper in this comparison.)

      XML allows for the quick creation of data formats, but it doesn't magically make these data formats popular or parsable by actual programs - that's still a real issue. And even when they settle on an internal format, there's the question of getting existing data into that format, or exporting back into popular formats. It's not as easy as just saying "XML", even if XML makes the process of shaping a custom data format and parsing it easier, and allows for XSL-like transformations.

    7. Re:XML? by sxpert · · Score: 1

      and I can tell you from experience that processing large volumes of XML (or any text-based markup format, for that matter) is extremely expensive in terms of processor and memory resource usage

      neither of which is really a problem
      throw an unused ascii-something at it and you should be fine

  10. MS Office by sycodon · · Score: 0

    If everything is in MS Office, it's guaranteed to be inaccessible after just two upgrades.

    --
    When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
  11. Dark age for information by stontu · · Score: 0
    We have read about it in a previous Slashdot article called Dark age of information.

    The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"

    Plus the director should be called dorkrector.
  12. iPod Mod... by __aaclcg7560 · · Score: 3, Funny

    The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod? Does it come with a Saturday Night Live dance cover?

    1. Re:iPod Mod... by misanthrope101 · · Score: 1
      The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod?
      One exists, but it is somewhat esoteric. I've heard it called "line-in," but I think that's just a buzzword. It's probably vaporware, but you never know. Send me your credit card number just in case, and I'll be on the lookout.
  13. Here's a thought.. by Khoa · · Score: 1, Redundant

    Let Google handle it?

    1. Re:Here's a thought.. by maelstrom · · Score: 0, Offtopic

      Here's a thought, stfu.

      --
      The more you know, the less you understand.
    2. Re:Here's a thought.. by Anonymous Coward · · Score: 0

      If it is going to be available via the internet, there is no doubt Google will pick it up and archive it in many different locations. :-)

  14. Snow Removal. by Anonymous Coward · · Score: 0

    "We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive."

    Those are called LUGs

  15. That's not a real measure of data storage by Ithika · · Score: 1

    I think we deserve to be told how many Library of Congresses that takes up!

    1. Re:That's not a real measure of data storage by dangitman · · Score: 1
      I think we deserve to be told how many Library of Congresses that takes up!

      24 football-fields wide and a couple of Grand Canyons deep.

      --
      ... and then they built the supercollider.
  16. Lockheed's proposal: by rodentia · · Score: 1


    Lockheed officials have recommended using a handful of widely accepted formats such as the popular Internet software language HTML. . .

    Those responsible have been sacked.

    --
    illegitimii non ingravare
    1. Re:Lockheed's proposal: by radiotyler · · Score: 1

      Lockheed officials have recommended using a handful of widely accepted formats such as the popular Internet software language HTML. . .

      Those responsible have been sacked.


      Those responsible for the sacking have been sacked.

      --
      hi mom!
    2. Re:Lockheed's proposal: by Anonymous Coward · · Score: 0

      The sentence you quoted was no doubt written by a journalist with limited technical knowledge. Do not read much into it. Trust me.

    3. Re:Lockheed's proposal: by gmcgath · · Score: 1

      HTML is one of the worst formats for long-term archiving. Hopefully all it means is that it's the only "software language", er, format, the reporter knows by name.

  17. Better make two tapes by tlynch001 · · Score: 0

    Better make two tapes in case Sandy Berger sneaks off with one.

    1. Re:Better make two tapes by Anonymous Coward · · Score: 0

      Better store one off site in case Bush burns the library down for leaking his latest lame joke to the terrorists.

  18. I'd love to read those emails... by rampant+mac · · Score: 4, Funny
    "The National Archives [...] will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years"

    I'd love to read those emails, seeing as how we've gone from:

    From: bclinton@whitehouse.gov
    To: hclinton@whitehouse.giv
    CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
    Subject: omglol, you got to get me some of these!

    I want these for Christmas! http://www.big-fat-cigars.com/



    To something along the lines of:

    From: gbushjr@whitehouse.gov
    To: dickc@whitehouse.giv
    CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
    Subject: Are they for real? Can we attack them too?

    Subject sayz it all, any toughts Dick? I think we can git `em.

    > DYKE BOURDER OIL SERVIES
    > OFFER FOR SALE OF NIGERIAN CRUDE OIL
    >
    > Dear Sir,
    >
    > I am President of blah blah blah...

    --
    I like big butts and I cannot lie.
    1. Re:I'd love to read those emails... by iphayd · · Score: 1

      You know Osama's email address. You must be a terrorist. Please bend over for a Homeland Security "investigation." An agent will arrive shortly.

  19. The obvious solution by the+eric+conspiracy · · Score: 2, Funny


    rm -rf /

  20. Not to worry, it's the Bush White House by Anonymous Coward · · Score: 0

    So they'll block all email retention, claiming, uh, national security or something.

  21. Google Search Appliance by TubeSteak · · Score: 2, Informative
    Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...
    Except when you're right

    The Google Search Appliance
    http://www.google.com/enterprise/gsa
    What it does

    The Google Search Appliance makes the sea of lost data on your web servers, file systems and relational databases instantly available with one mouse click. Just point it toward your content, add a search box to your site, and in a matter of hours, your users will be able to search through more than 220 different file formats in any language. The Google Search Appliance indexes up to 15 million documents, and its security features ensure that users only see the documents to which they have proper access.

    How it works

    The Google Search Appliance crawls your content and creates a master index of documents that's ready for instant retrieval using Google's search technology whenever a customer or employee types in a search query. The Google Search Appliance is easy to set up and requires minimal ongoing administration, making it extremely cost-effective. The Google Search Appliance starts at $30,000 to search up to 500,000 documents.
    FAQs

    Though it isn't really ontopic, Google search appliances are vulnerable to various exploits & Google does provide patches.
    --
    [Fuck Beta]
    o0t!
    1. Re:Google Search Appliance by Deltaspectre · · Score: 0

      From the site:

      The Google Search Appliance is designed for the needs of large enterprises and government agencies and can support up to 15 million documents.

      Uhoh! :D

      --
      My UID is prime... is yours?
  22. Why store all of this by Stan+Vassilev · · Score: 1

    We've all had our "I gotta keep everything I do, download, see or hear in my records" moments, and sometimes they may last for years before we realize we don't need 99% of it anyway and will never never use it.

    Information is infinite, there's no ends to the amount of information anyone of us can produce. Storing everything is old school, new school recognizes that fact and stores only important information.

    What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

    1. Re:Why store all of this by YrWrstNtmr · · Score: 3, Insightful
      What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

      Often, you don't know whats important, until long after the fact. Storage space is so cheap and easy, it doesn't make sense to try to filter, as its happening. Inevitably, something important/crucial/worldchanging would get lost, resulting in cries of government censorship.

      And I'd say for a presidency...ALL of it is crucial.

      Random conversations, recorded by the secretary, then 'erased', has already caused one president to resign. What was in that erased 18 minutes? The NARA may actually find out.

    2. Re:Why store all of this by Wildfire+Darkstar · · Score: 1

      The government doesn't save everything forever. All records created by the federal government have their own retention schedules, which can range from a few weeks to forever. There are dozens of potential reasons for needing to access any given record, though, and they aren't all as obvious as one would think. An e-mail from the president asking for a cup of coffee might well have some value to a historian or biographer. Personal communications might have potential legal repurcussions down the line, for whatever reason. Given the importance of the president, presidential communication gets treated even more gingerly than everything else (a lot of this probably will be maintained in the long-term for historical value, if nothing else). But NARA absolutely does not "store everything."

      --
      Sean Daugherty "I have walked in Eternity -- and Eternity weeps."
    3. Re:Why store all of this by TubeSteak · · Score: 1

      Unimportant information is not important... until it is.

      This is why we fund research into the saliva of rare frogs, or we study ancient cultures. The information may not be important now but that can change.

      I'm sure the Library of Congress has lots of 'unimportant' stuff in it, but (like the National Archives) they save everything they can

      --
      [Fuck Beta]
      o0t!
    4. Re:Why store all of this by dangitman · · Score: 2, Insightful
      What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

      That is an absolutely insane idea for government policy. We shouldn't decide what's important for the future - the future history writers decide that for us. Who is it that decides what is important? The public owns the government, and has the right to retain everything it does. Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute, and current heroes of government will not receive their dues or recognition.

      Make no mistake, some of the most insignificant things in past peoples' lives, have provided the most significant insights into humanity when later discovered by historians, anthropologists or archaeologists. It's what we consider "trash" today that will tell our story to future generations. When that trashball heads back to Earth, you wanna make goddamn sure you wear noseplugs and know how to make 20th Century trash.

      --
      ... and then they built the supercollider.
    5. Re:Why store all of this by Anonymous Coward · · Score: 0

      Yep. We don't have records of every conversation our founding fathers had when they bumped into each other in the bathroom to take a leak, and we've managed to do just fine. We have a few important things they did, a constitution they drafted, some letters and such. I'm hardly impressed by our current chief executive's pronouncements on his best days, when he's being propped up by ivy grad speechwriters. I certainly hope we're not thinking posterity will care to rifle through the burps and outgassings in between.

      I think the National Archivist needs a counterpart: the National Janitor. It shall be the National Janitor's job to throw shit away.

      I'm not kidding.

    6. Re:Why store all of this by Benwick · · Score: 2, Insightful

      Speaking as a trained archaeologist (and I'm not just saying that for effect), it would definitely be wrong to filter out the "unimportant" who-got-coffee when, because it makes a false judgment about what sort of information will be of interest to scholars of the future. There are all kinds of weird correlations possible, too -- "Presidential Coffee Breaks and the History of Global Commerce in the Post-Lewinsky Era," etc. One might want to study what lower-level White House bureaucrats did, too -- who knows. It's all primary source material.

      If all of this sounds boring to you, that's why you're not an Archaeologist. Of course, neither am I. But I did study it.

    7. Re:Why store all of this by Stan+Vassilev · · Score: 1

      "Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute, and current heroes of government will not receive their dues or recognition."

      With the few replies supporting the same point of view as yours, I tend to agree.

      HOWEVER, I ask: honestly, do you think corrupted politicians freely use logged medium to exchange idea for stealing taxes/money from corrupted businesses?

    8. Re:Why store all of this by dangitman · · Score: 1
      HOWEVER, I ask: honestly, do you think corrupted politicians freely use logged medium to exchange idea for stealing taxes/money from corrupted businesses?

      No I don't, I think they are careful, and usually maintain several layers of coverup. However, they usually slip up somewhere (or an underling does). And they WILL communicate over logged mediums, because they need to give some sense of legitimacy. It will look funny if they have no logged transcripts during their years in office. And what they might think "unimportant" and let slip on a logged medium, might be decrypted in future to lead to the smoking gun. Heck, this happens even in front of TV cameras when the satan-worhipping pedophiles let their drunken mouths get away from them, and accidentally reveal their kitten-slaughtering plans.

      --
      ... and then they built the supercollider.
    9. Re:Why store all of this by Anonymous Coward · · Score: 0

      Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute

      That is insane. By that logic, we should live in a pure Orwellian state, where everybody's activities are constantly monitored, recorded, and archived. It goes against every principle of living in a free state. Innocent until proven guilty. Right to privacy. On and on and on.

      I suggest that if you truly believe what you are saying, you should (1) start studying American history and civics pronto; and (2), if that doesn't work, go try living in a police state for a while to see how you like it. I'd suggest North Korea.

      Ignorance such as yours is exactly why America could very well end up a totalitarian state, 180 degrees opposed to it's founding principals.

    10. Re:Why store all of this by Frumious+Wombat · · Score: 1

      You'll remember that one of the issues that brought Newt distress was that he was conversing freely over a cellphone, and didn't think that anyone was listening. There's also the old Chicago saying, "First guy on the bus gets the best seat"; i.e. when someone's going down, you never know what they've been saving and would be willing to reveal in return for better treatment.

      Both LBJ and RMN taped their own conversations, which turned around to bite them later. Never underestimate hubris on the part of the powerful.

      --
      the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
  23. Format obsolesence by StikyPad · · Score: 4, Insightful

    There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.

    1. Re:Format obsolesence by HermanAB · · Score: 1

      Yes, but that would be logical and far too simple... You need to understand the archivists golden rule: Never erase, delete, modify or shred anything.

      --
      Oh well, what the hell...
    2. Re:Format obsolesence by Anonymous Coward · · Score: 0

      Actually what the grandparent suggested is the common practice for law firms currently. All electronic documentation is converted to plain text and TIF images. Then a searchable database is built based off those files. Because of the plain text you can search for anything you might need to find and it shows you the related documents. Links direct to the images are standard and some of the database solutions even have the image viewers built in.

      Sure I'd love them to be something other than B/W tifs, but that's all the current software the law firms have can handle. But since it's good enough for the courts, I would figure it'd be fine for this problem.

      And yes there is an entire industry based around this concept already.

  24. Internet Archive by arrrrg · · Score: 2, Interesting

    If the Internet Archive can back up the entire internet every few months, I would think the National Archive could handle a few hundred million emails.

    1. Re:Internet Archive by fiji · · Score: 2, Informative

      For some value of entire.

      TIA is pretty damn impressive, but they certainly don't get all of it.

      1: There is more to the internet than the web
      2: They don't do a lot of dynamic pages... so a lot of forums will probably be ignored (not that that necesarilly loses anything useful ;-)
      3: They only get images if you request it
      4: Sites can request that they not be spidered (robots.txt)
      etc.

      -ben

    2. Re:Internet Archive by shadowbearer · · Score: 1

      Exactly.

        The best "internet backup" is all the stuff that we rat-packers save and someday recall again... :)

        I recently reconstructed a vanquished web page thru TIA, local saved pages, and various googled caches. It was rather an enlightening experience. One application I can see for the future of the internet is distributed user archive programs such as the TIA is, but with many, many more machines. Google is really kind of a baby step towards the infrastructure needed to have a collective database of human info.

      SB

      --
      It's old. The more humans I meet, the more I like my cats. At least they are honest.
  25. ASCII Text by Spazmania · · Score: 2, Insightful

    electronic documents created today may not be legible on tomorrow's devices

    ASCII text has been around for decades and oh by the way Internet-formatted email is 100% representable as ascii text since that's how its still transferred today.

    This supposed problem is a real problem only for those with Exchange, Domino or Groupwise which creates email in custom, internal formats.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    1. Re:ASCII Text by renoX · · Score: 1

      I wonder who is stupid enough to have moderated the parent as insightful?

      What if the email contains an attachment which is in a format that you can't read?
      Sure it is encoded as ASCII but it doesn't help..

    2. Re:ASCII Text by SalsaShark42 · · Score: 1
      That's only the message body--it doesn't account for the attachments. But with all the virtualization technologies out there, I don't see why someone can't build a "Year 2005" VMware image that contains all of the native applications and just pull it up on demand.

      The proprietary formats of groupware systems will become less of an issue with time. Domino already supports full-fidelity XML exports (minus encryption and digital signatures, but that's a separate topic).

      What most of the messaging compliance vendors are doing now is saving the messages in native formats (maintaining "best evidence" rules) but also keeping an RFC-822 shadow copy to support litigation discovery/review...

    3. Re:ASCII Text by Halfbaked+Plan · · Score: 1

      Well, obviously, you have your secretary print out the ASCII representation of the attachment. She puts it the printed sheets in an interoffice envelope and it goes to the typing pool, where it's retyped and then re-entered on punched cards by the keypunch operator. The card deck goes to the computer operator who loads the deck and runs the job to convert it and print it out. In the end, a greenbar representation of whatever the attachment in question was is brought to your office by clerks using a skid jack to move the pallet.

      --
      resigned
    4. Re:ASCII Text by Spazmania · · Score: 1

      What if the email is encrypted? What if the author wrote using code-words that only the recipient could understand? Answer: Don't worry about it. You take the common word-processor formats and store a second copy as ascii text. You take the commone spreadsheet formats and store a copy as ascii .csv files. Pick formats for graphics, audio and video where the primary criteria is its current ubiquity and store second copies of those too. Everything more obscure you simply store as is and don't sweat it. Tomorrow's computer experts will surely include folks competent at reverse-engineering as long as we can get them a raw copy of the data on a media they can read. Let them decide what's important enough to bother with.

      And speaking of the storage media, don't overthink it. Its impractical to come up with some media that can last 200 years. Worse, its not cost effective. So, pick a media that can last 10 years with 5-nines reliability and plan on someone copying to a new media 10 years hence. How? COTS. Go out and buy some IDE hard drives. They're ubiquitous, cheap, and survive offline storage relatively well. They've been around for 15 years now and there's every reason to expect that computers 10 years from now will still be able to access IDE hard drives even if some other technology has overtaken them.

      The 20 meg MFM drives of 1985 were obsolete in 1995, but you could still get the equipment to read them. The 500 meg IDE drives of 1995 are obsolete now, but even if your PC doesn't have PATA ports (most still do) you can get a $20 USB converter and read the old drive that way. As long as you plan on forward-converting the data every 10 years the media choice is simple: pick whatever is currently ubiquitous.

      Even with substantial data recovery blocks (a la par2) you're still looking at a media cost of less than $1000 per terabyte for 10 years of storage with the expectation that in ten years a terabyte will only cost $100 for the following 10 years.

      Seriously, this is not a hard problem. Look at past situations where data has become unrecoverable. They share some common characteristics:

      1. Recovery was attempted too long after the media became obsolete, typically 20 years or more. Nobody is complaining about lost NASA data from the 1990's. Its the 1970's data they can't retrieve. Plan on forward-converting it all every 10 years and you won't lose it.

      2. The storage media was some obscure magnetic tape format that never saw wide deployment so that it became unavailable at exactly the same time it became obsolete. That continues to be true of essentially all magnetic tape formats -- try recovering data from a DDS2 tape from just 10 years ago. Good luck finding a drive that matches the format used by the original. With the infamous '70s NASA data, for example, the chief problem isn't decoding the data (a good hacker could likely figure it out) but rather retrieving the data from tape. On the flip side, you still have folks around who fiddle with Commodore 64 data from 1985 -- widely available COTS equipment will win the longevity race every time.

      3. No one left documentation (or the documentation did not survive) describing the data structures used to store the data on the media. So, make sure to store it in formats that are very well known such as RFC2822, ascii, and csv. Perhaps in a well documented FAT filesystem using open source tar, gzip and par2 (and store copies of the source code for that software with the data). These formats will be trivially understood 10 years from now, even if they're not then in common use.

      Seriously, its just not that hard.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    5. Re:ASCII Text by 1u3hr · · Score: 1
      Internet-formatted email is 100% representable as ascii text

      I often get two paragraph messages from people as a Word doc attachment. And yes, it actually is sent as a Base-64 encoded (ASCII) segment of the message, but I don't think that's very helpful. Personally I filter most of my email to plain text, so messages that weighed in at 50k or more come down to 1k. But an archivist doesn't have the freedom to simplify the format.

    6. Re:ASCII Text by Spazmania · · Score: 1

      an archivist doesn't have the freedom to simplify the format.

      Sure he does. He shouldn't discard the original, but he has complete freedom to store an additional copy or copies in simplified and standardized formats. And given your comments on the relative size of the alternate versions, such additions would be relatively cheap.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    7. Re:ASCII Text by 1u3hr · · Score: 1
      Sure he does. He shouldn't discard the original, but he has complete freedom to store an additional copy

      Storing an additional version isn't "simplification". I was talking about "instead of", which is what I do for my own archives.

    8. Re:ASCII Text by Spazmania · · Score: 1

      Storing an additional version isn't "simplification".

      Of course it is. You're giving the eventual retriever an easily read version he can get to conveniently as well as the original that he can dig in to if needed. Unless you can guarantee the the converted version is perfectly equivalent to the original (and you usually can't) you have to save the original anyway.

      What you do for your own personal archives is, of course, a very different question. You may well find that the space savings justifies the loss of the original. While that may be appropriate for you personally its not reasonable for an organization charged with preserving the country's history.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    9. Re:ASCII Text by LWATCDR · · Score: 1

      Yep except what about other documents? Things like CAD drawings, spreadsheets, and documents stored in... WORD!.
      This is why the government shouldn't all the use of proprietary formats for any internal documents. Of course a lot of things sent in email would have never been documented before because they would have just been conversations. Things like what do you want to do for lunch today and how about them Redskins... Man if political correctness keeps going the way that it has been in 100 years some poor Washington football fan will go down in history as racist.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    10. Re:ASCII Text by Spazmania · · Score: 1

      Yep except what about other documents?

      For easy stuff like word processor and spreadsheet documents, simply store a second copy that has been downconverted to ascii. For more obscure stuff (like CAD drawings) don't worry about it. An archivist can't possibly deal with every obscure data format out there, and clever hackers that can reverse engineer a format are relatively easy to come by should the contents of the document ever be desired. All you have to do is get them the bits and bytes on a media that they can handle.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    11. Re:ASCII Text by LWATCDR · · Score: 1

      "For more obscure stuff (like CAD drawings) don't worry about it. "
      Oh yea stuff like cad drawings are so much less important than text. I mean a memo about what flower need to be planted on the white house lawn is so much more important then the blueprints of the CIA office building.
      Currently there are people restoring WWII aircraft for museums using the original blueprints.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    12. Re:ASCII Text by 1u3hr · · Score: 1
      Storing an additional version isn't "simplification".
      Of course it is

      Of course it's not. It's easier to use, which is an entirely different thing. Windows is "simpler" to use than DOS, but is not a simplification, it's millions of lines of code compared to thousands.

    13. Re:ASCII Text by Spazmania · · Score: 1

      Just how many CAD drawings do you think are flying around in the White House email?

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    14. Re:ASCII Text by Spazmania · · Score: 1

      Of course [storing an second converted copy is] not [simplification]. It's easier to use, which is an entirely different thing.

      You want to nitpick semantics? The point is it achieves the goal of improving the chance that future researchers can retrieve the data. I call that simplification. You can call it whatever floats your boat.

      --
      Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    15. Re:ASCII Text by Anonymous Coward · · Score: 0
      You want to nitpick semantics?

      You keep picking at mine, then go ahead to make exactly the same points I did. And I'm done here.

    16. Re:ASCII Text by LWATCDR · · Score: 1

      I was speaking about the National Archive as a whole not just about what the White House produces.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  26. Your math is wrong by Anonymous Coward · · Score: 1, Interesting

    You fail to take into account html email, attachments, large email threads where everyone replies to all (very common in a large organization).
    The average email is 500,000 bytes in size (500K).

    100,000,000 * 500KB = ~50,000GB = ~50 Terabytes of information

    That's a lot of data even if you store it compressed.

    You'd need 1250 DLT tapes or 250 LTO1 tapes or 125 LTO3 tapes to back up that data.

    Compressing that data with Bzip 2 would take:
    0.625 * 50,000,000 = 31250000 seconds = ~520833 minutes = ~8680 hours = 361 days = ~1 year

    1. Re:Your math is wrong by Anonymous Coward · · Score: 0

      You need to start a chain letter in management speak that gives away a secret in getting your email to the other guys faster so you can make more sales, its keeping the message small. but I have no idea how to word it.

  27. er.... by Anonymous Coward · · Score: 0

    Ya know, this whole "obsolesence" thing could probably be avoided with open document formats.

  28. I can see it now... by Anonymous Coward · · Score: 0

    Next on /. A neat mod to make your ipod play 8 track tape.

  29. Bring'em Back!!! by gmby · · Score: 1

    The 8-track was a great idea. Bad design.

    It had the best User Interface.

    1.One Tape - slide it in
    2.One Button - Press for Selection
    3.Four Lights - Four Play List

    The player was as simple as you can get with tapes.

    1.One Motor
    2.One Solinod
    3.One Tape head
    4.One Audio Amp
    5.A few lights and hardware to tie it together.
    6.A Box!

    I say we need to start an Open Source/Hardware rework of the design. The patents might not be a problem anymore.

    1.High Quality Design of the Mechanisum
    2.LCD front panel Song display
    3.Maybe a seek(fast Foward) button added.
    4.Digital/Analog Tapes
        A.Both analog and digital tape ability.
        B.Backwards compatibility with orignal 8-tracks.(yes i know most are broken now)
    5.Dolby/Suround on Digital Tapes.

    I would love to have a so eligant design in an old 50's Chevy. Retro but High Tech at the same time.
    So Cool...

    --
    I don't want a pickle; I just want a Motor-Cycle! A four foot cop arrived with a five foot gun!
  30. Official History? by datafr0g · · Score: 3, Insightful

    The National Archives, entrusted to preserve America's official history...

    The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...

    Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind! :)

    --
    "Who says nothing is impossible? Some people do it every day!" - Alfred E. Neuman
    1. Re:Official History? by Tesla+Tank · · Score: 1

      Well, it will be biased history. I don't mean that as an attack on the government. All history is biased. It's very difficult, maybe even impossible to present history in an unbiased manner. Perhaps if you tell it from different view points. However, who's to say how many view points is sufficient. How can we be sure we're not neglecting someone's experience?

  31. In Library's Of Congress by TubeSteak · · Score: 1

    Well.. One LOC is 11,362.5 GB (based off this)

    If 100 Million e-mails is ~800GB
    then 100 Million e-mails is about 7.04%

    --
    [Fuck Beta]
    o0t!
  32. Some Funny Formats and Laws. by twitter · · Score: 1
    let's be generous and say that the average email is 8192 bytes in size (8KB)

    Let's be honest and admit they use M$ junk. You know they are slinging around 70MB power point files, word docs, ad nauseum. Getting that all put into something legible is hard to do. Try opening your Excel 4 files, for example. Did you remember to install the right fonts and equation editor? If all non text were pumped to pdf or html, things would be a little easier but still larger. The challenge is automating the conversion given an administration that's cluelessly in love with all things M$.

    Now that we've thought for two seconds, let's visit the article:

    the Archives is struggling to devise a system for storing the enormous amount of digital information in a format that will allow it to be accessed 20, 75, even 200 years from now by historians, students and average Americans looking for a first-hand accounts of the federal government's activities.

    Sounds a lot like that Mass. mess. Reading on ...

    For example, when the electronic records of the Sept. 11 presidential commission arrived at the Archives a year ago, "it was the equivalent of all the fully processed electronic records we had received in 30 years," or about one terabyte of data, says Robert Chadduck, computer engineer overseeing the Archives' search for a solution.

    Oh my, better get a bigger drive than 800GB.

    Part of the impetus for wanting to come up with a comprehensive strategy for digesting electronic records is the desire to make them accessible via the Internet, rather than requiring people visit one of the Archives facilities, request a tape and then wait for a copy be mailed to them.

    Suckage.

    federal law requires that government documents be kept in their original formats to verify authenticity -- particularly documents that may be used in court.

    Oh shit, they are going to become a Digital Williamsburg. I suggest they start learning Bochs, because it's unlikely they will be able to keep some dinky P1 running (with it's "original" CD) to read Bill Clinton's love letters to Paula Jones, much less connect it to a network. Preserving the original format is a good idea, but documents must also be converted to some reasonable publication format before the ability and interest in such conversions goes away.

    --

    Friends don't help friends install M$ junk.

  33. I know of two emails that aren't. by User+956 · · Score: 2, Funny

    Clinton only sent two emails during his entire 8 years in office.

    "His administration generated about 40 million messages - mostly memos and notes among aides and cabinet members. Of the two Mr Clinton sent, one was a test to see if the president could push an e-mail button. The other was addressed to astronaut John Glenn"

    That shouldn't be hard to archive.

    (on a slightly related note, I wonder what percentage of those are/were spam, and if they have to archive all those spam messages for online poker and hot wet bitches?)

    --
    The theory of relativity doesn't work right in Arkansas.
    1. Re:I know of two emails that aren't. by phaggood · · Score: 1

      The other was addressed to astronaut John Glenn"

      Perhaps you meant, Senator Glenn?

  34. MySQL? by jacklexbox · · Score: 2, Insightful

    Please correct me if I am wrong, as I probably am, but would like to have this explained to me. Why couldn't all the emails be stored as plain text in a MySQL database with either a web interface (php?) or an application written in an interpreted language (Java or Ruby)? Does that make sense? Is there something I am missing?

    1. Re:MySQL? by Detritus · · Score: 1

      Is any of that technology going to be around in 20 or 30 years?

      --
      Mea navis aericumbens anguillis abundat
    2. Re:MySQL? by HermanAB · · Score: 1

      Of course, but that would not warrant a multi billion dollar budget...

      --
      Oh well, what the hell...
    3. Re:MySQL? by jacklexbox · · Score: 1

      I would think so, if not, I could image it would be easy to export all the info from the database into a new database/technology.

    4. Re:MySQL? by jacklexbox · · Score: 1

      You mean spending money was a requirement? I guess F/OSS is out... maybe Oracle would be interested...

    5. Re:MySQL? by mindtriggerz · · Score: 0

      I've been ruby-ing all day scaffold :emails done.

  35. Old solutions to the rescue... by Pig+Hogger · · Score: 1
    Old solutions could be helpful here... Not because acid-free paper will last for centuries (the volume of paper would be staggering, and you can't grep dead trees), but they provide methods that could be applied.

    Take mercury delay lines. They kept data by continuously sending sound impulses inside a tank filled with mercury, and the impulses were recycled through to refresh the storage.

    Well, this could be done with a *HUGE* disk array, where you add drives to increase storage, and "retire" broken or obsolete drives and it would evolve as technology does, never losing any of it's data.

    1. Re:Old solutions to the rescue... by HermanAB · · Score: 1

      Congrats! You just re-invented RAID5.

      --
      Oh well, what the hell...
  36. Talk to the Catholic Church by toupsie · · Score: 3, Funny

    Monks have done an amazing job preserving important documents over the years. In fact, Xerox worked with Brother Dominic in the field of document preservation. Print out all the e-mails on archive quality paper and store them underground. Be sure they are also translated in Spanish so future Americans will be able to read them.

    --
    Strange women lying in ponds distributing swords is no basis for a system of government.
    1. Re:Talk to the Catholic Church by Anonymous Coward · · Score: 0

      Hello there under the bridge, the church has willfully destroyed
      many documents as well e.g; Archimedes

    2. Re:Talk to the Catholic Church by toupsie · · Score: 1

      Hello there under the bridge, the church has willfully destroyed many documents as well e.g; Archimedes Yet somehow you know about this. Interesting.

      --
      Strange women lying in ponds distributing swords is no basis for a system of government.
    3. Re:Talk to the Catholic Church by AK+Marc · · Score: 1

      Yet somehow you know about this. Interesting.

      Perhaps they kept excellent records of what they destroyed?

  37. SpamAssassin by HermanAB · · Score: 1

    Well, if they would just run their mail through SpamAssassin it should make the problem far more manageable...

    --
    Oh well, what the hell...
  38. Isn't this a case for by Anonymous Coward · · Score: 0

    ODF. Am I wrong? Isn't the whole point of ODF is to have a format for documents that will be around longer than any company. As for emails; text and html should be easily accessible.

    Please correct me if I am wrong.
    PS: I am aware that some email clients butcher html.

    Regards

  39. Why not print them on paper? by Anonymous Coward · · Score: 0

    Why not print the e-mails on paper? Seems to me that the National Archives are already well equipped to archive paper documents, and the data would last at least several hundred years.

    Of course, stone tablets have proven to be the most durable data storage medium to date, lasting upwards of 5,000 years, but that would probably be overkill. :)

  40. They could always just say.... by Halfbaked+Plan · · Score: 1

    ...'the server ate my email' to any queries about critical email messages.

    Like the Clinton team did.

    --
    resigned
  41. Stop & think before posting, please by maggard · · Score: 3, Insightful
    Turning emails into text files, all graphics attachments into PNGs, etc. isn't the issue.

    How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax.su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.

    Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...

    Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.

    So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.

    At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...

    --
    I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
    1. Re:Stop & think before posting, please by dogugotw · · Score: 2, Insightful

      A-freakin-men!

      Seems like there is about 100:1 'understand:clueless' post ratio here.

      Converting the body of an email or document (word, pdf, excel, powerpoint, html, whatever) is trivial. Maintaining all of the meta data associated with the document/email is not. Maintaining the original context is not trivial. Let's not forget that something like highlighting, font color, underlining, bold face, or italics within a message may have meaning - if you convert to all ascii, the formatting and the meaning that went with it are gone and the saved information has less value.

      XML might be a solution but for it to work, all of the existing production systems must be changed to xml compliant systems and users must be retrained and policies to manage the newly created data must be updated.

      I work for a medical device company and struggle with these issues every day and I only need to worry about data for 10 years or so. I cannot imagine trying to keep today's data meaningful for 50 years.

      If anybody has a solution that is:
      Free
      Transparent to the users
      Transparent to the admins/developers/maintainers
      Easy to implement
      Doesn't require revalidation (oh, you didn't know regulated systems had to be validated before use and change controlled and tested at each change???)

      feel free to chime in.

    2. Re:Stop & think before posting, please by Fuzzy+Eric · · Score: 1

      The parent and sibling (both to this reply) address one of the more difficult aspects: context, access, validity, integration, and cross-linking. Perhaps of more interest is some of the *invisible* (meta-) information in these e-Mails.

      Say, for instance a PDF is attached to a message. That PDF contains a document with redacted information. That information can be retrieved by cutting and pasting the document. What do you store in the archive? The redacted text? the unredacted text? both? Clearly, "direct conversion" posters completely miss this issue. But those more technically competent who recognize that "dump in standard formats" is a non-solution may not be thinking along these lines either.

      Sadly, the only reliable way to capture all of the data and metadata that may reside in these e-Mails and in attachments is in the original format and with original document readers. I.e., you can't just store the nouns, you must also store the verbs. Converting formats quite evidently risk destruction of information and must only be undertaken with extreme care. Even careful format conversion risks loss of information (edit sequences ("Where was that sentence when the speech was first drafted and how did it move around during drafting?"), editors ("Who deleted that paragraph?"), deleted text ("What's in the other 80% of this DOC file?"), et al.). While admittedly, not all documents store all this information, those that do and which were released as official government documents containing that information, are a deeply insightful glimpse into an Administration and into the processes of that Administration -- precisely the insight to be engendered by this collection of documents.

      Interestingly, this suggests that it is important that the Office of the President and any member of the government who may provide documents which appear in the Archives use file formats that may readily be archived. Perhaps these are ODFs and perhaps not.

      Perhaps more interesting is how contextual information will be collected. Can you universally convert "From:" addresses into real people's names? You can probably get most with some work, but all? Assuming 1% of e-Mails have a new address on them, this requires tracking down 1M addresses. (Perhaps we can skip the "home loan" and "Viagra" et c. new addresses, reducing the number to a few thousand, but that's still an exorbitant amount of manual effort just to put the "From:" fields in context. Now let's move onto the "CC:" field...)

  42. Are we blinded? by electrosoccertux · · Score: 2, Interesting

    Lately I've been wondering how great Google really is, and whether its deserving of the love I give it. Sure, I think the company Google is full of geniuses coming up with some of the best ideas since bread & butter.

    But then I ask myself how much time I've spent trying to find things online. I've been finding Google to be increasingly less useful. When was the last time you googled, looking for information, and found nothing related? When was the last time you had to rephrase your search query not once, not twice, not three times, but four or five times? Now, when was the last time you googled for something besides Wikipedia (or any other well known site) and found what you wanted on the first page? I can tell you that for me, the times I've been able to check off "found in under 15 seconds" have become scarcer and scarcer. Since, I've increased results to 20 per page. That's helped a bit. But most of the time I'm having to rephrase my search query multiple times. After 5 or 6 tries, I usually find what I want halfway down the page. Why is this?

    I've had several thoughts on this issue lately. Google could be filling up with spam - pages optimized just to get a high pagerank. Or perhaps I'm asking Google to find me increasingly complex and niche information. Being a GT student, its entirely possible I'm simply asking it for things most other people don't find useful. But I didn't have these problems until, at most, two months ago. Or perhaps what I fear is becoming a reality: Google's IPO has turned the company in a different direction. Maybe their slogan is changing from a "do no evil" to a "do less good" stance? Am I crazy? Or are we blind, and is what I say true? Are we loving Google only because they're giving Microsoft a run for their money?

    Don't get me wrong, Google has plenty of wonderful services: Google Earth, Gmail, the new click-a-button-and-have-that-company-phone-me service, etc. But is it possible that they're beginning to sell out the top results in their searches? Consider the evidence: I've been spending more time than ever finding quality links. Google's IPO was but a few months ago. Also, in talks with AOL, Google now plans to offer not only specialized AOL ads, but also FLASHier adsense ads. So is it probable that Google is selling a place in their top results? I'm very inclined to think so. And so, just recently, I've come to question my devotion to Google.

    Am I the only one wasting search time? I think its time we re-evaluate Google's search engine, and think twice before we offer our praise.

    1. Re:Are we blinded? by sootman · · Score: 1

      Depends what you're doing. Using google with the exact wording of an error message often gives the solution in the first match. It's still great for me. Spam is an ongoing battle but my searches usually don't result in much--just subject matter differences, I guess. I'm sure if you're looking for digital camera info it's kind of hard. But when the revolution comes and all spammers are lined up along a wall and shot, that problem will go way.

      --
      Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
    2. Re:Are we blinded? by hurfy · · Score: 1

      hehe, the one that got me was last night searching for an envelope sealer. Half the first page results were for sites that search for items one of which returned a result for another site that searches for items.....

      If i am searching google i probably do NOT want to look for a search site...if i do i can type the word SEARCH in the damn box if i want someone else to search for it. Half the ads were the same, noone had anything for sale they just want to find it for me :(

      Yes, it took several tries to even get that far i think. mailing machine sealer got more heat sealers and strapping amchines than mailing stuff :(

      BTW anyone know of a envelope sealer between $40 and $600 :(

  43. Uh - they put it on tape? 1600 or 6250 BPI? by Anonymous Coward · · Score: 0

    You got to be kidding!!! why didnt they just create master DVD disc then press/create DVD data disc for every local library that want to have a copy?

    The other big question is why hasn't Google offer free service to archive all those data ?

  44. My sister got bitten by a moose once by Anonymous Coward · · Score: 0

    My sister got biten by a moose once. Mind you, it was a prety good moose Thos responsible for sacking the people wo sacked the original editors have also been sacked The argument was completely re-edited at the last moment by a team of equadorian mountain Llama's

  45. Microsoft Word by csplinter · · Score: 2, Funny

    They should experience how the latest version of Microsoft Office can help them better manage documents, organize workload, and collaborate with coworkers--not just from their desk, but from almost anywhere! Why? So that their system will deliver the features, options, and performance they need to maximize their productivity and enjoyment, to insure that their software is authentic, properly licensed and supported by Microsoft or a trusted partner, so that they will get access to updates, enhancements, and innovations that help them protect and do more with their PC! In conclusion, If you don't believe that Microsoft Office has REAL Ultimate Power you better get a life right now or they will chop your head off!!! It's an easy choice, if you ask me.

  46. yes, oh yes by misanthrope101 · · Score: 2, Insightful
    100 million hate-emails is a lot of hate mail.
    It is all hate mail, right?
    Yes. Every last one was to people who questioned or disagreed with a decision made by the Bush administration. Strangely, every single email contained the same text: "Why do you hate America?" Apparently it's the most cogent, incisive argument around.

    Or was this about email received by the White House? All of that routed through a special team working out of the office of the Vice President. All of that email was also identical: "Cheney was right all along."

    These two may seem like odd coicidences, but only if you hate America. Your email will be forthcoming.

  47. Re:8K??? by dogugotw · · Score: 1

    Ya mean these folks don't send out Word, Powerpoint, Excel, funny pics and all that other attachment crap to the 100 folks on the cc: list? I'm thinking 8K is probably a very low end estimate given the junk that shows up in my inbox.

  48. Just print it all out by brainburger · · Score: 1

    Print it all out using stable inks on acid-free paper.
    - This will give the librarians something to do, and will be immune to technology going obsolete ;-)

  49. Incompatible formats not the fault of tech itself. by Anonymous Coward · · Score: 0

    "At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.'"

    That's not really the fault of tech, that's a problem with companies trying to engage in vendor lock-in tactics. Keep things simple and standardized (aka ascii/plain text or open formats) and this should be a non issue. Keep everything in PDF or DOC format and yeah... you'll have problems. Take it from the State of Maine, they know full well what they are doing.

  50. Three words under an ITIL framework by Anonymous Coward · · Score: 0

    Software Media Library

  51. Having worked as a contractor for NARA... by plazman30 · · Score: 1

    I have to say the biggest problem they face is that fact that the entire US Government is not on one standard for electronic documents. NARA uses GroupWise for it's e-mail. Other agencies use Exchange/Outlook. Some agencies still use text mode e-mail on a mainframe or UNIX box. People I speak with in the Navy tell me that the whole navy uses a bunch of different formats for everything from e-mail to work processing documents.

    The government is only recently adopting PDF files, because PDFs before version 1.5 of the spec were not Section 508 compliant, and a screen reader could not read them.

    Flash animations on web sites are out also due to Section 508 compliance. NARA's headache would be greatly reduced if they could standardize on a format that everyone uses across the board in all agencies.

    Sadly, the government doesn't work that way...

  52. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  53. Re:Major technical problem with ODF by Anonym0us+Cow+Herd · · Score: 1

    There are major technical problems with ODF.

    The first and biggest one is that it doesn't help to entrench the MS Office monopoly. In fact, it tends to work against this goal, because other vendors can freely interoperate with ODF documents.

    Another major problem with the ODF format is that nobody is able to impose a "tax", or to require special individual license permission for each new software which reads and writes the format.

    Finally, ODF is tainted by that "open source" movement. Respected, successful business leaders of our nation have denounced it with phrases such as "...like a cancer" and "...infects other intellectual property", "is un-American", and other similar remarks.

    Considering the above problems with ODF, and because I am cynical and have lost all faith in a system which is hopelessly corrupt, I don't expect ODF to actually be used by the government. It simply doesn't put money into the right pockets.

    --
    The price of freedom is eternal litigation.
  54. Experience with data archiving. by gameguy1957 · · Score: 1
    I'm almost finished with a degree in Anthropology/Archaeology and have owned a business in the past that recovered and converted data. Here are a few things I have ran into while doing archival or retrieval work.

    I started archiving slides of archaeological digs for one of the local universities about twelve years ago. At the time Kodak gave an estimated shelf life of 100+ years for the recorded cds. We actually started having data corruption within a few years. Even with multiple copies stored in different, climate controlled locations.

    Now, the slides are more degraded than twelve years ago and we have are back to looking at other methods of archiving the data since we can't predict what will happen to the digital storage down the road 10-15 years, much less hundreds of years from now.

    Also, I used to have a business that would recover and convert data from one format to another. You wouldn't believe the number of businesses that archive data in one format and put it in storage, have a catastrophic failure and, upon trying to recover data, find that they no longer have the equipment to retrieve the data.

    Converting the data was easy - if you still had the equipment that was used to archive the tape, cartridge, diskette, etc.

    If I experience these problems in a small city within 15 or so years, I can't imagine what problems a project of that scope (archiving the whole of the gov't's data) would have while trying to preserve it for future research or historic context.

    -JM

    1. Re:Experience with data archiving. by AK+Marc · · Score: 1

      If I experience these problems in a small city within 15 or so years, I can't imagine what problems a project of that scope (archiving the whole of the gov't's data) would have while trying to preserve it for future research or historic context.

      I can imagine. And I think it would all be fixed by treating it as live data. Don't "archive" it in the sense that you save it to a tape or flashdisk or something, then toss that in a drawer for 50 years then realize it is unusable. Save it to live media, like a hard drive. Make backups of in in cases of corruption (no problems for disk failures because redundancy would be used). Sure, it isn't cheap, but it makes for reliable storage. When the disks become hard to replace, build a new storage container. The nice thing is, by the time you need new live storage (about 10 years or so), the storage will drop in price by 10x or more, so moving it every 10 years isn't as expensive as it sounds. The largest cost would be building the initial storage array.

    2. Re:Experience with data archiving. by gameguy1957 · · Score: 1
      At the time, a single-speed cd-r was $7500 and we were spent something like $1500 on a less than 1GB HD. So, we couldn't afford to store it live.

      Live storage is one of the options being looked at right now.

      Thanks,

      -JM

  55. not like 8-tracks--this is software by figital · · Score: 1

    This isn't quite like 8-tracks where the players aren't made anymore.

    Obviously if you dump this stuff to tape then the comparison holds...for a little while. I would expect that any company, upon upgrading their archive hardware would migrate existing data to the new equipment!

    Then the issue is simply with format. I find it hard to believe that in 500 years, no one will be able to decipher a txt, doc, png, jpg, etc. These are SOFTWARE formats--not hardware. Thus, you don't need to maintain any piece of equipment, just a little code. Sure, the more obscure formats like wps, wks, etc. might give you trouble but comon--you will have the same problem with well documented open source formats that are not very popular (like xcf?).

    I don't think open source alone is the solution here.

    1. Re:not like 8-tracks--this is software by AK+Marc · · Score: 1

      Obviously if you dump this stuff to tape then the comparison holds...for a little while. I would expect that any company, upon upgrading their archive hardware would migrate existing data to the new equipment!

      How is that obvious? I've never worked for a single place that migrated archived data upon a format switch. It's old data, it's supposed to last a while, and they'll keep one old reader, never tested, in case they might want to try to pull something off. Oh, and they won't notice that the reader only ever had drivers written for a VAX, and they tossed their last VAX out 5 years ago, and the 25 pin SCSI connector on it is a little damaged.

  56. Re: Dancing on the Head of a Pin by some+guy+I+know · · Score: 1
    I think we deserve to be told how many Library of Congresses that takes up!
    Since the National Archives are (presumably) part of the Library of Congress, I would guess that the answer would be < 1.

    Yes, I know; Whoosh!
    --
    Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana
  57. OpenDoc by greg_barton · · Score: 1

    I've never seen a more compelling argument for OpenDoc. (and/or a conversion requirement to OpenDoc.)

  58. Got this one via the Freedom of Information Act by dexter+riley · · Score: 1

    Take out anything that might injure National Security, then turn the rest over for Google to index.

    Dear Sir,

    I write to inform you of my desire to acquire [REDACTED] in your country on behalf of [REDACTED] of the [REDACTED] in Nigeria. Considering his very strategic and influential position, he would want the [REDACTED]. He further wants [REDACTED], until [REDACTED]. Hence our desire to have [REDACTED].

    [28 LINES REDACTED FOR SECURITY PURPOSES]

    Your quick response will be highly [REDACTED]. Thank you in anticipation of [REDACTED].

    Yours [REDACTED],
    [REDACTED]

  59. Archiving emails by TheBright1 · · Score: 1

    I sympathize - sort of. In the '90's I worked on a defense contractor's account during their physical move to a "black hole": data, machines, everything. An engineer called me; he was close to retirement, and had worked on the original F101, and the 105. Twenty years ago he had archived his drawings to the system they were using: Macs.They didn't get UNIX until the 90s. He had all the drawings etc under his desk and on shelves on floppys and wanted them converted to either Wintel, or current Mac (8.0 at that time.) Not an easy process. But all those emails go to the retiring President for his library, and his crew there can worry about it. The Pffice of the POTUS, and the archives are only responsible for "historic" documents. Bush has some bright people (I think.)