Slashdot Mirror


Archiving Digital History at the NARA

val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

23 of 202 comments (clear)

  1. 16000 formats?!? by gardyloo · · Score: 3, Funny

    Hm. This sounds like a job for OpenOffice...

  2. 347 petabytes? by ravenspear · · Score: 4, Insightful

    Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?

    I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.

    1. Re:347 petabytes? by OrangeSpyderMan · · Score: 3, Informative

      I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,

      Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archive anyway, but i'd guess you'd know that if you'd ever archived anything....

      --
      Try NetBSD... safe,straightforward,useful.
    2. Re:347 petabytes? by CodeBuster · · Score: 3, Informative

      The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.

  3. Data loss will always be a possibility by divide+overflow · · Score: 4, Insightful

    It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.

    1. Re:Data loss will always be a possibility by tabdelgawad · · Score: 3, Insightful

      Actually, it's more like 'inevitable'. I'll bet almost everyone has unintentionally lost digital data permanently and will do so again in the future.

      The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.

      --
      Imposing Libertarian views on everyone online since 1992.
    2. Re:Data loss will always be a possibility by Tristor · · Score: 3, Funny

      No, but it could have been lost to the strike of flint, a pregnant pause, then an "glukús theométôr" (Sweet Mother of God, for you people that suck). (Note: I spent like 20 minutes transliterating that to Latin just so I could post it on /. because it hated the Greek charset. I have no life.)

      --
      "I just karma whore to everyone." -garcia (6573)
  4. Answer is Compression? by reporter · · Score: 4, Informative
    National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

    Perhaps, the answer is compression.

    Does anyone know whether there is an upper limit to text compression?

    In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?

    Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.

    If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.

    1. Re:Answer is Compression? by MasterC · · Score: 3, Interesting

      The only thing that comes to mind is information entropy. If you're given a text document, you can determine the probability distribution for each letter, letter combinations, for words, or whatever you can think of. Then given the probability distribution, you can determine the information entropy. If, in the sum, you use log with base 2 then H(x) (see formal definitions) gives you the entropy in bits.

      For example, if you have a text file with letters of equal probability (all letters have a probability of 1/27) then the bits required to represent a single letter turns out to be ~4.7549 bits. (Indeed, 2^4.7549 = 27)

      This is the upper limit of compression. Such methods as the, now 50-years old, Huffman coding do decent work at approaching this limit (used in JPEG, for one).

      So the answer to your question is: it's not broadly definiable for "text" or "information" but based on the patterns of the English language or a specific document.

      --
      :wq
  5. ha by The+Big+Ugly · · Score: 3, Funny

    "Archiving Digital History at the NARA"

    You'll have to pry it from my cold, dead hands!

    Ohhhh, NARA, not NRA....

  6. Google to the rescue!!! by feloneous+cat · · Score: 3, Funny

    With the new GoogleNARA...

    nara.google.com

    Oh, wait... I'm getting ahead of myself...

    --
    IANAL, but I've seen actors play them on TV
  7. Difference between data and trash by HermanAB · · Score: 4, Insightful

    In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.

    --
    Oh well, what the hell...
  8. Dark Ages by TimeTraveler1884 · · Score: 5, Insightful
    Are we destined for a "digital dark age"?"
    If by "dark age" you mean a time in human history where more information is recorded than ever, yes I suppose we are.

    I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.

    Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.

    Personally, I take the "you never know" ideology and save everything.
  9. Not a dark age... was the past so bright? by G4from128k · · Score: 4, Insightful

    Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.

    Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).

    By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.

    To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.

    --
    Two wrongs don't make a right, but three lefts do.
  10. So? by ArchAngel21x · · Score: 3, Insightful

    By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.

  11. Have a look at the Fedora Project by pangloss · · Score: 3, Funny

    http://www.fedora.info/
    (Not to be confused with the Linux distribution)

    From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".

    Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.

  12. Relevant, interesting post by Council · · Score: 4, Funny
    Here is a relevant post by Ralph Spoilsport on an earlier article, which can be found here. I am reproducing it here in full because it is very interesting and highly relevant.

    this is actually a BIG question

    And one that I have railed about for many years.
    I have been in the same position the Author discussed, and I have come to ONLY negative conclusions. In a few words, and I hate to say this, but buddy:

    WE'RE FUCKED.

    Digital is a loser's proposition. backing up to analogue or even digital data on analogic substrates (such as DV tape) fail. Simply nad purely.

    The *only* thing that comes close is some kind of RAID, and those, even with the plummeting price of storage, are still too expensive given the needs.

    Also, a RAID assumes a continuity of several things that are not likely to be continuous:

    With Video:
    Framerate, number of lines, colour depth, aspect ratio, file format, compression format, Operating system compatibility, etc etc etc. All of these things are variables.

    With Audio:
    sample rate, compression format, bit depth, file format, etc.

    Basically all of it points to very bad places.

    I am fairly well convinced that our age will simply disappear. They will find our garbage, the few books not pressed on acidic paper, our paintings (fat lot of good the abstract stuff will mean to them) and drawings, that's about it. the rest will just be shiny little bits of crap in the landfill.

    Since we will have used up all the dense energy forms, they will be appalled at the energy requirements just to get the few remaining museum piece devices to work. Archiving the 21st century will be impossible. To the 25th century, the 21st century will be seen as a dark age - not only for the holocaust of the die caused by the failure of the petroleum based economy, but from the simple fact that very little of the information formats we are totally geared into will survive, including this note on /.

    His problem of saving personal video is just the tip ofthe iceberg. His problem is the problem of our very civilisation, writ small.

    That's why I am abandoning video, and going back to painting. In 500 years, my painting CAN survive. the video simply won't.

    RS


    And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.

    What do you think about Ralph's thoughts?
    --
    xkcd.com - a webcomic of mathematics, love, and language.
  13. Every mail is sacred by kfg · · Score: 3, Insightful

    Every mail is great
    If a mail is wasted
    The gods get quite irrate

    Every mail is wanted
    Every mail is good
    Every mail is needed
    In your network neighborhood

    Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.

    KFG

  14. Re:Why do we need to archive everything? by felix71 · · Score: 4, Insightful

    Actually, one of the main complaints Historians have is incomplete information about the past. Not having every little tidbit makes it impossible to figure out how people actually lived. History _should_ be more than just names, dates, and events. If we can properly preserve and index items that seem really mundane to us, future generations have a _much_ better chance of having some real understanding of how we developed as a society.

    --
    Never attribute to malice that which can be adequately explained by incompetence. -- Jerry Pournelle
  15. strip MS HTML from Outlook mails by rduke15 · · Score: 4, Funny

    I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...

  16. I'm guessing... steady state. by dpbsmith · · Score: 3, Interesting

    The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.

    I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.

    And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.

    Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.

    So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.

    But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.

    It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.

  17. Re:burn, knowledge, burn by mcrbids · · Score: 4, Insightful

    Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

    Absolutely, yes!

    History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.

    One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.

    History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.

    The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?

    Those memos and IMs comprise that understand of people today.

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  18. Try to help correct other's math sans sarcasm. by jbn-o · · Score: 5, Insightful

    You were just a little over 12 times too much. Let's just hop you don't write code for a living :p [...]

    To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.