Slashdot Mirror


Archiving Digital Data an Unsolved Problem

mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"

13 of 405 comments (clear)

  1. I've heard this problem over and over by csoto · · Score: 5, Interesting

    Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about. My point in these discussions is that most of what's out there is crap. Humanity is not lessened by its loss. Good stuff gets reproduced, reviewed, studied, dissected, etc. and survives. It *is* stupid to try to solve this problem, because the problem doesn't need solving.

    --
    There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
  2. Re:Not too long... by eln · · Score: 5, Interesting

    Your timeline may be a little off (at least I hope so), but you're right that it's a silly goal. Whether the US has 10 or 1000 years left, history shows us it will most likely fall at some point, and that point will be fairly soon when compared to the entirety of human history.

    Making a format that will survive a thousand years so long as our advanced civilization is still around and still cares is pointless, because as long as there is a continuous line of people that care, they will be willing to transfer at least the more important stuff to new media. The trick is coming up with something that will still be readable when archaeologists dig it up 10, 50, or 100 thousand years from now.

  3. The solution by alexwcovington · · Score: 3, Interesting

    In this era of virtualization, the solution for x86 software is as easy as retaining a copy of the primary partition of a computer originally used to work with the desired files. Searchability could be a problem for proprietary data formats, but the move to open standards in the future will mitigate that.

    The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.

    --
    (It's never too late to join the Renaissance)
  4. Re:How is this different by quanticle · · Score: 3, Interesting

    Its different because of the sheer volume of information being created today. Ancient cultures were not creating millions of pages of information every day.

    Your Rosetta Stone analogy is inappropriate. We have not discovered any sort of Rosetta Stone for the ancient Maya hieroglyphs but we have had success in deciphering them because we can apply linguistic analysis techniques to figure out what words correspond to what actions/things. Its a little more complicated for abstract concepts, but you can figure out a surprising amount from basic language knowledge.

    --
    We all know what to do, but we don't know how to get re-elected once we have done it
  5. Re:How is this different by s20451 · · Score: 4, Interesting

    Say western civilization is disrupted for a period of time that is short by historical standards -- 40-50 years would be enough. Electrical power is only sporadically available, and as a result the Internet collapses and PCs become useless. With much more important issues to deal with, such as finding food, people ignore digital data storage.

    The era of restoration comes. However, when people blow the dust off those old DVDs and players, they discover that the DVDs have decayed to the point of unreadability. Massive quantities of archived data and knowledge are irretrievably lost.

    The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.

    --
    Toronto-area transit rider? Rate your ride.
  6. Re:How is this different by Marxist+Hacker+42 · · Score: 4, Interesting

    Now that's the right problem. What is needed isn't some mysterious Universal Translator Format- it's storing the read hardware, with programs in ROM that understand the format, along with the electronic copy. Hell, store the whole thing in ROM chips with a well documented interface printed on the outside of the chip. Libraries could be made up of whatever reading technology exists at the time the library is built- with this common pin-level interface.

    --
    SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
  7. Re:Not too long... by FooAtWFU · · Score: 3, Interesting
    I've been wondering, with our global nature now, will we need archeologists in the future? While I believe cililiziations will surely 'collapse', won't we all be around to immediately take note of it, and update Wikepedia?

    Archaeology is the search for fact. Not truth. If it's truth you're interested in, Doctor Tyree's Philosophy class is right down the hall. So forget any ideas you've got about lost cities, exotic travel, and digging up the world. We do not follow maps to buried treasure, and 'X' never, ever marks the spot. Seventy percent of all archaeology is done in the library. Research. Reading.

    -- Indiana Jones and the Last Crusade
    --
    The World Wide Web is dying. Soon, we shall have only the Internet.
  8. Re:Not too long... by nido · · Score: 3, Interesting

    Granted it's not like most people care nowadays. Look at any slashdot discussion on education, rather sad how people complain about having to take history (heck or any subject they're not "interested in deeply") in school. People want to be ignorant sheep.

    History is interesting, school makes it suck: "In Year ABC, XYZ happened. Test next week - students who regurgitate well will get an 'A'."

    People don't want to be sheep - totalitarian governments need populations to be docile. School is designed to suck the uniqueness out of children so, as adults, they'll take up a spot on a standardized assembly line.

    Kinda cruel how the government has encouraged the shipping of assembly line jobs to China... Dumb down the population, then get rid of the reason for the dumbing-down.

    See Gatto's Underground History, for example.

    --
    Learn the rules so you know how to break them properly.
    www.teslabox.com
  9. Relax... Google will take care of it... by Panaqqa · · Score: 3, Interesting

    Unless I miss my guess, Google will continue towards its stated objective of making all the world's information searchable and retrievable. Want something archived, Google will take care of it. And if Google fails, my suspicion is the entity that takes their place will take it on.

  10. Re:How is this different by Dun+Malg · · Score: 3, Interesting
    The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.
    Of course, preserving the data is only half the battle. Figuring out what it says is the second part. This is, of course, nothing new. We still can't read Linear A. In the case of the Rosetta Stone we were simply lucky to find something relating hieroglyphics to a language we knew. The Rosetta Stone is rather unusual. Normally we have nothing so convenient.
    --
    If a job's not worth doing, it's not worth doing right.
  11. Re:How is this different by adrianmonk · · Score: 3, Interesting
    They aren't going to see files. They are going to see 1's and 0's. Lots of them - billions on a memory card and trillions on a harddrive. They won't have a clue know how to interpet the file system, even for something relatively simple like FAT16. They may not even know that a byte is 8 bits.

    They might not know that a byte is 8 bits, but with a little analysis, it shouldn't be hard to figure out. There are numerous statistical properties that can be exploited to figure this out relatively easily. For example, with most types of data, the higher-order bits (in any size byte) are more likely to be 0 than the lower-order bits are. Think about how booleans are stored in most systems. Think about the characters in this message: 100% of them have a zero high-order bit. To put it a little differently, there is more entropy in the lower-order bits.

    So, to figure out how many bits there are in a byte, you take your data, and for all reasonable sizes of bytes (say, from 4 bit bytes up to 36 bit bytes), you compute the function that maps bit position (low- or high-order) to an entropy value for that bit. Then you can tell by the shape of that curve which guess about bits per byte was the right guess. Heck, it should be such a strong trend that you can probably automate it!

    Remember that future civilizations will probably also use digital data as well, at least ones sophisticated enough to try to read the optical and magnetic media. They may not know the FAT32 filesystem, but they will have invented statistics and information theory, and they will be able to make some awfully good guesses at things. And yeah, it might take them 10 or 20 years to be able to read a FAT32 volume correctly if some poor college student of the distant future has to do it on a shoestring budget of grant money, but if they're reading 10,000 year old data, how much does that matter?

  12. The Waste Isolation Pilot Plant site marking by frdmfghtr · · Score: 3, Interesting

    This reminds me of the study done for the Waste Isolation Pilot Plant (http://downlode.org/Etext/wipp/#executivesummary) . The study looked at how to mark the site in such a way that the purpose of the site would be indicated for 10,000 years.

    While the WIPP site won't have the benefit of constant updating of the media (it's designed to be survive on its own for 10,000 years) it does address some of the same points; longevity of the media, a format that will be usable into the future, and ability of future civilizations to understand the message.

    Off-topic perhaps but an interesting read.

    --
    Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
  13. Re:Not too long... by mattpalmer1086 · · Score: 3, Interesting

    No one seriously working in digital preservation is trying to make a single thing that will last for 50, 100 or 1000 years. The point is not to preserve information in the event of a total civilization collapse, to make it easier for future archaologists, or some such scenario. The point is to keep our historical digital records *currently* readable at any given point in time. If our civilization collapses, it will be up to those who come after to figure out what we were up to.

    There are two basic strategies to keep our digital files *currently* accessible:

    1) emulation. Check out IBM's Universal Virtual Computer project.
    2) migration. Not only migration of storage media, but migration to new and currently readable formats.

    We will need to migrate all of our digital files every 5-10 years or so to keep them current. And yes, information will get lost along the way - everything decays eventually.