Archiving Digital Data an Unsolved Problem
mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
I can't wait to hear Microsoft's explanation why the project should use one of their proprietary formats.
Apology to Ubuntu forum.
So, they're shooting for about 10 years then?
than the previous ages where all information was kept on paper or in spoken words? The problem isn't so much how to invent something that will always be readable, but some way to always have the applications to read it. If it were not for the Rosetta Stone, much of what we know about the ancient world might still be a mystery.
Support NYCountryLawyer RIAA vs People
Worked for the Egyptians didn't it?
So rise up, all ye lost ones, as one, we'll claw the clouds.
Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about. My point in these discussions is that most of what's out there is crap. Humanity is not lessened by its loss. Good stuff gets reproduced, reviewed, studied, dissected, etc. and survives. It *is* stupid to try to solve this problem, because the problem doesn't need solving.
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
I've seen this very thing happen where I work -- we've lost data over the years because of incompatiblity issues. On the other hand, as with many things, it's a huge problem but not an insurmountable one. The key is in planning an anti-obsoloscence strategy into every IT decision. Store data files in open formats on robust media and put someone in charge of ensuring the archives are maintained and accessible.
It's not easy, sure, but neither are many of the other tasks we take on as humans.
Give a man a match: warm him for an instant. Douse him in petrol and set him aflame: warm him for the rest of his life.
Since I shoot RAW, I also burn a copy of dcraw.c onto every disc - so even if the current platforms get lost by the wayside, there will be code to convert them still.
;)
Storage itself? Currently burning onto Delkin Archival Gold, storing cool and dark, and in two physically distant locations.
They're also stored on my harddisk, and the best are backed up onto a USB drive.
If it looks like the DVD-ROM drive is becoming obsolete I'll burn them on to whatever comes along next.
If you're truly paranoid you can always print them on archival quality paper using pigment based inks
There are only two ways of doing this: keeping a copy of every program used to create these files (and a system to run them on) or converting them to some open and well-supported format.
For text documents, HTML is probably the best bet. It is so widely used and supported readers are almost garunteed to exist as long as computers do in their current form. (And if something ever truely supersedes it, a mass-conversion program will be written anyway.) HTML probably works for basic spreadsheets too. Graphics support for GIF, JPEG, and PNG is probably at that level as well, and MP3 for music.
As a bonus, most of the native programs for the documents to be preserved have translators to these formats already.
Beyond that I have no idea.
'Sensible' is a curse word.
From TSA: "Popular Mechanics asks: Will an entire era of human history be lost?"
Obviously not; Popular Mechanics itself has preserved much of the era in traditional hardcopy formats, making it no less lossy than previous printed-word eras.
Of course, understanding the era from such incomplete and unreliable records will be a challenge to archaeologists and historians; again, not much different from previous eras.
In conclusion: doesn't matter, hardly news.
Any sufficiently well-organized community is indistinguishable from Government.
I'd trust that guy. If there's one thing our governrment knows, it's stupidity.
"Was it a millionaire who said 'Imagine No Posessions?'" -- Elvis Costello
Interestingly, This Slashdot article is shown to me with advertisement for HD-DVD, which has a data format "forgotten" by design.
In this era of virtualization, the solution for x86 software is as easy as retaining a copy of the primary partition of a computer originally used to work with the desired files. Searchability could be a problem for proprietary data formats, but the move to open standards in the future will mitigate that.
The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.
(It's never too late to join the Renaissance)
It really isn't a question WHETHER we will be able to read old digital data in the future. After all, humans invented these formats, flawed as they may be, and humans can decipher them with enough effort. We can crack cryptography -- a deliberate attempt to make it as difficult as possible to decipher certain information. So it's hard to imagine any data format that could not be deciphered in the future with some honest effort.
Instead it is a question of whether the data is WORTH the effort. From an anthropological standpoint, this is valuable historical data, and its value is not decreased by our inability to interpret it. The benefit of digital data is that it can be copied even if we don't know what it means. It will not erode or decay like other historical artifacts, if we put in the small effort required to preserve it. Assuming humanity doesn't self-destruct, there will be plenty of time in the future for historians to decipher and interpret the data when a need arises for it.
I believe Ray Bradbury had something to say on this subject.
Perhaps more ironic -- it's a pretty good bet that whatever he wrote on the subject, it's not available online due to copyright restrictions imposed by his publisher or "estate."
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
I wonder what archaeologists will think of the Zune :)
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
It happened recently. When I was a lad, the BBC and UK schools composed a "domesday book", which was supposed to be a parallel to the original Domesday book, which was a bit more than a cencus from the UK made in 1086.The modern one used the popular home PC the BBC Micro (made by Acorn). It was made on laserdisk, and distributed around the UK to the schools that had compiled the information.
Well, 15 years on, it was useless. The then-proprietary format was not readable on anything modern, and there was not much of the old hardware around either. You can google for it ("UK domesday bbc data" should do it), the first link I saw was on the Guardian Online.
I've still got stuff on floppies, but no-one builds PCs with them anymore. I've got two old laptops with floppy drives, the other three computers have none. (OK, I also have two corpses with floppy drives, and the controllers on two of the new PCs will accept floppy drives, but, please take my point - they're going out of fashion.)
In 20 years time, there will probably be no CD/DVD drives, we'll all be using a new more portable, more backupable, lighter, faster, probably online-only storage medium. Kids won't recognize laserdisks, floppies, or USB ports. They might not recognise keyboards either - who knows?
Note to ACs: I won't mod you up, even if you are being funny or insightful. So take a chance! It's not real life!
Open and widely published formats are good, of course. But if you're looking for a really long term solution (as in multiple millennia), then I think the prime requirement other than physical durability should be easy reverse engineering. This way the data has some hope of recovery even if the knowlege of the format has been lost. This generally means that simpler is better. Things like plain ascii text. Uncompressed and unencrypted image and/or audio data. Verbose ascii based vector graphics. Things like that. Put it all on a durable, low density, and simply formatted media that will easily give up its secrets to relatively low-tech and completely non-specialized tools like a microscope. It's not the most efficient way to store data, but it's much more likely to be useable by future archaeologists than things like MS-Word files, WMA files, JPG's, MP3's, etc.
Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it. -- Linus Torvalds
Afaik, cds are the worst media to 'backup' your precious data.
The first burnable cds you could buy (in the 90ties) were of a decent quality, i still have some burned ones around, and they are still readable (older than 10yrs).
But some newer ones (cheaper, & mass-marketing 'mode') are of an awful quality: i have plenty that "died" when reading them: it begins with some bad CRCs, and then more & more & more, till nothing valuable can be read off it. This happened in LESS THAN 2 YEARS.
The problem with cds:
- They hate sunlight
- they hate being in a too hot, or too cold place
- they hate being in a place with too much/not enough humidity
- and the worst: they react with air (oxygen).
It's build with a 2mm plastic, the dye is on top, with some 'protective' layer over it. Some are better than others.
Now with DVDs, they seem to be from a much better quality already, the explanation is simple: the dye isn't on the surface anymore, but between 2 slides of plastic glued together. The reaction with air seems to be insignifiant. Atm, i have no single failing DVDR that i know.
But some brands are of better quality than others.
And btw:
"Real men don't use backups, they post their stuff on a public ftp server and let the rest of the world make copies." - Linus Torvalds
Pressed/stamped CDs (like commercial audio CDs) age fairly well, given appropriate handling (well, at least my 20yo copy of Greetings from Asbury Park, NJ is still playable). Recordable CDs, however, aren't stamped. Instead, they use a phase-changing dye. Some of the earliest used a blue dye (cyananaline?) that wasn't stable and degraded after just a few years (10). Even discs with better dyes are sometimes not sealed properly and can go bad.
;)
That said, there are some newer dyes that are claimed to be stable for a hundred years. I haven't ever seen these in stores, so they may be seriously expensive, or maybe I just don't know where to shop...
Just junk food for thought...
I'm doing my part by working on a project where I'm copying every single MySpace page onto stone tablets.
When future archeologists dig them up and see "LOL Bobby Ray Sucks!" and "D00d 1 pwnz3r3d U!!1!", they'll understand that our civilization didn't just decline; our only choice was to destroy ourselves because we were so lame.
republic (plural republics)
1. A state where sovereignty rests with the people or their representatives, rather than with a monarch or emperor; a country with no monarchy.
http://en.wiktionary.org/wiki/republic
Don't forget funding. I've seen vast amounts of data disappear when nobody was willing to pay for its storage. This is common in large bureaucracies. You've spent years building and maintaining a library, and then it all ends up in a dumpster when the parent organization is eliminated.
Mea navis aericumbens anguillis abundat
Unless I miss my guess, Google will continue towards its stated objective of making all the world's information searchable and retrievable. Want something archived, Google will take care of it. And if Google fails, my suspicion is the entity that takes their place will take it on.
Just because the difficulties in doing a job isn't easy, doesn't mean its not of importance.
In the early 1960s a wise man spoke
/ quote
We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.
/ unquote
We Went to the Moon, and all the signals received including a high definition picture quality version (by the technology of the time) was recorded at Nasa (and also I believe at the receiving station of Parkes receiving station in Australia where the signals were received through their deep space network radio telescope), these most important "documents" of our time have been lost, lost and never able to be recovered leaving us purely with the broadcast version which was at a much lower quality standard (eg a poor quality photocopy).
Its important for the nature of our history and our essence of our technology and who we are as a people to preserve these important events for our future generations.
When you look at this Planet, we regularly goes on a rampage where the technology is lost and we are thrown back hundreds of years, Take Ancient Egypt, The Technology of the first milenium, The great library of Alexandria, (atlantis etc) so much of the past for which we have lost and are poorer for as a result.
Cant we get it right this time as we face our possible next destructive surge, whether it be by climate, economic, famine, nuclear war, microbiological warfare / disease (whether natural or manmade), chemical accident causing a chain reaction etc..., so many risks, lets do this before its too late, too late to be done and too late to be able to be done.
Darren Stephens
Adelaide, Australia
Doubtless the Anglosaxons felt the same way about their rubbish... and yet archaeologists get orgasmic over the everyday bits and pieces that tell them so much about how those normal people led their lifes.
The question isn't IF it will disappear, the question is really WHEN and HOW. Printing to paper-based hardcopy helps for a few hundred years. It can be recopied from paper to paper easily - it's a very low context solution: ink on paper followed by ink on paper. So, important information about our society can be transferred across generations, even if the generations have no electricity at all. This is how we know Shakespeare, for instance.
Many people say "Oh, but we'll have some NEW technology that will take care of it". This assumes that the resource base for a new technology will be as generous and dense as our present resource base provides. This is a VERY unwise presumption, as there is categorically no proof that such will be the case. In fact, there are a variety of intense warning signs that suggest quite the contrary.
From the evidence I have found, and, oddly, I've studied this for a number of years now, I am fairly well convinced that industrial civilisation will simply erase itself from the human record as little more than a horribly polluted stain that destroyed itself through overpopulation and environmental stupidity. All the music you hear, all the shows you watch, all the films you cried at, it will all go away. Poof. This also means that self-absorbed hucksters like Madonna, Britney Spears, Michael Jackson, Tom Cruise, and their supporting technology of TV, Radio, DVD/CD, etc will also disappear - just the flotsam of "entertainment" culture.
The long term future will be people chasing bison/cows across the prairie or living in small agrarian villages bound by localised population bursts and die-offs. But it will take several centuries to get their. In the meantime we've got our MTV and Orange Crush. The most important thing to remember is this: not getting to that Star Trek future IS NOT A BAD THING. We pissed away the globe's resources on our Xbox's, SUVs, jetset vacationlands, and all the other minutae and ephemera that makes a society "civilised" and provides "leisure activity". All societies have that, to varying degrees. We just had more of it, thanks to our insane and unrelenting exploitation of resources, petroleum, and electrical generation. But it will all go away, and THAT'S OK.
We will disappear. We Are Atlantis.
RS
Shoes for Industry. Shoes for the Dead.
I ask: has this ever happened before?
Not necessarily in electronic bits and bytes. Not the "Alexandria Library" that was mostly duplicated in other libraries or private collections. Maybe like the Inca quipu, mats of knotted strings that recorded all their empire's operational records, other than the ceremonial records in statues and murals. But some quipu survive, despite Spaniards destroying most of them in the mid-1500s. Enough that we can at least recognize that they did have records of lots of transactions.
No, something more transient, as transient as our bits, read/written by something more transient than our metal/plastic/glass machines. Maybe songs or other performed stories, like tribal Australians. Maybe woven in more degradable material, like uncured plant matter. Maybe both, like the Pacific star navigation lore taught in temporary woven stics, but carried in the mind. Maybe patterns in some other loseable medium, like animal pelt patterns no longer readable now that the code has been lost, or interbred back into "blankness".
If it can happen to us, it could have happened before. Our civilization rose from meager beginnings only about 12K years ago, after the last Ice Age that lasted about 12Ky. There was another one before that, with people accumulating knowledge between. And probably a half-dozen or so others since we became as genetically developed as we are today, between 7Mya and 200Kya. We don't even have many records from the first half of the last 12Ky. Could we be reinventing the wheel, literally, every 25 thousand years?
--
make install -not war
Archeologist from the 23rd century going through or email archives: "Wow! These guys must have had humongous penises with all the enlarging going on!"
As a Slashdot discussion grows longer, the probability of an analogy involving cars approaches one.
Look.
In 100 years, you will be forgotten.
In 1000 years, your country will be forgotten.
In 10000 years, your civilisation will be forgotten.
In 100000 years, your species will be forgotten.
One thing you can absolutely count on is that you and everything you find familiar will be lost and forgotten. Nothing that you accomplish, no matter how famous, infamous or worthy will be remembered in 10,000 years.
There is only one contribution you can make which will have any lasting effect at all, and I'll let you work out what that is for yourself.
Deleted
As a game developer, it's profoundly disturbing how casually we treat games just a few years old. Hardware will continue to evolve and OSes will change; we really need a way to secure our ability to play old games.
Console games are semi-okay because you can at least keep the (static) hardware around, but PC games are in bad shape. PCs evolve gradually, and it only takes one small OS or video driver change to render a game unplayable. Because games are a commercial medium, games simply aren't supported once it's no longer financially beneficial.
As long as there are programmers out there willing to write emulators, I suppose we're okay... but it still makes me nervous.
As I perused the contents of said stack of discs, I found that almost 90% of them were redundant or out of date copies of files I had completely forgotten about.
Well then I have question that I would like to throw out to Slashdot readers. Like the person who wrote the parent, I have tons of old files on my hard drives. I always run at least two hard drives, using one for backups. Then when I upgrade computers, I bring over one of the old hard drives to the new computer, copy it to the new drive, then continue to use it to backup new material. By now I have files duplicated and triplicated all over the place. After almost a decade of this, I have many gbs of files which would probably condense down to a fraction if all the duplications were eliminated. What kind of software do I need that will analyze all my files and automatically find and remove duplicates? - or do I need to develop such software for myself? ...and if I do, then is there niche for commercialization of such software?
... just print all the ones and zeros out on paper, so that later on others
can just read it all back in again with OCR! Oh, I know we could use
punch cards instead, but we don't want our kids to laugh at us, do we?
Besides, if we print the ones and zeros real small, we can achieve higher
data densities.
I doubt you'd sell many Nano-Pump (tm) enlargement kits. It's all in the name, even in the future.
SAILING MISHAP
Yep. Microsoft's commitment to their "Plays for Sure" campaign with the Zune really instills confidence in their backwards compatability.
At least with OpenOffice I can legally archive the source code and install images needed to access the data for that period (say, every year or six months.) Sort of like dropping a copy of TrueCrypt on a DVD full of crypto archives.
With the new DRM keys and license enforcement policies, I dread someday trying to resurrect an old image so I can access data archives, only to find it wants to register with a DRM verification service that no longer runs or is no longer compatible with a 4-5 year old install image.
I do not fail; I succeed at finding out what does not work.
This reminds me of the study done for the Waste Isolation Pilot Plant (http://downlode.org/Etext/wipp/#executivesummary) . The study looked at how to mark the site in such a way that the purpose of the site would be indicated for 10,000 years.
While the WIPP site won't have the benefit of constant updating of the media (it's designed to be survive on its own for 10,000 years) it does address some of the same points; longevity of the media, a format that will be usable into the future, and ability of future civilizations to understand the message.
Off-topic perhaps but an interesting read.
Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
Not every piece of digital info can be saved that way, or needs to be saved as others have pointed out. Current college textbooks, some history books, literature and music and an encyclopeadia will go a long way to create a useful memory of our times for the future.
Some years ago, in California, they opened up an 100 year time capsule. I do not remember the suff that was in it, but it was mostly useless junk by our standards today. If we could send an e-mail back in time, we would ask them to include totally different things. It is easy to make the same mistake now as to content.
To most people, any of the files they used on computers before their first "IBM Compatible" is probably lost forever already. Think of how many files are "frozen" on 5.25" floppy disk for the Commodore 64 alone!
That dosen't have to be the case though, you can retrieve files from disks of hundreds of different 80's era computers on a modern PC using a Catweasel card. http://www.vesalia.de/e_catweaselmk4.htm
With the catweasel, a standard 5.25" PC floppy disk drive (hello, ebay), and a 3.5" PC floppy disk drive there's hardly a floppy disk you won't be able to retrieve your petrified files from.
Finding a program that can do anything with those files is another subject entirely.
... and in the DRM, bind them.