Archiving Digital Data an Unsolved Problem
mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
I can't wait to hear Microsoft's explanation why the project should use one of their proprietary formats.
Apology to Ubuntu forum.
So, they're shooting for about 10 years then?
than the previous ages where all information was kept on paper or in spoken words? The problem isn't so much how to invent something that will always be readable, but some way to always have the applications to read it. If it were not for the Rosetta Stone, much of what we know about the ancient world might still be a mystery.
Support NYCountryLawyer RIAA vs People
Worked for the Egyptians didn't it?
So rise up, all ye lost ones, as one, we'll claw the clouds.
Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about. My point in these discussions is that most of what's out there is crap. Humanity is not lessened by its loss. Good stuff gets reproduced, reviewed, studied, dissected, etc. and survives. It *is* stupid to try to solve this problem, because the problem doesn't need solving.
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
I've seen this very thing happen where I work -- we've lost data over the years because of incompatiblity issues. On the other hand, as with many things, it's a huge problem but not an insurmountable one. The key is in planning an anti-obsoloscence strategy into every IT decision. Store data files in open formats on robust media and put someone in charge of ensuring the archives are maintained and accessible.
It's not easy, sure, but neither are many of the other tasks we take on as humans.
Give a man a match: warm him for an instant. Douse him in petrol and set him aflame: warm him for the rest of his life.
Thibodeau hopes to develop a system that preserves any type of document... for as long as the United States remains a republic
So he only needs to archive up to November 7th, 2000? That should help him with managing the scope.
Basically with the draconian virtual ban on reverse engineering of formats .. this sort of thing can be expected. Especially since copyrights for even abandoned works will be extended indefinitely.
I've been trying to develop software to do it... unfortunately, my amazing abilities at procrastination and wanting to constant redesign the project have left it languishing for nine years.
Then I keep seeing articles on archiving projects and think I really should get back to work on it...
Since I shoot RAW, I also burn a copy of dcraw.c onto every disc - so even if the current platforms get lost by the wayside, there will be code to convert them still.
;)
Storage itself? Currently burning onto Delkin Archival Gold, storing cool and dark, and in two physically distant locations.
They're also stored on my harddisk, and the best are backed up onto a USB drive.
If it looks like the DVD-ROM drive is becoming obsolete I'll burn them on to whatever comes along next.
If you're truly paranoid you can always print them on archival quality paper using pigment based inks
IBM will NEVER shoot that baby in the head, so there will be Notes databases around when my grandkids are long dead.
... for as long as the United States remains a republic.
So like, what the next decade at most.... no problem.
There are only two ways of doing this: keeping a copy of every program used to create these files (and a system to run them on) or converting them to some open and well-supported format.
For text documents, HTML is probably the best bet. It is so widely used and supported readers are almost garunteed to exist as long as computers do in their current form. (And if something ever truely supersedes it, a mass-conversion program will be written anyway.) HTML probably works for basic spreadsheets too. Graphics support for GIF, JPEG, and PNG is probably at that level as well, and MP3 for music.
As a bonus, most of the native programs for the documents to be preserved have translators to these formats already.
Beyond that I have no idea.
'Sensible' is a curse word.
There are several companies out there which specialize in Document Imaging Software, specifically for searchable archive purposes. The primary problem is simply the manpower to write the number of conversion filters necessary to import external data formats into the database's internal format; the storage and search/retrieval problems are mostly solved already.
Disclaimer: I used to be an engineering intern at Laserfiche
This isn't the 80's and almost any file being saved in Archives are in formats that many programs can open. Meaning that the specifications for those formats are known... regardless of whether or not it is legal. Even word files are viewable by a number of applications, and nobody is archiving historical information with advanced macros so don't even post with that macro crap.
Also to assume that future generations won't have the sense or ability to figure out how to open files we write is silly.
Because "some" businesses (or the military like the articles suggests) find opening archived information ON THE FLY difficult doesn't mean a (more technolgically advanced) society wanting to learn their past will have the same limitations. This article is just another example of entry level "tech writers" and of how low journalistic standards are.
PS
I am not a journalist... so save your grammer and spelling corrections for someone who is.
From TSA: "Popular Mechanics asks: Will an entire era of human history be lost?"
Obviously not; Popular Mechanics itself has preserved much of the era in traditional hardcopy formats, making it no less lossy than previous printed-word eras.
Of course, understanding the era from such incomplete and unreliable records will be a challenge to archaeologists and historians; again, not much different from previous eras.
In conclusion: doesn't matter, hardly news.
Any sufficiently well-organized community is indistinguishable from Government.
I'd trust that guy. If there's one thing our governrment knows, it's stupidity.
"Was it a millionaire who said 'Imagine No Posessions?'" -- Elvis Costello
Interestingly, This Slashdot article is shown to me with advertisement for HD-DVD, which has a data format "forgotten" by design.
We are unraveling history using models of mitochondrial dna genetic drift using data collected across the planet, and archivist as concerned about future generations not having Office 2003 compatible software? Ok, so the making it broadly available and searchable to current generations may be a challenge, but they can't seriously be concerned about future researchers not being able to read our data formats. I suppose we should be concerned as to whether the physical media will survive, but I doubt we need to worry about our computer illiterate progeny being able to figure these things out.
In this era of virtualization, the solution for x86 software is as easy as retaining a copy of the primary partition of a computer originally used to work with the desired files. Searchability could be a problem for proprietary data formats, but the move to open standards in the future will mitigate that.
The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.
(It's never too late to join the Renaissance)
'Your problem is so big, it's probably stupid to try and solve it.'"
Sounds like general end-user hate crime to me. Hey, I've been guilty many times of shunning a user because I didn't feel like fixing his stupid problem.
"how to store digital files so future generations can access them"
Quite simply, you don't store them in one format. Just move everything every 10 years or so. In fact, with Moore's Law and all, you will probably be able to store everything you had before in 1 of whatever is new 10 years later. Hire some part timers to move it or something. It's not a hard problem. It's just an inconvenient one.
It really isn't a question WHETHER we will be able to read old digital data in the future. After all, humans invented these formats, flawed as they may be, and humans can decipher them with enough effort. We can crack cryptography -- a deliberate attempt to make it as difficult as possible to decipher certain information. So it's hard to imagine any data format that could not be deciphered in the future with some honest effort.
Instead it is a question of whether the data is WORTH the effort. From an anthropological standpoint, this is valuable historical data, and its value is not decreased by our inability to interpret it. The benefit of digital data is that it can be copied even if we don't know what it means. It will not erode or decay like other historical artifacts, if we put in the small effort required to preserve it. Assuming humanity doesn't self-destruct, there will be plenty of time in the future for historians to decipher and interpret the data when a need arises for it.
NuParadigm is a company that has software that does scanning, indexing and worklflow. I used it at WashU and it is terrific, They call the softare DataFlow.
Open Office Docs, FTW!
Chums up, let's do this!
I think using open formats are good. And standardized formats and well documented formats. :)
Plain text files (.txt) is very safe.
I believe Ray Bradbury had something to say on this subject.
Perhaps more ironic -- it's a pretty good bet that whatever he wrote on the subject, it's not available online due to copyright restrictions imposed by his publisher or "estate."
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
There's an infinite amount of trivia that could be recorded. We could all go around recording "my life in HDTV" recorded at 900GB/hr uncompressed, but it just wouldn't be meaningful. Sure, a certain sample of "everyday life in $foo" is useful, but on the whole who cares. And with digital media, this should be simpler than ever since you with proper redundancy should never experience data loss. Obscure image format? Find a decoder, store is as PNG. Yes, it'll be a lot bigger but you'll never have to worry about lost data from the original or keeping support for a kazillion old formats. You just have to be slightly critical, and don't stretch yourself so thin you could lose something actually important. I refuse to believe that everything we do now is so much more "important" than people 50 years ago or 100 years ago. You ahould be able to do more than ever before, and perfection can never be reached. For example, I could say "show me an untouched part of the $foo forest untouched by human hands". Biologists and whatever would love it. Suddenly you're not talking about 900GB*8billion and blew right off the scale. What's "important"? News media, encyclopedias, wikipedia etc. all screen stuff (well, you can put *almost* any trivia on wikipedia). If a bear shits in the woods, and noone gives a crap, why keep a record of it?
Live today, because you never know what tomorrow brings
This way DRM is bad as it can make data hard to read many years later.
Create or choose a lossless, unencrypted format that fits with each type of file. Make sure they are always supported with free libraries and utilities. Also, find a type of format that can shrink the size of files (like zip or something)
Can somebody explain me, how much CDs decade? I thought they were pretty much sealed... except that exotic muchroom, that eats the silicium layer... (an even then, altough *we* can't read it now, with a laser, the information should still be there in the plastic...)
--
Karma 50, and all I got was this lousy T-Shirt.
How much of this stuff really has such high priority. I'm pretty sure I wont want people looking back and finding old myspace blogs and thinking... "Wow everyone 1000 years ago deserved to die."
The good stuff will get saved, the bad stuff, who cares?
...XML of course. XML solves all of the world's ills!
i thought XML is the interim cure for this problem.
also, think of the gee...what do i say, "exobytes" worth of new data that will be created in the next 100 years. does anyone really think there will be a strong interest in dredging up and analyzing all of today's mostly circularly repeatative drivel? some of it, perhaps a snapshot from the past could be preserved, but why everything??
how often to you think about accessing your great, great, great, great grandfather's love letters?
I wonder what archaeologists will think of the Zune :)
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
It happened recently. When I was a lad, the BBC and UK schools composed a "domesday book", which was supposed to be a parallel to the original Domesday book, which was a bit more than a cencus from the UK made in 1086.The modern one used the popular home PC the BBC Micro (made by Acorn). It was made on laserdisk, and distributed around the UK to the schools that had compiled the information.
Well, 15 years on, it was useless. The then-proprietary format was not readable on anything modern, and there was not much of the old hardware around either. You can google for it ("UK domesday bbc data" should do it), the first link I saw was on the Guardian Online.
I've still got stuff on floppies, but no-one builds PCs with them anymore. I've got two old laptops with floppy drives, the other three computers have none. (OK, I also have two corpses with floppy drives, and the controllers on two of the new PCs will accept floppy drives, but, please take my point - they're going out of fashion.)
In 20 years time, there will probably be no CD/DVD drives, we'll all be using a new more portable, more backupable, lighter, faster, probably online-only storage medium. Kids won't recognize laserdisks, floppies, or USB ports. They might not recognise keyboards either - who knows?
Note to ACs: I won't mod you up, even if you are being funny or insightful. So take a chance! It's not real life!
Did you happen to eat any of that exotic mushroom? Your thoughts are barely coherent.
XML on paper tape.
The era of restoration comes. However, when people blow the dust off those old DVDs and players, they discover that the DVDs have decayed to the point of unreadability. Massive quantities of archived data and knowledge are irretrievably lost.
There goes my copy of Just Like Heaven! Oh the humanity!
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
I have a HD specifically allocated for "stuff I plan on keeping forever". I limit it to one of: pdf, tiff, jpg, gif, html, wav, mp3, and plain txt files. The HD is FAT-32 formatted and reads and writes nicely both from my OS-X Mac and Windows-XP PC. On the mac I have a program (graphicConverter) which will, among other things, do batch converts. In a single command, I can convert *.xxx to *.yyy. For example, convert every single tiff on the hard drive to a pdf.
.wav files to circa 30 GB of poor quality MP3s (so that I could take my ENTIRE music library on vacation with me and not lug around the big hard drive, and this took about 4 days of background CPU time)!
While it might be many days of crunching, it would seem that should some format be on its way out, or some new format prove itself to be the "way of the future", there will be programs to convert *.one to *.theOther. It might take a lot of cpu time, but that is not a big deal. (For example, I just recently converted 300+ GB of
Nevertheless, I cannot imagine there will not be a simple capability to convert *.one to *.theOther, on a giant scale if necessary.
This is not like the project I did a couple of years ago, where I converted my reel-to-reel tapes to digital format. That required a massive PHYSICAL effort, mounting reels, monitoring the conversion, etc. Once in digital format, converting to new formats, copying to new kinds of storage mediums, whatever I can imagine in the future, will now be as simple as dragging from one icon to another.
So why are we worried? Is this just FUD?
I know this doesn't answer the format question, but the media problem can be solved by having multiple copies "in the ether".
I reproduce all my (and several other people's) data on several different machines in different geographic locations, doing it efficiently with "rsync" (and other free tools).
Hard disks come and go, optical and magnetic media fail with time, but the strategy of multiple copies keeps things safe. When was the last time you had 4 machines fail simultaneously in 4 different parts of the country?
Posting anonymously from Redmond. I think CVS is a good answer. Setup a server, do regular backups, and you're done. Sure, the DB grows as documents change (esp. those not in a text format for diff to work), but all your data is there. You will have to buy a new hard drive every now and then for backups, but your data is safe. If you need security as well, use ssh for the CVS connections, and use some partition encryption program. I wouldn't encrypt the whole partition though, as it might be seen as trying to hide something suspicious from the authorities.
Why not just broadcast all data out into space. Maybe we can set up a relayer way far away and bounce it back to earth and back again indefinitely.
Asking people to think is like asking them to buy you a new car
Open and widely published formats are good, of course. But if you're looking for a really long term solution (as in multiple millennia), then I think the prime requirement other than physical durability should be easy reverse engineering. This way the data has some hope of recovery even if the knowlege of the format has been lost. This generally means that simpler is better. Things like plain ascii text. Uncompressed and unencrypted image and/or audio data. Verbose ascii based vector graphics. Things like that. Put it all on a durable, low density, and simply formatted media that will easily give up its secrets to relatively low-tech and completely non-specialized tools like a microscope. It's not the most efficient way to store data, but it's much more likely to be useable by future archaeologists than things like MS-Word files, WMA files, JPG's, MP3's, etc.
Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it. -- Linus Torvalds
This is one of the reasons open standards are important. Not that open formats last forever, but at least they are documented, which means there's some hope of deciphering them after the software that does so is no longer maintained. Of course, that doesn't solve the problem of how to make the actual data survive...hard disks and tapes demagnetize, optical disks become translucent or otherwise unreadable, etc.
Please correct me if I got my facts wrong.
I think the best solution currently available, is to include with each copy of your data (or on each backup volume) some source-code implementation of a document reader or parser, in a commonly understood and well-documented language, probably ANSI C (although Ada has all of its documentation in the public domain, so you could include it as well).
This wouldn't help you if you expect people to lose the ability to read the media that you're storing the data and source code on, but that's a much more complicated problem. At that point, you're really talking about stone tablets or metal engravings, rather than backup tapes or CDs.
In terms of practical solutions, ensuring that there are source-available readers, written without external dependencies (besides a compiler), for various document formats, is probably the best way to ensure that they'll be readable. Somewhere else in this thread, someone gives an example: storing a source copy of a GPLed RAW-file processor, on each CD containing RAW images. This seems like a very good idea: assuming that your eventual user can read the media, even if their machine architecture is different and readers don't exist, they have a solvable problem: either find a compiler for their architecture and build the program from the provided source, or use the source code as documentation, to build a compiler in a 'modern' language that can be compiled. The only weakness here is that the language might become a 'lost art,' but that's difficult to avoid. (You could provide documentation on the computer language in a natural/human language, but then you have the same problem of indecipherability of the human language; and ultimately I think a computer language is probably easier to puzzle out than a natural one is.)
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
At the rate we're going, that's what... about 12 more months?
Humans should become data storage devices - some similar way, as DNA is stored - this way, the data will stay with us, as long as we are here.
Nothing beyond that really matters (to us).
"Will an entire era of human history be lost?"
How ironic that in an age where we have the highest capability to preserve our history, it can become obsolete in a matter of decades. Take the 5 1/4" floppy disc. Assuming that the disk didn't loose it's magnetically bound data, I would be hard pressed in 2006 to find a drive that couuld read it. I don't even have a 3 1/2" drive anymore.
Another example. My father has a magnetic real from the 30's with a radio recording of my great grandfather. We have no idea how or where we can get a copy of it on a media that we can use, like cassette, CD, or MP3. Who knows what else is quickly evaporating from our ability to use anymore.
What a shame...
We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
Yes, it will be lost.
Just like it has been lost before.
The human race has been more advanced then we are now, and all records have been lost becasue they were stored in a format that became unreadable after the great war.
Krikey! This one looks well preserved! Nameless graduate student, carefully and tediously remove the volcanic ash from this "ard rive." Yes professor!
(months later)
Finally, after 6 long years of restoration and figuring out what the *metal* conductors on the back were for, we can plug the "ard rive" into our nanocomputer and get a glimpse into the life of humans from 2006!!!!
This is strange, all I can find is a massive and bloated operating system called "Windows" that keeps crashing and a huge folder labeled "pr0n." I wonder what's in it...ZOMG! Now we know why they were all killed in 2007 during WW3. Silly pacifist hippies sat around whacking it and didn't by guns!
I'm sure to get the Bush Peace Prize for this one.
"for as long as the United States remains a republic."--So, what does he want to do with data created after 1/21/2001?
It's called "Google"
Step 1: Hire hundreds of Chinese people that are unemployed. Step 2: Buy 1000's of cheap notebooks (PAPER, not a laptop) Step 3: Write 0's and 1's down on said notebooks with pen.
Thibodeau and the rest of the people at NARA have been thinking about this problem for awhile, as have other researchers around the world. If you're interested in such things, there are a few places to start looking.
a rd.html
CAMiLEON http://www.si.umich.edu/CAMILEON/
Cedars http://www.leeds.ac.uk/cedars/
InterPARES http://www.interpares.org/
DSPace http://www.dspace.org/
Lockheed Martin won the NARA contract to develop the Electronic Records Archives.
http://www.archives.gov/era/acquisition/option-aw
After hearing them talk about it at the Managing Electronic Records conference (http://www.merconference.com/), I'd say they have a few things to work out yet... but these are important questions for the preservation of history, culture, and more. These questions also involve authenticity, the value of evidence, and more...
Cool, someone got it right for a change.
---- Booth was a patriot ----
Today's data is different from previous generations' in two ways:
/do/ store everything on hard disk. But with storage prices where they are nowadays, a couple TiB is pretty affordable, and I don't see many people generating that amount of data any time soon.
1. It's digital
2. The machines that produce it are networked
1. means that you can copy it as many times as you like without errors creeping in.
2. means that when you buy a new computer, you just hook it up to the old and copy all your existing data across; Moore's Law means that the old data will occupy a tiny corner of the new computer's hard disk (or whatever the future storage device is).
Of course, this only works if you
The problem is, this requires a fairly major rethink from the POV of archivists, who are used to storing things on physical media and hoping they don't degrade. Time to learn - the way to assure long-term survival of data is multiply redundant on-line storage. Physical media are useful solely for sneakernet and short-term disaster recovery.
The books from the early days of the written word were hand copied and translated. This was done to preserve them because the books wore out from use. Often, especially in the case of books of gospels, they were edited in an effort to keep them in the current idiom. Today, we struggle to find the earliest works so that we can know more about the original author's intent.
I suspect we are facing a similar situation with archival of the mass amounts of information/data that we are now making; the very first bits of source code, even the first copies of compiled code, are scarce. And some of it is getting corrupted. Early UseNet stuff is getting harder to dig up.
For many reasons, I hope this early work is preserved.
Best regards.
USA is not and never was a republic. It is a federation of states.
Catalin Braescu
Ofaly.com
This very problem was, in effect, a major motive for the long development of the SGML standard, and its special cases like HTML and XML. It's also part of the thought behind acronyms like ASN-1 and UTF-8.
The difficulty of decoding computer files even a few years old is a problem that dates from the earliest days of computers. Programmers have been battling this problem all along. And we know a lot of solutions.
The only problem is getting people to use the solutions. This means fighting the natural tendency of management to discourage anything not aimed at improving this quarter's bottom line.
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
I believe the author of TFA has a mixed metaphor there. His title ought to be Digital DARK Age, not ICE Age.
... I have about 15 years' worth of e-mail (ok, so I'm wierd), including all that spam until about a year ago (and I'm a packrat, so sue me). Saved in text files, exported from everything from pine to Outlook and many in between. This is all readable for a while.
I had documents in WordPerfect that I took the time to batch convert to Word for Windows 2, then to Office whatever. I may convert them to ODF pretty soon. Yes, some formatting is lost, but the content is there.
And I currently have my archives on a USB/Firewire hard drive, CD-ROMs, and 4MM DAT. I'll be going from CD to DVD, and the 4MM will probably go to a DV via camera. And I need some other hard drive soon, this one is 3 years old and due to go 'click'.
The solutions include:
- Multiple physical media, and some file management.
- Converting to new file formats every few years.
- Occasionally jettisoning the true crap.
But, perhaps a spam library from the early days is not 'crap'? I got spam from my AOL account, my first FIDO account, and my first 'Internet e-mail' account. I've had my own SMTP server for 10 years, and my account is at least 12 years old. I get a lot of spam.
There is no single solution. Cover all the bases. Hard drive interfaces will become obsolete, CDs will give way to Blu-Ray, and even 4mm DAT will some day become unknown.
And I'm glad my wife doesn't know how to add up all this storage. She'd call it a waste and make me delete it all. pfft.
-rick
-ps, I already deleted 11 years' worth of old Pr0n. Sorry. I got married.
deleting the extra space after periods so i can stay relevant, yeah.
I'm doing my part by working on a project where I'm copying every single MySpace page onto stone tablets.
When future archeologists dig them up and see "LOL Bobby Ray Sucks!" and "D00d 1 pwnz3r3d U!!1!", they'll understand that our civilization didn't just decline; our only choice was to destroy ourselves because we were so lame.
It's a process, not a product or solution.
Keeping old hardware and software round is a non starter. I've been there, it's expensive and unreliable. Instead, convert the data into an open standardised format, ASCII if possible or something simple otherwise. Then put a process in place to move the data from old to new storage media, keep it on the live media.
Automate it if you can. There are also data lifecycle products which'll manage it for you.
Deleted
I don't know about the even older stuff, but for the mac, there was freeware - Transmac I believe - that lets you read MacOS floppies on a Windows machine. (I was using it on NT4, but I'm reasonably certain it or an analog is still around..)
And I'm reasonably certain a newer mac without a floppy can still read the old floppies if you get a USB floppy drive for it.
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
What I Do:
I run my data through a program that spits out a Whitespace program that generates the original data, then print the program. Nothing beats a paper trail, and there's no need to worry about the ink fading, either.
Please correct me if I got my facts wrong.
I read somewhere that IBM and some other companies were working on a project to keep digital data accessible in the future. A quick Google search didn't bring it up and since I'm still at work I hope one of you kind readers has more details.
The idea was to create a virtual machine that runs in as many platforms as possible, and within it has viewers or players for all kinds of documents (text, graphics, sound, video, etc.). Development and maintenance of this VM is supposed to continue as time passes, porting it to new OS versions or hardware platforms as they appear, while essentially keeping the internals of the VM the same.
The catch is the need for maintenance. If it lapses and technology advances too far, the "link" breaks and we're back where we started. Should we start carving bits on stone tablets?
Quantum mechanics: the dreams that stuff is made of.
A solution to this problem has been implemented by National Archives in Australia. National Archives is responsible for the care of valuable Commonwealth government records and make them available for present and future generations to use. This organisation has a Digital Preservation area which stores data (extracted from various formats) in XML format.
This particular project will be presented on at linux.conf.au 2007 by Michael Carden and is relevantly titled, "Digital Preservation - The National Archives of Australia, Open Standards and Open Source". Yet another reason to be seen in Sydney for linux.conf.au this coming January!
well one question I will throw out to the slashdot community is people's thoughts on archiving data. Considering how cheap drivespace is now becoming, what is your thoughts on just building a machine which the OS and all files you want to store are placed. Then you just simply keep the machine shut down and unplugged. Basically people are talking about 200-300 (estimated) years with CD-R and DVD formats.. has anyone any insight on just using harddrives that are powered up long enough to store the data on and then powered down? Will the data stored on the platters begin to deteriorate? I'm curious myself about this as I am going to start archiving some documents and mostly treasured family photos and I'm wondering which way would be the best and safest way to store them.. any and all comments are appreciated! Craig
And why can't we keep the archives after we aren't a republic anymore? Once I'm emperor I'll Be OK with it. So long as you all wear your underclothes on the outside, with the day of the week written on them!.
(If at first you don't succeed, do it different next time!)
Not a word? Sure it is, amigo. It's like the word "infamous". It means the same as "in" and "famous"; In-n-n famous. See? So, "irregardless" simply means the same as "ir" and... never mind...
I'm generally a lazy person. This fact has always made it difficult for me to keep good back backups. I recently purchased a giant external drive that syncs with my internal one nightly. Up until this point, I had everything stored on CDs (and more recently, DVDs.) A week ago, I began a journey through the stack that was sitting on my desk. It had grown to be over a foot tall, and had begun falling over periodically.
As I perused the contents of said stack of discs, I found that almost 90% of them were redundant or out of date copies of files I had completely forgotten about. I would estimate that I recovered only about 500MB of files that I had no other usable copy of.
Engineering is the art of compromise.
Having mulled this problem over myself, I'd say make sure that all file formats you use can at least be decoded by open source software (written in something very widespread like C). That way, if the platform becomes extinct, you can recompile the software and recover the files.
.PAR2 files to recover data degradation. Make new full backups at least 4 times a year.
:)
So:
Compression: ZIP (no closed-source stuff like RAR) -- even better, don't use compression at all
Graphics: PNG (JPG for lossy)
Video: HuffYUV (MPEG2 for lossy)
Audio: WAV, FLAC (Ogg Vorbis or mp3 for lossy)
Documents: plaintext ASCII if you can help it, otherwise I'd go with HTML or PDF.
And top it all off with
And do not underestimate the important of good open source emulators
Don't forget funding. I've seen vast amounts of data disappear when nobody was willing to pay for its storage. This is common in large bureaucracies. You've spent years building and maintaining a library, and then it all ends up in a dumpster when the parent organization is eliminated.
Mea navis aericumbens anguillis abundat
Unless I miss my guess, Google will continue towards its stated objective of making all the world's information searchable and retrievable. Want something archived, Google will take care of it. And if Google fails, my suspicion is the entity that takes their place will take it on.
You are not entitled to your opinion. You are entitled to your informed opinion. -- Harlan Ellison
Just because the difficulties in doing a job isn't easy, doesn't mean its not of importance.
In the early 1960s a wise man spoke
/ quote
We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.
/ unquote
We Went to the Moon, and all the signals received including a high definition picture quality version (by the technology of the time) was recorded at Nasa (and also I believe at the receiving station of Parkes receiving station in Australia where the signals were received through their deep space network radio telescope), these most important "documents" of our time have been lost, lost and never able to be recovered leaving us purely with the broadcast version which was at a much lower quality standard (eg a poor quality photocopy).
Its important for the nature of our history and our essence of our technology and who we are as a people to preserve these important events for our future generations.
When you look at this Planet, we regularly goes on a rampage where the technology is lost and we are thrown back hundreds of years, Take Ancient Egypt, The Technology of the first milenium, The great library of Alexandria, (atlantis etc) so much of the past for which we have lost and are poorer for as a result.
Cant we get it right this time as we face our possible next destructive surge, whether it be by climate, economic, famine, nuclear war, microbiological warfare / disease (whether natural or manmade), chemical accident causing a chain reaction etc..., so many risks, lets do this before its too late, too late to be done and too late to be able to be done.
Darren Stephens
Adelaide, Australia
We have printers for printing rendered data (text/images) onto papers. Why not have a stone or metal printer which uses a lazer with a bot more watts to etch the information into these durable materials? I concede though that searching these etched medias would be more of a problem, but there has been no system that I know of to this date which were capable of accessing vast amounts of information, packed into a small space other than digital/magnetic/photonic storage (comparatively transient in nature).
After the nuclear winter, the next civilization will simply have to "scan" the recorded knowledge of 21th century from rune stones into computers again.. We did it so why wouldn't they..
PS. I hear the moon is pretty rocky, so perhaps we can use it for something after all.
In a society that believes in nothing, fear becomes the only agenda ~ Bill Durodié
The question isn't IF it will disappear, the question is really WHEN and HOW. Printing to paper-based hardcopy helps for a few hundred years. It can be recopied from paper to paper easily - it's a very low context solution: ink on paper followed by ink on paper. So, important information about our society can be transferred across generations, even if the generations have no electricity at all. This is how we know Shakespeare, for instance.
Many people say "Oh, but we'll have some NEW technology that will take care of it". This assumes that the resource base for a new technology will be as generous and dense as our present resource base provides. This is a VERY unwise presumption, as there is categorically no proof that such will be the case. In fact, there are a variety of intense warning signs that suggest quite the contrary.
From the evidence I have found, and, oddly, I've studied this for a number of years now, I am fairly well convinced that industrial civilisation will simply erase itself from the human record as little more than a horribly polluted stain that destroyed itself through overpopulation and environmental stupidity. All the music you hear, all the shows you watch, all the films you cried at, it will all go away. Poof. This also means that self-absorbed hucksters like Madonna, Britney Spears, Michael Jackson, Tom Cruise, and their supporting technology of TV, Radio, DVD/CD, etc will also disappear - just the flotsam of "entertainment" culture.
The long term future will be people chasing bison/cows across the prairie or living in small agrarian villages bound by localised population bursts and die-offs. But it will take several centuries to get their. In the meantime we've got our MTV and Orange Crush. The most important thing to remember is this: not getting to that Star Trek future IS NOT A BAD THING. We pissed away the globe's resources on our Xbox's, SUVs, jetset vacationlands, and all the other minutae and ephemera that makes a society "civilised" and provides "leisure activity". All societies have that, to varying degrees. We just had more of it, thanks to our insane and unrelenting exploitation of resources, petroleum, and electrical generation. But it will all go away, and THAT'S OK.
We will disappear. We Are Atlantis.
RS
Shoes for Industry. Shoes for the Dead.
I ask: has this ever happened before?
Not necessarily in electronic bits and bytes. Not the "Alexandria Library" that was mostly duplicated in other libraries or private collections. Maybe like the Inca quipu, mats of knotted strings that recorded all their empire's operational records, other than the ceremonial records in statues and murals. But some quipu survive, despite Spaniards destroying most of them in the mid-1500s. Enough that we can at least recognize that they did have records of lots of transactions.
No, something more transient, as transient as our bits, read/written by something more transient than our metal/plastic/glass machines. Maybe songs or other performed stories, like tribal Australians. Maybe woven in more degradable material, like uncured plant matter. Maybe both, like the Pacific star navigation lore taught in temporary woven stics, but carried in the mind. Maybe patterns in some other loseable medium, like animal pelt patterns no longer readable now that the code has been lost, or interbred back into "blankness".
If it can happen to us, it could have happened before. Our civilization rose from meager beginnings only about 12K years ago, after the last Ice Age that lasted about 12Ky. There was another one before that, with people accumulating knowledge between. And probably a half-dozen or so others since we became as genetically developed as we are today, between 7Mya and 200Kya. We don't even have many records from the first half of the last 12Ky. Could we be reinventing the wheel, literally, every 25 thousand years?
--
make install -not war
http://www.npr.org/templates/story/story.php?story Id=1216161
>That's the real challenge - devising a digital storage format in which presentation can be used to apply context to the data.
I know! ASCII art!!!!
This is crazy talk!
Look, ever since I began using computers, in 1979, the ability to store data has increased far faster than the data has accumulated - at least on a personal level. That trend has not stopped, and shows no sign of stopping, or even slowing down.
Ipso Facto, the solution is simply to add more storage. That way you never archive anything. True, the problem then becomes how do you keep a track of all your stored data - and this is where the technical challenge lies I believe:
"How do I find that bunch of photos I took in about 2009, in 2059?"
Even if you know on what bunch-of-disks they are stored on, you may not be able to find them without considerable effort. After all, in 2059 I may have petabytes of data stored.
I wonder if Google will be around to index it all, and keep a good track of it?
I guess half the solution would just be to have an interface in my brain to allow me to get to it faster.
How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
There is a project to help solve this problem. It is called the Universal Virtual Computer (UVC), and it aims to offer: The UVC concept consists of the UVC itself, a logical data scheme with type description, the UVC program (format decoder) and the logical data viewer.
http://www.kb.nl/hrd/dd/dd_onderzoek/uvc_voor_imag es-en.html
If you search google, there is a lot of information on this project...and it seems I heard about it on Slashdot.
Transporter_iiDoctors destroy health, lawyers destroy justice, universities destroy knowledge, religion destroys spirituality
Thanks to Lulu books, and the innovation of acid-free paper, I am printing out copies of all of my failing out-of-print documents. From books of my childhood passed on to me from my father, to old documents detailing how to do everything I can think of (Such as purifying insulin from cattle. Should a sudden war destroy my supply I plan on being ready for it...), I will have a new and shiny copy to pass on to the next generation.
Though it is hard to do on a students budget, as well as slow going... But I will manage. Right now I am formating all of my favorite Royal Society write-ups into a single volume. Good reading in that.
3 degrees of separation from Vladimir Putin
Look.
In 100 years, you will be forgotten.
In 1000 years, your country will be forgotten.
In 10000 years, your civilisation will be forgotten.
In 100000 years, your species will be forgotten.
One thing you can absolutely count on is that you and everything you find familiar will be lost and forgotten. Nothing that you accomplish, no matter how famous, infamous or worthy will be remembered in 10,000 years.
There is only one contribution you can make which will have any lasting effect at all, and I'll let you work out what that is for yourself.
Deleted
This is not a problem.
To wit: The NSA currently has a lot of data it has no real use for, and will have no more of a need for in 2100.
Stop thinking problems, and start seeing solutions.
You get 3 DVDs. You have a system that reads all 3 DVDs. When a flaw is found in 1 DVD, the DVD is destroyed and a new DVD replaced in which then gets data from one of the other 2 DVDs which should be identical. The system would be like a giant juke box, rotating in 3 dvds every so often to check that all the data matches.
God spoke to me.
I agree. The amount of information has increased orders of magnitude but we're certainly not going to lose any *less* information then we did in the past.
Even if something becomes entrapped in a dead format, if somebody wants it enough a way will be found. Humans are clever.
The articles first big example pretty much makes my argument.
The BBC decided to expand The Domesday Book. They stored the new contributions on laser disc. Laser disc died but, "(The multimedia version was ultimately salvaged.)"
It has has been argued that the wikipedia is the end of archeology; worrying that we'll forget how to read is giving too much credit to Nud Ludd.
As a game developer, it's profoundly disturbing how casually we treat games just a few years old. Hardware will continue to evolve and OSes will change; we really need a way to secure our ability to play old games.
Console games are semi-okay because you can at least keep the (static) hardware around, but PC games are in bad shape. PCs evolve gradually, and it only takes one small OS or video driver change to render a game unplayable. Because games are a commercial medium, games simply aren't supported once it's no longer financially beneficial.
As long as there are programmers out there willing to write emulators, I suppose we're okay... but it still makes me nervous.
For physical media, the best solution I know of is clay tablets.
After thousands of years, there are warehouses full of
Babylonian cuneiform clay tablets that are still perfectly
readable.
... just print all the ones and zeros out on paper, so that later on others
can just read it all back in again with OCR! Oh, I know we could use
punch cards instead, but we don't want our kids to laugh at us, do we?
Besides, if we print the ones and zeros real small, we can achieve higher
data densities.
I have followed the entire discussion, and now am prepared to add my two cents worth. In the first place, the amount of data is growing exponentially, at a rate such that we will never be able to keep up with it. For example, video games are now an important art form, taking a vast mind share away from movies. Every indication is that interactive forms of entertainment will continue to evolve and take over from former "lower-dimensional" forms of expression. Perhaps we are at the dawn of 3D television, if we are to believe recent articles. Certainly interactive television is already here, and may become dominant. Then how do you preserve these art forms? As technology progresses, it is certain that the number of dimensions will only continue to increase. Right now we have multi-channel audio, 3D images - and perhaps it won't be too long before we will have images with scent synthesis, touch, who knows?
Much of the discussion so far has been about archiving documents, images, and sounds, but as time goes by, these mediums will present a very low-dimensional view of contemporary culture - only the tip of the iceberg so to speak.
I believe the answer must lie in computers themselves as the storage medium. Perhaps there are ways of building computer chips that will last for centuries or millennium. What we need is a storage machine, that is self replicating and self repairing and capable of constantly updating itself. Then an archaeologist of the future, even a future where technology has take a huge backslide, will run across one of these machines in a cave somewhere, someday. It will sense the presence of the archaeologist and begin to project an image on the cave wall, with accompanying sound track, and tell the story of our history. It will also be capable of teaching others how to regain lost knowledge and technology.
I would think whatever the data source is it would have to be sealed from the elements at the very least 4 times, each container high endurance with a buffered feed to the next outer casing - inside each casing would be some sort of written guide (pictures, symbols, etc.) of what is inside (to let discoverers know it isn't treasure in the physical sense) and how to access the archive (Don't know about having displays, etc as those would be way too fragile and may be considered 'valuable' even if broken to future generations.
:-)
A hopeful tong term scenario would be assuming at the discovery at leas one casings may have been broken or worn away with a couple still to go till you might contaminate the core.
It should definately not be small or light, big and (enviromentally) impervious would probably be best.
That's what I came up with on my walk home.
"Enjoy what you're doing! If it becomes drudgery, you're doing it wrong!" - Jim Butterfield
New file formats seem to be more human readable and therefore easier to produce over time. I recently wrote a little Clipper Summer 87' program to translate old .dbf files to .sql statements so they could be imported into modern databases. It is not difficult at all to do data translation in order to ensure data persistence, the real issue here are the media in which data is stored.
Punched Cards! So what if it takes 5 boxes to store a pic of Fluffy.
Table-ized A.I.
If the whole of wikipedia was carved onto a large stone plaque today...it would be bigger than 200 libraries of congress. [cite source]
Ok, this has been happening since the beginning of computers.
I still have some punch cards that I need to convert. It doesn't matter that IMSL solved all the same problems better.
I have some:
- 5.25" floppies
- RLL/MFM hard drives with data
- parallel port QIC80 tapes (250MB ea)
- 1/4" tapes (not so important)
All of these need to be converted to useful media. Or, I figure they are probably corrupt by now.
Hence, the addition of par2 http://www.par2.net/ which provides parity protection against partial media failures. Corruption can be handled by home users. For enterprise customers, EMC, Veritas and STK have handled this for years. For home users, the extra effort that par2 requires, http://en.wikipedia.org/wiki/QuickPar, may not be worth it. But if it is your wedding pictures, RAID on disk, off-site storage and optical media are **all** required to save your marriage. That 1GB of GMail space might be worth it?
If it is your corporate date - go ahead, take your chances. How much can it possibly be worth if it is missing? YOUR JOB perhaps?
The cost of time to find all the data and work interuption is nothing compared to gracefully handling a disk failure without a production impact. Where I work, we do backups **very** well and remote vaulting in alternate data centers.
How hard is it to read disks and punch cards from the 1950's and 1960's? and that's just 30 years ago! I don't think those 10 1/2 DEC floppies is going to fit in anything, but, hey!, at least you can read old Mac, C64, Amiga, PC 3 1/2 and 5 1/4 inch disk (even protected) with Cat weasel! http://www.jschoenfeld.de/products/cwmk3_e.htm
Yep. Microsoft's commitment to their "Plays for Sure" campaign with the Zune really instills confidence in their backwards compatability.
At least with OpenOffice I can legally archive the source code and install images needed to access the data for that period (say, every year or six months.) Sort of like dropping a copy of TrueCrypt on a DVD full of crypto archives.
With the new DRM keys and license enforcement policies, I dread someday trying to resurrect an old image so I can access data archives, only to find it wants to register with a DRM verification service that no longer runs or is no longer compatible with a 4-5 year old install image.
I do not fail; I succeed at finding out what does not work.
Right now i've got ~300GB of digital photography. That'll likely grow closer to 500 by the time i finish digitizing all my film.
Right now it's kept on an external disk and i'm slowly uploading it to an online storage system.
Within 10 years i'm sure i'll be able to keep the entire thing on one recordable optical disk. What is today a difficult to manage quantity of data will become easier in the future.
Could we be reinventing the wheel, literally, every 25 thousand years?
Maybe... I think we can expect the USPTO to start extending the lifetime of patents a few millenia.
ASCII text has been going strong for 40 years with no signs of becoming inaccessible in the foreseeable future. If you want your text to be retrievable 40 years hence, just make sure that one of the forms you save it in is ASCII.
As to the rest, its worth noting that none of the major proprietary formats has become unrecoverable in 20 years. Even the old DOS Word Perfect files are still readable in modern office programs. Lots of minor formats have gone by the wayside, but that should have been obvious up front. It should be similarly obvious that the major compression formats (pkzip, gzip) will be around for the rest of your life while minor formats (zoo, arj) and newcomers (rar, bzip2) may or may not survive the next decade.
What it boils down to is this: If you use common sense when choosing how to archive your documents, you should have every expectation of being able to retrieve your documents with the software then available for the rest of your life. If you succumb to the urge to use the latest greatest format then you'll probably lose the data.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
This reminds me of the study done for the Waste Isolation Pilot Plant (http://downlode.org/Etext/wipp/#executivesummary) . The study looked at how to mark the site in such a way that the purpose of the site would be indicated for 10,000 years.
While the WIPP site won't have the benefit of constant updating of the media (it's designed to be survive on its own for 10,000 years) it does address some of the same points; longevity of the media, a format that will be usable into the future, and ability of future civilizations to understand the message.
Off-topic perhaps but an interesting read.
Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
by UbuntuDupe (970646) * on Monday November 20, @04:56PM (#16921476)
I can't wait to hear Microsoft's explanation why the project should use one of their proprietary formats.
Yes, because Open Office has already solved the problem... right?
BTW... I have an answer. It's called "Portable Document Format". It's not perfect, but it's there and, as long as you scan it as text, it's searchable.
information that gets used, gets saved.
create systems on cots technology that are searchable and transferable to new media - and the parts people want to keep they will, either by paying for them or creating groups to preserve them, like museums
we can not save all our information, and trying too is foolish. no one cares about all the people in ancient Egypt - just some of them to get a good idea what life was like. The same token, we won't care about every photo and every bit of information.
If you use commercial-grade storage systems - you will be guaranteed data migration to new hardware and the continuation of your data.
I had the displeasure of trying to pull up some mp3's I'd ripped a few years ago (2000/2001)
that was likely done on a cendyne cd burner at a max of 4X on imation media.
Put it in a DVD-/+RW DL and...and...zip, zilch, nada...unreadable.
Oh, FSCK!
Fired up son's computer with a (boo!) Sony dvd-rom...and, there it is (yea!).
So, IMO/E 100+ years is nothing, how about 10?
Hardware, software, os, format, media as variables to b0rken along the way?
And to the game dev: Amen! fat32 (maybe ntfs) borked tomb raider and a few other games
for a while, 2k was rough at first, Heretic2 has a 25% chance of working under XP and very
few have figred out why (but works flawlessly under 2k).
Upgrade-itis drives even the most stoic of IT lemmings over the cliff with data one step
behind.
Microsoft's OS is partly to blame, but Linux ain't so innocent, either (think breaking wordperfect 8, which I am still sad about).
New, improved, faster and shiny and with blinkin' lights (ooooh)...but will it last?
Crap shoot, russian roulette, IMO, and the current stand on virtulazation by MS says about how
much of a flying fsck at a rolling doughnut they care.
Have you read the moderator guidelines? Well, have you, PUNK? (and I want a Karma: Gnarly option)
There was once a king who asked of his wise men and advisors that they provide him with some object which, when he was happy would make him sad and when he was sad would make him happy.
/. comment, for posterity thereby archiving it for future generations :) mod me off topic if you dare!
They researched and worked very hard and finally came up with a simple ring.
Inscribed on the ring were the words: "This, too, shall pass".
Then theres the wonderful story of the Opening of the Eye of Horus, as told by that great sage and fool, Aleister Crowley (pronounced to rhyme with 'holy' not 'foully').
I shall paste the whole darn thing into this
1. This is the Book of the Opening of the Eye of Horus, of which the symbol in the profane world is the eye in the triangle, and of which the meaning is Illumination.
2. Thou who readest this doth not read; thou who seeketh shall not attain; thou who understandeth doth not understand. For attainment and understanding cometh only when thou art not thou, yea, when thou art nothing.
3. Once there was a monk, a disciple of that great Magus of our Order whom men name the Buddha which signifieth He Who Is Awake. For men asked the Lord Gotama, Are you a God? And he answered, No. And they asked again, Are you a saint? And he answered again, No. And they asked then, What are you? And he answered: I am awake. Thence is he known as the Buddha, the Awakened One.
4. And the monk, in order to awaken himself, practised the Art of Meditation as taught by Buddha, which in its original form before being distorted by False Imaginings and Elaborations of Theologians, was but this: To look upon all incidents and events and Remember to Say Unto Thine Soul of each: This is transitory.
5. And the monk looked upon all incidents and events, Reminding himself always: This is transitory.
6. And the monk came close to Awakening, and therefore was he in great peril, for The Lord of the Abyss of Hallucinations, whom Buddhists call Mara, the Tempter, cometh quickly to one near Awakening, to hypnotize him again into the Sleep of Fools which is the ordinary consciousness of Men.
7. And Mara did sorely afflict the monk with death of offspring, and insanity of loved ones, and eye-troubles, and slander, and malice, and the great curse of Law Suits, and diverse sufferings, but the monk thought only: This is transitory. And he was closer to Awakening.
8. And Mara, the Lord of the Abyss of Hallucinations, then caused the monk to die and reincarnate as an almost Mindless creature, a Parrot, which flitted from tree to tree deep in the jungle; and Mara thought, Now he has no chance of Awakening.
9. But a brother Monk of the Buddhist order came one day through the jungle, chanting the Teachings, and the Parrot heard, and repeated the one phrase over and over: This is transitory.
10. And Mental Activity began in the Parrot, and the memories of his past life came to him, and the meaning of the teaching, This is transitory; and Mara cursed horribly in frustration, and caused him to die again and reincarnate as an Elephant, even deeper in the jungle and further from the languages of men.
11. And many years passed, and there seemed no chance of Awakening for that soul; but the effects of good karma, like those of bad, continueth forever; and eventually Men came to the jungle, and took the Elephant captive, to sell him to a great Rajah.
12. And the Elephant lived in the courtyard of the Rajah, and many years passed.
13. And another monk of the Buddhist order came to the Rajah, and taught in the courtyard, and his teaching was: This is transitory. And memories awoke in the Elephant, and meaning was understood in the memories, and Awakening again came close.
14. And Mara cursed wrathfully, and caused the Elephant to die; and this time Mara took good care that reincarnation would recur at the furthest possible remove from all chance of Awakening, for Mara caused that the monk be reb
In the free world the media isn't government run; the government is media run.
data from the past is readable because of 3 or 4 things.
first: the data is physical and that means lasting. head and sun and rain take a very long time to destroy words is stone
second: oral history and evolution of languages allows us to extrapolate much of a language based on similar languages.
third: context keys like a rosetta stone that allows us to put actions and words together a build our knowledge upon that.
fourth: pure analysis of structure and reoccuring symbols. taking just the smallest parts of language and searching for them by how common they are and building sentance patterns from that data.
as long as we provide some physical medium then the data can be recovered in time. our best bet would be to print our own rosetta stones on stainless steel or other medium that would last for ages.
these keys could allow future generations to build workable mechanisms to read the data. images could be used to show direction and the corrisponding words to build vocabulary. each translation would allow the next level of data to be read. even the organization on the data mediums like tapes or dvds could be shown so the significan bits and bit counts would be known and the ansii character set would be known from previous keys.
since most of this data would still be stored digitally, it wouldnt take very many keys to get your started.
the only problem is that dvds and cds dont store data indefinitely as the dies degrade in time. tapes loose thier magnetic charge and the plastics decay.
we need a more permanent medium. something more like laser disks where the data is burned physically to the disk and not just a die being altered. maybe have the disks also be steel and have them laser etched when technology can do it for a reasonable price(which is not far off) steel isnt going to degrade anytime soon and by the time it does people wont care what happened in politics as they would probably only be interested in the genomes present at that period in history and climate, solar flare activity etc etc.
Set up a virtual machine with the appropriate readers installed and configured. Then, save the VM with an installer & key for VMWare (Linux version, if you prefer.) In future, a Windows or Linux emulator will be able to install VMWare and then load & run the VM containing all the installed apps prepped and ready to go. Now, the problem is indexing and searching within VMs, but at least the data will be readable.
Hal Spacejock: Science Fiction with Nuts
To solve this problem, I forsee a chisel, a hammer, a stone wall, and about 200,000 slaves (Jewish or equivalent value). After they are done, they can start on my pyramid!
I am not sure I would care to watch 1,000 year old porn, probably a little too tame.
You obviously have no idea what was going on in good ol' Rome back in the day.
Somebody mentioned Microsoft. They actually participate in such a project, namely PLANETS, where they do in fact work towards open formats and preservation of our ability to access past, current and future file formats.
It's exciting work. Relevant too ;)
The long now foundation had an excellent presentation by Clay Shirky on just this topic some time ago. Well worth watching.
-Grey
Silver Clipboard: Time Management Tips
Let's not forget DRM (i.e.: copyright enforced by code / encryption). As the content industries move towards greater control over digital media they effectively sidestep the need for stringent copyright. In this absolute control scenario, access to information becomes a matter of market forces, whereby only those works that have some market value and/or are dutifully maintained by their owners will survive.
e ss_and_the_cultural_record
Here's a shameless plug for a paper I wrote on this very subject:
http://thomas.kiehnefamily.us/technologies_of_acc
-- t_kiehne
Somewhere, I read about the event that happened in china. That article also stated that those books (thoughts) came out later in some other name/form. Some works are beyond destruction (say, E = mc^2). To avoid the loss, all information (data and knowledge) can be transformed to as simple as possible (wisdom).
Given the trends recently in copyright term extensions and law changes (read: tightening the thumbscrews) it won't really matter, because those Archaeologists a century or more from now will probably be hunted down and shot for even thinking about trying to access digital media without the copyright holder's permission.
Even assuming that the future is not some 1984-esque dystopia, there are a number of large, annoying and extremely rich and vocal organisations which would be very much against the concept of storing a lot of our data for extended periods in an accessible, searchable and unencrypted format. Google already ran into this problem with their plan to archive books. The only data that could be archived like this at the moment without someone complaining about copyright violation is pure facts and figures and anything that's been put in the public domain. And unfortunately, a lot of this sort of information is not really what Archaeologists look for. The tangible aspects of our society's culture are things like art, writing and music, all of which are coincidentally copyright protected (ostensibly to encourage the development of those cultural aspects by allowing those who expend effort to create them to be able to be compensated for their work).
So in short, either the publishers of art, music, movies and books need to all band together behind this initiative or copyright law in itself needs to change. And since neither of those look likely to happen, I think this whole archival concept can be written off as unfeasible. After all, why allow a hypothetical future generation to have insight into our culture if you can make piles of money instead?
It doesn't, always, if it's copyrighted, and in today's world all works are copyrighted automatically for more than a hundred years unless the holder explicitly waives that right. Otherwise it'd be easy for me to get hold of a lot old (but very important) textbooks in fields where it's not economic to republish, but which only come on the market second-hand when someone dies.
Copyright law significantly blocks the distribution of much information, to the point that the remaining stores of it are often lost or destroyed by the time it's legal to reproduce and distribute them to the populace.
Unfortunately today, this problem has become as much of a legal one as a techinical problem.
My fathers wedding-pictures are film-negatives, stored in a secure-against-fire safe in his apartment. He just has to hope that no fire burns too long or too hot (the safe has limits) that noone breaks in and breaks open or steals the entire safe, that the area is never flooded, etc. They are pretty secure, but there's only one of them and it's terribly expensive to secure against everything, in the end it's just a risk he has to accept.
My wedding-pictures are digital. I don't have a safe. But the pictures exist on around a dozen different physical hard-discs and around a dozen different individual DVDs. 3 of these hard-discs are in professionally run and daily backuped raids, standing in environmentally controlled mountain-halls. There's copies in 4 continents, and in probably 20 different buildings.
It'd literally take a collapse of civilization, global thermonuclear war or similar to even have a chance of wiping all these. Each individual copy ain't much secured (if at all), but the added security that comes automatically with multiple geographically dispersed copies means my pictures are a *LOT* more secure than my fathers pictures.
Lots of information that nobody cares about will be lost. Some of that information will later turn out to have been important, and we'll curse ourselves for not having saved it. But lots of information that peope *do* care about is saved, and will be saved. Since data-storage grows exponentially, the cost of storing old data falls exponentially with time, so there's basically no reason to ever stop saving something once you've saved it in the first place.
If you've taken the time and expense of saving a set of data from 1980 to 2000, you migth aswell save it forever, if it was savable for reasonable cost in 1980, it's savable for trivial cost in 2000.
Yes, saving digital works requires active maintenance. Multiple copies, regular moving to newer storage-media. Documentation of file-formats. (or conversion to file-formats that are well-documented) But the cost of this is more than offset by the gargantuan capacity and the dirt-cheap copying.
Not every piece of digital info can be saved that way, or needs to be saved as others have pointed out. Current college textbooks, some history books, literature and music and an encyclopeadia will go a long way to create a useful memory of our times for the future.
Some years ago, in California, they opened up an 100 year time capsule. I do not remember the suff that was in it, but it was mostly useless junk by our standards today. If we could send an e-mail back in time, we would ask them to include totally different things. It is easy to make the same mistake now as to content.
This seems like a good place to mention a repository of media that didn't make it. Hundreds of promising (and not so promising) formats that are now unreadable.
Another solution is to abort using filesystems and start using databases at O/S level. Therefore future civilizations will be able to read the data because the type information will be in the database; and since there is much printed text about databases, the chances information will be unreadable in the future are minimal.
ian
At least HTML v. 1.0 Other "enhancements" have really screwed things up and killed backward compatability.
The point here is that HTML is an "open" and "widely used" data format, so it is widely used. Plain HTML (and now even some variants of some XML) are now being used by groups like the Gutenberg Project as reliable enough to be used for permanent archiving, where the HTML and XML offers superior text formatting information that is not preserved with plain ASCII text documents.
And before you go off and say that HTML is ASCII, it is not. It is a mark-up language that does muck arround with the text and do things that sometimes make the text hard to read with just a plain ASCII text reader. Of course this is making a distinction between plain vanilla ASCII text files and something marked up like a web page.
As for propritary formats... the "conversion" subroutines that reformat to more modern data format: they "mostly" work correctly. Sometimes there are bugs in the conversion process, especially (as was common with older Microsoft Office formats) when there were "undocumented" features in the data format due to the propritary nature of the software using it. In other words, the documentation was in the source code in the form of algorithms and nowhere else. I've seen that far too often among programmers who write these file formats for this and many other file types where only a single application is assumed to use the data generated for that format. While you might be able to recover the raw text of the document through this conversion process, often the formatting is shot to hell at best, or even unusable.
If you avoided fancy formatting of the document and simply used "default settings" for most of your older documents, the conversion process is usually pretty good. It is only when somebody decided to get fancy and use some of the more obscure formatting styles that you get into some real problems. Unfortunately, those are often the most important documents that you want to access as well.
To most people, any of the files they used on computers before their first "IBM Compatible" is probably lost forever already. Think of how many files are "frozen" on 5.25" floppy disk for the Commodore 64 alone!
That dosen't have to be the case though, you can retrieve files from disks of hundreds of different 80's era computers on a modern PC using a Catweasel card. http://www.vesalia.de/e_catweaselmk4.htm
With the catweasel, a standard 5.25" PC floppy disk drive (hello, ebay), and a 3.5" PC floppy disk drive there's hardly a floppy disk you won't be able to retrieve your petrified files from.
Finding a program that can do anything with those files is another subject entirely.
... and in the DRM, bind them.
"When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
... far too many more)!"
... are "White_Collar_Trash (WCT) Career Management" specialist bent on destroying the USA as a Great Nation, Culture, and People. WCT Career Management specialist will always be able to convince the semiliterate public "probable stupidity is smart" (also it is lethal, but why tell the pitiful semiliterate public).
The above quote, perfectly expressed the epitome of USA Leadership for the past 30 years in business, religion, and government. You already know the rest of the story "Stupid is as Stupid does (Vietnam, Iraq, IPR, DMCA, TIA, broken-education, healthcare nightmare, Enron, Global Crossings, E-voting Diebold, Opt-Out Privacy, Security_by_Obscurity
No problem (except stupidity and death) is too big to fix. It is multi-degreed, highly certified, and probably stupid Big-Chicken (White_Collar_Trash) people that can't try and won't solve problems. Government research labs (many government offices, businesses, religions) need leadership not a "PhDoctorate of Career Management" in do nothing wrong or right.
Politicians, Corporatist, Televangelists
Unaccountable leaders are masters, and unrepresented people are slaves. How do US and EU fare?
This is about safety, or "sureness" as French has it. The article, very much to the point, made quoted the accounts from the American Civil War that are still readable. Why are they ? Because they are in text format. Written by human hand, but there is no basic difference between a hand-written letter and, say, an ASCII or Unicode text file, as soon as you keep the ASCII table or the Unicode tables somewhere, for reference. Even "complicated" things like UML diagrams, and entire RDBMSes, can be saved or exported in text format. What we really need, given this, is spectacularly performing text compression and transmission protocols. Who feels up to the task ?
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
http://www.ietf.org/rfc/rfc0001.txt
The "seven wonders of the world" don't all exist any longer. But we know about them, because "popular" reference to them survives. The dinosaur is long gone, but it's impression in the mud lingers...
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
Build a massive structure, environment-proof as far as we can make it (down to earthquakes and flooding). Make it visible from orbit. Surround it, if necessary, with statues showing humans pondering objects or simply thinking deeply.
Inside, have a straightforward progression of information in multiple written and pictorial languages, all leading to the same location, all at least tripled at other locations in the complex, and all showing/telling how to decipher a very simple graphical metalanguage. Instructions in this metalanguage should start at stage 2 in the complex, and get physically smaller and smaller, from large attention-attracting start points down to very small 'font' size which explains (and possibly also shows in pictures) how to construct a tool to view the sub-microscopic storage format which all the rest of the archive is stored in.
Part of the data in the archive will be multiply redundant copies and translations of all the large-size instructions in the structure, plus cross-language dictionaries between as many of them as possible, plus easy-to-understand information and instructions on various redundancy and compression coding schemes, plus instructions on how to read ASCII (Or Unicode, or whatever other digital format the bulk of the data is stored in).
We'll all be dead by then any way.
If the data was useful you would have it in a current format.
IANAL but write like a drunk one.