Vint Cerf: Data That's Here Today May Be Gone Tomorrow
dcblogs writes "Vinton Cerf is warning that digital things created today — spreadsheets, documents, presentations as well as mountains of scientific data — may not be readable in the years and centuries ahead. Cerf illustrates the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. 'It doesn't know what it is,' he said. 'I'm not blaming Microsoft,' said Cerf, who is Google's vice president and chief Internet evangelist. 'What I'm saying is that backward compatibility is very hard to preserve over very long periods of time.' He calls it a 'hard problem.'"
We're at an interesting spot right now, where we're worried that the internet won't remember everything, and also that it won't forget anything.
My data will be readable because I use bog-standard formats. If I get really froggy I use HTML, and you can just strip the tags and read that.
If his data won't be readable, that's his problem. Anything you want to save for posterity, export it now.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Support emulatorVM developers! Encapsulate your entire machine in a VM and you can run the entire software stack if necessary. Anything you need convenient access to, export to CSV, XML or some other standard format.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
We're in a difficult spot right now because for years we ignored the warnings about 'proprietary file formats'.
I'm not blaming Microsoft either. We let Microsoft do this to us of our own free ignorance.
I think you will find that there's a little known branch of academia called "history" which sometimes takes a curious interest in even the most trivial of past information.....
Yes, you're right I have this ASCII text file created in 1997 and I can't find anything to read it...
OH WAIT ACTUALLY FUCKING *EVERYTHING* STILL READS IT.
Stop gargling Microsoft's balls so much and wipe off your chin. Proprietary data formats are THE PROBLEM. Stop trying to redirect public discourse with this thinly veiled bullshit.
A perfect example of this is basically the issue of old video games. (I may as well bring this up because it's going to come up)
Recently, the Internet Archive stored a whole pile of TOSEC collections of games from various old systems (thanks to their DCMA exemption of being an archival repository so that they can legally do this). Data and information that would have otherwise been completely lost into a digital black hole, if it weren't for the fans of the system, and the dedicated teams of people collecting and amassing this software as a hobby.... in breach of copyright.
The problem with DRM is that without dedicated crackers and pirates, unless the original rights holders are around long enough to resell old titles for that long (which most aren't), old games will simply disappear into a digital copyright black hole and never be seen again. This happens once the computer/console system system is old, not sold anymore, and forgotten about, and the media degrades and isn't backed up in some form (in breach of EULA). If people aren't able to collect the software and hang on to it, preserving/duplicating the media while still in copyright, it's going to vanish. Culturally important games of significance will be lost forever, and that, if anything is as much a crime as it is to pirate software in the first place.
It's only due to the efforts of an army of swappers/crackers, etc, that most of the old games on old systems were even preserved.
The steam model on PC is quite good though as it makes a few compromises where you can actually make backups and go offline if you want.
For old computers and consoles however, this doesn't apply,.... and with some more restrictive attempts to squash the used game market, and force internet-always-connected authentication on upcoming consoles to even play the game... one has to wonder if the game companies deliberately want to squish all traces of their old work, let it disappear into the ether, and to resell you this year's football game which is just like last year's. I fear that this is where we are headed (if we aren't there already)
READY.
PRINT ""+-0
Were living in what could well be a future dark age for archeologists / historians. Hardly anything is put into a nice hard format (stone is incredibly rare and metal gets stolen) for someone to find. What's left suffers from incompatible file formats, acid based paper that decomposes, bit rot, cryptography, incompatible technology for data storage and worst of all DRM. With DRM you have active measures that try to prevent something from being usable.
In the old days people stopped use with armed guards, obfuscation and primitive crypto. Today we have servers that are required for operational functionality for many products. With the advent of the cloud you have reasons for storing things where you have a dependency on a third party. How many services that are cloud / server based have come about and gone tits up?
Even having a large well known brand name doesn't protect you from having a server shut down. Just think of Microsoft's play4sure service that lasted less than a decade. Having a license and a physical disk isn't that helpful when the DRM requires an authentication server that doesn't exist. With the movement to put more and more DRM into the cloud or with SSL certificates (again dependent upon servers and naturally time bombed) this is going to be a problem that will only grow worse.
Learning to break DRM is far more critical than file formats which require nothing more than a conversion tool.
Digital archival is one of the HARD problems. Over the last 40 years we have already lost more cultural artifacts that were created for the entirety of human history. A great deal of that is useless garbage of course but the original moon landing tape? 1000s of government emails reavealing exactly what was going on at pivotal times in history?
The truth is, we need systems for hardcopy; digital is too tranient; emulators are a useful stop gap measure but dont protect againt the kinds of catastropic failures that we will likely see over the longer time frame; and we need indexing because someone at somepoint will want to wade through our digital ditritus.
In to a usable document from scratch? Pretty hard. Ever looked at the XML of a moderately complex document?
I think that given MS office and LibreOffice are in XML, it shouldn't be difficult at all to reverse engineer in the future.
Yes, the problem is not "data" but "data in proprietary formats" ... and even that is becoming less of a problem. A converter to/from almost anything is usually just a google search away. With VMs and emulators, even proprietary binary programs are easier than ever to deal with. I can run any CP/M or C64 program on my desktop Linux computer using free emulators. This was indeed a "hard problem", but today it is mostly solved.
This has been true of all technology in the past and will continue into the future. Just look at film. How many preserved films from 1915 are still around? Just the ones that were recorded into a new format of film, then a newer format of film, then into a VHS, then into a LaserDisc, then a DVD, then a BlueRay... (Metropolis, I am looking at you.)
Within arms reach, I have Floppy drives that contain files created in AMI Pro work processors.... WHen I say Floppy, I am talking about the 5 1/4 inch floppies.
Technology hardware and software is not stagnant... It will always continue to develop and progress (ignore windows 8). Data that is worth keeping will get converted. Data that isn't will get left behind. I would not be surprised that in about 25 years, there will be "classic" software as there is Classic literature...
Too much typing.. going back to drinking.....
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
"First things first -- but not necessarily in that order"
-- The Doctor, "Doctor
The IRS wants to audit me, going back several years. I kept the records as required but they are unreadable now.
Thanks Microsoft!
Have gnu, will travel.
I think that given MS office and LibreOffice are in XML, it shouldn't be difficult at all to reverse engineer in the future.
Binary formats were standard for everything up through Office 2003. Office 2007(2003 with optional converter pack and some weird bugs) could output something XML based, though I have the vague memory from the OpenDocument/Open Office XML slugfest that 2007 produced something that deviated from the theoretical ideal of OOXML in some respects, and that full conformity happened at 2010 or 2013. I might be remembering that wrong; but anything before 2003, and a lot from 2003 were definitely binary.
I think you will find that there's a little known branch of academia called "history" which sometimes takes a curious interest in even the most trivial of past information.....
Even if you don't care about the historians, I'm sure the lucky people who have the pleasure of handling property deeds at your local governance hive can tell you a story from within the last week or two about needing to pull some rather seriously dusty documents to allow a present-day transaction to go through without incident.
Many data will, indeed, be of no interest at all, or the same historical interest that neolithic refuse dumps are; but data in the nontrivial-number-of-decades range are still live in more than a few contexts.
XML doesn't magically solve everything in this regard. If there's no good documentation for the format, it's unlikely you'll be able to display everything exactly as intended. Likewise, if the format is hideously complex (see: Microsoft Office Open XML) or there's bugs in the de-facto implementation, it's going to be tricky to reverse engineer.
I'd also point out that MS Office spits out compressed XML. I believe it's based on ZIP, which is very well documented, but that's yet another hurdle to cross. And then you have to deal with the binary format of the XML itself -- ASCII, UTF8, etc.
There's no -1 for "I don't get it."
MS removed the PowerPoint 4.0/95 converters completely with Office 2007 for Windows and later, and disabled them by default in Office 2003 SP3. And the PowerPoint 4.0 converter (but not 95) was disabled by default instead of fixed with MS09-017.
On the Mac, they removed then even earlier, when they ported Office to Carbon.
IMO it would be a good idea for MS to package PP4X32 and PP7X32 from PowerPoint 2003 separately, along with a utility to call the converters of course.
For a supposedly smart guy, he seems a bit silly:
He could've just downloaded MS's Powerpoint 97 viewer
I don't respond to AC's.
Have you tried disabling the file blocks first? At least Word for Mac 4.x and 5.x can be read this way.
Some are glass plate Daguerreotypes. Somehow, I am not too confident that my digital pictures will be legible 150 years from now, unless I make a good quality print on archival paper. Digital files are too easily corrupted and made totally useless. Media formats will change. 8" floppies anyone?
"Do the Right Thing. It will gratify some people and astound the rest." - Mark Twain
We're still able to restore cars from the 80s and earlier as the cars were fully mechanical or hydraulic. No computers.
Fast forward to 20yrs from now, nobody's going to be carrying the computer boards for a 2004 Toyota Pruis or a 2013 Tesla.
However, you'll still be able to restore your grandfather's '57 Chevy...
I presented a solution to this long-standing problem last year to the Denver HTML5 Meetup.
Code should never be separated from data. This is possible with HTML5, JavaScript, and open source.
In the presentation, I steal and repurpose Hofstadter's analogy of DNA to an LP vinyl record, which is an information bearer, but useless without its information retriever (the record player). Like the cell of an animal, which contains both DNA and the means to "play" it, I ask why not the same with software?
My maxim is: data should always carry the code with it to play itself. It was inspired from the field I've spent 50% of my career in: non-destructive testing where, for example, X-Rays and ultrasounds are performed on safety-critical industrial parts with 50-year service lives. If one of those parts fails and kills someone, you're going to want to go back into the old data and find the earliest indication of the flaw or fault and reinspect every other part in the world like it that is still in service. And maybe you need to go back 50 years. Under such a context, not providing the code with the data could be considered an act of gross neglect.
In my presentation, I use the 1990's era trick of embedding XSL into an XML file, with the addition of the XSL now being able to use HTML5/JavaScript. Sadly, I've only gotten it work with Firefox -- the other browsers consider it a security violation.
https://en.wikipedia.org/wiki/Windows-1250
Professional Wild-Eyed Visionary
I've been part of archival problem planning. We went with DVD. now I am not there, I suspect they are thinking DVD sucks and are moving "forward" when the DVD was more than good enough and those plastic discs will last a century. mpeg-2 files will have open source decoders. Now physical readers will still be a problem... the only solution is to wait as long as possible and then switch to the next long lasting format - but not necessarily the newest one at that time. (which is why moving to blueray is a waste of money.)
The biggest problem with other formats is the FORMAT; even with something like open office documents, the ODF format will have revisions and new features added and tweaks to the format. version 2, 3 etc. The features and changes that promote the creation of more and more formats is the biggest problem. Just like my above DVD video problem- if you go beyond your needs then you are complicating things with more and more formats.
TEXT? sucks. we need WORD! Word 1.0? the app sucks... we need WORD 20! (and all versions in between to migrate the old docs...plus labor to deal with conversion issues...)
Perhaps we need ARCHIVAL formats; like PDF, which has done besides the stupid additions Adobe has been making to it. Or just TEXT export... a less bloated output only format without the feature BS problems.
Thankfully, email remains the same... sort of. although storage of the emails differs greatly; if you want to archive emails you need to pick a close-to-the-source method (and simple storage filesystem-- good luck reading that NTFS formatted disk image in 30 years.)
Democracy Now! - uncensored, anti-establishment news
Both have published specifications, so reverse engineering shouldn't be necessary. However, Microsoft's XML includes things that are not defined in the specification. That was one of the objections to giving it status as an open standard.
Seriously, why would Vincent Cerf not blame Microsoft? They have an extremely poor track record with backwards compatibility, and I don't think they even know what forwards compatibility is. If you design the data formats correctly then you can keep things usable for decades (or centuries). Guess what, twenty year old TeX documents still work, and yet Word X won't work with Word X-2. I've pulled runoff documents off of 70's versions of Unix that can still be printed. That says to me that one can deal with compatibility issues.
This is all intentional on Microsoft's part too. They make money when customers buy new copies of software, so it is in their best financial interests to make sure that customers have significant pressure to upgrade. I remember the solution to an acknowledged bug for Word 97 was to make sure that everyone who was going to read your document had the appropriate Word 97 plug in in their older version of Word. I completely blame Microsoft here.
This is not that hard a problem, IF the company pays attention to it and gives it even a small amount of priority.
Vint, that's bullshit and you know it. It's nothing more than preserving syntaxes, grammar, file formats. That's not hard, and it only requires someone to create a format conversion ONCE to solve the problem at each stage of the evolution.
The real problem here is proprietary non-public formats and structures. When the structure of data has been a closely guarded secret and requires reverse engineering that may not even yield a perfect result, THAT is hard.
No! Fail! You don't get it!
1) Code is data
2) Code is data that is especially hard to interpret
3) One of the main reasons of all this mess ia that in all those proprietary formats, data is intermixed with code, and the whole mess is very hard to parse.
Data should be kept completely isolated, as far away from code as possible. That way, if you cannot interpret the code any more, you will still be able to analyze and parse the data. You know, it is not that hard to construct a record player.
AccountKiller
My first Latex publications from 20 years back and all my human-readable ASCII scientific data still be read and used without any problem. Human-readable file
formats in the UNIX tradition completely solve this problem.
This problem is only hard if the people making the data formats are either stupid or do not want their formats to be easily accessible to other applications, as Microsoft does. Of course, others are creating just as fundamentally broken formats for either of the same reasons.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
You can get emulators for just about every machine you can imagine: PDP-10, PDP-11, DOS, Atari, Amiga, C64, microcontroller, etc. You can get hardware emulators with FPGAs if you like. Almost any important format is documented or has been reverse engineered. Yes, you can easily read 1997 PowerPoint files, even if his weird choice of Office on Mac can't. And that's only with current technology. Give it a few decades and all that can happen behind the scenes and computers will just automatically perform even the most complicated data conversions behind the scenes. "Computer, scan the 1997 floppy and put the data on screen."
Have you seen what some people (and MS) do with XML? And what convoluted structures they use? Coded in binary? With compression and other eminently hard to understand stuff? Most of these things will be readable just as long as the applications that created them are around, but not longer.
Forget XML. Forget Unicode as well. Plain ASCII is the only thing that works. Simple PDF or PostScript will work also, because the standards and open-source tools to read them will still be around. But nothing as complicated as a MS office document will survive. LibreOffice formats may have a chance, because LibreOffice may still be compilable and runnable (being FOSS), but only because of that and I would not bet on it.
Incidentally, all my decades old LeTeX documents still compile and can also be read directly. So can my 20 year old ASCII-coded measurement data.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Who hurt you? :-(
Sand's overrated... it's just tiny little rocks.
Not even Microsoft can implement their Office XML "standard" ; from examination it's pretty much a direct name-for-name serialization of their internal binary structs, with some of the more obvious gaffes like explicitly saying "do this like this old version of Word" hastily renamed to placate ISO. It needs you to implement a whole bunch of specific behaviours if you want it to work in the MS software (things like "if you update this bit, you also have to update this other bit just so or it won't work"), but these aren't documented.
You've got more of a chance, sure, just because the structs are marked and you don't have to infer where their boundaries are, but it's a far cry from ODF which was designed from the outset to be an open XML format rather than just hastily being bunged together to permit large purchasing bodies (like governments) to tick the "Open format" box on their form.
Holy shit, yeah, you're right - it's totally impossible to strip out the XML tags and be left with readable plain text content!
I bet nobody could ever decode it!
You seem to be assuming a flat-text file with predictable order. Strip the XML out of anything in a tabular format (eg a spreadsheet -- see TFS) and you lose vital data. Blank cells are lost and the tabulated data no longer lines up.
It gets worse in a filetype with unstructured formatting, eg DTP and slideware. You've got a collection of elements that are only ordered by their metadata. The explanatory labels you want to overlay on top of that image? They're no longer linked to it and you've no way of knowing what they're their for. Multiple news stories on the same page merge into one, and have been divorced from their headlines.
Readable != useful.
Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
The best safeguard is the abandonment of all existing proprietary formats to freedom (so anybody can write conversion software) and the proliferation of open formats on an ongoing basis.
"I believe in Karma. That means I can do bad things to people all day long and I assume they deserve it." : Dogbert