Vint Cerf: Data That's Here Today May Be Gone Tomorrow
dcblogs writes "Vinton Cerf is warning that digital things created today — spreadsheets, documents, presentations as well as mountains of scientific data — may not be readable in the years and centuries ahead. Cerf illustrates the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. 'It doesn't know what it is,' he said. 'I'm not blaming Microsoft,' said Cerf, who is Google's vice president and chief Internet evangelist. 'What I'm saying is that backward compatibility is very hard to preserve over very long periods of time.' He calls it a 'hard problem.'"
We're at an interesting spot right now, where we're worried that the internet won't remember everything, and also that it won't forget anything.
My data will be readable because I use bog-standard formats. If I get really froggy I use HTML, and you can just strip the tags and read that.
If his data won't be readable, that's his problem. Anything you want to save for posterity, export it now.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
We're in a difficult spot right now because for years we ignored the warnings about 'proprietary file formats'.
I'm not blaming Microsoft either. We let Microsoft do this to us of our own free ignorance.
I think you will find that there's a little known branch of academia called "history" which sometimes takes a curious interest in even the most trivial of past information.....
Yes, you're right I have this ASCII text file created in 1997 and I can't find anything to read it...
OH WAIT ACTUALLY FUCKING *EVERYTHING* STILL READS IT.
Stop gargling Microsoft's balls so much and wipe off your chin. Proprietary data formats are THE PROBLEM. Stop trying to redirect public discourse with this thinly veiled bullshit.
A perfect example of this is basically the issue of old video games. (I may as well bring this up because it's going to come up)
Recently, the Internet Archive stored a whole pile of TOSEC collections of games from various old systems (thanks to their DCMA exemption of being an archival repository so that they can legally do this). Data and information that would have otherwise been completely lost into a digital black hole, if it weren't for the fans of the system, and the dedicated teams of people collecting and amassing this software as a hobby.... in breach of copyright.
The problem with DRM is that without dedicated crackers and pirates, unless the original rights holders are around long enough to resell old titles for that long (which most aren't), old games will simply disappear into a digital copyright black hole and never be seen again. This happens once the computer/console system system is old, not sold anymore, and forgotten about, and the media degrades and isn't backed up in some form (in breach of EULA). If people aren't able to collect the software and hang on to it, preserving/duplicating the media while still in copyright, it's going to vanish. Culturally important games of significance will be lost forever, and that, if anything is as much a crime as it is to pirate software in the first place.
It's only due to the efforts of an army of swappers/crackers, etc, that most of the old games on old systems were even preserved.
The steam model on PC is quite good though as it makes a few compromises where you can actually make backups and go offline if you want.
For old computers and consoles however, this doesn't apply,.... and with some more restrictive attempts to squash the used game market, and force internet-always-connected authentication on upcoming consoles to even play the game... one has to wonder if the game companies deliberately want to squish all traces of their old work, let it disappear into the ether, and to resell you this year's football game which is just like last year's. I fear that this is where we are headed (if we aren't there already)
READY.
PRINT ""+-0
Were living in what could well be a future dark age for archeologists / historians. Hardly anything is put into a nice hard format (stone is incredibly rare and metal gets stolen) for someone to find. What's left suffers from incompatible file formats, acid based paper that decomposes, bit rot, cryptography, incompatible technology for data storage and worst of all DRM. With DRM you have active measures that try to prevent something from being usable.
In the old days people stopped use with armed guards, obfuscation and primitive crypto. Today we have servers that are required for operational functionality for many products. With the advent of the cloud you have reasons for storing things where you have a dependency on a third party. How many services that are cloud / server based have come about and gone tits up?
Even having a large well known brand name doesn't protect you from having a server shut down. Just think of Microsoft's play4sure service that lasted less than a decade. Having a license and a physical disk isn't that helpful when the DRM requires an authentication server that doesn't exist. With the movement to put more and more DRM into the cloud or with SSL certificates (again dependent upon servers and naturally time bombed) this is going to be a problem that will only grow worse.
Learning to break DRM is far more critical than file formats which require nothing more than a conversion tool.
You're very clever, young man, very clever - but it's VMs all the way down!
XML doesn't magically solve everything in this regard. If there's no good documentation for the format, it's unlikely you'll be able to display everything exactly as intended. Likewise, if the format is hideously complex (see: Microsoft Office Open XML) or there's bugs in the de-facto implementation, it's going to be tricky to reverse engineer.
I'd also point out that MS Office spits out compressed XML. I believe it's based on ZIP, which is very well documented, but that's yet another hurdle to cross. And then you have to deal with the binary format of the XML itself -- ASCII, UTF8, etc.
There's no -1 for "I don't get it."
MS removed the PowerPoint 4.0/95 converters completely with Office 2007 for Windows and later, and disabled them by default in Office 2003 SP3. And the PowerPoint 4.0 converter (but not 95) was disabled by default instead of fixed with MS09-017.
On the Mac, they removed then even earlier, when they ported Office to Carbon.
IMO it would be a good idea for MS to package PP4X32 and PP7X32 from PowerPoint 2003 separately, along with a utility to call the converters of course.
For a supposedly smart guy, he seems a bit silly:
He could've just downloaded MS's Powerpoint 97 viewer
I don't respond to AC's.
We're still able to restore cars from the 80s and earlier as the cars were fully mechanical or hydraulic. No computers.
Fast forward to 20yrs from now, nobody's going to be carrying the computer boards for a 2004 Toyota Pruis or a 2013 Tesla.
However, you'll still be able to restore your grandfather's '57 Chevy...
I presented a solution to this long-standing problem last year to the Denver HTML5 Meetup.
Code should never be separated from data. This is possible with HTML5, JavaScript, and open source.
In the presentation, I steal and repurpose Hofstadter's analogy of DNA to an LP vinyl record, which is an information bearer, but useless without its information retriever (the record player). Like the cell of an animal, which contains both DNA and the means to "play" it, I ask why not the same with software?
My maxim is: data should always carry the code with it to play itself. It was inspired from the field I've spent 50% of my career in: non-destructive testing where, for example, X-Rays and ultrasounds are performed on safety-critical industrial parts with 50-year service lives. If one of those parts fails and kills someone, you're going to want to go back into the old data and find the earliest indication of the flaw or fault and reinspect every other part in the world like it that is still in service. And maybe you need to go back 50 years. Under such a context, not providing the code with the data could be considered an act of gross neglect.
In my presentation, I use the 1990's era trick of embedding XSL into an XML file, with the addition of the XSL now being able to use HTML5/JavaScript. Sadly, I've only gotten it work with Firefox -- the other browsers consider it a security violation.
Seriously, why would Vincent Cerf not blame Microsoft? They have an extremely poor track record with backwards compatibility, and I don't think they even know what forwards compatibility is. If you design the data formats correctly then you can keep things usable for decades (or centuries). Guess what, twenty year old TeX documents still work, and yet Word X won't work with Word X-2. I've pulled runoff documents off of 70's versions of Unix that can still be printed. That says to me that one can deal with compatibility issues.
This is all intentional on Microsoft's part too. They make money when customers buy new copies of software, so it is in their best financial interests to make sure that customers have significant pressure to upgrade. I remember the solution to an acknowledged bug for Word 97 was to make sure that everyone who was going to read your document had the appropriate Word 97 plug in in their older version of Word. I completely blame Microsoft here.
This is not that hard a problem, IF the company pays attention to it and gives it even a small amount of priority.
Vint, that's bullshit and you know it. It's nothing more than preserving syntaxes, grammar, file formats. That's not hard, and it only requires someone to create a format conversion ONCE to solve the problem at each stage of the evolution.
The real problem here is proprietary non-public formats and structures. When the structure of data has been a closely guarded secret and requires reverse engineering that may not even yield a perfect result, THAT is hard.
No! Fail! You don't get it!
1) Code is data
2) Code is data that is especially hard to interpret
3) One of the main reasons of all this mess ia that in all those proprietary formats, data is intermixed with code, and the whole mess is very hard to parse.
Data should be kept completely isolated, as far away from code as possible. That way, if you cannot interpret the code any more, you will still be able to analyze and parse the data. You know, it is not that hard to construct a record player.
AccountKiller
Have you seen what some people (and MS) do with XML? And what convoluted structures they use? Coded in binary? With compression and other eminently hard to understand stuff? Most of these things will be readable just as long as the applications that created them are around, but not longer.
Forget XML. Forget Unicode as well. Plain ASCII is the only thing that works. Simple PDF or PostScript will work also, because the standards and open-source tools to read them will still be around. But nothing as complicated as a MS office document will survive. LibreOffice formats may have a chance, because LibreOffice may still be compilable and runnable (being FOSS), but only because of that and I would not bet on it.
Incidentally, all my decades old LeTeX documents still compile and can also be read directly. So can my 20 year old ASCII-coded measurement data.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
err... plus DosBox is running x86 software I have from 198x...which is 30+ years now.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Not even Microsoft can implement their Office XML "standard" ; from examination it's pretty much a direct name-for-name serialization of their internal binary structs, with some of the more obvious gaffes like explicitly saying "do this like this old version of Word" hastily renamed to placate ISO. It needs you to implement a whole bunch of specific behaviours if you want it to work in the MS software (things like "if you update this bit, you also have to update this other bit just so or it won't work"), but these aren't documented.
You've got more of a chance, sure, just because the structs are marked and you don't have to infer where their boundaries are, but it's a far cry from ODF which was designed from the outset to be an open XML format rather than just hastily being bunged together to permit large purchasing bodies (like governments) to tick the "Open format" box on their form.