National Archive File Format Time Bomb
geordie_loz writes "The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format."
Unless Microsoft is going to write converters for every existing file format in the world. In the past, I'd have the most luck with old document formats using open source products like AbiWord.
Just make a torrent.
ITSATRAP!
Red to red, black to black. Switch it on, but stand well back.
don't give in to MS on this one, some states in the US have already and it's no better than standard word format because it's owned by a private entity. use the Open Office format if you want to be sure that you won't get the rug pulled out from under you some years down the road.
"The Most Fun Possible on 4 wheels" is at SunBuggy in Las Vegas
The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format.
There are so many idiots in this state of the affairs:
1. the idiots which decided to build huge archive with undocumented proprietary format
2. idiots which believe they can't find even a single copy of the software they need
3. idiots who didn't store a single copy of the software that reads the format, together with the archive (not very far from obvious, is it).
4. idiots who want to convince other idiots that OOXML is an open format (versus straight XML serialization of the whatever binary DOC was in the source code base at the time in MS)
It predates Moses, and is quite likely to survive the heat death of the universe.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
to give it a proper name, the format is "Microsoft Open Office XML", they deliberately went to a lot of trouble to pick a name that's as easily to confuse as possible with OpenOffice
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
MS benefits a lot from upgrades, that way you end up "needing" to pay for an upgrade down the road regardless of whether you bought a new computer or not. they stand to lose everything if open source is seen to be nearly or just as good by people at large/the government so they do just what they are required but not enough to weaken future cash streams from upgrades in the future.
Sigs are too short to say anything truly profound so read the above post instead.
This is why we should be using open formats, particularly for things that are really complex like video codecs.
I am with Linus on this one.
If you have a problem with proprietary formats you go to Microsoft to solve it for you... The word "DOH" springs to mind.
Oh yeah, their solution? Virtualised Windows 3.1. And obviously in 15 years you'll have to virtualise Vista in order to run the Win3.1 virtual machine to run Word. And Microsoft will be paid a license for each application and level of virtualisation.
You couldn't make this stuff up.
Deleted
No. The obvious solution for the predicted problem of data being unavailable due to being in unsupported proprietary formats is to move it to a widely supported non-proprietary format.
As "well intentioned" as Microsoft may be, Microsoft's Open XML cannot be anything but proprietary when its code references Windows and Office API functions rather than more precise data format information as with ODF. (For more information about this, you might search out the arguments against making OOXML an ISO standard.)
As sad as it is, I've run into some PDF's that don't open with newer PDF viewers. Sort of defeated the purpose, I thought.
Ah, brilliant. It's partly Microsoft's fault that this problem exists, so their suggestion is to trust them to offer a solution that won't cause this problem later down the line. Especially smart, given this sort of longevity and interoperability would annihilate their business model. Yes, good, excellent. Explicit trust ahoy!
The guy sounds like a Microsoft salesman, not someone who should be in that position of responsibility. Look at all the MS software boxes behind the computer. A puppet.
Carry on.
I can't believe the National Archives partnered with the company that caused this mess in the first place, ie Microsoft.
Second, why on earth do they think virtualisation is a long-term solution? Sure, you can emulate Windows 95 within Windows XP today, but what happens in another ten years? Another layer gets wrapped around XP? So in 100 years, you're relying on a stack of emulators to access the old software. You better hope Moore's law holds up, because you're going to need it. Also, who will know how Word 95 worked in 10 years, let alone 100?
IMO translation of the old documents would be a better solution. Translate the documents into a well-documented, open format, and throw away all of the old formatting idiosyncrasities while you're at it. That way, you only have to maintain one way to access the documents with the software-du-jour, instead of having to prop up the entire teetering stack of virtualisation layers.
It seems to me that this is really a nonproblem--OOo is compatible with lots of "dead" formats (or, can read them at least), as well as many other open source office programs. I can't imagine they're going to begin throwing away this compatability--it isn't like it takes extra coding (as far as I know). Also, I have found Microsoft Word's "Extract text from any file" to work pretty well (I had a roommate with a corrupted Mac-formatted disk that had his deceased grandmother's journal on it in some old Mac Word file (a format still readable in Word, but the disk was corrupted so I couldn't just open the file). I popped it in my parents' now deceased iMac and the only program I found that opened it was Word, using the "Extract text from any file" function. I emailed him the journal and he thanked me profusely).
Also--as noted, the OOXML format is a nonsolution for this nonproblem. It seems like it would be a waste of effort--why convert a bunch of files to a format that may die just as quickly as any other format, when you can just leave the file as is and open it in OOo (assuming I'm correct that they won't stop read support for dead formats)?
Also, it seems to me that no current format or any future format will ever solve this nonproblem because formats will always change as new functionality is continually added. The better solution is to keep this a nonproblem by having open source software that can read old file formats.
I think I've read something that they are already unable to read some data stored on computers in the Ex-German Democratic Republic.
The only solution IMHO is _open and documented_ interfaces, protocols, programs, data types and hardware. In the future they won't be able to read our disks and files. They just can try to build a machine that reads our disks and files - for which they need documentation how they work.
What's surprising about that? Someone in MS Spin Control and Public Relations is worth his salary. The story could have exploded into an "avoid MS products if you want your data accessible some years down the road" fiasco (we all know that MS is the worst offender when it comes to changing the document formats, usually undocumented). Instead, it was turned into another push for their next format.
Brilliant.
"What, the shit I sold you yesterday stinks? Try this new shit, it's great and it has none of the problems of the old one."
That's what you hire PR people for.
Assorted stuff I do sometimes: Lemuria.org
Rather than bitching about Microsoft making an offer of 'help' which is just thinly disguised marketing (I mean, come on, par for the course no?), could we get a discussion about real solutions? I know MS bashing is fun, but come on, we do it on just about every other thread... lets have a day off.
To kick things off here's one:
Keep EVERYTHING in the simplest possible format. ASCII would seem sensible, since its the content we care about, not the formatting. (although that wouldn't help our Asiatic brethren much). Then Keep decent records of HOW you can read that format. With examples of the software and hardware. do this bit on PAPER. V. Tough Paper (or rock, or plastic or whatever). Update the explanations every other year, to put it in language the next gen will understand. Maybe also have instructions on how to translate the simple format to less simple things.
I guess, basically, its a case of KISS and then *provide a persistent and regularly updated 'Rosetta Stone'* for latecomers to work from.
As a side branch, this kind of reminds me of discussions I read about a while back of how to warn future generations about Nuclear Waste dumps (y'know, the really nasty stuff with half-lives in the thousands of years range). I don't think anyone ever came up with a decent answer....
'Speak softly and carry a beagle'
I just can't believe that Microsoft think they can get away with lecturing others about open standards.
"If you think the problem is bad now, just wait until we've solved it." --- Arthur Kasspe
The real problem seems to be the credulous morons in charge of the National Archives project.
Deleted
simple solution - DON'T UPGRADE THE MACHINES - just keep all the old computers and associated stuff for looking at the archives!
If they have reasonable hardware then it should last a long time.
Call me a luddite, but if it ain't broke don't fix it.
It's not just about the software. It's the hardware, too.
I'm sure that most of the archive data created today is stored on something like DVDs but, as recently as the early 1990s, the official long-term storage medium for the UK government was Syquest 44MB removable cartridge hard drives.
I know that I have a working 44MB drive (well, when I last fired it up, which would have been sometime last decade) somewhere in my attic but I doubt that too many of these drives are still in existance.
I only hope that the data that was once stored on thousands of these was successfully transferred to a more readily accessible storage format and that that new format is just as durable - media these days just seems to disintegrate after a few years.
"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
If you are going to choose a proprietary vendor to safeguard your data wouldn't IBM be the obvious choice. They have proven their ability to keep 20 year old programs running in modern environments without modification.
It has been a while since I worked on an AS/400 system... so anyone with updated info please feel free to correct me if things have changed.
It seems like a no-brainer.
Link: http://en.wikipedia.org/wiki/AS/400
And being a government, these files are INCREDIBLY important.
Why haven't they been converted? Really, all their DIGITAL archives should be in a single format by now.
All those books are in a single format. And paper records can last a LOT longer than digital records. They still have the original Constitution and that's more than 200 years old.
They've found papyrus records that were 2,000 years old.
It looks like paper is a better choice for keeping records than digital formats.
You need to run the original software in an emulator, OS and all.
That emulator itself needs to be Open Source so that you can port it to future platforms. Otherwise, you'd be faced with running an emulator in an emulator in an emulator in an emulator in an emulator...
Keeping around multiple conversions certainly doesn't hurt. Converters vary in quality and the resulting conversions will themselves vary in future compatibility.
OK, the deal is this. Let's say you have a bunch of files in some old format, and a spec for that format, and you need some information out of those files. That spec ill be useful to you if - and only if - the cost of implementing that format from the spec is less than the cost of losing those files, AND it's less than the cost of reverse-engineering enough of the format to extract the information you want from the files.
The OOXML spec is huge (expensive to implement from the spec) and complex, and the meaning of many components can't really be determined without looking at the way Office behaves (so it's incomplete, this implementing a reader for it will require a fair amount of trial and error). Reverse-engineering Office's format may be much easier, depending on what information you're looking for... just extracting the text strings from a Word document has often been MY preferred method of reverse-engineering it...
Which means that OOXML is a poor archival format... unless you want to lock people who want to use the archives in the meantime into using Microsoft Office to read them.
Comment removed based on user account deletion
I wanted to design something that would be still usable in 100 years. (Donald E. Knuth, more than 20 years ago)
..). Oh and smaller filesize, too.
Also, LaTeX will get you nicer documents than any WYSIWYG word processor in less time (once you know it
A pentabyte is 5 bytes, right? How hard is it to store 20 bits on paper? ;)
(I assume petabyte (10^15 or 2^50, depending on convention) is the word you're looking for.)
Ben Hocking
Need a professional organizer?
Ben Hocking
Need a professional organizer?
The book of Genesis was originally done in SGML.
I believe this is called "hiring the fox to guard the henhouse".
I think we've pushed this "anyone can grow up to be president" thing too far.
Unfortunately, those bright people don't get to make technical decisions.
The British Library recently introduced SED, an electronic document delivery system. With SED, you can order electronic copies of journal papers and articles from their archives. Great idea! Previously, you had to wait for the documents to come through the post, and that would take a week or so. Now you get them by email in a couple of working days.
Except that the documents are crippled by Adobe DRM, which imposes the following restrictions:
- You can only view them using certain specific versions of Acrobat Reader (6 or 7) - the latest version is not recommended.
- The software only works on Windows 2000 or XP. No Linux support, no Mac support. Vista might work, but again, it's not recommended.
- You can only look at each document for a limited time, and you can only print it once.
So, if you want to use the service, you'd better hope that you have (a) the right version of Windows, (b) the right version of Acrobat Reader, (c) a reliable net connection, and, most importantly, (d) a very reliable printer that won't chew up the document. Unless you're a filthy dirty pirate, of course.If Adobe managed to convince the British Library to put up with this ridiculous system, I am sure that Microsoft will have no difficulty convincing them about their archive "solution". If SED is anything to go by, it'll be another awful implementation of a great idea.
>north
You're an immobile computer, remember?
I haven't gone through the ODF specification yet, but there's one thing I'd like to know:
Does the ODF specification support each and every Word/Excel/Powerpoint 2007 feature?
If not, is it extensible?
If it is extensible, do changes have to go through some sort of committee to be incorporated? How frequently are changes incorporated? How long is the process?
One of the most depressing IT related articles I've read in a long time.
bytecolor
It's partly MS's fault, but also partly a whole bunch of organizations' faults. Microsoft isn't the only one with big, ugly proprietary formats. There's still a helluva lot of documents from the age of WordPerfect. The real fault lies in the fact that the old push for standards like ASCII, which was meant to overcome much of this, were ignored in the halcyon days of the personal computer, when companies, whether through dreams of lock-in or simply because they didn't give a damn, ignored decades of work that had produced ASCII, TeX and the like.
To my mind, the very best way to amend this problem is for the archival agencies to insist that only certain formats be accepted. The applications exist to translate Word or WordPerfect documents to open file formats, and it doesn't take a rocket scientist to do it. Don't bloody well go to companies like Microsoft for solutions.
The world's burning. Moped Jesus spotted on I50. Details at 11.
"Spacing like WP6"? "Caclculate incorrect leap year like Excel"?
Becuase if you want to include bugs etc, then no, it doesn't support each and every 2007 feature.
If you mean supporting tables, nested documents, embedded graphs, scripting and so on, yes.
It may not be "click the same buttons" feature correct nor probably the "run the same VB code" compatible.
Take a look at some of the people on the board that devised ODF. They include the US National Archives. Print media. Archivists.
Y'know, people who KNOW DOCUMENTS.
As to the remainder of your questions, there is a process, it does have to go through comittee (else how does everyone else know how to implement the new standard? MS doesn't have this problem since they only want themselves to know their updated standard). It is XML so it is extensible (decode the initialism). The process will take as long as it takes. Much the same as Vista will take as long as it takes to get SP 1 out.
I don't see how these latter issues are something that is a part of ODF and not any form of standardisation that OfficeXML will have to have to go through for anyone other than MS to implement...
... one good tester, an open mind and a week. There is almost always a simpler/cheaper solution than "adopt MS's newest buzzword" and testers love to find those solutions!
Scott Barber
Chief Technologist, PerfTestPlus
Executive Director, Association for Software Testing
Step 2: Make sure to copy the files to new storage media once they become widely available (ie. copy the documents from the tape onto DVDs and hard disks). Continue doing this when new holographic/whatever media become available.
For not easily-convertible formats (databases, binary code, etc.), make sure to archieve the original program, and hunt for emulators that will emulate the appropriate hardware.
Good! The word is carefully chosen. It now has a chance to be heard by politicians. Wouldn't there just be a way to link proprietary formats to Al-Quaeda ? Come on ! I'm sure we can !
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
... we already have a solution: http://www.naa.gov.au/recordkeeping/preservation/d igital/applications.html
The Archives' approach to digital preservation relies on converting digital records from their original format into preservation formats. Xena (XML Electronic Normalising of Archives) is the program created by the National Archives to complete these processes.
Xena converts digital records into two preservation formats.
* Bitstream version. This is a metadata-wrapped bitstream version of the record, which is considered a secure original copy of the record. This version contains all of the information from the original, but requires access to the original hardware, operating system and application software for performance.
* Normalised version. This version is also wrapped in metadata. The process of normalising converts the record from its original format into eXtensible Mark-up Language (XML). The XML version is not considered to be an original copy of the record as some information may be lost during the normalisation process. However, the performance of the normalised object is the closest to the original that is currently possible. Xena is being continually improved so, over time, the performance of normalised versions is expected to more closely replicate the original.
We live in an age when brand-new, undocumented, *encrypted* file formats are deciphered within days or weeks. You're telling me that in a few decades, NOBODY will be able to figure out a spreadsheet or word-processing document?
Oh, you're not stuck, you're just unable to let go of the onion rings.
Not everyone accepts the Pebi designation, so I included the phrase "depending on convention" in an effort to bypass arguments on both sides. Obviously, my effort failed. ;)
Ben Hocking
Need a professional organizer?
I'm curious to see what happens, if Blai^H^H^Hrown or Bush learn that the UK National Archive has got a time-bomb...
The MAFIAA is a bunch of mindless jerks who will be the first up against the wall when the revolution comes
It's a time bomb, going to explode any second causing massive data loss sending us into an eternal dark age where no one can access old copies of childrens TV programs!
:)
Come on, I know I shouldn't be surprised, you can only expect such FUD about a news company, but still, this is crap. You know what they need to do, they need to keep all the original recordings, then, in a digital database they have the recordings saved digitally standardised recordings (in highest possible qualities necessary) along with information in the database about where it is saved in original format, what format it was originally in, and a final piece of information, about where the necessary equipment to veiw the original recordings should be, and they should have a store, backed up, with every single piece of playing equipment to playback every file in the library. This sounds like overkill, but is only a good method of backing up, and the complications are from the silliness of how badly standardised in the past they were, but with an effort they can maintain that library. Not only that but they should be standardising future recordings, so backing up and future proofing can be done more easily in future.
See, two minutes and you can think up a simple strategy to preserve all data, make it future proof, and in actuality, with a concerted effort over time it can be simple, as past data is safe, and by standardising future data you are significantly cut down on future efforts. All that and I didn't even need to refer to explosives terrorists or any scare tactics, weird isn't it.
That, or stick all the files on a torrent and they'll float round the internet for years
... but isn't an archive supposed to be future proof ?
+1 Funny if I had modpoints :)
OOXML is XML -- if you want to extract plain text from it just feed it through a XML parser and strip all the tags.
Precisely my point. If the layout and non-text information in the file matters, then you've thrown it away. If it doesn't, then why are you bothering to put it in the archive?
You can do something similar with Office's format, but the solution will be far less perfect and contain lots of junk.
Yes, and (as I noted) I've done the same thing, and it's a relatively crude way of reverse-engineering the format.
On a spectrum of "what's a good archive format" OOXML is a bit better than older office formats.
But compared to even something like HTML that's got an open specification that's actually open enough to have multiple independent implementations, and easy enough to implement that you can do it in BASIC for display on a dumb terminal, OOXML is just daft.
Ok, just finally realized that OOXML is the MS format and not the Open Office one.
:p
No prize for guessing why they used that name.
They want to use Microsoft to convert file formats? The same company who can't even save/load their own proprietary formats in their tiny little locked-in world... I think these records are in serious danger of being lost forever.
The very king of incompatible proprietary formats, where every version of their formats requires upgrading to the next version of software to continue using their format, is promising that all will be compatible with their newest proprietary format!
George Orwell could not have dreamed up this scenario!
It's worth going to one of their lectures, actually.
No, the amusing bit is that MS has shuffled its way in there and is flogging the single most important threat to longevity of digital information: an ill documented, proprietary standard that only pretends to be open. The problem the British library has is very obvious, so it's not like a sales rep has to think hard - the MS solution on offer is ludicrous to anyone who's been near a standards process and is simply a path to establish credibility for, well, the blatant after-the-facts rigging of standards to claw back the proprietary grip on the market after ODF gave that a good and well needed kicking.
IMHO, MS is yet again able to use British establishment figures to do its selling for them.
If I recall correctly, Blair was so impressed by money that he let himself being talked into being present at the UK W2K launch (or the version before that), thus providing a Government endorsement, and it seems history is about to repeat itself.
You'd think the British Library would know about history..
This example perfectly illustrates the problems with proprietary formats. Once the software that interprets a proprietary format vanishes any information written in it is gone. Okay, it's not gone gone. I am sure you can get a bunch of good cryptanalysts to pour over binary dumps of these files. Eventually they will crack it - if your information is worth the cost, that is.
This is why we need open standard formats such as ODF and reject pretenders such as OpenXML. Just because the name has "open" in it does not mean the information to completely read and write OpenXML is freely available to the public. This makes OpenXML a proprietary format dispite the name.
OpenXML should be placed where it belongs - the rubbish bin.
This isn't new. For years (decades) many large defense and government projects have all the source code, documentation etc stored along with a computer all setup with the required software in order to read all this stuff.
What's new is the fact businesses are starting to realise they have the same problem.
AKA The wolf will hire himself out cheaply as a shepherd.
Get your own free personal location tracker
http://www.linux.org.au/conf/2007/talk/55.html
Michael Carden explains it well
This has been a problem with ALL media that is not readable without technology. Or even if the people who know the language die off; we couldn't read hieroglyphics if we hadn't found the Rosetta Stone.
Anyone have a wire recorder handy? They were very popular back in the (19)20s and 30s. Oh yes, can someone loan me a Dictaphone or a dictaphone belt? How about a phonograph that plays 78rpm records? How about even having a phonograph? 8 Track tape? Now, as for computer formats, does anyone have any 80 Column punch cards? Teletype or a paper tape reader? 12" magnetic tape reels, or tape drive that reads 7 track coding (as opposed to newer 9 track), presuming that they even have tape any more? Or most of the stuff used with mainframe computers. How about 8 inch or 5 1/2 inch diskette? Got any Zip disks? Now, do you have any .LBR or .ARC archive files? What about EBCDIC, read any files coded using it lately?
When was the last time you handled a photograph that had a negative? I handle probably a dozen images or more a day when I'm going through digital pictures on my computer, but it's probably been ten years since I had a picture that had a photographic negative. But we might have pictures and plates as far back as the 1890s when the camera was first developed, it's highly unlikely you can get duplicates made, or if you can, it's going to require a specialty photographic processor and is probably expensive. Does anyone even use film anymore for "home movies" or are we using video tape and now video disc? The cost differential between video and film is about 50 to 1, e.g. for $3 you can buy a high-quality tape that will record 2 hours vs. 3 minutes for 8mm, if film is even that cheap; I haven't had to buy 8mm film for twenty years. What happens to those old movies? If we we can even view them, it's usually because they have been converted to tape or disc.
Oh, yes, video tape. Movies are going all disc now, and as a result most video stores are selling their tape collections at low prices ($1 per tape) because the space and cost of disc has become much more advantageous; in the space of three video tapes you can probably store ten or more discs. Which begs the question, if either the HD-DVD or BLU RAY format wars get settled, shouldn't we expect all videos to go to that format? (Or maybe they'll just release in both, in either case, you'll either need two machines os eventually they'll have to develop a dual-format machine to read both.) Oh yes, I forgot the earlier videodisc format that came out long before CDs.
The changing of storage formats has caused problems even with open format standards - let alone troubles over files using proprietary or non-standard formats - as we have changed technology. This has been noted for years and is a big problem with non-profits with limited resources - such as libraries - which might have to convert data from one device or file format to another as older systems become obsolete and data is trapped on those systems if not converted. Lots and lots of data produced at significant expense have either been lost or is inaccessible because the systems that coded it are failing as parts become unavailable and machines cannot be maintained, and where they can be, it's a huge expense to do so.
Paul Robinson - My BlogThe lessons of history teach us - if they teach us anything - that nobody learns the lessons that history teaches us.
Finding software support for obsolete file formats isn't NEARLY as serious a problem as the media that these files are stored on. Maintaining digital media (optical discs, tapes, floppies, hard drives, etc.) over an archival length of time (500+ years) is likely going to be VERY tough. Simply READING the media will be a challenge in the future. You can open a book from 1,000 years ago and read the thing (you may have to learn a specialized language, but it's easily possible). How are you going to read a hard drive 1,000 years from now--long after the hardware to read it has ceased to be manufactured and there aren't even blueprints still around to make 20th/21st century computers? Getting the bits to make sense after you access them may prove trivial next to the challenge of getting to them in the first place.
SJW: Someone who has run out of real oppression, and has to fake it.
Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.
See http://www.cedu.niu.edu/blackwell/multimedia/high/ library.html for some fascinating lookback, including
* bonobo trail blazes
* the Edison electric pen
* Baird mechanical television
and my personal favourite
* Rene Dagron, Pigeon Post Microfilm Balloonist
Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.