National Archive File Format Time Bomb
geordie_loz writes "The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format."
The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format.
There are so many idiots in this state of the affairs:
1. the idiots which decided to build huge archive with undocumented proprietary format
2. idiots which believe they can't find even a single copy of the software they need
3. idiots who didn't store a single copy of the software that reads the format, together with the archive (not very far from obvious, is it).
4. idiots who want to convince other idiots that OOXML is an open format (versus straight XML serialization of the whatever binary DOC was in the source code base at the time in MS)
It predates Moses, and is quite likely to survive the heat death of the universe.
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
to give it a proper name, the format is "Microsoft Open Office XML", they deliberately went to a lot of trouble to pick a name that's as easily to confuse as possible with OpenOffice
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
If you have a problem with proprietary formats you go to Microsoft to solve it for you... The word "DOH" springs to mind.
Oh yeah, their solution? Virtualised Windows 3.1. And obviously in 15 years you'll have to virtualise Vista in order to run the Win3.1 virtual machine to run Word. And Microsoft will be paid a license for each application and level of virtualisation.
You couldn't make this stuff up.
Deleted
There is no such thing as Open Office format. Perhaps you mean OpenDocument Format, which is used by several different applications ( http://en.wikipedia.org/wiki/List_of_applications_ supporting_OpenDocument ), including OpenOffice.org.
It seems to me that this is really a nonproblem--OOo is compatible with lots of "dead" formats (or, can read them at least), as well as many other open source office programs. I can't imagine they're going to begin throwing away this compatability--it isn't like it takes extra coding (as far as I know). Also, I have found Microsoft Word's "Extract text from any file" to work pretty well (I had a roommate with a corrupted Mac-formatted disk that had his deceased grandmother's journal on it in some old Mac Word file (a format still readable in Word, but the disk was corrupted so I couldn't just open the file). I popped it in my parents' now deceased iMac and the only program I found that opened it was Word, using the "Extract text from any file" function. I emailed him the journal and he thanked me profusely).
Also--as noted, the OOXML format is a nonsolution for this nonproblem. It seems like it would be a waste of effort--why convert a bunch of files to a format that may die just as quickly as any other format, when you can just leave the file as is and open it in OOo (assuming I'm correct that they won't stop read support for dead formats)?
Also, it seems to me that no current format or any future format will ever solve this nonproblem because formats will always change as new functionality is continually added. The better solution is to keep this a nonproblem by having open source software that can read old file formats.
What's surprising about that? Someone in MS Spin Control and Public Relations is worth his salary. The story could have exploded into an "avoid MS products if you want your data accessible some years down the road" fiasco (we all know that MS is the worst offender when it comes to changing the document formats, usually undocumented). Instead, it was turned into another push for their next format.
Brilliant.
"What, the shit I sold you yesterday stinks? Try this new shit, it's great and it has none of the problems of the old one."
That's what you hire PR people for.
Assorted stuff I do sometimes: Lemuria.org
Rather than bitching about Microsoft making an offer of 'help' which is just thinly disguised marketing (I mean, come on, par for the course no?), could we get a discussion about real solutions? I know MS bashing is fun, but come on, we do it on just about every other thread... lets have a day off.
To kick things off here's one:
Keep EVERYTHING in the simplest possible format. ASCII would seem sensible, since its the content we care about, not the formatting. (although that wouldn't help our Asiatic brethren much). Then Keep decent records of HOW you can read that format. With examples of the software and hardware. do this bit on PAPER. V. Tough Paper (or rock, or plastic or whatever). Update the explanations every other year, to put it in language the next gen will understand. Maybe also have instructions on how to translate the simple format to less simple things.
I guess, basically, its a case of KISS and then *provide a persistent and regularly updated 'Rosetta Stone'* for latecomers to work from.
As a side branch, this kind of reminds me of discussions I read about a while back of how to warn future generations about Nuclear Waste dumps (y'know, the really nasty stuff with half-lives in the thousands of years range). I don't think anyone ever came up with a decent answer....
'Speak softly and carry a beagle'
Comment removed based on user account deletion
A pentabyte is 5 bytes, right? How hard is it to store 20 bits on paper? ;)
(I assume petabyte (10^15 or 2^50, depending on convention) is the word you're looking for.)
Ben Hocking
Need a professional organizer?
Unfortunately, those bright people don't get to make technical decisions.
The British Library recently introduced SED, an electronic document delivery system. With SED, you can order electronic copies of journal papers and articles from their archives. Great idea! Previously, you had to wait for the documents to come through the post, and that would take a week or so. Now you get them by email in a couple of working days.
Except that the documents are crippled by Adobe DRM, which imposes the following restrictions:
- You can only view them using certain specific versions of Acrobat Reader (6 or 7) - the latest version is not recommended.
- The software only works on Windows 2000 or XP. No Linux support, no Mac support. Vista might work, but again, it's not recommended.
- You can only look at each document for a limited time, and you can only print it once.
So, if you want to use the service, you'd better hope that you have (a) the right version of Windows, (b) the right version of Acrobat Reader, (c) a reliable net connection, and, most importantly, (d) a very reliable printer that won't chew up the document. Unless you're a filthy dirty pirate, of course.If Adobe managed to convince the British Library to put up with this ridiculous system, I am sure that Microsoft will have no difficulty convincing them about their archive "solution". If SED is anything to go by, it'll be another awful implementation of a great idea.
>north
You're an immobile computer, remember?
And being a government, these files are INCREDIBLY important.
;)
Why haven't they been converted? Really, all their DIGITAL archives should be in a single format by now.
No, they shouldn't. You usually want 3 formats:
- the original format of the document. Whatever whichever idiot happened to write (or record, or video) it in, you absolutely want the original in your records.
- a searchable format (eg OCR'd text from scanned image docs)
- a rendered format. (eg an image or pdf, or svg - something open enough that you can continue to show how the doc would have looked). The appropriate rendered format varies. Paper is not an appropriate format for storing CCTV footage, for example
If you're very, very lucky the original is both searchable and viewable; like, say, HTML. It gets more complicated too, because you often want to store a redacted copy of the document (think of the Onion story 'CIA realise they've been using black highlighter pen all these years') and you want that searchable too, so you have to keep a redacted searchable format too... and of course, some of the records are on actual paper. Have you started worrying about the fading inks in the originals yet?
BTW you can't restrict the format of the original. Consider an email from a corporate bidding for a govt contract, with attachments. They need to keep those.
- Mr. E
PS, posting anon because I have dealings with the national archives, and don't want to speak for my company.
"Spacing like WP6"? "Caclculate incorrect leap year like Excel"?
Becuase if you want to include bugs etc, then no, it doesn't support each and every 2007 feature.
If you mean supporting tables, nested documents, embedded graphs, scripting and so on, yes.
It may not be "click the same buttons" feature correct nor probably the "run the same VB code" compatible.
Take a look at some of the people on the board that devised ODF. They include the US National Archives. Print media. Archivists.
Y'know, people who KNOW DOCUMENTS.
As to the remainder of your questions, there is a process, it does have to go through comittee (else how does everyone else know how to implement the new standard? MS doesn't have this problem since they only want themselves to know their updated standard). It is XML so it is extensible (decode the initialism). The process will take as long as it takes. Much the same as Vista will take as long as it takes to get SP 1 out.
I don't see how these latter issues are something that is a part of ODF and not any form of standardisation that OfficeXML will have to have to go through for anyone other than MS to implement...
Yeah, it's XML. Also, unlike OOXML, ODF uses namespaces, so you can create a separate standard if you don't want to muck around with ODF.
It would depend. The thing about changing standards is that it causes problems for all sorts of people. There is a real need for a stable and standardized document format that just doesn't change, or if it does, very slightly.
There is no such thing as Open Office format.
Rubbish. I've worked at places with an Open Office format. Basically they open the office to any monkey who turns up for a job interview and a handful of people have to make up for their incompetence.
These posts express my own personal views, not those of my employer