National Archive File Format Time Bomb
geordie_loz writes "The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format."
2. idiots which believe they can't find even a single copy of the software they need
.pw extension. No way to find the program.
Please give me a link to a copy of the Professional Write 3 (PW) software app. for MSDOS 6.
Yep, I had that very problem some years ago when I was cleaning my room and found several 5 1/4 disquettes which contained the
Ubuntu is an African word meaning 'I can't configure Debian'
No. The obvious solution for the predicted problem of data being unavailable due to being in unsupported proprietary formats is to move it to a widely supported non-proprietary format.
As "well intentioned" as Microsoft may be, Microsoft's Open XML cannot be anything but proprietary when its code references Windows and Office API functions rather than more precise data format information as with ODF. (For more information about this, you might search out the arguments against making OOXML an ISO standard.)
1. the idiots which decided to build huge archive with undocumented proprietary format
Which seems reasonable at a time when "everyone" has a computer that'll read it, for example when it comes to image viewers there's software covering literally hundreds of formats without issue.
2. idiots which believe they can't find even a single copy of the software they need
It's supposed to be an archive, not a "well we'll have to dig up a copy of the software, I'll get back to you in some months.
3. idiots who didn't store a single copy of the software that reads the format, together with the archive (not very far from obvious, is it).
Presumably a license issue, AFAIK things like national archives typically only required you to file a copy, not to file the software itself. Probably the laws need to be changed to include a copy of the reader software.
4. idiots who want to convince other idiots that OOXML is an open format (versus straight XML serialization of the whatever binary DOC was in the source code base at the time in MS)
If they can make a meaningful interpretation in OOXML it's an improvement, if it's a BLOB in XML or a BLOB in OOXML it doesn't really matter, then you still need the software.
If Microsoft has taken the job of taking some binary BLOB and make it into something human-readable, OOXML or not, then I say you'd have an easier time converting OOXML to something readable in OpenOffice than not.
Besides, this might actually be a semi-legitimate use for all the tags in OOXML which says to emulate an old version, because that's what you're doing. Particularly with stupid formatting like inserting line breaks to make pages break where you want them to, it might actually be preferable compared to manually going over them and replacing them with proper format tags.
Live today, because you never know what tomorrow brings
It's not just about the software. It's the hardware, too.
I'm sure that most of the archive data created today is stored on something like DVDs but, as recently as the early 1990s, the official long-term storage medium for the UK government was Syquest 44MB removable cartridge hard drives.
I know that I have a working 44MB drive (well, when I last fired it up, which would have been sometime last decade) somewhere in my attic but I doubt that too many of these drives are still in existance.
I only hope that the data that was once stored on thousands of these was successfully transferred to a more readily accessible storage format and that that new format is just as durable - media these days just seems to disintegrate after a few years.
"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
Don't have a link, but have Professional Write on CD of "Work software not used anymore"
Along with Professional File (database product)....
I wanted to design something that would be still usable in 100 years. (Donald E. Knuth, more than 20 years ago)
..). Oh and smaller filesize, too.
Also, LaTeX will get you nicer documents than any WYSIWYG word processor in less time (once you know it
Unfortunately, those bright people don't get to make technical decisions.
The British Library recently introduced SED, an electronic document delivery system. With SED, you can order electronic copies of journal papers and articles from their archives. Great idea! Previously, you had to wait for the documents to come through the post, and that would take a week or so. Now you get them by email in a couple of working days.
Except that the documents are crippled by Adobe DRM, which imposes the following restrictions:
- You can only view them using certain specific versions of Acrobat Reader (6 or 7) - the latest version is not recommended.
- The software only works on Windows 2000 or XP. No Linux support, no Mac support. Vista might work, but again, it's not recommended.
- You can only look at each document for a limited time, and you can only print it once.
So, if you want to use the service, you'd better hope that you have (a) the right version of Windows, (b) the right version of Acrobat Reader, (c) a reliable net connection, and, most importantly, (d) a very reliable printer that won't chew up the document. Unless you're a filthy dirty pirate, of course.If Adobe managed to convince the British Library to put up with this ridiculous system, I am sure that Microsoft will have no difficulty convincing them about their archive "solution". If SED is anything to go by, it'll be another awful implementation of a great idea.
>north
You're an immobile computer, remember?
You always archive the original (unless you have a batch; then you sample one and call it the original), and that original can be in just about any format, hand-written, coffee-stained, in sanskrit. When scanning a document into an electronic archive the ideal would be to have OCR create a font and layout on the fly while running, so that the electronic version of the original would still look like the dead-tree original, and yet be machine-readable.
Dead-tree is still the standard, as it has been for the past few millennia. Directives on national level will often not even recognize that electronic archiving exists since there is no standard to be used. Once we have a format of SGML such as
In practice your organization will often receive a document per snail-mail from an external source and an electronic version might not even exist. To make it electronic you have to scan it and to make it machine-readable you have to OCR it. Then the final results needs to be retrievable 50 years from now, or until the end of time. Yes, many things are actually archived without an expiration date. The cost, measured in money, of archiving something permanently is literally infinite.
From this perspective we geeks who are used to Moore's law etc. seem pretty darn impatient and narrow-minded. There are factors involved ranging from constructing immortal dinosaur pens to training staff who 'have been doing it their way for the past 30 years' and are hellbent on continuing so until they retire.
All rites reversed 2010
"There is always an easy solution to every human problem--neat,
plausible, and wrong." -- H.L.Mencken, The Divine Afflatus (1917).
Ok, you started by identifying one problem - asian languages. In fact, pretty much every non-US language since you said ASCII and not Latin1. So we can extend that to UTF-8 with no problems, except there's probably a huge table just for the 100000 characters or so, even though the spec is quite short.
But then, you have only characters, which is probably fine for basic text. How about works containing formulas? Uh-oh. You need some kind of math language for that, something like MathML. Of course, if you don't need math, you don't need MathML either.
But wait... what if it's not all text? What if you're trying to store diagrams, illustrations or graphs which can arguably be just as meaningful as the rest, for example if you're trying to preserve the works of Leonardo to Vinci. And uncompressed bitmaps are so incredibly inefficient, maybe you need a picture spec with some basic picture formats (vector, lossless scalar, lossy scalar at least), you could call them something like SVG, PNG and JPG. Of course, if you don't need graphics you don't need those.
But then, maybe you're looking to add references. Yes, you could do plain-text matching but it's much reliable and maintainable to make anchors and point to them, and they let you build tables of content and such etc. too.
Then, maybe there's times where fonts and layout really does matter, for example the lines of a poem? Perhaps we really should have a system to preserve that. Of course you can try to do this by embedding coding into plaintext the way say project Gutenberg does but it's not very userfriendly, it's a bit like having "magic numbers" in code. Most of the time it's easier to just have a file format, and say "and these bits are the plaintext".
In short, if you have a problem with storing this in OOXML and the solution is to use ASCII, then I think you're solving only a very very small subset of archive problems. If you're looking at the other way and say "I want one document format, to archive all my documents of every shape and form" then you'll be lucky if the 738 pages of OpenDocument standard (which is actually a lot more through referring to other standards) will suffice.
Live today, because you never know what tomorrow brings
And being a government, these files are INCREDIBLY important.
;)
Why haven't they been converted? Really, all their DIGITAL archives should be in a single format by now.
No, they shouldn't. You usually want 3 formats:
- the original format of the document. Whatever whichever idiot happened to write (or record, or video) it in, you absolutely want the original in your records.
- a searchable format (eg OCR'd text from scanned image docs)
- a rendered format. (eg an image or pdf, or svg - something open enough that you can continue to show how the doc would have looked). The appropriate rendered format varies. Paper is not an appropriate format for storing CCTV footage, for example
If you're very, very lucky the original is both searchable and viewable; like, say, HTML. It gets more complicated too, because you often want to store a redacted copy of the document (think of the Onion story 'CIA realise they've been using black highlighter pen all these years') and you want that searchable too, so you have to keep a redacted searchable format too... and of course, some of the records are on actual paper. Have you started worrying about the fading inks in the originals yet?
BTW you can't restrict the format of the original. Consider an email from a corporate bidding for a govt contract, with attachments. They need to keep those.
- Mr. E
PS, posting anon because I have dealings with the national archives, and don't want to speak for my company.
Step back, though, and think for a minute about the "house of cards" upon which that Word document rests.
It rests on
1) Physical storage medium -- whether this is Flash, Hard Drive, Optical Medium, [NV]RAM, etc., all these technologies may be very difficult to retrieve data from, especially if the level of technology happens to go down in the future (say, global thermonuclear war). Even if data is retrieved, there's no guarantee that it's intact after 1000 years (the dyes in CDs will have decomposed by that time; the Flash drives will have leaked all charge; the hard drives will have randomly demagnetized over that period of time, etc.).
2) 8 bits per byte, 32 bits for most integers in the file, IEEE specification for floating point numbers, ASCII, Unicode, GB2312, etc. -- encodings for our numbers, letters and even bit-packing of binary data will affect retrievability. It would be difficult, I imagine, for some futuristic person to wander along (with possibly a different language) and attempt to interpret all of that.
3) File format -- the MSOffice OLE2 format is incredibly complex, perhaps overly so. An OLE2 file takes the form of a miniature filesystem, with a Fragment Allocation Table, 512-4096 (variable) sized blocks, a master Double-Indirection FAT, sub-block allocation, etc. Fragments within the OLE2 container are assembled into "files" or streams in this file system and then parsed. Sure, it makes for wickedly fast saving times (since you can write to changed portions only and add fragments as needed, like a real file system), but it also makes it damn hard to parse, compared to plain text formats like XML or RTF.
There are many more layers to this system, but that's a basic overview of what a researcher (or someone else) 1000 years from now will have to contend with.
Sure, if you're looking at just extracting text, you can skip the last layer by simply going in the file and pulling as much out as possible, much like I used to do with corrupted Word documents. However, if you're looking to retrieve images, videos, archival audio, etc. then your job is much harder.
We fought a lot with this at Siemens (Sietec) about fifteen years ago, when trying to decide what format to use on stackers full of 12" WORM disks, which were just nicely becoming useful for large-scale archival storage in those days. We needed format that would outlast the disks, which probably meant 50-100 years assuming normal replacement/turnover.
We ended up with the bottom level being a WORM standard, which was served out to users via the NFS standard, which was reasonably close to a Unix filesystem, and was usable by Windows clients, and finally we stored the data in quit simple random files with tables of contents, so we could handle multi-page documents.
In practice we found the data we were storing was almost always images, as that what businesses wanted to store: scanned images of legal, business and medical documents. As the parent suggested, we used as simple a format as possible, but no simpler (;-))
For text documents, I recollect we did support some commercial formats, but only ones for which we knew the full specification and had a translator in source form. Our own data was mostly LaTeX, the typesetting language, expressed as ascii characters, and occasionally postscript or pdf, ditto.
--dave
davecb@spamcop.net
For example, I keep a copy of DOS and Win3.1 ISOs (about 20MB total) and Norton Commander (3 floppy images!) on a DVDR, along with a copy of Virtual PC. This lets me recreate a Windows 3.1 virtual PC anytime I want.
Now.... You can do that now. However, in 100 years, will this be possible? You do not know what the future brings. Let's not even talk about 1000 years and beyond. Now; you backed this stuff up on a DVD and you die tomorrow. Your kids keep the data, and when they die a historian specialising on the 20th century wants to analyse the daily life of 20th century person. VMWare is long dead, you backed it up... Sure, but his platform can't run it. We're at least 10 operating system versions later, and they run on an new platform. x86 is long forgotten and they moved to quantum computers.
Perhaps the guy is lucky, and can run an emulator in an emulator in an emulator in an emulator what you backed up. Perhaps....
I have to this day zip files containing Wordperfect 5.1 files of the letters with a girl I was penpal with (and to whom I ultimately lost my virginity, but that is another story). Those letters, documenting life in the mid eighties to mid nineties might be interesting to a historian someday. (Historians love the daily lives of long dead people). Will they be able to read it, in 100 years? I don't know, especially in a proprietary format like Wordpefect.
My copy of Office XP won't activate on any of the computers I currently own (the hardware it was originally activated on is long-dead), and that's only 5 years old.
"I've got more toys than Teruhisa Kitahara."