Slashdot Mirror


National Archive File Format Time Bomb

geordie_loz writes "The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format."

6 of 233 comments (clear)

  1. Re:Idiots by xtracto · · Score: 4, Interesting

    2. idiots which believe they can't find even a single copy of the software they need

    Please give me a link to a copy of the Professional Write 3 (PW) software app. for MSDOS 6.

    Yep, I had that very problem some years ago when I was cleaning my room and found several 5 1/4 disquettes which contained the .pw extension. No way to find the program.

    --
    Ubuntu is an African word meaning 'I can't configure Debian'
  2. More surprisingly!? No, UNsurprisingly by erroneus · · Score: 3, Interesting

    No. The obvious solution for the predicted problem of data being unavailable due to being in unsupported proprietary formats is to move it to a widely supported non-proprietary format.

    As "well intentioned" as Microsoft may be, Microsoft's Open XML cannot be anything but proprietary when its code references Windows and Office API functions rather than more precise data format information as with ODF. (For more information about this, you might search out the arguments against making OOXML an ISO standard.)

  3. Bright people don't make tech decisions by Cheesey · · Score: 4, Interesting
    The idea that an institution like the British Library, which is run by people bright enough to make you look like a dead match, would accept such a preposterous idea is insulting.

    Unfortunately, those bright people don't get to make technical decisions.

    The British Library recently introduced SED, an electronic document delivery system. With SED, you can order electronic copies of journal papers and articles from their archives. Great idea! Previously, you had to wait for the documents to come through the post, and that would take a week or so. Now you get them by email in a couple of working days.

    Except that the documents are crippled by Adobe DRM, which imposes the following restrictions:
    • You can only view them using certain specific versions of Acrobat Reader (6 or 7) - the latest version is not recommended.
    • The software only works on Windows 2000 or XP. No Linux support, no Mac support. Vista might work, but again, it's not recommended.
    • You can only look at each document for a limited time, and you can only print it once.
    So, if you want to use the service, you'd better hope that you have (a) the right version of Windows, (b) the right version of Acrobat Reader, (c) a reliable net connection, and, most importantly, (d) a very reliable printer that won't chew up the document. Unless you're a filthy dirty pirate, of course.

    If Adobe managed to convince the British Library to put up with this ridiculous system, I am sure that Microsoft will have no difficulty convincing them about their archive "solution". If SED is anything to go by, it'll be another awful implementation of a great idea.
    --
    >north
    You're an immobile computer, remember?
  4. Re:Doesn't matter. by Bazzargh · · Score: 5, Interesting

    And being a government, these files are INCREDIBLY important.

    Why haven't they been converted? Really, all their DIGITAL archives should be in a single format by now.


    No, they shouldn't. You usually want 3 formats:
    - the original format of the document. Whatever whichever idiot happened to write (or record, or video) it in, you absolutely want the original in your records.
    - a searchable format (eg OCR'd text from scanned image docs)
    - a rendered format. (eg an image or pdf, or svg - something open enough that you can continue to show how the doc would have looked). The appropriate rendered format varies. Paper is not an appropriate format for storing CCTV footage, for example ;)

    If you're very, very lucky the original is both searchable and viewable; like, say, HTML. It gets more complicated too, because you often want to store a redacted copy of the document (think of the Onion story 'CIA realise they've been using black highlighter pen all these years') and you want that searchable too, so you have to keep a redacted searchable format too... and of course, some of the records are on actual paper. Have you started worrying about the fading inks in the originals yet?

    BTW you can't restrict the format of the original. Consider an email from a corporate bidding for a govt contract, with attachments. They need to keep those.

    - Mr. E

    PS, posting anon because I have dealings with the national archives, and don't want to speak for my company.

  5. Re:Idiots by nneonneo · · Score: 3, Interesting

    Step back, though, and think for a minute about the "house of cards" upon which that Word document rests.

    It rests on
    1) Physical storage medium -- whether this is Flash, Hard Drive, Optical Medium, [NV]RAM, etc., all these technologies may be very difficult to retrieve data from, especially if the level of technology happens to go down in the future (say, global thermonuclear war). Even if data is retrieved, there's no guarantee that it's intact after 1000 years (the dyes in CDs will have decomposed by that time; the Flash drives will have leaked all charge; the hard drives will have randomly demagnetized over that period of time, etc.).
    2) 8 bits per byte, 32 bits for most integers in the file, IEEE specification for floating point numbers, ASCII, Unicode, GB2312, etc. -- encodings for our numbers, letters and even bit-packing of binary data will affect retrievability. It would be difficult, I imagine, for some futuristic person to wander along (with possibly a different language) and attempt to interpret all of that.
    3) File format -- the MSOffice OLE2 format is incredibly complex, perhaps overly so. An OLE2 file takes the form of a miniature filesystem, with a Fragment Allocation Table, 512-4096 (variable) sized blocks, a master Double-Indirection FAT, sub-block allocation, etc. Fragments within the OLE2 container are assembled into "files" or streams in this file system and then parsed. Sure, it makes for wickedly fast saving times (since you can write to changed portions only and add fragments as needed, like a real file system), but it also makes it damn hard to parse, compared to plain text formats like XML or RTF.

    There are many more layers to this system, but that's a basic overview of what a researcher (or someone else) 1000 years from now will have to contend with.

    Sure, if you're looking at just extracting text, you can skip the last layer by simply going in the file and pulling as much out as possible, much like I used to do with corrupted Word documents. However, if you're looking to retrieve images, videos, archival audio, etc. then your job is much harder.

  6. Re:Doesn't matter. by Corporate+Troll · · Score: 3, Interesting

    For example, I keep a copy of DOS and Win3.1 ISOs (about 20MB total) and Norton Commander (3 floppy images!) on a DVDR, along with a copy of Virtual PC. This lets me recreate a Windows 3.1 virtual PC anytime I want.

    Now.... You can do that now. However, in 100 years, will this be possible? You do not know what the future brings. Let's not even talk about 1000 years and beyond. Now; you backed this stuff up on a DVD and you die tomorrow. Your kids keep the data, and when they die a historian specialising on the 20th century wants to analyse the daily life of 20th century person. VMWare is long dead, you backed it up... Sure, but his platform can't run it. We're at least 10 operating system versions later, and they run on an new platform. x86 is long forgotten and they moved to quantum computers.

    Perhaps the guy is lucky, and can run an emulator in an emulator in an emulator in an emulator what you backed up. Perhaps....

    I have to this day zip files containing Wordperfect 5.1 files of the letters with a girl I was penpal with (and to whom I ultimately lost my virginity, but that is another story). Those letters, documenting life in the mid eighties to mid nineties might be interesting to a historian someday. (Historians love the daily lives of long dead people). Will they be able to read it, in 100 years? I don't know, especially in a proprietary format like Wordpefect.