Slashdot Mirror


Converting Word Files to Text for Archiving?

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"

5 of 81 comments (clear)

  1. Try... by curunir · · Score: 5, Informative

    this or this

    --
    "Don't blame me, I voted for Kodos!"
  2. Having worked a similar problem... by metacosm · · Score: 5, Informative

    Short Answer: Good Luck! :)

    Long Answer:

    This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.

    I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.

    In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.

    The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.

    Sorry I don't have the solution you are looking for... honestly good luck. :)

  3. wvWare by Alethes · · Score: 5, Informative

    wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.

    wvHtml: convert your Word document into HTML4.0

    wvLatex: convert your Word document into visually (pretty) correct LaTeX

    wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress

    wvDVI: converts word to DVI. Requires 'latex'

    wvPS: converts word to PostScript. Requires 'dvips'

    wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version :-)]

    wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.

    wvAbw: converts word to Abiword format. (Far better just to use Abiword.)

    wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.

    wvRtf: a basic version exists

    wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the /tmp directory.

  4. simple answer. by SN74S181 · · Score: 5, Funny
    One of the requirements of our archiving process is that the documents be stored in plain text format.


    There's one simple answer: uuencode.

    *ducks*
  5. Setup a text printer by mhesseltine · · Score: 5, Informative
    1. Add a Generic/Text only printer attached to a file port.
    2. Open file in Word
    3. Select print
    4. Select a file location and name
    5. Enjoy your new text file.

    For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF

    --
    Overrated / Underrated : Moderation :: Anonymous Coward : Posting