Slashdot Mirror


Converting Word Files to Text for Archiving?

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"

11 of 81 comments (clear)

  1. anti-word by morgothan · · Score: 4, Informative

    The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html

    Well hope that helps.

    --
    ---
  2. PDF by ObviousGuy · · Score: 2, Informative

    I believe Adobe Acrobat can import Word documents and save them as PDF.

    --
    I have been pwned because my /. password was too easy to guess.
  3. Try... by curunir · · Score: 5, Informative

    this or this

    --
    "Don't blame me, I voted for Kodos!"
  4. Having worked a similar problem... by metacosm · · Score: 5, Informative

    Short Answer: Good Luck! :)

    Long Answer:

    This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.

    I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.

    In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.

    The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.

    Sorry I don't have the solution you are looking for... honestly good luck. :)

  5. wvWare by Alethes · · Score: 5, Informative

    wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.

    wvHtml: convert your Word document into HTML4.0

    wvLatex: convert your Word document into visually (pretty) correct LaTeX

    wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress

    wvDVI: converts word to DVI. Requires 'latex'

    wvPS: converts word to PostScript. Requires 'dvips'

    wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version :-)]

    wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.

    wvAbw: converts word to Abiword format. (Far better just to use Abiword.)

    wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.

    wvRtf: a basic version exists

    wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the /tmp directory.

    1. Re:wvWare by sohp · · Score: 3, Informative
      I'll second wv, formerly know as MSWordView. I've used it since before the name change and have been satisfied. According to the blurb at freshmeat,
      wv (formerly known as MSWordView) is a library that understands the Microsoft Word 2000, 97, 95 and 6 file formats (".doc"), and is able to convert Word documents into HTML, which can then be read with a browser. It also allows other programs access to Word documents for the purpose of converting them to other formats (like RTF, PostScript, and PDF), and is currently being used by Abiword as its word importer.


      If by chance you have any Java around, the POI HDF APIs are great for manipulating that Horrible Document Format.
  6. use a word macro first by joe094287523459087 · · Score: 4, Informative

    i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets

    i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.

    we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.

  7. Stellent Outside In by divbyzero · · Score: 4, Informative

    Stellent (formerly INSO) Outside In is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.

    --
    But my grandest creation, as history will tell,
    Was Firefrorefiddle, the Fiend of the Fell.
  8. Re:Why? by Unknown+Relic · · Score: 4, Informative

    I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.

  9. Setup a text printer by mhesseltine · · Score: 5, Informative
    1. Add a Generic/Text only printer attached to a file port.
    2. Open file in Word
    3. Select print
    4. Select a file location and name
    5. Enjoy your new text file.

    For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF

    --
    Overrated / Underrated : Moderation :: Anonymous Coward : Posting
    1. Re:Setup a text printer by jsse · · Score: 2, Informative

      That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself.

      May be he could write macro to automate the process?