Converting Word Files to Text for Archiving?
Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html
Well hope that helps.
---
I believe Adobe Acrobat can import Word documents and save them as PDF.
I have been pwned because my
this or this
"Don't blame me, I voted for Kodos!"
Short Answer: Good Luck! :)
:)
Long Answer:
This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.
I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.
In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.
The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.
Sorry I don't have the solution you are looking for... honestly good luck.
wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.
:-)]
/tmp directory.
wvHtml: convert your Word document into HTML4.0
wvLatex: convert your Word document into visually (pretty) correct LaTeX
wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress
wvDVI: converts word to DVI. Requires 'latex'
wvPS: converts word to PostScript. Requires 'dvips'
wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version
wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.
wvAbw: converts word to Abiword format. (Far better just to use Abiword.)
wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.
wvRtf: a basic version exists
wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the
i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets
i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.
we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.
Stellent (formerly INSO) Outside In is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.
I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.
For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF
Overrated / Underrated : Moderation