Converting Word Files to Text for Archiving?
Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
this or this
"Don't blame me, I voted for Kodos!"
Short Answer: Good Luck! :)
:)
Long Answer:
This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.
I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.
In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.
The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.
Sorry I don't have the solution you are looking for... honestly good luck.
wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.
:-)]
/tmp directory.
wvHtml: convert your Word document into HTML4.0
wvLatex: convert your Word document into visually (pretty) correct LaTeX
wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress
wvDVI: converts word to DVI. Requires 'latex'
wvPS: converts word to PostScript. Requires 'dvips'
wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version
wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.
wvAbw: converts word to Abiword format. (Far better just to use Abiword.)
wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.
wvRtf: a basic version exists
wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the
There's one simple answer: uuencode.
*ducks*
For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF
Overrated / Underrated : Moderation