Converting Word Files to Text for Archiving?
Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.
HTML Tidy has a 'clean up Word HTML' mode which works wonders. Dave Raggett developed it for W3C.
use any suitable postscript printer driver, and print to a file.
you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.
this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.
bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.
i don't know why i'm not using caps today...
--Lawrence Lessig for Congress!
How about compression? Generally when archiving... your compressing and plain text gets a far far better compression ratio than a pdf (not to mention it's smaller to begin with.)
I've had very good luck using OpenOffice's save As.
There are a lot of options, however. I wonder what
Google uses?
One option would be saving as postscript and using
a postscript-to-text extractor.
But the best thing would be to relax the POT require-
ment to allow XML. XML is so trivial to parse that
even if XML itself falls from grace with technological
advance, the few simple tags you need will certainly
be supportable with 10-15 minutes of scripting
(if the computers aren't smart enought respond
to your voice command and retrieve the old XML
specs and interpret them well enough to perform
the transformation automatically at that future point
-- which seems unlikely, given that you use only
the most simple and basic of XML idioms.)
-I like my women like I like my tea: green-