Converting Word Files to Text for Archiving?
Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.
The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html
Well hope that helps.
---
I believe Adobe Acrobat can import Word documents and save them as PDF.
I have been pwned because my
this or this
"Don't blame me, I voted for Kodos!"
Short Answer: Good Luck! :)
:)
Long Answer:
This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.
I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.
In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.
The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.
Sorry I don't have the solution you are looking for... honestly good luck.
- Create a web page (eg. ASP) on a server running MS Word. In your back end code, create an instance of MS Word and automate the conversion of these files to HTML. Pain in the ass, but this can be done.
- MS Word HTML comes out with all sorts of xml tags and crap, so use a simple regular expression to filter out all the tags you don't need, keeping <p>, <ol> etc etc.
And yes:3. ???
4. Profit!!
Ladies, form queue here -->
- Tranform your documents into a reasonable format, XML, or very simple HTML: no page layout, only tables and lists.
- Transform those files into text.
The first part is difficult as you have to filter the data and remove a lot of unneeded information, one possible way to do this would be to convert word documents into RTF, and then RTF into HTML (tools to do this, like RTFTOHTML).The second part is quite easy. If your data is XML, you can convert it using simple scripts (tables might be an issue) if the data is simple HTML, you could could use Lynx to convert it into text with some layout.
In fact, I would keep both version of the data handy. Having a somehow strctured version of data never hurts, and text files do not take so much space.
wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.
:-)]
/tmp directory.
wvHtml: convert your Word document into HTML4.0
wvLatex: convert your Word document into visually (pretty) correct LaTeX
wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress
wvDVI: converts word to DVI. Requires 'latex'
wvPS: converts word to PostScript. Requires 'dvips'
wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version
wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.
wvAbw: converts word to Abiword format. (Far better just to use Abiword.)
wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.
wvRtf: a basic version exists
wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the
There's one simple answer: uuencode.
*ducks*
i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets
i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.
we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.
Stellent (formerly INSO) Outside In is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.
Use MS Word to save it as HTML, then run it though lynx -dump to save it as text.
Although, you may want to give strong consideration to another poster's recomendation of using PDF. (Particularly since you care about formatting.)
-Bill
SlashSig Karma: Excellent (mostly affected by moderatio
HTML Tidy has a 'clean up Word HTML' mode which works wonders. Dave Raggett developed it for W3C.
I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.
use any suitable postscript printer driver, and print to a file.
you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.
this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.
bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.
i don't know why i'm not using caps today...
--Lawrence Lessig for Congress!
For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF
Overrated / Underrated : Moderation
Mariner Software released DropDoc, which is based on the GPL'ed wvWare libraries. It converts Word documents to .rtf, which maintains most of the basic formatting of a Word document.
I have used it and it works fairly well.
- (c) 2018 Hank Zimmerman
You are seriously asking the impossible...Nothing besides Word can do that...The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place.
Mod parent up as funny, dude. Best MCSE parody I've seen in weeks.
Seriously, dude.
-=Maggie Leber=-
How about compression? Generally when archiving... your compressing and plain text gets a far far better compression ratio than a pdf (not to mention it's smaller to begin with.)
^ | up there somewhere the original poster said the reason he needs the conversion. He is using propietary hardware THAT requires text (or a tiff) as an intermediate format that it then puts on microfilm. Enough to the pdf, rtf, etc would be a better solution. text and tiff are the only solutions for this guy.
These are both good ideas I was going to suggest. The last piece of this puzzle which I was also going to suggest - if you really need plain text, the text browser "lynx" can be scripted to spit out html reformatted as plain text. Export word->html, run through tidy, and html2text with lynx, should yield decent results.
:)
Good Luck
I've had very good luck using OpenOffice's save As.
There are a lot of options, however. I wonder what
Google uses?
One option would be saving as postscript and using
a postscript-to-text extractor.
But the best thing would be to relax the POT require-
ment to allow XML. XML is so trivial to parse that
even if XML itself falls from grace with technological
advance, the few simple tags you need will certainly
be supportable with 10-15 minutes of scripting
(if the computers aren't smart enought respond
to your voice command and retrieve the old XML
specs and interpret them well enough to perform
the transformation automatically at that future point
-- which seems unlikely, given that you use only
the most simple and basic of XML idioms.)
-I like my women like I like my tea: green-
I recently discovered Lotus Notes Word format (.SAM) is text based just like XML. You could use this, or better go for StarOffice/OpenOffice native formats or SGML.
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
What I would do is use word to "print" the files using a postscript printer driver . Use the "save to file option" in the printer driver. Then convert the .ps files to tiffs . Details left as an "exercise for the reader"
Word scripts and macros also left as an "exercise for the reader"
If a TIFF file will work, then why not some kind of 'print to fax' setup?
Basic idea is:
Word: print->fax program
fax program saves as TIFF file (perhaps choosing 'fine' resolution etc.)
Same idea with a different implementation would be to print to a 'postscript' printer, then use a postscript to tiff conversion program. This could be fairly well automated with a print queue on a *nix box accepting the postscript 'print' jobs, then simply archiving the resulting tiff files off someplace handy.
"But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
Maybe w3m would be a better choice, it also deals with tables (and even frames, but that doesn't seem to matter here).
Programming can be fun again. Film at 11.
Sounds good, one possible snag is if he cannot use multi=page TIFFs, although doubtless there exists some little open source snippet to convert a multi page TIFF into lots of single pages.
~~~~~ BigLig2? You mean there's another one of me?
How? Does OpenOffice have a command line interface?
10b||~10b -- aah, what a question!
There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer. This program was originally created to save you paper, by printing up to eight pages on one side of one sheet, but over time it has gained a lot of nice features.
One of the features that I've used for my online transactions reciepts is the option to Save the print job(s) to a file or a bunch of files. It will save in the following formats:
When I save as a TIF, I am given the option of monochrome, 4-bit, 8-bit, or 24-bit, and a resolution 72-1200 or "custom". There is a checkbox for "Create a separ4ate file for each page", so that solves the multi-page problem as well.
I really like this program, as I've been a registered user since nearly the beginning, and I've used nearly all of it's features at one time or another. And I don't get anything for mentioning it either, so =P
FinePrint
Please consider making an automatic monthly recurring donation to the EFF
Please consider making an automatic monthly recurring donation to the EFF
The above solution worked at my company - except we needed to convert them to PDF so we sent them to a PDF-converting printer.
The trick is to set up a macro to print all the documents in a folder/whatever. The easiest way is the following:
1) Create a new word document. You have to store the Macro somewhere, and a word document is simplest. Call the document "ConvertDocumentsToText".
2) Add some instructions in the actual text of the document (Notice how you can combine the documentation with the code! ) Generally, the instructions include:
How do you start the macro
Where do the files need to be?
3) Start writing the macro. (Alas! I don't have my Word macro handy, so the best I can do is pseudo-code.):
Dim ofsFileSys as New FileSystemObject
Dim theFolder as Folder
Dim theFile as File
Dim theDocument as Word.Document
set theFolder = ofsFileSys.getFolder("c:\mypath")
for each theFile in theFolder.files
theDocument =open(theFile, _lots of parameters to the open call.)
theDocument.print
theDocument.close
next
3) That's pretty much it.
4) Be careful with TWO issues:
Don't use background printing - use foreground printing. Otherwise the print function kicks off a background thread and you'll get thousands of documents open and the machine will grind to a halt.
Also, be careful with the parameters to the open command. We forced the open even if there were conversion questions (otherwise you get a dialog box that you have to click ).
-Peter