Converting Word Files to Text for Archiving?

← Back to Stories (view on slashdot.org)

Converting Word Files to Text for Archiving?

Posted by Cliff on Sunday December 8, 2002 @12:41PM from the non-trivial-conversions dept.

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"

43 of 81 comments (clear)

Min score:

Reason:

Sort:

have you tried by gnixdep · 2002-12-08 12:44 · Score: 4, Insightful

Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.
1. Re:have you tried by 0x0d0a · 2002-12-08 13:13 · Score: 2
  
  TrueType is a propriatary font format?
  
  The format isn't, but Times New Roman and friends are.
  
  --
  May we never see th
2. Re:have you tried by aminorex · 2002-12-08 17:45 · Score: 2
  
  There are, however, plenty of freeware TrueType
  fonts -- and in fact, fonts can't be copyrighted!
  
  While this may change in the future, due to intense
  lobbying efforts by special interests and the general
  prevailing culture of intellectual property grants --
  I will not say "rights" because it is a devaluing abuse
  of the word -- the constitutional proviso precluding
  ex post facto law insures that all currently existing
  fonts will never be copyrighted (within the
  U.S.).
  
  --
  -I like my women like I like my tea: green-
3. Re:have you tried by 0x0d0a · 2002-12-08 17:56 · Score: 2
  
  Microsoft relies on a EULA instead of straight copyright to protect their fonts, which are the relevant ones here.
  
  And few cloned fonts have the same spacing/characters as the fonts they're trying to clone. Actually, most of MS's fonts are "clone fonts", but they differ from the Adobe originals significantly.
  
  --
  May we never see th
4. Re:have you tried by aminorex · 2002-12-08 18:32 · Score: 2
  
  True. That is, however, a relatively recent innovation.
  Even limiting yourself to MS Fonts, there are still a lot
  in circulation -- including all the important ones -- from
  distributions that occurred before the font EULAs were
  introduced.
  
  --
  -I like my women like I like my tea: green-
anti-word by morgothan · 2002-12-08 12:49 · Score: 4, Informative

The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html

Well hope that helps.

--
---
PDF by ObviousGuy · 2002-12-08 12:50 · Score: 2, Informative

I believe Adobe Acrobat can import Word documents and save them as PDF.

--
I have been pwned because my /. password was too easy to guess.
1. Re:PDF by Komarosu · 2002-12-09 00:13 · Score: 2
  
  but still your in a format that needs to be converted still to plain text...
  
  --
  
  "What do you mean you have no ice? Do you expect me to drink this coffee hot?" - Random Customer, Clerks
Try... by curunir · 2002-12-08 12:51 · Score: 5, Informative

this or this

--
"Don't blame me, I voted for Kodos!"
Having worked a similar problem... by metacosm · 2002-12-08 12:54 · Score: 5, Informative

Short Answer: Good Luck! :)

Long Answer:

This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.

I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.

In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.

The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.

Sorry I don't have the solution you are looking for... honestly good luck. :)
Convoluted Suggestion #1 by tedDancin · 2002-12-08 12:58 · Score: 2
This is by no means the easiest solution, but one that could work. I don't envy the position you're in.
1. Create a web page (eg. ASP) on a server running MS Word. In your back end code, create an instance of MS Word and automate the conversion of these files to HTML. Pain in the ass, but this can be done.
2. MS Word HTML comes out with all sorts of xml tags and crap, so use a simple regular expression to filter out all the tags you don't need, keeping <p>, <ol> etc etc.
And yes:

3. ???
4. Profit!!
--

Ladies, form queue here -->
1. Re:Convoluted Suggestion #1 by shaitand · 2002-12-08 16:17 · Score: 2
  
  HTML is already text??? I understand there are reasons to turn html into regular old plain text but certainly not for archiving... The two issues are a format that will be around in 10yrs, html should do the trick, or xhtml and compression, either is text and should get an EXTREMELY high compression ratio.
First HTML by Matthias+Wiesmann · 2002-12-08 13:05 · Score: 2
I don't know of a direct solution, but I would decompose this into two problems.
1. Tranform your documents into a reasonable format, XML, or very simple HTML: no page layout, only tables and lists.
2. Transform those files into text.
The first part is difficult as you have to filter the data and remove a lot of unneeded information, one possible way to do this would be to convert word documents into RTF, and then RTF into HTML (tools to do this, like RTFTOHTML).
The second part is quite easy. If your data is XML, you can convert it using simple scripts (tables might be an issue) if the data is simple HTML, you could could use Lynx to convert it into text with some layout.

In fact, I would keep both version of the data handy. Having a somehow strctured version of data never hurts, and text files do not take so much space.
wvWare by Alethes · 2002-12-08 13:06 · Score: 5, Informative

wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.

wvHtml: convert your Word document into HTML4.0

wvLatex: convert your Word document into visually (pretty) correct LaTeX

wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress

wvDVI: converts word to DVI. Requires 'latex'

wvPS: converts word to PostScript. Requires 'dvips'

wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version :-)]

wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.

wvAbw: converts word to Abiword format. (Far better just to use Abiword.)

wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.

wvRtf: a basic version exists

wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the /tmp directory.
1. Re:wvWare by sohp · 2002-12-08 15:07 · Score: 3, Informative
  
  I'll second wv, formerly know as MSWordView. I've used it since before the name change and have been satisfied. According to the blurb at freshmeat,
  wv (formerly known as MSWordView) is a library that understands the Microsoft Word 2000, 97, 95 and 6 file formats (".doc"), and is able to convert Word documents into HTML, which can then be read with a browser. It also allows other programs access to Word documents for the purpose of converting them to other formats (like RTF, PostScript, and PDF), and is currently being used by Abiword as its word importer.
  
  If by chance you have any Java around, the POI HDF APIs are great for manipulating that Horrible Document Format.
simple answer. by SN74S181 · 2002-12-08 13:25 · Score: 5, Funny

One of the requirements of our archiving process is that the documents be stored in plain text format.

There's one simple answer: uuencode.

*ducks*
1. Re:simple answer. by argel · 2002-12-09 03:56 · Score: 2
  
  There's one simple answer: uuencode. And then you could use ROT13 to encrypt it! :-)
  
  --
  
  -- Argel
use a word macro first by joe094287523459087 · 2002-12-08 13:32 · Score: 4, Informative

i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets

i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.

we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.
Stellent Outside In by divbyzero · 2002-12-08 13:40 · Score: 4, Informative

Stellent (formerly INSO) Outside In is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.

--
But my grandest creation, as history will tell,
Was Firefrorefiddle, the Fiend of the Fell.
1. Re:Stellent Outside In by jafuser · 2002-12-09 02:57 · Score: 2
  
  It's not perfect, not crash-proof, and not free
  Useful, stable, cheap . . . choose any zero? =)
  
  --
  Please consider making an automatic monthly recurring donation to the EFF
2. Re:Stellent Outside In by divbyzero · 2002-12-09 10:07 · Score: 2
  
  I didn't mean to be too negative; it is quite useful, in spite of certain shortcomings.
  
  --
  But my grandest creation, as history will tell,
  Was Firefrorefiddle, the Fiend of the Fell.
Word & Lynx by wdr1 · 2002-12-08 13:47 · Score: 3, Interesting

Use MS Word to save it as HTML, then run it though lynx -dump to save it as text.

Although, you may want to give strong consideration to another poster's recomendation of using PDF. (Particularly since you care about formatting.)

-Bill

--
SlashSig Karma: Excellent (mostly affected by moderatio
Re:have you tried ... HTML Tidy by Louis_Wu · 2002-12-08 14:05 · Score: 4, Insightful

HTML Tidy has a 'clean up Word HTML' mode which works wonders. Dave Raggett developed it for W3C.
Re:Why? by Unknown+Relic · 2002-12-08 14:50 · Score: 4, Informative

I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.
print to postscript by wfrp01 · 2002-12-08 14:54 · Score: 3, Insightful

use any suitable postscript printer driver, and print to a file.

you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.

this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.

bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.

i don't know why i'm not using caps today...

--

--Lawrence Lessig for Congress!
Setup a text printer by mhesseltine · 2002-12-08 14:57 · Score: 5, Informative
1. Add a Generic/Text only printer attached to a file port.
2. Open file in Word
3. Select print
4. Select a file location and name
5. Enjoy your new text file.
For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF
--
Overrated / Underrated : Moderation :: Anonymous Coward : Posting
1. Re:Setup a text printer by jsse · 2002-12-08 15:25 · Score: 2, Informative
  
  That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself.
  
  May be he could write macro to automate the process?
2. Re:Setup a text printer by mhesseltine · 2002-12-08 17:36 · Score: 2
  
  From jsse:
  That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself. May be he could write macro to automate the process?
  
  I couldn't agree more. I'd hate to open each file, select print, and enter a filename. However, I also agree that VBA is probably geared toward this exact problem.
  
  --
  Overrated / Underrated : Moderation :: Anonymous Coward : Posting
On the Mac... by singularity · 2002-12-08 15:08 · Score: 2

Mariner Software released DropDoc, which is based on the GPL'ed wvWare libraries. It converts Word documents to .rtf, which maintains most of the basic formatting of a Word document.

I have used it and it works fairly well.

--
- (c) 2018 Hank Zimmerman
Re:Dude. by MaggieL · 2002-12-08 16:03 · Score: 3, Funny

You are seriously asking the impossible...Nothing besides Word can do that...The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place.

Mod parent up as funny, dude. Best MCSE parody I've seen in weeks.

Seriously, dude.

--
-=Maggie Leber=-
Re:Archiving? Are you sure? by shaitand · 2002-12-08 16:14 · Score: 3, Insightful

How about compression? Generally when archiving... your compressing and plain text gets a far far better compression ratio than a pdf (not to mention it's smaller to begin with.)
Stop with the other formats!!! by shaitand · 2002-12-08 16:29 · Score: 2

^ | up there somewhere the original poster said the reason he needs the conversion. He is using propietary hardware THAT requires text (or a tiff) as an intermediate format that it then puts on microfilm. Enough to the pdf, rtf, etc would be a better solution. text and tiff are the only solutions for this guy.
Re:have you tried ... HTML Tidy by HRbnjR · 2002-12-08 17:08 · Score: 2

These are both good ideas I was going to suggest. The last piece of this puzzle which I was also going to suggest - if you really need plain text, the text browser "lynx" can be scripted to spit out html reformatted as plain text. Export word->html, run through tidy, and html2text with lynx, should yield decent results.

Good Luck :)
OpenOffice by aminorex · 2002-12-08 17:32 · Score: 3, Insightful

I've had very good luck using OpenOffice's save As.

There are a lot of options, however. I wonder what
Google uses?

One option would be saving as postscript and using
a postscript-to-text extractor.

But the best thing would be to relax the POT require-
ment to allow XML. XML is so trivial to parse that
even if XML itself falls from grace with technological
advance, the few simple tags you need will certainly
be supportable with 10-15 minutes of scripting
(if the computers aren't smart enought respond
to your voice command and retrieve the old XML
specs and interpret them well enough to perform
the transformation automatically at that future point
-- which seems unlikely, given that you use only
the most simple and basic of XML idioms.)

--
-I like my women like I like my tea: green-
The choices are obvious by mnmn · 2002-12-08 17:55 · Score: 2

I recently discovered Lotus Notes Word format (.SAM) is text based just like XML. You could use this, or better go for StarOffice/OpenOffice native formats or SGML.

--
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
Re:Why? by HughsOnFirst · 2002-12-08 18:14 · Score: 2

What I would do is use word to "print" the files using a postscript printer driver . Use the "save to file option" in the printer driver. Then convert the .ps files to tiffs . Details left as an "exercise for the reader"
Word scripts and macros also left as an "exercise for the reader"
Re:Why? by smoon · 2002-12-08 19:46 · Score: 2

If a TIFF file will work, then why not some kind of 'print to fax' setup?

Basic idea is:
Word: print->fax program
fax program saves as TIFF file (perhaps choosing 'fine' resolution etc.)

Same idea with a different implementation would be to print to a 'postscript' printer, then use a postscript to tiff conversion program. This could be fairly well automated with a print queue on a *nix box accepting the postscript 'print' jobs, then simply archiving the resulting tiff files off someplace handy.

--
"But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
Re:have you tried ... HTML Tidy by __past__ · 2002-12-08 21:26 · Score: 2

Maybe w3m would be a better choice, it also deals with tables (and even frames, but that doesn't seem to matter here).

--
Programming can be fun again. Film at 11.
Re:Why? by biglig2 · 2002-12-08 23:19 · Score: 2

Sounds good, one possible snag is if he cannot use multi=page TIFFs, although doubtless there exists some little open source snippet to convert a multi page TIFF into lots of single pages.

--
~~~~~ BigLig2? You mean there's another one of me?
Re:OpenOffice is your friend by forsetti · 2002-12-09 01:21 · Score: 2

How? Does OpenOffice have a command line interface?

--
10b||~10b -- aah, what a question!
Re:Why? by jafuser · 2002-12-09 03:13 · Score: 2
Even better:
There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer. This program was originally created to save you paper, by printing up to eight pages on one side of one sheet, but over time it has gained a lot of nice features.
One of the features that I've used for my online transactions reciepts is the option to Save the print job(s) to a file or a bunch of files. It will save in the following formats:
- fp (FinePrint)
- bmp
- emf
- jpg
- tif
- txt
When I save as a TIF, I am given the option of monochrome, 4-bit, 8-bit, or 24-bit, and a resolution 72-1200 or "custom". There is a checkbox for "Create a separ4ate file for each page", so that solves the multi-page problem as well.
I really like this program, as I've been a registered user since nearly the beginning, and I've used nearly all of it's features at one time or another. And I don't get anything for mentioning it either, so =P
FinePrint
--
Please consider making an automatic monthly recurring donation to the EFF
Re:Why? by jafuser · 2002-12-09 03:15 · Score: 2

There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer.
I meant to have the word "driver" at the end of this sentence.

--
Please consider making an automatic monthly recurring donation to the EFF
Re: The Macro to help out the printing by phamlen · 2002-12-09 03:42 · Score: 2

The above solution worked at my company - except we needed to convert them to PDF so we sent them to a PDF-converting printer.

The trick is to set up a macro to print all the documents in a folder/whatever. The easiest way is the following:

1) Create a new word document. You have to store the Macro somewhere, and a word document is simplest. Call the document "ConvertDocumentsToText".

2) Add some instructions in the actual text of the document (Notice how you can combine the documentation with the code! ) Generally, the instructions include:
How do you start the macro
Where do the files need to be?

3) Start writing the macro. (Alas! I don't have my Word macro handy, so the best I can do is pseudo-code.):

Dim ofsFileSys as New FileSystemObject
Dim theFolder as Folder
Dim theFile as File
Dim theDocument as Word.Document

set theFolder = ofsFileSys.getFolder("c:\mypath")
for each theFile in theFolder.files
theDocument =open(theFile, _lots of parameters to the open call.)
theDocument.print
theDocument.close
next

3) That's pretty much it.

4) Be careful with TWO issues:
Don't use background printing - use foreground printing. Otherwise the print function kicks off a background thread and you'll get thousands of documents open and the machine will grind to a halt.

Also, be careful with the parameters to the open command. We forced the open even if there were conversion questions (otherwise you get a dialog box that you have to click ).

-Peter