Slashdot Mirror


Converting Word Files to Text for Archiving?

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"

43 of 81 comments (clear)

  1. have you tried by gnixdep · · Score: 4, Insightful

    Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.

    1. Re:have you tried by 0x0d0a · · Score: 2

      TrueType is a propriatary font format?

      The format isn't, but Times New Roman and friends are.

    2. Re:have you tried by aminorex · · Score: 2
      There are, however, plenty of freeware TrueType
      fonts -- and in fact, fonts can't be copyrighted!


      While this may change in the future, due to intense
      lobbying efforts by special interests and the general
      prevailing culture of intellectual property grants --
      I will not say "rights" because it is a devaluing abuse
      of the word -- the constitutional proviso precluding
      ex post facto law insures that all currently existing
      fonts will never be copyrighted (within the
      U.S.).

      --
      -I like my women like I like my tea: green-
    3. Re:have you tried by 0x0d0a · · Score: 2

      Microsoft relies on a EULA instead of straight copyright to protect their fonts, which are the relevant ones here.

      And few cloned fonts have the same spacing/characters as the fonts they're trying to clone. Actually, most of MS's fonts are "clone fonts", but they differ from the Adobe originals significantly.

    4. Re:have you tried by aminorex · · Score: 2

      True. That is, however, a relatively recent innovation.
      Even limiting yourself to MS Fonts, there are still a lot
      in circulation -- including all the important ones -- from
      distributions that occurred before the font EULAs were
      introduced.

      --
      -I like my women like I like my tea: green-
  2. anti-word by morgothan · · Score: 4, Informative

    The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html

    Well hope that helps.

    --
    ---
  3. PDF by ObviousGuy · · Score: 2, Informative

    I believe Adobe Acrobat can import Word documents and save them as PDF.

    --
    I have been pwned because my /. password was too easy to guess.
    1. Re:PDF by Komarosu · · Score: 2

      but still your in a format that needs to be converted still to plain text...

      --

      "What do you mean you have no ice? Do you expect me to drink this coffee hot?" - Random Customer, Clerks
  4. Try... by curunir · · Score: 5, Informative

    this or this

    --
    "Don't blame me, I voted for Kodos!"
  5. Having worked a similar problem... by metacosm · · Score: 5, Informative

    Short Answer: Good Luck! :)

    Long Answer:

    This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.

    I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.

    In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.

    The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.

    Sorry I don't have the solution you are looking for... honestly good luck. :)

  6. Convoluted Suggestion #1 by tedDancin · · Score: 2
    This is by no means the easiest solution, but one that could work. I don't envy the position you're in.

    1. Create a web page (eg. ASP) on a server running MS Word. In your back end code, create an instance of MS Word and automate the conversion of these files to HTML. Pain in the ass, but this can be done.
    2. MS Word HTML comes out with all sorts of xml tags and crap, so use a simple regular expression to filter out all the tags you don't need, keeping <p>, <ol> etc etc.
    And yes:

    3. ???
    4. Profit!!
    --

    Ladies, form queue here -->
    1. Re:Convoluted Suggestion #1 by shaitand · · Score: 2

      HTML is already text??? I understand there are reasons to turn html into regular old plain text but certainly not for archiving... The two issues are a format that will be around in 10yrs, html should do the trick, or xhtml and compression, either is text and should get an EXTREMELY high compression ratio.

  7. First HTML by Matthias+Wiesmann · · Score: 2
    I don't know of a direct solution, but I would decompose this into two problems.
    1. Tranform your documents into a reasonable format, XML, or very simple HTML: no page layout, only tables and lists.
    2. Transform those files into text.
    The first part is difficult as you have to filter the data and remove a lot of unneeded information, one possible way to do this would be to convert word documents into RTF, and then RTF into HTML (tools to do this, like RTFTOHTML).

    The second part is quite easy. If your data is XML, you can convert it using simple scripts (tables might be an issue) if the data is simple HTML, you could could use Lynx to convert it into text with some layout.

    In fact, I would keep both version of the data handy. Having a somehow strctured version of data never hurts, and text files do not take so much space.

  8. wvWare by Alethes · · Score: 5, Informative

    wvWare has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.

    wvHtml: convert your Word document into HTML4.0

    wvLatex: convert your Word document into visually (pretty) correct LaTeX

    wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress

    wvDVI: converts word to DVI. Requires 'latex'

    wvPS: converts word to PostScript. Requires 'dvips'

    wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version :-)]

    wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.

    wvAbw: converts word to Abiword format. (Far better just to use Abiword.)

    wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.

    wvRtf: a basic version exists

    wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the /tmp directory.

    1. Re:wvWare by sohp · · Score: 3, Informative
      I'll second wv, formerly know as MSWordView. I've used it since before the name change and have been satisfied. According to the blurb at freshmeat,
      wv (formerly known as MSWordView) is a library that understands the Microsoft Word 2000, 97, 95 and 6 file formats (".doc"), and is able to convert Word documents into HTML, which can then be read with a browser. It also allows other programs access to Word documents for the purpose of converting them to other formats (like RTF, PostScript, and PDF), and is currently being used by Abiword as its word importer.


      If by chance you have any Java around, the POI HDF APIs are great for manipulating that Horrible Document Format.
  9. simple answer. by SN74S181 · · Score: 5, Funny
    One of the requirements of our archiving process is that the documents be stored in plain text format.


    There's one simple answer: uuencode.

    *ducks*
    1. Re:simple answer. by argel · · Score: 2

      There's one simple answer: uuencode. And then you could use ROT13 to encrypt it! :-)

      --

      -- Argel
  10. use a word macro first by joe094287523459087 · · Score: 4, Informative

    i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets

    i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.

    we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.

  11. Stellent Outside In by divbyzero · · Score: 4, Informative

    Stellent (formerly INSO) Outside In is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.

    --
    But my grandest creation, as history will tell,
    Was Firefrorefiddle, the Fiend of the Fell.
    1. Re:Stellent Outside In by jafuser · · Score: 2
      It's not perfect, not crash-proof, and not free
      Useful, stable, cheap . . . choose any zero? =)
      --
      Please consider making an automatic monthly recurring donation to the EFF
    2. Re:Stellent Outside In by divbyzero · · Score: 2

      I didn't mean to be too negative; it is quite useful, in spite of certain shortcomings.

      --
      But my grandest creation, as history will tell,
      Was Firefrorefiddle, the Fiend of the Fell.
  12. Word & Lynx by wdr1 · · Score: 3, Interesting

    Use MS Word to save it as HTML, then run it though lynx -dump to save it as text.

    Although, you may want to give strong consideration to another poster's recomendation of using PDF. (Particularly since you care about formatting.)

    -Bill

    --
    SlashSig Karma: Excellent (mostly affected by moderatio
  13. Re:have you tried ... HTML Tidy by Louis_Wu · · Score: 4, Insightful

    HTML Tidy has a 'clean up Word HTML' mode which works wonders. Dave Raggett developed it for W3C.

  14. Re:Why? by Unknown+Relic · · Score: 4, Informative

    I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.

  15. print to postscript by wfrp01 · · Score: 3, Insightful

    use any suitable postscript printer driver, and print to a file.

    you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.

    this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.

    bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.

    i don't know why i'm not using caps today...

    --

    --Lawrence Lessig for Congress!
  16. Setup a text printer by mhesseltine · · Score: 5, Informative
    1. Add a Generic/Text only printer attached to a file port.
    2. Open file in Word
    3. Select print
    4. Select a file location and name
    5. Enjoy your new text file.

    For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF

    --
    Overrated / Underrated : Moderation :: Anonymous Coward : Posting
    1. Re:Setup a text printer by jsse · · Score: 2, Informative

      That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself.

      May be he could write macro to automate the process?

    2. Re:Setup a text printer by mhesseltine · · Score: 2

      From jsse:

      That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself. May be he could write macro to automate the process?

      I couldn't agree more. I'd hate to open each file, select print, and enter a filename. However, I also agree that VBA is probably geared toward this exact problem.

      --
      Overrated / Underrated : Moderation :: Anonymous Coward : Posting
  17. On the Mac... by singularity · · Score: 2

    Mariner Software released DropDoc, which is based on the GPL'ed wvWare libraries. It converts Word documents to .rtf, which maintains most of the basic formatting of a Word document.

    I have used it and it works fairly well.

    --
    - (c) 2018 Hank Zimmerman
  18. Re:Dude. by MaggieL · · Score: 3, Funny

    You are seriously asking the impossible...Nothing besides Word can do that...The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place.

    Mod parent up as funny, dude. Best MCSE parody I've seen in weeks.

    Seriously, dude.

    --
    -=Maggie Leber=-
  19. Re:Archiving? Are you sure? by shaitand · · Score: 3, Insightful

    How about compression? Generally when archiving... your compressing and plain text gets a far far better compression ratio than a pdf (not to mention it's smaller to begin with.)

  20. Stop with the other formats!!! by shaitand · · Score: 2

    ^ | up there somewhere the original poster said the reason he needs the conversion. He is using propietary hardware THAT requires text (or a tiff) as an intermediate format that it then puts on microfilm. Enough to the pdf, rtf, etc would be a better solution. text and tiff are the only solutions for this guy.

  21. Re:have you tried ... HTML Tidy by HRbnjR · · Score: 2

    These are both good ideas I was going to suggest. The last piece of this puzzle which I was also going to suggest - if you really need plain text, the text browser "lynx" can be scripted to spit out html reformatted as plain text. Export word->html, run through tidy, and html2text with lynx, should yield decent results.

    Good Luck :)

  22. OpenOffice by aminorex · · Score: 3, Insightful

    I've had very good luck using OpenOffice's save As.

    There are a lot of options, however. I wonder what
    Google uses?

    One option would be saving as postscript and using
    a postscript-to-text extractor.

    But the best thing would be to relax the POT require-
    ment to allow XML. XML is so trivial to parse that
    even if XML itself falls from grace with technological
    advance, the few simple tags you need will certainly
    be supportable with 10-15 minutes of scripting
    (if the computers aren't smart enought respond
    to your voice command and retrieve the old XML
    specs and interpret them well enough to perform
    the transformation automatically at that future point
    -- which seems unlikely, given that you use only
    the most simple and basic of XML idioms.)

    --
    -I like my women like I like my tea: green-
  23. The choices are obvious by mnmn · · Score: 2


    I recently discovered Lotus Notes Word format (.SAM) is text based just like XML. You could use this, or better go for StarOffice/OpenOffice native formats or SGML.

    --
    "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  24. Re:Why? by HughsOnFirst · · Score: 2

    What I would do is use word to "print" the files using a postscript printer driver . Use the "save to file option" in the printer driver. Then convert the .ps files to tiffs . Details left as an "exercise for the reader"
    Word scripts and macros also left as an "exercise for the reader"

  25. Re:Why? by smoon · · Score: 2

    If a TIFF file will work, then why not some kind of 'print to fax' setup?

    Basic idea is:
    Word: print->fax program
    fax program saves as TIFF file (perhaps choosing 'fine' resolution etc.)

    Same idea with a different implementation would be to print to a 'postscript' printer, then use a postscript to tiff conversion program. This could be fairly well automated with a print queue on a *nix box accepting the postscript 'print' jobs, then simply archiving the resulting tiff files off someplace handy.

    --
    "But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
  26. Re:have you tried ... HTML Tidy by __past__ · · Score: 2

    Maybe w3m would be a better choice, it also deals with tables (and even frames, but that doesn't seem to matter here).

  27. Re:Why? by biglig2 · · Score: 2

    Sounds good, one possible snag is if he cannot use multi=page TIFFs, although doubtless there exists some little open source snippet to convert a multi page TIFF into lots of single pages.

    --
    ~~~~~ BigLig2? You mean there's another one of me?
  28. Re:OpenOffice is your friend by forsetti · · Score: 2

    How? Does OpenOffice have a command line interface?

    --
    10b||~10b -- aah, what a question!
  29. Re:Why? by jafuser · · Score: 2
    Even better:

    There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer. This program was originally created to save you paper, by printing up to eight pages on one side of one sheet, but over time it has gained a lot of nice features.

    One of the features that I've used for my online transactions reciepts is the option to Save the print job(s) to a file or a bunch of files. It will save in the following formats:

    • fp (FinePrint)
    • bmp
    • emf
    • jpg
    • tif
    • txt

    When I save as a TIF, I am given the option of monochrome, 4-bit, 8-bit, or 24-bit, and a resolution 72-1200 or "custom". There is a checkbox for "Create a separ4ate file for each page", so that solves the multi-page problem as well.

    I really like this program, as I've been a registered user since nearly the beginning, and I've used nearly all of it's features at one time or another. And I don't get anything for mentioning it either, so =P

    FinePrint

    --
    Please consider making an automatic monthly recurring donation to the EFF
  30. Re:Why? by jafuser · · Score: 2
    There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer.
    I meant to have the word "driver" at the end of this sentence.
    --
    Please consider making an automatic monthly recurring donation to the EFF
  31. Re: The Macro to help out the printing by phamlen · · Score: 2

    The above solution worked at my company - except we needed to convert them to PDF so we sent them to a PDF-converting printer.

    The trick is to set up a macro to print all the documents in a folder/whatever. The easiest way is the following:

    1) Create a new word document. You have to store the Macro somewhere, and a word document is simplest. Call the document "ConvertDocumentsToText".

    2) Add some instructions in the actual text of the document (Notice how you can combine the documentation with the code! ) Generally, the instructions include:
    How do you start the macro
    Where do the files need to be?

    3) Start writing the macro. (Alas! I don't have my Word macro handy, so the best I can do is pseudo-code.):

    Dim ofsFileSys as New FileSystemObject
    Dim theFolder as Folder
    Dim theFile as File
    Dim theDocument as Word.Document

    set theFolder = ofsFileSys.getFolder("c:\mypath")
    for each theFile in theFolder.files
    theDocument =open(theFile, _lots of parameters to the open call.)
    theDocument.print
    theDocument.close
    next

    3) That's pretty much it.

    4) Be careful with TWO issues:
    Don't use background printing - use foreground printing. Otherwise the print function kicks off a background thread and you'll get thousands of documents open and the machine will grind to a halt.

    Also, be careful with the parameters to the open command. We forced the open even if there were conversion questions (otherwise you get a dialog box that you have to click ).

    -Peter