Slashdot Mirror


Converting Word Files to Text for Archiving?

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"

5 of 81 comments (clear)

  1. have you tried by gnixdep · · Score: 4, Insightful

    Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.

  2. Re:have you tried ... HTML Tidy by Louis_Wu · · Score: 4, Insightful

    HTML Tidy has a 'clean up Word HTML' mode which works wonders. Dave Raggett developed it for W3C.

  3. print to postscript by wfrp01 · · Score: 3, Insightful

    use any suitable postscript printer driver, and print to a file.

    you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.

    this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.

    bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.

    i don't know why i'm not using caps today...

    --

    --Lawrence Lessig for Congress!
  4. Re:Archiving? Are you sure? by shaitand · · Score: 3, Insightful

    How about compression? Generally when archiving... your compressing and plain text gets a far far better compression ratio than a pdf (not to mention it's smaller to begin with.)

  5. OpenOffice by aminorex · · Score: 3, Insightful

    I've had very good luck using OpenOffice's save As.

    There are a lot of options, however. I wonder what
    Google uses?

    One option would be saving as postscript and using
    a postscript-to-text extractor.

    But the best thing would be to relax the POT require-
    ment to allow XML. XML is so trivial to parse that
    even if XML itself falls from grace with technological
    advance, the few simple tags you need will certainly
    be supportable with 10-15 minutes of scripting
    (if the computers aren't smart enought respond
    to your voice command and retrieve the old XML
    specs and interpret them well enough to perform
    the transformation automatically at that future point
    -- which seems unlikely, given that you use only
    the most simple and basic of XML idioms.)

    --
    -I like my women like I like my tea: green-