Slashdot Mirror


Sanely Moving from Word to the Web?

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

13 of 547 comments (clear)

  1. Scrapping by fembots · · Score: 5, Interesting

    Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):

    1. Place all "to-process" documents in a specific folder in a webserver
    2. Write a script to read those documents
    3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).

    Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.

    Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.

    1. Re:Scrapping by Daengbo · · Score: 2, Interesting

      Or, you could convert all those Word docs to Writer and use Ant like this guy did to xsl transform the xml into a website. I discovered this website because I'm starting to write a CMS in PHP5 which automatically adds content from OO.o documents.

      Alternately, you could use Writer2Latex to generate XHTML 1.0 strict for yourself.

      Those two methods seem the easiest.

  2. Sounds like you should release on sourceforge by arete · · Score: 5, Interesting

    So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
  3. Get it in PDF first. by frostman · · Score: 2, Interesting

    I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.

    What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.

    Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.

    Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.

    Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.

    --

    This Like That - fun with words!

  4. Resign from your executive position by Fastball · · Score: 2, Interesting

    What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"

  5. Re:Change the Model? by danheskett · · Score: 2, Interesting

    That's the best way. Really. You want the data to be in a structured format. Semantically structured if possible, but at least structured. Define a bunch of templates. Use a templating system like smarty or whatever to make it happen. Give your users a simple form - HTML, Windows, Java, whatever, that selects a template and reads in a list of fields from the template. Dynamically generate the form fields to be filled based on the template. Store the data. To generate a page start from the master record - be it in a database, an xml file, or whatever. Load the template and fill the data from the relational store. If you do it right you can even substitute different rendering layers and get an X/HTML version, a Word version, and a PDF version without any real substantial work. This also helps (1) create consistent documents, (2) create documents for more than one target format, (3) create searchable content with rich meta-data and (4) move to a more robust system later without tons of extra work. I've done it before, and if you spend a week engineering the solution properly it'll last years.

  6. Re:PDF? by ImaLamer · · Score: 2, Interesting

    "Print to PDF" seems to be the function that would solve all of these problems, but so would any others. Think you *could* print to a TIFF, PDF, virtually any image type with a *nix Word compatible program - then you can scan the image and OCR it to plain text. Antiword (mentioned by another /.er: http://www.winfield.demon.nl/) can convert DOC to plain text... there are thousands of options.

    However, if someone is getting the idea for another open source project to solve this dilema then I'd suggest something that can render DOC to HTML on the server side. That would allow those who just know how to "setup" a webserver to sit back and let the software deal with people's problem with not using standard types. Parse the Word, Wordperfect, OpenOffice, RTF, whatever and render it in HTML. This would allow anyone in a company dump the document on the server/share and let it be viewed by anyone else.

    But there are limitless options like this http://www.doc-api.com/ found on google...

  7. Actually, an NDA probably doesn't matter. by arete · · Score: 2, Interesting

    In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.

    However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.

    At a minimum a lot of small companies would be fine with this - big companies would vary wildly.

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
  8. Re:no link for you, Slashdot hordes! by FooAtWFU · · Score: 2, Interesting

    actually, I'm quite all right. At first I was a trifle worried when I saw that my machine's load was a little high and the story relatively new, but then I realized that it was just running pisg to generate channel statistics for #wikipedia. It's a beefy server on a fast line, really; I don't anticipate any issues if I can hide way down in the comments page instead of in the fine summary...

    --
    The World Wide Web is dying. Soon, we shall have only the Internet.
  9. Re:Dreamweaver by Anonymous Coward · · Score: 1, Interesting

    From my experience I've found that you have to do the 'clean up word docs' twice to get most (if not all) of that extra garbage.

  10. Trying this again by einhverfr · · Score: 2, Interesting

    The script (decss.sed) is:

    s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
    s:</FONT>::Ig
    s:<FONT[ -=\"A-Z0-9]*>::Ig
    s:BORDER=[0-9]*::Ig
    s:ALIGN=BO TTOM::g

    --

    LedgerSMB: Open source Accounting/ERP
  11. Re:Convert to RTF first by tonsofpcs · · Score: 2, Interesting

    Yes, it is, since RTF is a text-based format where all the formatting is open and close tags, much like HTML. Save a word doc as rtf instead and open it in notepad, and you will see. There are many tools premade to convert from RTF to HTML, but you can build your own easily.

  12. Re:OpenOffice 2.0-beta "save-as" and "export" grea by sonamchauhan · · Score: 2, Interesting

    Just expanding on your suggestion...

    Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.

    OpenOffice API:
        http://api.openoffice.org/

    Code snippet shows simplicity of converting OpenOffice Writer SXW document into PDF:
        http://codesnippets.services.openoffice.org/Writer /Writer.StoreWriterAsPDF.snip
    Perhaps a few small changes here would get him what he wants.

    Perl interface (ooolib):
        http://ooolib.sourceforge.net/doc/ooolib-0.1.5-doc .html#info
    There are also Java code snippets. I think it would be possible to convert the OOBasic snippet above to either Java or Perl.