Slashdot Mirror


Sanely Moving from Word to the Web?

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

31 of 547 comments (clear)

  1. Scrapping by fembots · · Score: 5, Interesting

    Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):

    1. Place all "to-process" documents in a specific folder in a webserver
    2. Write a script to read those documents
    3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).

    Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.

    Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.

  2. Dreamweaver by necro2607 · · Score: 5, Informative

    I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.

    1. Re:Dreamweaver by fean · · Score: 5, Informative

      in Dreamweaver, there's a command "Clean up MS Word HTML". Its made to clean up Word's crappy html, and does a pretty nice job of it.

    2. Re:Dreamweaver by drmike0099 · · Score: 3, Funny

      The OP is correct: 1) Open Dreamweaver. 2) Commands > Clean Up Word HTML... 3) Rejoice

    3. Re:Dreamweaver by Anonymous Coward · · Score: 5, Informative

      Also note that you have the ability to cut and paste formatted text from Word into the 'Design View' within dreamweaver and DW will automatically reformat the incoming text appropriately. In my brief test to make sure i wasnt talking out my a** i found it even supports word tables properly.
      If you paste text into the Code view, DW removes the formatting completely and just uses the raw text.

  3. Antiword by alanp · · Score: 3, Informative

    Try antiword, it's got a real decent HTML option.

    --

    Alanp

  4. Sounds like you should release on sourceforge by arete · · Score: 5, Interesting

    So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
    1. Re:Sounds like you should release on sourceforge by Anonymous Coward · · Score: 3, Insightful

      Since they were produced at work, the copyright on them is probably owned by the company and not by him.

      Plus the templates are probably in-house templates and thus would be useless outside of the company.

  5. Textism by NoInfo · · Score: 4, Informative

    Here's a tool I saw linked off of O'Reilly Radar once:

    http://textism.com/wordcleaner/

    I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.

    1. Re:Textism by e**(i+pi)-1 · · Score: 3, Informative

      a standalone Perl script, I use daily is demoronizer.

  6. One suggestion by Da+Fokka · · Score: 3, Funny

    You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.

    Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.

    1. Re:One suggestion by Anonymous Coward · · Score: 3, Funny

      It's hard to find qualified monkeys - most of them already have jobs editing /. and cnn.com...

  7. One Word... by ScentCone · · Score: 5, Funny

    ..."Intern"

    --
    Don't disappoint your bird dog. Go to the range.
    1. Re:One Word... by Cerdic · · Score: 5, Funny

      No, no, no...

      Usually they are from academic users

      It sounds like this might be a university environment. The correct answer should be grad students .

      --
      Advice for my fellow geeks: before seeking out that threesome you dream of, you might see what a TWOsome is like first.
  8. Dreamweaver by SlashChick · · Score: 4, Informative

    Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.

    Good luck!

  9. HTML Tidy by N8F8 · · Score: 5, Informative

    Save the Word document as filtered HTML and pipe the HTML through HTML Tidy. Nice clean HTML.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  10. no link for you, Slashdot hordes! by SeanTobin · · Score: 5, Informative

    Hmmm... sounds like a challenge to me. Let's see what we can dig up.

    Step 1: Let's look at his user page

    Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/

    Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..

    Step 2: Let's look at his author page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome :)

    A-ha! There is a link to his employer! It's Economic History Services. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.

    Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.

    --
    Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
    1. Re:no link for you, Slashdot hordes! by Dunbal · · Score: 5, Funny

      Which only goes to show:

            There is NO WAY the slashdot effect can be avoided. Resistance is futile...

      --
      Seven puppies were harmed during the making of this post.
    2. Re:no link for you, Slashdot hordes! by FooAtWFU · · Score: 3, Informative

      My SSH connection to my server still lives; I think my task was accomplished well enough. :)

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    3. Re:no link for you, Slashdot hordes! by jalefkowit · · Score: 4, Funny
      This just seems like a perfect opportunity for the application of a little common sense along with just a hint of courtesy.

      You must be new here.

  11. Tidy Flags by N8F8 · · Score: 5, Informative

    Almost forgot. The Tidy Docs will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  12. Re:HTML Export by Marxist+Hacker+42 · · Score: 4, Informative

    Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.

    --
    SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
  13. HTML Tidy program by Todd+Knarr · · Score: 4, Informative

    One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/. It seems to clean up code (particularly from Word) quite a bit.

  14. Re:Resign from your executive position by dougmc · · Score: 4, Insightful
    Everybody else is on board with plain text
    I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.

    I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.

    I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...

    Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)

  15. WordML - FO - XHTML/PDF by room101 · · Score: 4, Informative

    Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet to convert the WordML to FO (formatting objects).

    From there, do anything you want, like XHTML or PDF.

    Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!

    --
    room101 -- how much can you stand before they break you?
    (they always break you eventually)
  16. Net-It is your magical tool by netringer · · Score: 3, Informative

    Net-It Central is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.

    Oh, you mean non-commercial magical tools?

    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  17. Amen by Quadraginta · · Score: 5, Funny

    Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...

  18. Try this.... by mormop · · Score: 4, Informative

    Demoroniser is, in the author's own man pages words:

    A Perl script which corrects incompatible HTML generated by Microsoft applications.

    You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.

    --
    Hmmmmmm..... Deep fried and look like Squirrel.
  19. Re:Resign from your executive position by VGR · · Score: 5, Funny

    You think that's bad?

    I was given 61 screenshots (blithely dubbed "program requirements"), each its own Word document. Each containing only a (weirdly scaled) picture, of course.

    61 Word documents.

    --
    The Internet is full. Go away.
  20. Re:Export it as XML and XSLT it to HTML by Lemuridae · · Score: 3, Informative

    From the AC above:

        1) get a copy of Word 2003
        2) "save as" an exemplar as XML
        3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
        4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
        5) publish

    I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting with tranformations to Word HTML. This is wrong because it assumes that the Word to HTML conversion will produce usable HTML in the first place which is a bad assumption.

    The solution suggested by the AC could be combined into a program that drives the entire process using the Word COM API to save to XML and then then, for example, the MS Jet XSLT COM object model to automate the XML conversion. This could easily be maintained (eg: new Word formatting not previously encountered) with small changes to the XSLT.

    If the desire is to completely control the output without having control of the input then this is the best way to go. Yes, it's a bit of work but once you have a maintainable turn-key system you will save a lot of futzing with manual formatting. Use the power of XSLT.

  21. Re:Actually, an NDA probably doesn't matter. by mrchaotica · · Score: 5, Funny
    Doubtless he couldn't post the _documents_ that he converted.
    You realize he was converting them for the purpose of putting them on a website, right? ; )
    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz