Sanely Moving from Word to the Web?
FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.
Here's a tool I saw linked off of O'Reilly Radar once:
http://textism.com/wordcleaner/
I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.
bug.gd: error search engine. Humanity working together to solve all errors.
Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.
Good luck!
Simpli - Your source for San Jose dedicated servers and colocation!
Save the Word document as filtered HTML and pipe the HTML through HTML Tidy. Nice clean HTML.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Hmmm... sounds like a challenge to me. Let's see what we can dig up.
:)
Step 1: Let's look at his user page
Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/
Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..
Step 2: Let's look at his author page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome
A-ha! There is a link to his employer! It's Economic History Services. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.
Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
Almost forgot. The Tidy Docs will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/. It seems to clean up code (particularly from Word) quite a bit.
Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet to convert the WordML to FO (formatting objects).
From there, do anything you want, like XHTML or PDF.
Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!
room101 -- how much can you stand before they break you?
(they always break you eventually)
Demoroniser is, in the author's own man pages words:
A Perl script which corrects incompatible HTML generated by Microsoft applications.
You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.
Hmmmmmm..... Deep fried and look like Squirrel.