Sanely Moving from Word to the Web?
FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):
1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.
Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.
Rock that crushes, Paper & Scissors that don't matter.
How about Word -> PDF -> HTML?
... and probably a dumb one.
Just a thought
you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing
Thanks to file sharing, I purchase more CDs
Thanks to the RIAA, I buy them used...
It takes Word file and spits out plain text. It can also do some more tricks.
I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.
Try antiword, it's got a real decent HTML option.
Alanp
So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
Here's a tool I saw linked off of O'Reilly Radar once:
http://textism.com/wordcleaner/
I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.
bug.gd: error search engine. Humanity working together to solve all errors.
You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.
Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.
..."Intern"
Don't disappoint your bird dog. Go to the range.
hello. how are you?
If you're using Office 2000, you can find the HTML filter here:
a milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN
http://www.microsoft.com/downloads/details.aspx?F
I believe this functionality is built into later versions of Word.
Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.
Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.
Good luck!
Simpli - Your source for San Jose dedicated servers and colocation!
Save the Word document as filtered HTML and pipe the HTML through HTML Tidy. Nice clean HTML.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Hmmm... sounds like a challenge to me. Let's see what we can dig up.
:)
Step 1: Let's look at his user page
Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/
Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..
Step 2: Let's look at his author page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome
A-ha! There is a link to his employer! It's Economic History Services. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.
Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.
HTML Tidy:
http://tidy.sourceforge.net/
HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
http://www.chami.com/html-kit/
Countless other editors integrate with HTML Tidy as well. Have fun and good luck!
OtakuBooty.com: Smart, funny, sexy nerds.
I'm fine thank you
Almost forgot. The Tidy Docs will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.
What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.
Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.
Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.
Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.
This Like That - fun with words!
What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"
That's the best way. Really. You want the data to be in a structured format. Semantically structured if possible, but at least structured. Define a bunch of templates. Use a templating system like smarty or whatever to make it happen. Give your users a simple form - HTML, Windows, Java, whatever, that selects a template and reads in a list of fields from the template. Dynamically generate the form fields to be filled based on the template. Store the data. To generate a page start from the master record - be it in a database, an xml file, or whatever. Load the template and fill the data from the relational store. If you do it right you can even substitute different rendering layers and get an X/HTML version, a Word version, and a PDF version without any real substantial work. This also helps (1) create consistent documents, (2) create documents for more than one target format, (3) create searchable content with rich meta-data and (4) move to a more robust system later without tons of extra work. I've done it before, and if you spend a week engineering the solution properly it'll last years.
I'm fine too.
I'm glad we have these little discussions. It makes my day so much more interesting.
Let's do lunch.
fckeditor is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.
The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.
TODO: come up with a clever sig
One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/. It seems to clean up code (particularly from Word) quite a bit.
Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet to convert the WordML to FO (formatting objects).
From there, do anything you want, like XHTML or PDF.
Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!
room101 -- how much can you stand before they break you?
(they always break you eventually)
To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags.
The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document to a web site, and I always try to get the author(s) to not worry about formatting. Formatted documents are pure evil simply because 9 times out of 10 it does not affect the relevant information that you are trying to convey to your audience. Sometimes, the authors give me grief about it, but I simply show them the possibilities of separating the content and presentation during the translation. I convert their documents to generic HTML (with whatever tools are available) and use CSS to apply relevant formatting for the type of document (a report, article, thesis, or whatever). No funky font tags, or weird tables. Just the let the HTML flows as it's meant to be.
Coderz 4 Life
Net-It Central is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.
Oh, you mean non-commercial magical tools?
Ever dream you could fly? Get up from the Flight Sim. I Fly
Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...
Pagify is a perl script I wrote to do this for another job. It's basically a series of regular expressions that: 1. purges all the proprietary XML gunk from the HTML file you save from Word. 2. chops the file into smaller files wherever a Heading 1 appears 3. attaches endnotes as footnotes to the appropriate pages. It's GPL'd, so go nuts.
Demoroniser is, in the author's own man pages words:
A Perl script which corrects incompatible HTML generated by Microsoft applications.
You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.
Hmmmmmm..... Deep fried and look like Squirrel.
1) Open Word
2) Select All -> Copy
3) Open Dreamweaver
4) File -> New Html Doc
5) Paste
6) Commands -> Clean up Word Html
7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
8) Done
9) Drink beer
10) Sleep
Ave Molech Setting
The Demoroniser was nice in its time, but it assumes the output should be 7-bit ASCII, or ISO Latin-1 at best.
The Unmoroniser is an updated version that handles Unicode properly and will do things like convert proprietary Windows-only curly quotes to the appropriate HTML4 entities instead of dropping them back to less accurate, typographically offensive straight quotes. Same with ligatures and other characters that the Demoronizer would munge instead of convert.
http://rheme.net/unmoroniser/
PDF is most suitable for documents that need to be printed with specific formatting.
For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.
In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.
However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.
At a minimum a lot of small companies would be fine with this - big companies would vary wildly.
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
From the AC above:
1) get a copy of Word 2003
2) "save as" an exemplar as XML
3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
5) publish
I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting with tranformations to Word HTML. This is wrong because it assumes that the Word to HTML conversion will produce usable HTML in the first place which is a bad assumption.
The solution suggested by the AC could be combined into a program that drives the entire process using the Word COM API to save to XML and then then, for example, the MS Jet XSLT COM object model to automate the XML conversion. This could easily be maintained (eg: new Word formatting not previously encountered) with small changes to the XSLT.
If the desire is to completely control the output without having control of the input then this is the best way to go. Yes, it's a bit of work but once you have a maintainable turn-key system you will save a lot of futzing with manual formatting. Use the power of XSLT.
But that's not what I came here to tell you about.
I came to talk about the draft.
One option that can work for some situations is to export / save the file from .doc into .rtf (rich text format) and then use one of the free or pay RTF->HTML converters. I find using other software than Word to convert MSDOC -> RTF produces better results.
Using that process has made preserving italics, bold, and special characters much easier for me and almost seems fully automatable.
I've been using this method recently with some very simple search and replace and able to get good results.
I personally use dreamweaver for coding.. I know, I know, all that gui overhead and only semi-compliant code if it generates it itself.. but it does have the useful clean up word html tool, then I get to working it over in pure code.
works for me anyway..
- paul
Pmp @ DeviantArt
"... a perfect opportunity for the application of a little common sense..."
What is this "common sense" of which you speak? Where may I download it from?
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Are you using the telerik radeditor MCMS placeholder? It's free, and has capabilities that let you automatically strip out word formatting. In my experience it only sort of works... but it's better than nothing.
You can also add an event handler for the updating event that does some regex tidying. Replacing the regex "]*>" will go a long way (better double-check that). You should be able to come up with a similar one for all the smarttag nonsense that gets inserted, too.
Still, Word formatting remains a major bane to my existence. Good luck.
HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.
It must be terrible to work at Microsoft and always do mediocre work.
--
If you support dishonesty and violence, don't say you are Christian.
The script (decss.sed) is:
O TTOM::g
s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
s:</FONT>::Ig
s:<FONT[ -=\"A-Z0-9]*>::Ig
s:BORDER=[0-9]*::Ig
s:ALIGN=B
LedgerSMB: Open source Accounting/ERP
Yes, it is, since RTF is a text-based format where all the formatting is open and close tags, much like HTML. Save a word doc as rtf instead and open it in notepad, and you will see. There are many tools premade to convert from RTF to HTML, but you can build your own easily.
Video Production Support
From the sourceforge page:
I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.
Just expanding on your suggestion...
r /Writer.StoreWriterAsPDF.snip
c .html#info
Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.
OpenOffice API:
http://api.openoffice.org/
Code snippet shows simplicity of converting OpenOffice Writer SXW document into PDF:
http://codesnippets.services.openoffice.org/Write
Perhaps a few small changes here would get him what he wants.
Perl interface (ooolib):
http://ooolib.sourceforge.net/doc/ooolib-0.1.5-do
There are also Java code snippets. I think it would be possible to convert the OOBasic snippet above to either Java or Perl.