Sanely Moving from Word to the Web?
FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):
1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.
Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.
Rock that crushes, Paper & Scissors that don't matter.
What I basically do is paste the document into EditPlus, then I use a function called "Replace" to get rid of the big stuff and edit out the rest of the tags manually. It may not be the best solution, but it's visually easier than just using notepad.
"Simplify, simplify, simplify!" Thoreau
How about Word -> PDF -> HTML?
... and probably a dumb one.
Just a thought
Sounds like a job for Mrs FooAtWFU
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
Chinese Children.
Coincidentally, the captcha for this post is "chinked". Fucking hillarious.
you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing
Thanks to file sharing, I purchase more CDs
Thanks to the RIAA, I buy them used...
It takes Word file and spits out plain text. It can also do some more tricks.
I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.
Try antiword, it's got a real decent HTML option.
Alanp
So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
Here's a tool I saw linked off of O'Reilly Radar once:
http://textism.com/wordcleaner/
I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.
bug.gd: error search engine. Humanity working together to solve all errors.
You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.
Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.
..."Intern"
Don't disappoint your bird dog. Go to the range.
hello. how are you?
Dreamweaver comes with a function explicitly for dealing with Word goodness (Clean Word HTML IIRC). Also, perhaps try HTML Tidy?
tidy, from w3c. Dreamweaver will clean up some Word HTML.
I dont have a link, or proper info, but I recall seeing someting here a few weeks back in which someone suggested saving the word doc as RTF, then they had a util to convert RTF to HTML - apparently it was really useful.
Perhaps offer every document as a pdf (there are plenty of conversion tools out there, such as ps2pdf, which you can use after printing the document to a postscript file), as well as offer it in whatever format was sent to you?
I Am My Own Worst Enemy
ask them to save it as an RTF file... Reading an RTF is much easier whilst supporting almost all of the important text formatting features.
OpenOffice.Org supports the ability to export a document as PDF. As you probably know, PDF viewers are available for all mainstream OSes, including Linux, from Adobe themselves.
Unless you're dealing with content that has to be accessed or updated frequently, then PDF is the way to go.
If you believe everything you read, you'd better not read. - Japanese proverb
Could you provide forms on your website for your academic users to submit the information directly?
You can't talk about Wikipedia's flaws on Wikipedia
It's a dream come true for eliminating Word formatting in an html file, or for just copy/pasting from one file to the other.
It doesn't seem to transfer colors over, but that may be user-error.
You could always just write some php/perl scripts for scrubbing, too! RegExps are your friends!
This article has recently been linked from Slashdot. Please keep an eye on the page history for errors or vandalism.
If you're using Office 2000, you can find the HTML filter here:
a milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN
http://www.microsoft.com/downloads/details.aspx?F
I believe this functionality is built into later versions of Word.
Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.
And it works pretty well... TIDY!!!
Frankly I didnt think much of this tool till I had to convert a LOT of pages where there was going to be a ton of cleanup by hand. In some cases it was easier to go back and get word to spit out ugly html and then let tidy fix that (if you can belive it). Best of all it is FREE and easy to use!!!
Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.
Good luck!
Simpli - Your source for San Jose dedicated servers and colocation!
If not, give it a try. In the past, anything I've taken from .doc to html (in particular, resumes), seemed to convert nicely if I did a cut-and-paste from Word straight into an html project in Frontpage.
Considering that they have a common core and a part of the Office suite, they seem like they should be the most directly compatibile with each other.
Advice for my fellow geeks: before seeking out that threesome you dream of, you might see what a TWOsome is like first.
http://textism.com/wordcleaner/
"A tool that strips proprietary Microsoft tags and other cruft from Word HTML documents, leaving basic formatting intact. File sizes are greatly reduced, and the returned HTML is easier to read, revise and employ."
5 for a 24-hour pass, 20 for a 1-year individual subscription.
No, I don't work there.
Save the Word document as filtered HTML and pipe the HTML through HTML Tidy. Nice clean HTML.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
http://tidy.sourceforge.net/ As I recall HTML-Tidy allows you to remove all of Words "enhancments".
http://tidy.sourceforge.net/
Check out the -bare and -clean options to remove microsoft cruft.
It can programatically convert most Word documents into html documents, and does about as good a job as one could expect. And it makes better html than Word does itself.
what's it all aboot, eh?
Hmmm... sounds like a challenge to me. Let's see what we can dig up.
:)
Step 1: Let's look at his user page
Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/
Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..
Step 2: Let's look at his author page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome
A-ha! There is a link to his employer! It's Economic History Services. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.
Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
Open in Word Select All Hit: Control + Space
I would suggest installing a PDF printer driver, printing to it to generate a PDF and then going from there, such as using any number of PDF to HTML applications, avaliable from google.com
Outsource the boring and tideous work to India.
I once had to convert a large number of pages generated by Word into something that was at least close to validating and I used Tidy HTML. It took a little bit of poking around with all the arguments to get it to do what I wanted, but once I had I just ran it on all the Word exports and it popped out clean code. It even had a special flag (though I don't remember it off the top of my head) to specifically deal with Word exports.
HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.
HTML Tidy:
http://tidy.sourceforge.net/
HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
http://www.chami.com/html-kit/
Countless other editors integrate with HTML Tidy as well. Have fun and good luck!
OtakuBooty.com: Smart, funny, sexy nerds.
That's the only way go fix the problem.
how does a racist Troll get moderated Interesting?
My days of not taking you seriously are certainly coming to a middle...
For batch conversions, there is nothing better that I know of than TextPipe. I also like askSam [the import feature lets you grab content from many different filetypes]. If there are not many files to do on a given day, or you just want a low resource-intensive approach, try PureText (I use 2.0). It is extremely easy to use. Good luck!
yes dreamweaver has a handy "clean up word HTML" function, you can even grab a trial version (but its worth the money imho)
Assuming that it is not in your power to change the material coming to you, then you must change how you process it.
Quite frankly, the most cost effective way to deal with this problem is to hire an intern, temp or clerk. Train this person to formal very plain HTML, to your liking (or XML, or XHTML or whatever you prefer). Then use your application to apply the style you like to the HTML the temp made.
If you want to involve more programming, you could whip up a parser to validate the intern's work. But the reality of the situation here is that unless you are working on a truly overwhelming volume of documents, it will be much cheaper to use human labor than to invest the programming time to automate the process.
-jr
For example, you can turn every underlined, 18-pt text into <H1> headers, etc.
This way you can keep the consistency quite easily, while still staying flexible.
You can even create HTML that is compatible with the IDs and CLASSes of your site's existing CSS.
This, however, requires that you know VB, and spend some time getting to know the Word object model, which is not too difficult
I'm fine thank you
Almost forgot. The Tidy Docs will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.
What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.
Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.
Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.
Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.
This Like That - fun with words!
Adobe Acrobat PDF conversion preserves look
Many free or cheap printing filters / converters available
Homesite has a function to import and clean Word Documents.
IAAL
Some future version of Tomcat should have built-in content parsing in its filters so that filter writers could write simple filters to transform content in a meaningful way. But I haven't seen that as a proposal anywhere.
I have a similar problem in QuarkXpress. My current solution is to export the doc as HTML and then search and replace in BBEdit in order to clean things up. A regex would do the job except for the fact that Quark generates arbitrarily named stylesheets that require manual changes. I am considering writing a script that would parse Xpress-tagged output and convert it to HTML.
I'd suggest something similar for Word...export as rtf (?) and parse it into valid HTML. However, Word's HTML is *much* worse than Quark's.
Try wvWare (http://wvware.sourceforge.net/). It works amazingly well for Word Excel and Powerpoint. I have used in Zope applications and have had very good results.
From the site:
This is the home of the wv library. The original name of the project, mswordview, was uncomfortably close to Microsoft's own product named wordview, so the library was renamed.
wv is a library which allows access to Microsoft Word files. It can load and parse Word 2000, 97, 95 and 6 file formats. (These are the file formats known internally as Word 9, 8, 7 and 6.) There is some support for reading earlier formats as well: Word 2 docs are converted to plaintext.
wv compiles and works under most operating systems. Although most development is carried out with Linux, wv should work on BSD, Solaris, OS/2, AIX, OSF1, and even (with varying levels of success) AmigaOS VMS. The GnuWin32 project maintains a port for Windows, and it is required to compile and work on all of AbiWord's supported platforms, of which there are a lot.
wv allows other programs access to Word documents for the purpose of converting them to other formats. It is currently being used by AbiWord as its Word importer, and concepts and bits of code are being used by the KDE folks over at KWord in their word importer.
I've had a similar task once and we used HTML Transit, a software by Stellent (http://www.stellent.com/) and distributed by Avantstar (http://www.avantstar.com/). You can define templates for all kinds of word styles and fine-tweak the HTML output quite neatly. And, another advantage, I had excellent support when some questions arose.
So long and thanx for all the fish, RaSchi
P.S. There is a Ruby Port as well.
I had to do this recently, but to print a 20-page document as 12 pages by removing page breaks. I found that kword using the html export filter and setting it to "HTMl 4.01 + Light (strict xhtml)" was the best mode. This doesn't use the style sheets and just converts to basic html.... no fancy positioning or fonts, just some headers and basic styles. This was using kwork 1.4.1 / kde 3.4.2 btw.
Everything else I tried sucked, including OO.o's export.
What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"
I used to use homesite 4.5... it has a built-in macro to strip the office tags and styles out of html. http://www.macromedia.com/software/homesite/
We run Microsoft CMS for my company's Web site, which annoyingly accepts pastes direct from Word, complete with all the extraneous code. (As opposed to a normal text box, which strips formatting when accepting pasted text.)
Since we style the text with CSS, we have to train everyone who works on the site to first paste anything from Word into Notepad to strip out Word code crap, then paste that into the CMS browser client, then re-apply formatting with the tools in the client toolbar. What a pain! I'd love to know if anyone has figured out a way to allow people to paste Word content directly into MS CMS without having to go through all those extra steps.
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
Perhaps if the workforce in the US didn't use phrases like "sand monkeys", IT companies wouldn't be so inclined to look for good workers overseas.
I'd recommend using the CVS version of AbiWord. It'll preserve almost all of your visual and semantic meaning using XHTML and CSS. This includes fairly complex things like endnotes, footnotes, tables, floating text boxes, etc.
AbiWord --to=file.html file.doc
http://www.abisource.com/
"Yeah, hire some sand monkeys from overseas to do it. That's what all the IT companies are doing. Duh."
Sand monkey? Did you write that with a sheet over your head?
Most web users now seem to tolerate PDF files, and exporting from word to PDF is much more reliable than exporting from word to HTML.
As a solution goes, it is pretty crude. However, it works quickly and easly, and produces nice looking output.
Doesn't dreamweaver have an 'unfuckup' button that fixes word-html?
"What does slashdotting mean?"
"You've never heard of slashdot?"
"I know it makes websites not work."
Avoid all HTML export tools.
Edit -> Copy
Switch to gvim
Edit -> Paste
Seriously. People need to stop using Word (or FrontPage, for that matter) to design pages.
I'm fine too.
I'm glad we have these little discussions. It makes my day so much more interesting.
Let's do lunch.
http://www.winfield.demon.nl/
Sounds like a perfect job for AppleScript. You can create a scriptable folder, drop your documents in it and let it copy and paste all the paragraphs and add some html tags, etc. Very flexible.
Mind | Body | Spirit | Cash
That's what I did. Copy all the text in Word, paste it in a text editor (which kills all the formatting assuming you're not using RTF), copy that and paste it in your HTML editor (usually the same editor to code your HTML) or you can paste into Dreamweaver or similar and go that route. Quick and easy.
"He uses statistics as a drunken man uses lampposts...for support rather than illumination." - Andrew Lang
I'm seeing a lot of 'use Dreamweaver' responses that are well-meaning and probably will solve this guy's dilemna. But what about those of us running CMS systems with text area inputs in forms? Our content people copy-and-paste directly from word and these crazy MsWord entities get crudly transposed into ASCII question marks.
Anyone got a good regsub routine for correctly substituting these entities for their approximate ASCII equivalents? I'm just looking for pattern matching here... Don't need a bunch of code.
Appreciatively,
Seth
$5 / month hosted VPS on linux = awesome!
You may want tot look at WebWorks pro application for sanely exporting Word files as HTML/XML. I've used it in the past (a handful of years ago) and it was pretty reasonable. It is worth investigating in any case.
I have run into a similar kind of set of problems, as I run an on-line philosophy journal (see http://ejap.louisiana.edu). The solution I found was to convert the documents down into RTF format as an intemediate step. There are a number of shareware RTF-to-HTML converters available. Unfortunatly, I cannot find the name of the program I usually use at the moment, or a link for it, but googling for "RTF to HTML" shareware produces quite a few likely candidates. This system worksjust fine for me. What I like best about the program I have is that it puts the HTML codes in in French! If you look at the source code for the most recent edition of my journal, you can see the system in action.
fckeditor is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.
The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.
TODO: come up with a clever sig
One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/. It seems to clean up code (particularly from Word) quite a bit.
Some have suggested using PDFs. To do this, I use Ghostscipt and Ghostword. Here is a good description from O'Reilly's Word Hacks on how to install it in Word.
"Prepare for the worst - hope for the best."
(Perl script)demoroniser - correct moronic and gratuitously incompatible HTML generated by Microsoft applications
http://www.fourmilab.ch/webtools/demoroniser/
607-272-4817, ask for Jim. Cyrus Company is a web development firm in upstate NY. I worked there for the last 3 years - I'm in the UK now - and we had a client who needed just this type of thing. Jim set it up (he can program in all kinds of languages I don't understand) and you can copy and paste from Word to an HTML form, keeping the format. There might be a browser requirement, but that's about it. I was amazed when I first saw it myself. If you have any questions, email me at paper@paperskies.com - sorry for the ad-sounding post, but it's the truth and I can't really think of any other way to put it! Regardless, good luck with the search.
Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet to convert the WordML to FO (formatting objects).
From there, do anything you want, like XHTML or PDF.
Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!
room101 -- how much can you stand before they break you?
(they always break you eventually)
Why not print it, scan it in, and post the jpg? With one of those multi-function printers with a sheet feeder for the scanner, it might even be fun!
(for some definitions of "fun", anyway)
I could really use a speling cheker.
"Prepare for the worst - hope for the best."
http://www.w3.org/People/Raggett/tidy/
Ron
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
I use a homebrew version of demoronizer with accumulated patches that I added to the script along the years + tidy to sort everything up
To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags.
The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document to a web site, and I always try to get the author(s) to not worry about formatting. Formatted documents are pure evil simply because 9 times out of 10 it does not affect the relevant information that you are trying to convey to your audience. Sometimes, the authors give me grief about it, but I simply show them the possibilities of separating the content and presentation during the translation. I convert their documents to generic HTML (with whatever tools are available) and use CSS to apply relevant formatting for the type of document (a report, article, thesis, or whatever). No funky font tags, or weird tables. Just the let the HTML flows as it's meant to be.
Coderz 4 Life
Net-It Central is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.
Oh, you mean non-commercial magical tools?
Ever dream you could fly? Get up from the Flight Sim. I Fly
Why not just copy the plain text to an HTML editor, or even Notepad? Then manually add any font variations (titles, subtitles, bullets, etc) that are needed. Even with Dreamweaver or (shudder) FrontPage, this would not be too hard to do, even with longer articles.
What did happen to RTF? Seriously.
I've got a table at Elaine's. Can you make it?
I use this program, Papyrus, from rom-logicware http://www.rom-logicware.com/ that has got a quite good HTML export function, and also a quite good M$Word import.
The quality of the import varies depending on the source document (for my kind of stuff it's very good), but the quality of the HTML export is EXCELLENT, tidy-proof.
there is a demoversion that basically has got the only limitation of 1page printed with some letters swapped, but any other function is OK.
Also, its size is extremely small (kinda 2-3 MB)
Pumbaa! I don't wonder; I know.
Does anyone have the BT link for this? ;-D
Why is it important to make the code beautiful if the objective is merely to publish content for legible consumption? Why not just use Word's HTML export capability and dump the results into your web page and be done with it? Your content will be published and who cares what the code looks like if nobody's going to be doing any significant editing to it..?
Just curious. Personally, unless the document is too big for consideration, I'll usually recode the thing by hand if I need the code to be precise - I haven't met a code-generator that I like yet.
A computer?
Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
Word? Never heard of it. Request that all submissions be sent in one of these "industry" standard formats:
And, provide a Preview button so that people can preview what they are sending in.
I noticed a nice feature in tinyMCE (javascript wysiwyg editor) that allows you to copy-paste stuff from word to tinyMCE.
If this would really do any cleaning up I don't know, and sometimes tinyMCE has problems of it's own just keeping track of font styles (it keeps flooding me with <font> tags (eww!)).
It's not a complete solution, and others in here have better suggestions, but this feature is certainly interesting (and relevant?)
The abiword project has a set of command line utilities to convert word documents to various other formats. Its called wv in Ubuntu/Debian. The one you want is called wvHtml.
"We Don't Need No Truthless Heros!" - Project 86
I do this all the time.
/. know-it-all recommends.
Don't bother with any other lame program that some supposed
1. Copy your document from Word into Wordpad.
2. Copy your document from Wordpad into your HTML editor of choice.
Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...
Dreamweaver does this well. You "clean" up the HTML and it cleans things up nicely.
For anyone out there who use Gmail you might have noticed the the JavaScript based editor for emails.
It will actually give you the ability to paste in code from other editors ( including MS Word ).
The app is called TinyMCE. TinyMCE is a platform independent web based Javascript HTML WYSIWYG editor control released as Open Source under LGPL by Moxiecode Systems AB.
It may not make perfect XHTML compliant code, but you can try the input and test the output results here.
I've found it to be a pretty useful tool and it should clean up the HTML pretty well. All in all it's going to depend on the basic page layout you have to decide if this is the right fit for you.
rm
Works every time.
When I was looking into doing this sort of format stripping (for a college newspaper, oddly enough) I started really looking into wvWare. It really is a marvelous little program - it takes a document in word format and translates it into a document in some text-based format by essentially replacing any given bit of word formatting with the text from a tag in an XML file describing the destination format. If you want it to, for instance, keep paragraphs, bold and italic markings, but nothing else, you would write (Or, as I was doing, edit) an XML file specifying that for each you replace the beginning with the appropriate start tag, and the end with the appropriate end, and all other formatting in the document with nothing at all. I found the format for doing WML pages to be marvelously close to a very minimal HTML (only a tag or two away).
1) get a copy of Word 2003
2) "save as" an exemplar as XML
3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
5) publish
Pagify is a perl script I wrote to do this for another job. It's basically a series of regular expressions that: 1. purges all the proprietary XML gunk from the HTML file you save from Word. 2. chops the file into smaller files wherever a Heading 1 appears 3. attaches endnotes as footnotes to the appropriate pages. It's GPL'd, so go nuts.
Demoroniser is, in the author's own man pages words:
A Perl script which corrects incompatible HTML generated by Microsoft applications.
You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.
Hmmmmmm..... Deep fried and look like Squirrel.
As you've found out word is intended to create paper documents, not web content. I think you really need to look at the bigger picture here. You're currently taking word documents and desperately trying to convert them to HTML so they can be published on the web. Great, but not a good solution to your larger problem.
The really the question is, why are you accepting word documents in the first place? If your authors are serious about publishing on the web you should really be pushing them to give you content in html, or look into some kind of content management system. Any major website and most minor would cringe at the thought of using word as a content creation program. Those should really be your long term goals as this word->html business is really quite a crappy system. If you don't start pushing people now to change to something more sane they never will.
Eventually you're going to end up with a difficult site to manage because of all the word->html conversions. Solve the short term problem with some cleanup program, but you really need to work on the long term problem before it kills you, or the site.
AccountKiller
Inspect the copy closely in the Design view before you strip the unsightly word commands so you don't miss any little trick that might get stripped in the process. This has happened to me once or twice.
But don't hit the undo, usually there's a quick fix in Dreamweaver that will bring the page back to the way it looked before.
A small asside... Attention Dreamweaver fans, Let's all let Adobe know how much we love this program as they absorb Macromedia later this year.
"Where did this apple come from?"
--Alan Turing
Clean up HTML with HTML Tidy (http://tidy.sourceforge.net/)
This can be easily scripted; no gui needed. Of course, you seem to want even /cleaner/ code ... this is only a starting point, but it seems like it will do most of the work for you.
Use my userscript to add story images to Slashdot. There's no going back.
I used a combination of Perl and Ole (using the Win32::Ole package). MS Word has Ole hooks, designed to work seamlessly with the native Windows VBScript. With Perl's Win32::Ole, you can always do the equivalent, but you may have to use some syntactic acrobatics. After figuring out the proper syntax to use, it worked smoothly for me.
:-) If you have any further questions, reply to this thread, and hoopefully I'll get back to you with some answers.
Unfortunately, to check for bold, italic, or underlined text, I had to check every individual character to see if it was formatted. Very inefficient, especially with large documents.
Google Groups is your friend
- The <span... for every paragraph
- giant and identical "style=...." qualifiers for each <td> and <tr>
- each cell with an explicit paragraph inside it
- each cell ending with an explicit </td>.
Manual cleansing ended up reducing the file size by about 60%, without changing, how it looks.In Soviet Washington the swamp drains you.
Step 1: Buy a cheap Mac Mini .DOC in Word 2004
Step 2: Buy copy of Office 2004
Step 3: Open your horrendous Word
Step 4: Print to PDF
OR
Step 4: Export to XHTML in Word
Done.
Exporting to XHTML in Word 2004 seems to do a pretty good job usually. However you should just print the DOCs to PDF and put those up instead.
There is a nifty program called Cross Eyes which reveals all of the formatting in a Word doc (basically shows you the "source" of a .doc). It can help you see what is tripping you up and get it removed.
You want to create a bloated internet. Or maybe you just want to drive users away from your site! Save PDF for documents that are meant to be printed.
Why would you want to force your users to open up a different application to view your online content!?
I can go on about why this is such a bad idea but I think it is very obvious.
nt
1) Open Word
2) Select All -> Copy
3) Open Dreamweaver
4) File -> New Html Doc
5) Paste
6) Commands -> Clean up Word Html
7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
8) Done
9) Drink beer
10) Sleep
Ave Molech Setting
Print to pdf in OO.o.
Either post the pdfs, or last time I checked there are alot of cleaner tools to deal with them. (Google obviously does it with little problem, so its out there)
Roses are red
Violets are blue
In Soviet Russia
Poems write you!
Did anyone else read that as "Sanely Moving from World to the Web?"
I was thinking "Oh God, not another article from Wired!"
And while I don't want to encourage the posting of every article in Wired on /., I also feel compelled to cite why I thought the above. So, realizing that this post will probably only be read by 2 other people at most... the article is here:
http://wired.com/wired/archive/13.08/start.html?pg =3
Go ahead and mod me Troll.
90% of being smart is knowing what you're dumb at.
First off, like others have reccomended, use HTML Tidy.
Secondly, create a set of standard templates/formatting rules and make sure that your guys keep to them. This makes everything a lot easier, possibly allowing you to even script the exportation of the documents, as well as making sure that there'll be a standard look and feel across the pages without a need for so much editing.
I mentioned HTML Tidy already, but you could also automate MS word and crawl through the Word Object model and reproduce the page in HTML. One advantage of this method is you can identiry all the parts of the document including footnotes and formulas. Here is some code to start with. The tricky part is to realize you have to crawl through each paragraph/range at the word level and check styles. Do it at the character level and the thing is a dog. Another tricky part is crawling through the document in the correct order. The link isn't my code but was the closest I could find on the web before I sat down and figured it out myself.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
What always got me was that MSWord has all these built in styles e.g. Heading 1, Heading 2 etc which get turned to complete html font size mush when you use save as html AND this goes for every other single convertor I've used.
Why not use these to convert to structured html, eg h1 h2 etc?
This is why back in 1998 in a previous company (i was lead developer in a publishing company) we had to build our own - which is still being used by some of my current clients.
There are a number of problems with it; its always one of those "solutions" that I've always meant to get back to, correct all the issues and publish it online.
If the documents you use are heavily structured using the built in styles then it maybe of some use to you.
Its basically a VB (yuk) automation on Word. We had to use Powerpoint to get all the images out as gifs (yuk again), and we had to write a really interesting algorithm to get out tables as html rather than the awful images some convertors do.
Maybe you could get in contact if you're interested.
CourseGenie is a product designed for exactly what you're looking for -- taking Word documents and bringing them out to sane, accessible HTML.
:) ]
It's especially designed for academic uses like you're looking for.
http://www.horizonwimba.com/products/coursegenie/
[disclaimer: Yes, I do work for the company
Ctrl-a, Ctrl-c, open notepad, Ctrl-v, Ctrl-s, type document.html, press OK. Done.
Stop Global Warming!
Just say no to irreversible processes!
Any "academic" user using Word should not call themselves academic. The real ones use LaTeX. ;-) LaTeX allows you to do everything you want, and more.
one post above you.
too bad i'm A to the mothafuckin C.
This is common to other M$ Office applications.
Try loading PowerPoint-created HTML into something other than Internet Explorer and see what happens.
I've got a short sample with IE and Firefox screenshots here: http://home.mindspring.com/~fredthompson/
OpenOffice doesn't properly load all M$ Office files, especially those with fine formatting control or embedded video.
Have you ever used WinFax to send a Word document? You'll notice the margins change.
PDF seems to be the only way to keep the formatting but then you don't have the raw text content.
Supposedly, the upcoming major release of M$ Office won't use proprietary formats. Yeah, well, we've heard that before so maybe yes, maybe no.
PDF is most suitable for documents that need to be printed with specific formatting.
For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.
The simplest way to proceed would be to set up an online form for content submission. Tell them its the only way that you'll take submissions. Then they can cut & paste text into the fields that you specify, or if they are professors, more likely they'll give it to their grad students or department secretaries to do.
You can give them some formatting options by using textarea tags and allowing a limited set of html tags into the content, the same way that slashdot does.
In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.
However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.
At a minimum a lot of small companies would be fine with this - big companies would vary wildly.
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
- Open the word document, then tell Word to "Save as Filtered HTML".
- Close the docuemnt in Word, then open it up in Dreamweaver MX 2004.
- Tell DW to "Clean up Word HTML", then do "Clean up HTML" and under specific tag put "span".
Aside from some table issues, this cleans up about 99% of Word's garbage.HJ
Actually the problem may be with which browser you preview css pages in. As you may know, Explorer is getting further and further from complience with css standards. When I preview in other browsers, my css pages done in Dreamweaver look fine.
I use a mac and no longer bother to check my css pages in Explorer since MS quit supporting the program for mac platform. For a while I previewed on a friends pc, and still do for pages that don't use css, but with the browser going so far out of suppot for web standards esp. reguarding css I depend on other browsers for compliance.
It would be nice to get Explorer updated for the Mac, I like the way they do bookmarks. Do any of you slash doters know if there is a group asking MS to update?
"Where did this apple come from?"
--Alan Turing
Give it to the intern. :-D
From TFAskSlashdot:
He's already doing what you suggest. He wants a better way.
Bogtha Bogtha Bogtha
I find myself in a similar state but I'm trying to get from word (and html) to wiki markup.
I agree that filtered html | tidy gives workable output for html. How do I get html converted to wiki markup? Or word to wiki markup for that matter?
Of course, that means your users would have to have an Acrobat plugin for their browser.
rather than pdf, you could use flash paper, of course this would upset those that don't like add ins but the installed base of flash is quite high, and they open faster than pdf's.
Here's a method that's practically 100% reliable. You won't lose *any* formatting details. Also, the way I describe is by hand, but I'm pretty sure you could set something up to automate this.
1) Open the document in Word
2) Maximize the window
3) Take a screenshot
4) Upload the screenshot to the web
5) Done!
& I wish I knew the password to your heart . . . &
The "Web Page (filtered)" solution is money. Easy and effective, given the scenario presented in the initial post. Further posts should all just herald this info. NAZIS!!!
are meant to be printed
In high school (several years ago) our school newspaper was produced in Quark Express, which did not lend itself to HTML at all (at least at the time). We would print the document as a PDF and then use BCL Magellan: http://www.bcltechnologies.com/document/products/m agellan/magellan.htm to convert it to HTML (and HTML that was readable on any browser at that...). It seems the company now has a web based solution: http://www.gohtm.com/ and that Magellan now converts from .doc as well.
I run a small newspaper and get press releases and stories filed from people doing absolutely any file formate and styles that you can imagine. Quickly striping the text down is hugely important to me.
I've become a huge fan of Textutil, a command-line tool built by Apple that was included in Mac OS X 10.4. It can process Doc, RTF, text and anything that Apple's OS can read. And it can spit files back out in any format you want.
I wrote an AppleScript for BBEdit (you could just as easily do a Perl script) to strip out everything but the most generic tags -- italics and bolds -- so that I can use those files to my own ends. It rocks.
Deny the receipt of the documents in anything other than plain HTML.
you can save word documents as html documents, directly in word.
if it looks like shit, oh well... it was made in word.
the only permanence in existence, is the impermanence of existence.
I long for the days when I used Framemaker. It's style system is much easier to use the Words, and makes it much easier to enforce standard formatting. And MIF output was great for Perl transformations.
"This mission is too important to allow you to jeopardize it." -- HAL
here's what I do at a certain work-study web "developer" gig I work:
In Word:
- Select All
- Copy
Open Notepad and:
- Paste
Voila! Plain text! Now:
- Copy again
- Paste into Frontpage
- add formatting tags at will
The only way to fly for mass document uploads. Just ask the IRS and thousands of companies that have significant document delivery via the web.
Don't fight reality!
http://www.microsoft.com/downloads/details.aspx?Fa milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN
The Luddites were ahead of their time.
http://www.fckeditor.net/
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
Microsoft makes exactly such a tool and it's available here:f 2.aspx.
http://office.microsoft.com/downloads/2000/Msohtm
You could have saved us all a lot of time if you just searched for it instead of posting an Ask Slashdot question.
Future Wiki -- If you don't think about the future, you cannot have one.
Nice detective work. However, I disagree that all that nice clean code has all been done by hand. This is a guy who didn't just start looking for a solution to the problem. From randomly chosen source:
"HTML Tidy for Linux (vers 1st July 2004), see www.w3.org"...
My solution is to do a Save As on the horror.doc or messy.pdf, and to save it as
I use GoLive for the fix, but mostly in code view.
It's nice to see other people caring about the issue.
This apps is great and free. http://www.wimb.net/index.php?s=delphi&page=27
This is a job for perl. I used to code for a company that does exactly what you need to do for universities. The ideal is to write your own scripts based on the document's structure (extract title, paragraph and so on), but there's a billion scripts to handle html parsing on cpan.
My first thought is that the XM-109 25 mm rifle has been shown to be effective against armored vehicles at nearly 2,000 meters. It should be able to face down a Microsoft Word document, no matter what the formatting.
And it is cost-effective, requiring only a single soldier to operate, with no more training than would be required to, say, produce a Microsoft Word document.
Nevertheless, experience compells me to add that you can never have too much firepower where a Microsoft Word document is concerned.
Assuming "Save as HTML" isn't an option (if you're getting your Word docs from someone you can't easily have re-save them for you, for example), I've used Antiword (a Word-to-text converter) for this sort of thing. It's been years, though, and I can't say with a whole ton of certainty that it works as well now as it did then.
Problem, I have writers with no HTML skills, not even basic ones despite two years of trying to get them to learn bold tags. Some people just don't get it.
Solution they paste into a plain textarea, save and are sent to a page with a scaled down HTMLarea type editor where they can do the basic formatting they are allowed to do on the site. There should be nothing in their Word document formatting that can't be accomplished.
Tables/Images are not meant to be part of the article and as such they should ask for that ability on an as needed basis.
This has made my life better because I now get perfect XHTML transitional from their stories without the headaches caused from pasting in from Word.
I have yet to hear anyone complain they can't get the format they want in their story.
-- taking over the world, we are.
Again as someone mentioned there is antiword which is here http://www.winfield.demon.nl/index.html
Rather than trying to fix the horribly broken html microsoft word generates you might be better to try Abiword.
Abiword is fast light and can be used at the command line to batch convert Word documents into HTML that is even cleaner than Microsoft Word 97 ever managed to produce.
IIRC the command you need is abiword --to html filename.doc
Personally, I love FCKEditor [fckeditor.com]. It's open source, and can support ASP/PHP/etc. Simply put, it's a nice emulator for Word on the web, and it has an option that allows direct pasting from Word to it. Check out the demo, it's great! I use it in many projects of mine for clients (php/mysql scripting, mostly).
Open it in Dreamweaver. There's an option called, appropriately enough, "Fix Microsoft Word HTML." Hit it, and things get a whole lot cleaner.
http://www.fckeditor.net/
a very competent web-based word processor with one killer feature: "paste from word." i've tried it and it generates pretty clean html from even complicated ms word formatting. the only thing it doesn't seem to handle well is fonts, but i still think it'd be an excellent solution.
(really unfortunate name though. worse when you realize it came from the author's initials.)
Coffee. Lots and lots of coffee (or bawls).
If you just took anything I said seriously, read it again.
But that's not what I came here to tell you about.
I came to talk about the draft.
use the tag on the for pre formated text, and jsut putting in heading tags when needed
and load the word dll from asp.net. Works great.
The war with islam is a war on the beast
The war on terror is a war for peace
Try this program. I find it really useful...
Easy Text To HTML Converter (freeware)
http://www.easyhtools.com/
SED
Panel F, Relay #70
When copying/pasting any Word stuff to a HTML editor (like FrontPage) an intermediate paste to WordPad preserves most formatting.
.DOCs.
WordPad is also useful for opening bit-rotted Word documents which cause Word to freeze, allowing content to be copied/pasted into "fresh"
Assuming you have a recent enough version of word you could save it as Wordml (Microsoft Word XML vocab). Then you can convert the Wordml to DocBook (search wordml to docbook) or write your own stylesheet.
Once in Docbook format, there exists a stylesheet (which can be customized if necessary) to convert to html.
What you need is the app Magellan from BCL.
.DOC, .XLS, .PPT and .PDF files into quite acceptable .HTML documents.
.GIF and .PNG files aren't seen as single images, they're broken into multiple 1 pixel high images (Which places a huge strain on an HTTPS connection due to the added latencies of multiple requests.) but overall, it works very well indeed.
We use this in our ASP called "Diligent Boardbooks" (www.diligentbooks.com) and Magellan will convert
The software is not perfect - but it is probably the best thing currently available.
There are issues in the conversion process: such as text outside the printable area of the page, and transparent
Our "Boardbooks" app is perhaps, the only product of its kind in the world: specifically designed for company directors, and board meetings, so that materials are available as soon as they are uploaded and approved - with note making capability, and the system keeps track of what pages you have viewed, and which have changed.
All this with nothing except a web browser.
Well worth investigating
How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
I switched from html to docbook about a year ago and am very happy. It's xml so it can be edited with anything or you can find editors for it (XMLMIND is ok) and it can be easily translated to pdf and other formats.
The difference between Canada and the USA is that in Canada healthcare is a right and gun ownership is a privilege.
Depending on what you tell it in a configuration file, it removes or warns you about specified tags, attributes, and/or content bracketed by specified tags. You can use it as a filter and pipe the output into other tools as needed for other kinds of massaging.
One option that can work for some situations is to export / save the file from .doc into .rtf (rich text format) and then use one of the free or pay RTF->HTML converters. I find using other software than Word to convert MSDOC -> RTF produces better results.
Using that process has made preserving italics, bold, and special characters much easier for me and almost seems fully automatable.
I've been using this method recently with some very simple search and replace and able to get good results.
I haven't done this before, but what about outputting from Word to Postscript and then running pstohtml to convert the postscript to HTML. Does pstohtml make better HTML than Word?
www.timcoleman.com is a total waste of your time. Never go there.
For years we used to accept any kind of garbage from our contributing authors until we put our foot down and asked them to adhere to a Word style guide we authored: use headings, not the font drop-down, use the list styles, not the auto-bullets. If your Word is structurally-sound, there are alot of open source tools (we used a commercial solution called HTML Transit) that will do a near perfect translation, including tricky tables. Garbage in...
body massage!
"... a perfect opportunity for the application of a little common sense..."
What is this "common sense" of which you speak? Where may I download it from?
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
There is a cool chapter in the book, Dive into Python (free online, just google), which has a chapter or two on HTML processing. It may be worth a look.
You can convert to html first and then run a script to parse the html with the sgmlparser module. Then just ignore all the msword crappola when writing the output files.
Dunno, maybe that is an option.
#6495ED - cornflower blue
kses - http://sourceforge.net/projects/kses - can be useful for cleaning up the 'filtered HTML' from word, which is still rubbish.
You could try just getting a torrent such as at http://www.google.ca/search?&q=alice%27s+restauran t+torrent
I'd think stuff from academic sources would be in TeX.
Talk to these people
It's primarily a Learning Management System, but they do integration with other web sites for content posting.
Why?
Alternatively for $10 you can get the album at the iTunes Music Store, no turntable required...
HTML Tidy (http://tidy.sourceforge.net/) and its Java derivative JTidy (http://jtidy.sourceforge.net/) both have options to de-gunk HTML produced from MS Word. Does the job.
I looked at the site, using a link found below and I'm wondering why you don't use something like Drupal. (drupal.org)
Setup is easy, it's free and after a slight learning curve, you'll save yourself hours or days of effort. With a bit of CSS tweaking, the site could even look the same. You could use wordpress, but with an academic site like that, you probably need something with more advanced taxonomies.
Plus, you'd get things like RSS syndication, which adds immeasurably to the site's usefulness.
For simple text documents in word, just copy and paste. If you need something formatted exactly, use a downloadable PDF. (Preferably with the text somewhere that Google can index it easily.
-- My Weblog.
This is a really interesting question, which I think we will start to see more of in the coming decade.
There is a fundamental problem with the modern content ecosystem (one facet being: word to web.) What questions such as this point out, is that our current thinking in respect to content creation, is to attempt to cure to problem once it has already been presented (as in, someone has already created to word document, now we have to migrate it to html).
We need prevention, not a cure. Prevent the problem before it every appears.
Solution : Store the content separate from formatting until it needs to be published to a particular format.
Large groups of people create various content which ideally should be a)produced in one or many formats and b) shared as chunks between common users.
I've been working with enterprise level documentation problems for years, hell, i started in the days when documentation problems ment someone had lost the stapler. Today its no easier, people have thousands of documents, chunks of content and data stored in a never ending puzzle of directories. No one shares it, people cant find it, and you can not reuse it.
Databases people! What's taken everyone so long.
Ive found only one product capable at this time of what I speak of and i would be comfortable recommending.
AuthorIT http://www.authorit.com/
Other that AuthorIT, XML is looking promising, yet still far from an elegant solution and ultimately far from the best solution for author's.
Its time content creation took the next step. Most other enterprise solutions have sensibly moved to databases, why shouldn't content?
I created SWL http://cixar.com/swl because I had the same problem. It's just a small document language that lets me change a global template for my web sites without having to make any changes to my content documents, and lets me copy and paste relevant text from offensive document formats into VIM and spend a minute adding "half tags" to the beginning of lines that require style information. It's highly abbreviated, so making a bulleted list takes just four characters, no matter how many items it has. Same with an outline. Same with a table. Same with full page calendars. It lets me make macro tags if I need to produce some repetitive data without having to repeat the surrounding HTML I want. The template system lets me do things that I couldn't with HTML and CSS, like make ornate horizontal rules by overriding their tag. I'm making a new version that'll produce PDF's and HTML from the same source files. Come crash my server, if you will.
I may be crazy, but this pipeline works great if you (1) know how to program a DFA and (2) can rely on some semblance of consistence across various documents from a single source. Majix will also automate Word to convert it to RTF before you start processing it, at least the last time I used it.
Social media and technology thoughts: http://jasonkinner.wordpress.com
Part of my job is doing exactly that...
i did an activex dll in vb that commands MsWord, looking the doc template and the mappings defined in a ini file export using MsXml. most of the time new formats only need to define a new section (for the new template) in the ini file
the dll is called from the submit btn in the ASP
Sounds like you just need to get PHP and Mysql and make it database driven... then you could update the website from the web. Also, there are WYSIWYG editors made in javascript that make editing content similar to word. thats what i use at my site... or you could slap php-nuke on it like i did on my site
word even has tools to do this as a batch job
That's false. I don't like Microsoft either, but I don't see any evidence of that (in fact the last two versions of Office for Mac have been hailed as better in some ways than the Windows versions). I don't know because I don't use Windows. There are incredibly bad compatibility issues with .doc files over PC/Mac versions of office, but the latest versions seem to read everything (besides, there are Office incompatibilities with Office documents from previous versions irrespective of platform).
In any case, on Mac Office 2004 there is an option if you choose "Save as Web Page" for "Save only display information as HTML". This cuts out some of the cruft but not a ton of it; HTML from word is just tedious. I have taken to designing my documents first in HTML and then converting a copy to .doc rather than the other way around -- that way I don't have to deal with Word's output. Seriously there should be an option to ignore font settings, to not create new style sheets, etc. -- when converting DOC to HTML usually all I want is the information and any footnotes and tables properly converted; I don't care about font face and size info (unless it is different than the rest of the page in which case relative tags are what I want) and I sure don't want Microsoft's imperialist stylesheet information all over the place (right down to the style names like MsoNormal and MsoFootnoteText -- yeesh).
If you have a Mac available with Office etc. you can just print and save the output as pdf. Pages may also do a good job of making web pages. It has worked well for me so far at least with various formats.
Against the grain
Granted that the filtered option cleans a little bit,
you still ain't free of the M$... stuff.
M$ can't even leave the "filtered" version alone. Word leaves in a bunch of nasty extraneous attributes that are not XHTML compliant. However, the "filtered" version is a whole heck of a lot easier to write a clean-up script for.
I can't speak from specific experience, but perhaps a conversion from Word to RTF, and then RTF to HTML would give the best results. Word does a fair job converting it's documents to RTF. That conversion will help get rid of some of the weirdness that is Word. Googling for RTF2HTML gives a variety of options. Once it's it's RTF, you might have better luck with scripting tools or other editors that can take the doc to HTML.
You could use a WYSIWYG editor. Not HTMLarea or Fckeditor but xstandard (http://xstandard.com/).
You have to get the pro version for it to clean MS word formating. But among all the other WYSIWYGs i've tried this is the best.
Our company's Media Officer used to give me the Media Releases in Word format, which I would diligently convert to html.
To ease my pain, I added a form to the intranet with a "rich textarea" in which she copy/pastes from Word. Add a few RegEx's and nice clean code - handles tables nicely too.
Mongrel News all the news that fits and froths
This is a really interesting question, which I think we will start to see more of in the coming decade. There is a fundamental problem with the modern content ecosystem (one facet being : word to web.) What questions such as this point out, is that our current thinking in respect to content creation, is to attempt to cure to problem once it has already been presented (as in, someone has already created to word document, now we have to migrate it to html). We need prevention, not a cure. Prevent the problem before it every appears. Solution : Store the content seperate from formatting until it needs to be published to a particular format. Large groups of people create various content which ideally should be a)produced in one or many formats and b) shared as chunks between common users. Ive been working with enterprise level documentation problems for years, hell, i started in the days when documentation problems ment someone had lost the stapler. Today its no easier, people have thousands of documents, chunks of content and data stored in a never ending puzzle of directories. No one shares it, people cant find it, and you can not reuse it. Databases people! Whats taken everyone so long. Ive found only one product capable at this time of what I speak of and i would be comfortable recommending. AuthorIThttp://www.authorit.com/ Other that AuthorIT, XML is looking promising, yet still far from an elegent solution and ultimatly far from the best solution for author's. Its time content creation took the next step. Most other enterprise solutions have sensibly moved to databases, why shouldn't content?
Don't forget HTML Tidy. It has an option to clean Word HTML output.
If Openoffice can open and edit the file - can it "export" something more sane than MShtml?
I'll try it later.
Yes, but whats that got to do with the price of tea in D'ni?
In addition to taking the opportunity to shamelessly plug my book, I've posted a detailed response on the O'Reilly Developer Weblogs site, touching on using XSLT, VBA, Perl, Ruby, and more to get those Word docs into shape.
Andrew Savikas VP, Digital Initiatives O'Reilly Media, Inc.
For the Caravel Project (an OSS enterprise CMS) we chose RTF 2 HTML and HTMLTidy to automatically convert RTF files to HTML during the upload process. Despite the limitations, we found that exporting to RTF and doing our own conversion produced far cleaner code than anything MS did. If your Word documents are text-only, you can get away without additional editing unless the document uses lots of over-ridden stylesheets--the converter respects the stylesheets, while Word respects the overrides, which can yield some unpredictable results.
We also recently switched editors to TinyMCE which has very reasonable 'paste from Word' and 'paste as plain text' features.
I'm also interested in checking out the DocFrac project (http://docfrac.sourceforge.net/status.html) which looks like it might be a step up from RTF 2 HTML. While I think we're offering reasonable solutions, I would still consider Word conversion to be one of the weaker features.
Michael Sherer
http://caravelcms.org/
Not sure if anyone's mentioned this yet, but I recently found this little gem from Microsoft which does a pretty good job of cleaning up their horrendous MS Office HTML output.
I've tried just about every free solution out there, and this seems to do the job better than anything else. Ironic really.
I always go for standards-compliance so I don't tolerate any junk HTML code. This utility has saved me countless hours of hacking away at Word XP's awful HTML output.
HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.
It must be terrible to work at Microsoft and always do mediocre work.
--
If you support dishonesty and violence, don't say you are Christian.
Have a look at the Servoo project. Might be just what you are looking for. http://www.servoo.net/
grow a farkin brain, moran!!
.doc to .html.. give me a farkin break.. how does this junk get cleared.
jesus.
convert
had i'd asked that completely stupid question, i'd of been flamed worse than the likes of hell..
The tool you use to convert Word docs to other formats (HTML and PDF included) is, for the most part, irrelevant if those Word docs lack internal structure (semantic information), which in Word comes in the form of paragraph styles. These paragraph styles are analogous to HTML tags, like h, p, li, and so on. Unfortunately, most people who use Word are oblivious to the existence of these styles. In the typical Word doc, all paragraphs are just "Normal" with a ton of inline formatting; headings may look like headings (bigger font, bold) but they don't have a heading style. The structure of these docs is, thus, hidden from screen readers. Garbage in, garbage out.
Actually, converting semantic Word docs to other formats isn't very difficult; the greater challenge is teaching authors to use Word properly. (Why can't MS better document styles, templates, et al?) The time investment is definitely worth it for people who use Word on a regular basis.
BTW, when you convert Word to PDF in Acrobat, you can embed Word style information in the PDF in the form of tags, Adobe's answer to PDF accessibility issues. It works pretty well (if the Word doc has meaningful styles) and helps ensure your PDF files are 508-compliant.
Good luck.
Hahaha!! Here is his website. Let's bring his servers down!
Or as it is now known, wvWare. Includes wvHtml, which, "converts word documents into W3C certified HTML4.0 format." FOSS (GPL) command line.
The script (decss.sed) is:
O TTOM::g
s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
s:</FONT>::Ig
s:<FONT[ -=\"A-Z0-9]*>::Ig
s:BORDER=[0-9]*::Ig
s:ALIGN=B
LedgerSMB: Open source Accounting/ERP
i also have to often batch process many Word documents into HTML. I've had success with running the file through antiword, outputting it as a text file, and then loading it into the CMS I use for the site.
For static files, run the file through antiword, output to text, and run a simple regexp across the file that wraps the first line in an H1 tag (presumably this would be some sort of header) and adds a break tag to every line break.
See this site. for PureText.
I belive the magical tool you are looking for is called the LART.
The shareholder is always right.
See now that would actually be racist, if India was in the Middle East, where I hear there are lots of deserts and sand dunes and stuff. But you see, India's not in Middle East. So apart from a very small desert, there's also a huge coastline full of beaches, jungles, urban agglomerations and ..oh yeah the HIMALAYAS, with some of the tallest mountains in the world, covered in snow. So, please take the time to come up with a slightly more informed racist comment, it'll make you look smarter, and might even get you laid more often.
Seriously, if we wanted all of the smart tags and proprietary mark-up that Word generates that be an option that we would choose to include in our exported Word->HTML files? In the year 2005 "Export to HTML" should do just that. Export to clean HTML.
How hard can that be? It's got to be easier than exporting all of the other junk that Word currently generates.
If you think about this problem for more than two minutes it's pretty clear that Microsoft has made a the decision not t export to clean HTML.
PDF the documents through Adobe. It would proabbly be your best bet, as there is no magic solution to word HTML, other then start from scratch or quit and find a new job.
Simply don't accept format other than html. Deploy a simple and useful HTMLArea script so users can edit document (or copy-paste from Word) directly inside web browser. You can learn more on HTMLArea at http://www.htmlarea.com/> - there are plenty of them.. It is released under BSD license and plays nice with Mozilla and IE 5.5+.
My choice is the one from InteractiveTools http://sourceforge.net/projects/itools-htmlarea/>
1) Google for: clean up word html
1.1) Click I am lucky (if you type it yourself)
1.2) Download and use HtmlTidy, without checking I am sure it is the first hit.
2) I don't think you have to worry about keeping your sanity. Too late.
Congratulations! How did you pass the job interview?
Nevermind.
To confirm you're not a script,
please type the word in this image: involve
random letters - if you are visually impaired, please email us at pater@slashdot.org
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Seing that word now (supposedly) saves in some sort of "M$ Sane" type of XML schema, it should be possible to make a script to reform it to some sane (as in standards compliant sane) hmtl.
http://www.microsoft.com/office/xml/default.mspx
Otherwise I know Open Office saves the documents as a handfull of xml files, and zip compresses those files so you dont have to wory 'bout more than one file pr document.
Either way, it should be possible to make some sort of sence from the xml file, and make some sort of script to translate the sane xml to html...
seems like the easiest way to post standard content that will display correctly anywhere and everywhere
No, seriously
For my website, http://www.badstep.net/, I edit the files in abiword then use an ant script to drive the export method from the CLI of Abiword.
The reason I use Abiword is that the HTML export is that much better than anything elsen (at the time I looked at it), including XHTML, which is cool for XSLT to transform.
I just wish the ant developers would integrate Cygwin better into ant so's the whole operation could be seamless.
Patriotism is a virtue of the vicious
I have had some of the same trouble. My best free option to date is to get a blogger.com account.
- Start to create a new blog post.
- Copy the junk into the compose tab editor.
- Switch over to the "Edit Html" tab
- Copy all the html code except for the initial embeded graphic & paste into you new html document.
You are now golden.
Anyhow, it's only a matter of time before companies start using cheap African labor. (2 weeks later they'll remember that blackies are a bunch of five-fingering, lazy ass, crackhead sex fiends that want to rape the CEO's hot teen daughter).
I work for a University in Scotland I have used CG a few times and I love the clean HTML it generates.
There are a few formatiing problems with it but nothing a few minutes in notepad can't fix.
All spelling mistakes are due to solar flares...honest
I wanted to do something like this a few weeks back and ran in to similar problems. The difference was I was starting from Open Office sxw files. I thought, and it would seem that I was wrong, that OOo could run the file against a custom xslt style sheet to create output. If it can I couldn't get it working and no one replied to my request for more information on it. The solution I decided upon (but haven't got round to implementing) is to simply write a little code to unzip the sxw file and run the relevent file against a style sheet. Not as nice as getting Writer to do it but it should work. You would need to add another step in front of that but opening the word document and saving it as a sxw but that is easy enough and could probably be scripted.
I used to have a better sig but it broke.
Sorry, if this has already been mentoined (didn't find TinyMCE though).
...) numeric and bullet lists, bold and italic. Then I'd copy the whole word document into the textarea removing all of word's formatting. An then I'd apply all the markup via TinyMCE ... clean, fast and consitent with your existing design.
I recently found out about TinyMCE (no, I don't work for them). It's a WYSIWYG edtior that adds text-processing features to HTML textareas. This one's especially cool, because it produces clean XHTML code, can be completley modified and is free.
So, I would strip the functionality of TinyMCE down (it's really easy) to a format dropdown box (p, h1, h2, h3
Take a look at the examples (remember, you can remove all the stuff you don't want).
"Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML."
Are you trolling? When they start sending in MS Word documents, we stop calling them "academic users", regardless of how useful MS Word is for their particular needs. No self-respecting academic uses MS Word, and at my old university you would quite deservedly get a low mark if you turned something in that had been made in MS Word. (And yes, that would be noticeable; footnotes all over the place, misnumbered TOCs, and lack of substance, because the students often spent up to 90% of their allotted time in wrestling with the program instead of writing the damn paper.)
If I were you, I would just outlaw MS Word. Force users to use more sensible formats. It's good for you and even better for them.
Macromedia Dreamweaver MX (the version I currently have) has a little utility under the Commands menu, which is called "Clean Up Word HTML." It does a passable job. You can choose between cleaning up Word 2000/2002 or Word 97/98. Check boxes let you choose to remove all Word-specific markup, clean up CSS, clean up font tags, set background color, or apply source formatting. It's possible that the more recent versions of Dreamweaver can do more, but I can't say for sure.
If you have Dreamweaver, sometimes you can convert the Word document into Word HTML, open it in Dreamweaver, and use the clean up Word HTML feature. It's not always 100%, but it helps.
Make the people who what this stuff posted give it to you in HTML instead of word. HTML is not that difficult and unless they are doing some goofy things in word they will only need to learn a few tags.
Although it doesn't exactly meet your requirements, printing to PDF and then posting the PDF would preserve all of the original formating and would demand very little effort or time.
Don't Do It!
I have to support a bunch of Contribute users at work, and the keys are constantly getting corrupted on the users' PC's. I may have less-clueful users than many, but I suspect they're representative.
There is no 'i' in team, but there is in fiasco...
Give your document providers a LaTeX stylesheet. Then they can get around with producing content,while your stylesheet handles _all_ formatting and possibly layout.
I can throw myself at the ground, and miss.
I don't know if anybody posted this, but depending on how much you want to formalize the content-push process, then perhaps Tidy is what you are looking for.
http://tidy.sourceforge.net/
If you are considering any kind of scripting solution, I would look into it. I would also include the DreamWeaver option as well in your thinking, if you are not considering a scripted solution.
Best of luck! ~tim
Some power user you are. You can get it from the Russians [allofmp3.com] for a buck 46.
This was a quote of Kurt Vonnegut that didn't fit.
Surely:
1) Document a DTD for your contributors and make them aware of it (i'm sure they already exist for accademic papers and such), the DTD can be used in conjuction with some editors to validate the document produced.
2) Use Xalan or similar stylesheet transformation tool to parse document and generate HTML.
3) Create a page that your contributors can use to upload their documents and see how they look once style has been added.
4) Write something to transform word documents into your DTD.
5) Clean up documents that you have translated yourself
6) Bounce documents back to their owner citing web based preview tool if they do not conform.
Easy!
Now you just got to implement all that stuff...:)
Good luck.
I use dreamweaver mx 2004 i first pull in the word doc to word and save as html then load into dreamweaver and there is an option to clean up word html and viola! out comes my perfectly clean still formatted word doc with a new css section in the head that is perfectly w3c compliant
A finnish company Davisor has developed a Pure Java tool called Offisor that converts Word documents to corresponding custom rich XML (XMSW). The tool can be embedded in any Java -compatible application or service, or it can be used as a standalone (command line) application.
There is a free downloadable demo version available for anyone who wishes to try it out. To company sells also some feature-rich XSL transformations from the custom XMSW format to popular standard formats, including XHTML.
Please note that I'm an engineer working for this company, and therefore at least part of my pay-check comes from Offisor sales.
Use a CMS like Plone (built on top of Zope). Its built-in document type can automatically convert input documents in various formats (including MS-Word) into something more web-friendly.
Admittedly, it has been some time since I have used BBEdit, but the app used to have a feature called "Remove Gremlins" that got rid of a majority of the word trash.
For instance, the Microsoft Office interface is "obvious" - because they publicly released the software.
If someone made StarOffice (later OpenOffice) while under NDA at Sun that NDA would not keep them from releasing parts of that code unless the programming techniques used were different than what any programmer would expect.*
(Copyright WOULD prevent them from releasing it, of course!)
But the fact that a clone of MSOffice is "obvious" does not mean it is trivial, it was still a huge project.
*Curiously, I think reverse engineered MS protocols/formats probably ARE covered by NDA even at a company reverse-engineering them. But if the OpenOffice format is published it would no longer be protected.
I am not a lawyer.
Ben
Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
I'm a tech writer and when I go to a new job, one of the first things I do is survey the existing documents and create a Word stylesheet that incorporates:
- the three most common headings (H1, H2, and H3)
- the most common paragraph format (in two formats - one with a space after, the other without a space after the paragraph)
- and any special paragraph formats I see that I know users are married to.
Then I start producing and make the stylesheet available so that before long - miracle of miracles - a number of people are using them and simplifing my job a lot.
(By the way, I also publish a glossary early on so that everybody knows how to spell email, log-on, user ID, and Internet, as well as present phone numbers without parentheses. Every company has a list of vocabulary and words that need standardization.)
I am expected to create trade show booth-size graphics from logos that people send - in Word.
"Would it help if we brought you our business card to scan?"
No, sir, it f*cking would not. But thanks for trying.
The House Between - Original Sci-Fi Series
Evaluation version, works for 10 days, and only 5 files at a time. http://www.flash-utility.com/download/doc2html.exe
sorry if someone else has posted the same thing before, but I can't go through all the 500 comments before posting this.
Manojar - pronounced like Manager
PDF export via OpenOffice? It's a standart and respect page cut.
Be or ben't
Yeah, I hate that about Word. I have a perl script that strips out the worst of junk that MS Word seems to add. It does the job for me. Your milage may vary.
http://bill.herrin.us/freebies/striphtml.pl
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
From the sourceforge page:
I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.
Back in 1999, I was hired by a company to fix just such a problem as this Word to HTML thing.
I think Microsoft made this horribly mangled HTML in an effort to thwart Netscape. I fixed it by creating a program that parsed the HTML file, repaired the mistakes and spit out a new file. Worked pretty good as I recall.
codifex
Open the word document with WordPerfect and export it to pdf. done.
You should save them as pdf files (or whatever) and store them in a database with some metadata. That way you can build a searchable archive of everything you ever post. It would be pretty straight forward to build a dynamic page that builds itself from the database. That way you never edit the page source, you only make changes to the database when you want the page content to change.
I have self-help Web site and it has tonnes of pictures. HTML (I call MS-HTML) saved from Word would turn all PNG graphics into bloated Gif/Jpg files and would ony look decent in IE.
I switched to OpenOffice and I am better for it. In v1.1.4, I have to fix the line spacing in each HTML file but then it looks identical in IE and FF. This is fixed in OOo 2.0 beta.
My advice: Unless you plan to give your DOC files to other people, switch to OpenOffice. Give PDFs to other people and export to HTML for you site in one step.
Layout of menus is 90% like Word and it's free.
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
Control-A, Control-C, open WordPad, Control-V.
Control-A, Control-C, goto your CMS, Control-V.
Retains all the links and text formatting but gets rid of all the garbage, and you don't have to pay an outrageous amount of money for Dreamweaver.
I haven't seen your site (although there are those in this discussion that have found it, mwahaha), but if you haven't already, you may want to consider a database driven site. All of the supporting documents that relate to your main file could be uploaded to the database and pulled out dynamically as needed. I've done quite a few implementations of this where we take word documents and save them to the database using a tool called fckeditor. It has a function for pasting in content from Word docs. We then save that to the database, create the relationships, and wa-la! It's done.
FCKeditor is open source, but if you ned any help getting it up and running let me know.
THIS SPACE FOR RENT
Under Windows, some dynamic languages like Python and Ruby has COM and Win32 APIs. Use one of them to access Office COM objects, iterate through and grab the content you want and have it spit out the HTML you need.
Since shortly after the introduction of the typewriter, publishers have required that submissions be typed, not hand-written. In the early days, a typewriter was an incredibly expensive machine. But budding authors either bought one or rented one if they wanted their work to be published. Try submitting a hand-written (or probably even TYPED) document to a modern publisher and you will get a polite response (which will include your unread masterpiece) that suggests that you read their guidelines and try again.
Just becuase MSWord has a 90% market share of the non-computerati doesn't mean that you must do away with submission standards and guidelines. I'd suggest that you write up what you expect and if a submission doesn't conform, it should be returned, unread, with a politely worded note explaining:
1) Why it was rejected.
2) Where to go to find the guidelines.
3) Hints on how to conform to those guidelines (Word's Save As.. text/html or similar.) based on the format submitted.
The internet should not be multi-cultural when it comes to simple content. All such content should be submitted as HTML, period. This is not a heavy burden, takes only a few seconds per document and the author has the greatest stake in getting it on the web.
Submission guidelines: Good enough for print publishers, good enough for web publishers.
Here is a thought to consider, the page I am typing in right now as I write this text contains submission guidelines. These guidelines include: the acceptable text tags, how to format URLs and even hints on how to be a good submitter. Everyone who has responded to you (and even you, yourself) have managed to follow those guidelines. I bet your customers can too.
I would suggest Chami's HTML-Kit. It has a fine Word 2K tag removal feature. It has saved me a few hours of staring blankly at thousands of lines of useless MS Source.
There is this handy tool that can go through all of the HTML for you and do all kinds of custom formatting, cleaning up and simplification custom per your specifications!
It is called an intern.
....has a great "Clean up Word HTML" filter. Use that, then use the "Clean up HTML"filter in the same menu specifying that it also take out font and span tags.
then go through a search and replace the few that are left. works great.
or copy plain text and reformat.
I've got the same situation. All the content is generated in word, I need to get it live in HTML
Just copy paste the word document into play text, and open it in dreamweaver, then, it's style sheets to the rescue, as you (should) only need to enter a tiny bit of markup
... Americon ...
Come on someone, bring out some Transformers jokes!
Someone had to do it.
I always have the font layout and text size problems with Word docs. My question is do you think longhorn oops I mean vista will fix this problem?
Why do it the hard way? Use something like LyX as a frontend. It offers a nice equation editor and reduces the need for manual tex/latex writing significantly (basically, if you want something fancy, you may have to do it manually, but routine stuff becomes pretty transparent).
(Well, it does not help with the original question about Word->HTML conversion...)
Everyone who makes generalizations should be shot.
If you really want to keep an ASCII text version, you can use the "Extract Text" feature of Acrobat, possibly performing OCR (which is a built-in function of Acrobat) on the document first. Granted, this is a $500 dollar solution (I'm a hardcore open-source software user and _never_ use commercial software personally), but it works extremely well.
Open Office Version 1.9.122 (2.0-beta) is quite good for this.
Load Micro$oft Word file.
Export to HTML/PDF/whatever format you like. I've used it for my novel, and use both export-as-pdf and save-as-html, and with the exception of multi-columned text, saving as HTML works perfectly. Saving as PDF works perfectly for everything (including multicolumned entries and embedded fonts), as this example shows.
The Future of Human Evolution: Autonomy
ject line.
Use PDF995 http://www.995software.com/ to convert the docs to PDF, then convert the PDF to HTML.
"I am a Bomb-Disposal Technician. If you see me running, try and keep up."
There are several free tools to convert Word to PDF (and do a really good job).
Lexx/flex + yacc/byacc/bison
Or just use perl, thats what it's there for.
Non sequitur: Your facts are uncoordinated.
Wow. Thanks for the reference to that wysiwyg textarea editor!!! For me, in my work, this is huge! I'm going to implement this ASAP!
Appreciatively,
Seth
$5 / month hosted VPS on linux = awesome!
OpenOffice saves files as an XML format. Many languages, such as PHP, have external libraries that can read, parse or rewrite XML documents. Once you have the file stripped down to just the important XML sections, it should be very trivial to rewrite this into HTML. I did something kind of like this for a client who wanted to alter RSS news sources on a local website.
Copy the text, and paste it into Nvu without formatting (found in the menus, not the default option unfortunately). From there it shouldn't be a big deal to format it however you like.
Under capitalism man exploits man. Under communism it's the other way around.
bbedit has a neat feature which allows for batch editing of files, and if you wish, allows you to verify the change before committing... many programming tag libraries are available (html,perl,java,php), it really helps process large number of files. plus, if you use it with tiger, i bet you could do some funky automator scripts to really custom fit your needs.
three can keep a secret, if two are dead - benjamin franklin
Just expanding on your suggestion...
r /Writer.StoreWriterAsPDF.snip
c .html#info
Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.
OpenOffice API:
http://api.openoffice.org/
Code snippet shows simplicity of converting OpenOffice Writer SXW document into PDF:
http://codesnippets.services.openoffice.org/Write
Perhaps a few small changes here would get him what he wants.
Perl interface (ooolib):
http://ooolib.sourceforge.net/doc/ooolib-0.1.5-do
There are also Java code snippets. I think it would be possible to convert the OOBasic snippet above to either Java or Perl.