Slashdot Mirror


Sanely Moving from Word to the Web?

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

547 comments

  1. Scrapping by fembots · · Score: 5, Interesting

    Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):

    1. Place all "to-process" documents in a specific folder in a webserver
    2. Write a script to read those documents
    3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).

    Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.

    Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.

    1. Re:Scrapping by dougmc · · Score: 2, Insightful
      Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
      Of course, you really can't properly parse html just using regular expressions. You can get it right 90% of the time relatively quickly, and a day or so work will get you 5% more, but you could spend weeks trying to get that last 5% -- and never quite get it.

      It's really better to use things that other people have made for parsing html. For example, if you use perl (and you should -- it's the ideal tool for this), HTML::Parser works pretty well, though there's a signifigant learning curve in using it.

    2. Re:Scrapping by cloudmaster · · Score: 2, Informative

      For simplifying documents, I've found HTML::TreeBuilder to be a handy module. Then you just have to write code to simplify HTML (throw away useless tags, merge adjacent tags, etc), rather than worrying about reformatting word docs.

    3. Re:Scrapping by Daengbo · · Score: 2, Interesting

      Or, you could convert all those Word docs to Writer and use Ant like this guy did to xsl transform the xml into a website. I discovered this website because I'm starting to write a CMS in PHP5 which automatically adds content from OO.o documents.

      Alternately, you could use Writer2Latex to generate XHTML 1.0 strict for yourself.

      Those two methods seem the easiest.

    4. Re:Scrapping by sushibot · · Score: 1

      Copy and paste from Word to WordPad, then to Dreamweaver.

    5. Re:Scrapping by aGuyNamedJoe · · Score: 1

      Some time ago,for a similar job, I did pretty much as you suggest.

      Then I'd use the SaveAsHTML from word and hack the html with a perl script. In those days the HTML was really bad and took a lot of fixing. These days it's not as bad.

      Still, I've found it easier to save as text, and now use a script that makes some standard insertions, calls Markdown from DaringFireball) to create basic html and includes a css file reference .

      Then I have another perl script that lets me put Perl Regular expressions (generally substitutions) in another file and applies those changes for me. -- That way I don' t have to edit the perl script, just the file of regexes

      I generally have to iterate a few times, but because the files I'm working on are reports from a fixed group of people, the changes are relatively small, and I can often reuse many of the files from the previous round of these files.

      I can often get a set of reports (< 10) up in a couple of hours -- in nasty cases it may be easier to edit by hand.

    6. Re:Scrapping by griffjon · · Score: 1

      Unless they were created by people who drink the format koolaid and use styles instead of font changes and so on, I suggest the following method:

      As the above poster suggested, you put them all into one directory.

      Now, this is where it differs.

      You delete them and run screaming.

      Beyond some basic paragraph markings, I'd say you're pretty hosed in terms of automation. Now, if they didn correctly use styles, you can search-and-replace within word for style/formatting and put tags around it.

      But most likely, you're hosed.

      --
      Returned Peace Corps IT Volunteer
    7. Re:Scrapping by Sheriff+of+Rockridge · · Score: 1

      or straight to dreamweaver (code view)

    8. Re:Scrapping by ozmanjusri · · Score: 1

      or straight to dreamweaver (code view)

      Macromedia's own tool for converting Word etc to web pages is Contribute http://www.macromedia.com/software/contribute/prod uctinfo/overview/. It would be an easy tool to set up so the users could do their own updates, or you could keep it to yourself.
      Another option would be to use Microsoft's own tools for the job. Infopath/Sharepoint are intended as a toolchain for getting Office documents to inter/intranets, but they'd be harder (and more expensive) to set up.

      --
      "I've got more toys than Teruhisa Kitahara."
    9. Re:Scrapping by TCDooM · · Score: 1

      FreeTextBox.com is for you.

    10. Re:Scrapping by Anonymous Coward · · Score: 0

      Simple. Here's the Microsoft-approved programmatic way:

      http://support.microsoft.com/default.aspx?scid=kb; en-us;291325

      There's also a RegEx way we've found to work pretty well, but Slashdot's dumbass filter won't let me post it.

    11. Re:Scrapping by Baorc · · Score: 1

      Copy and paste from Word to WordPad, then to Dreamweaver. Actually, I've had huge huge documents to convert from word to HTML and I had some graphics in there as well. What I used to do is copy-paste to FrontPage, it would render the tables much better. And occasionally it would even render the pictures good as well. And on top of that, it isn't clustered with a bunch of dirty html code.

      Anyways, that's what I use, I'm sure you could just write a script after copy pasting that would get rid of some small constant details.

    12. Re:Scrapping by fshalor · · Score: 1

      Did it render the tables "correctly" or did it just "look" correct. :) ...

      [me tries to shake off toooo many bad renders...]

      --
      -=fshalor ::this post not spellchecked. move along::
    13. Re:Scrapping by Baorc · · Score: 1

      It rendered them pretty good. So I would say correctly because the only thing I had to change was the width (It's in pixels rather than %. And for the job I needed % to accomodate people on 800x600 and 1024x768 resolutions), also needed to change the border width, color, cellspacing and cellpadding. It basically put the table on default look, which is better than a ton of code trying to make it "look" correct. No style code either. I was pretty impressed. And as well, the frontpage working page was alot closer to it's preview page than Dreamweaver's working page to preview page.

    14. Re:Scrapping by Joseppi+Blauinski · · Score: 0

      I recommend Beer be included. Doesn't get the job done any faster or better, but you could care less.

    15. Re:Scrapping by jc42 · · Score: 1

      It's really better to use things that other people have made for parsing html. For example, if you use perl, HTML::Parser works pretty well, though there's a signifigant learning curve in using it.

      Well, I've tried that; more often than not, I gave up.

      The problem is that I usually had to use it with HTML that came from a Microsoft product. HTML::Parser did pretty well with well-formed HTML, but when it came to the malformed stuff, it usually couldn't handle it sensibly. I found too many cases where some critical part of the text was simply missing from HTML::Parser's parsing. Or if it was there somewhere, I couldn't find it.

      I've found that a better approach is to use HTML::Parser initially, but when it fails on the real-world input, don't waste too much more time. Just write a quick-and-dirty parser that handles the minimal markup needed to get the information out. Pass on a few tags; delete the rest. Don't worry about doing a perfect job of every little detail, and especially don't endanger your own sanity by complaining about the garbaqe that passes for HTML these days.

      This saves a lot of time and grief in the long run.

      I've often wished I didn't have to say such things. But I have no power over Microsoft's developers ...

      --
      Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  2. Handy alternative to Notepad by bigwavejas · · Score: 1
    I use a program called, "EditPlus" http://www.editplus.com/ It has syntax highlighting for the most common extensions (HTML, CSS, PHP, ASP, Perl, C/C++, Java, JavaScript and VBScript).

    What I basically do is paste the document into EditPlus, then I use a function called "Replace" to get rid of the big stuff and edit out the rest of the tags manually. It may not be the best solution, but it's visually easier than just using notepad.

    --
    "Simplify, simplify, simplify!" Thoreau
    1. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 1, Informative

      Why use editplus when you can use Crimson Editor when its free, open source, and has all the capabilities of edit plus, functionality, and then some? (http://www.crimsoneditor.com/) The built in macro functionality is really sweet too!

    2. Re:Handy alternative to Notepad by maotx · · Score: 1

      I've personally always enjoyed Scite.
      Free, os, macros, cross compatible (Linux, Windows), and recognizes syntax from multiple languages.

      --
      I'm a virgo and on Slashdot. Coincidence? Yes.
    3. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 0

      Why not just use Emacs, which does all of the above and then some? It is available for Windows you know.

    4. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 0

      A free alternative to EditPlus is ConTEXT (http://context.cx./ It's fast and lightweight, customizable, actively developed, has a good support community, has a lot of nice features and even more features planned (some already implemented) for the next version.

    5. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 0

      Don't forget that Crimson also has a built-in FTP browser!

    6. Re:Handy alternative to Notepad by B3ryllium · · Score: 1

      When it does syntax highlighting, can it change the background colour - so, for instance, inline PHP code would appear on a grey background, and inline HTML would appear on a white background?

      That, along with project management, is my favourite feature of Allaire HomeSite 4.5 :)

    7. Re:Handy alternative to Notepad by kelnos · · Score: 1

      Funny, poking around the Crimson Editor website, it doesn't appear to be open source. The license says it's redistributable, but there doesn't appear to be a source package. The download page calls it 'freeware', and says you can redistribute it as long as you don't modify it (though the actual license on the 'tips & notice' page doesn't say anything about modification). Not that not being OSS is a bad thing, necessarily (I don't subscribe to RMS' religion), but it's kinda lame to label something OSS when it clearly isn't.

      --
      Xfce: Lighter than some, heavier than others. Just right.
    8. Re:Handy alternative to Notepad by trewornan · · Score: 1

      That's confusing! For a few seconds I thought you were talking about ConteXt which I don't think would make a good replacement for a text editor.

    9. Re:Handy alternative to Notepad by c0d3h4x0r · · Score: 1, Flamebait

      Why not gnaw your own arms off and independently figure out quantum mechanics, since you're already willing to endure the pain of learning how to configure Emacs?

      --
      Moderator hint: a comment is neither "Flamebait" nor "Troll" if it is true.
    10. Re:Handy alternative to Notepad by laxiepoo · · Score: 1

      I'm a huge believer in Edit Plus. It's been my favorite text editor for the past 6 years (geez, has it been that long already??). I'm actually going from a poorly, POORLY written Word doc to an HTML page right now. Just doing it the old fashioned way, but I wanted to read the comments here. I did a CTRL-A, CTRL-C, CTRL-V into a blank TXT file in Edit Plus. I'll mark it up and drop it into an HTML page template for our website later.

    11. Re:Handy alternative to Notepad by arch_avaj · · Score: 1

      I have Crimson Editor, and it was decent for a simple text editor. The macros can make it easy to generate and test code quickly too.

      But what I really prefer to use now is this Eclipse Plugin, Colorer which does do the above mentioned highlighting for recognizing inline code. Well I haven't tried it on PHP, but it recognizes a lot of languages and syntaxes, and worked well with JSP.

    12. Re:Handy alternative to Notepad by Freexe · · Score: 1
      I can't live without jEdit, I dont know how i lived without code folding before this day!

      Plus its free and Open Source and by far the best editor out there/

      --
      "In a time of universal deceit - telling the truth is a revolutionary act." - George Orwell
    13. Re:Handy alternative to Notepad by Fez · · Score: 1

      EditPlus is nice, but it's still a far cry from UltraEdit.

      I just wish there were an open source editor (for Windows or *nix) that came close to its functionality and ease of use. I've come to depend on load/save directly to FTP/SFTP, great column editing, etc. Kate comes close, but not quite close enough...

    14. Re:Handy alternative to Notepad by FooAtWFU · · Score: 1

      Actually.... I already do, in combination with emacs and tidy. It's still very tedious work.

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    15. Re:Handy alternative to Notepad by inphorm · · Score: 2, Informative

      I personally use dreamweaver for coding.. I know, I know, all that gui overhead and only semi-compliant code if it generates it itself.. but it does have the useful clean up word html tool, then I get to working it over in pure code.

      works for me anyway..

      - paul

    16. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 0

      Why make trillions, when you can make... (pinkie on the lips) billions?

    17. Re:Handy alternative to Notepad by qw(name) · · Score: 1

      The only bad thing about jedit is it's print function. When I print out a color document such as my perl code, bolded letters oftentimes overlap other characters. When I need a color print-out I open the code up in Vim and print it from there. :)

      I love Vi but it is hard to beat all the features jedit has built in.

    18. Re:Handy alternative to Notepad by magma · · Score: 1
      "gVim" http://www.vim.org/. A litle bit of a learning curve but once you learn how to do range substitutions it is cake.

      :%s/find/replace/g
      :.,$s/find/replace/g
      :.,.+3s/find/replace/g

      But these are just simplified sed scripts. Better to drop vim and just run sed.
      Sed - it's there - use it.
      (Windows is no longer a good excuse http://www.cygwin.com/)
    19. Re:Handy alternative to Notepad by Anonymous Coward · · Score: 0

      Because without my arms, I won't be able to learn emacs. Duh.

    20. Re:Handy alternative to Notepad by Kick+the+Donkey · · Score: 1
      I don't think CE is oss. Free? Wes. Opensource? No.

      I used to use CE all the time, but I grew tired of it throwing up on large files. I switched to Programmers Notepad about a year ago, and never looked back.

      --
      /. is a bunch of nerds at a million typewriters. It's not a political conspiracy determined to undermine your beliefs.
    21. Re:Handy alternative to Notepad by luna69 · · Score: 1

      I've got to second this.

      UltraEdit offers exactly one thing: everything I need in an editor. While it's not OSS, it's cheap, regularly updated (but without breaking itself when new versions are released), has all of the features I could want: macros, extensible syntax highlighting, column editing, function collapse, templates, HTML validation & tidying, project & workspace management, etc.

      Recommended.

      --
      No gods, no demons, and no masters. Secular Humanism!
  3. PDF? by Anonymous Coward · · Score: 2, Insightful

    How about Word -> PDF -> HTML?

    Just a thought ... and probably a dumb one.

    1. Re:PDF? by cloudmaster · · Score: 2, Funny

      I hate you and your kind. Yes, hate. :)

    2. Re:PDF? by lebow · · Score: 1
      This is only a good idea if it is an alternative content. Don't make me leave my web browser, and don't make me install a plug in.

      (it would also be nice if it worked in lynx)

    3. Re:PDF? by fbjon · · Score: 1

      How about OCR? No seriously, no need to print it out, just make image files of every page, and scan the output into some simple html format with and where appropriate.

      --
      True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
    4. Re:PDF? by ImaLamer · · Score: 2, Interesting

      "Print to PDF" seems to be the function that would solve all of these problems, but so would any others. Think you *could* print to a TIFF, PDF, virtually any image type with a *nix Word compatible program - then you can scan the image and OCR it to plain text. Antiword (mentioned by another /.er: http://www.winfield.demon.nl/) can convert DOC to plain text... there are thousands of options.

      However, if someone is getting the idea for another open source project to solve this dilema then I'd suggest something that can render DOC to HTML on the server side. That would allow those who just know how to "setup" a webserver to sit back and let the software deal with people's problem with not using standard types. Parse the Word, Wordperfect, OpenOffice, RTF, whatever and render it in HTML. This would allow anyone in a company dump the document on the server/share and let it be viewed by anyone else.

      But there are limitless options like this http://www.doc-api.com/ found on google...

    5. Re:PDF? by keltor · · Score: 1

      Cause it would just a waste of bandwidth? I am actually going to try the word->pdf (this will be pdf with text not images) -> html

    6. Re:PDF? by FooAtWFU · · Score: 2, Informative

      Not quite what I'm looking for. Maybe I should clarify: I want to remove the nonessential formatting, while keep certain niceties (in particular, italics for the names of papers they reference, hyperlinks for footnotes, etc) and convert the rest into something simple and plain with just-the-basics of HTML, so I can then style it to match the other pages on the site. Many of these documents go to collections: encyclopedia articles, book reviews, abstracts of papers. If they don't look consistant, then people do complain. (And my site has enough formatting-consistency issues as it is ;)

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    7. Re:PDF? by try_anything · · Score: 1

      I also hate you. PDF is just terrible for online browsing. Don't do this to your readers!

    8. Re:PDF? by Taladar · · Score: 1

      Why do you accept Word documents in the first place? Usually scientific documents are much better off when written in Tex, Docbook or something similar. Especially Docbook offers all the options you mention without offering any layout in the content document (total separation of layout and semantics).

    9. Re:PDF? by child_of_mercy · · Score: 1

      Beat me to it.

      But you don't even need to buy acrobat if you print to a .ps or .prn file, then use ps2html.

      --
      'There is a Light that never goes out.'
    10. Re:PDF? by FooAtWFU · · Score: 2, Insightful

      The people we're dealing with here are not social sciences people, specifically economics. I'd be perfectly fine with taking DocBook or TeX documents- but nobody's going to send them. It's not happening. We accept Word documents because we have ALWAYS accepted Word documents and most countributors probably aren't even aware that something like TeX or Docbook even exists, let alone how to use it. And they're not willing to learn it just to send us stuff.

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    11. Re:PDF? by MarsLander · · Score: 1

      On Windows, it's easiest to create a PDF printer, using either Cute PDF or FreePDF XP. Both products are simply a wrapper for AFPL Ghostscript, and create a printer that you can print to from any application to create a PDF. Oh, and they're free (beer I think).

      I even find that they produce better PDFs than Adobe Distiller a fair bit of the time.

      Cute PDF
      FreePDF XP

    12. Re:PDF? by child_of_mercy · · Score: 1

      But in both cases they're producing postscript and then re-encoding to pdf, which is an unnecessary step.

      --
      'There is a Light that never goes out.'
    13. Re:PDF? by MarsLander · · Score: 1

      Are we interested in the means or the end here? You can go straight to PDF with the proprietary Acrobat Distiller, but I find that the resulting PDF is often larger and/or looks worse anyway.

      We're really talking about Word->PDF->HTML here anyway.

    14. Re:PDF? by child_of_mercy · · Score: 1

      No,

      Just because you don't notice the extra step doesn't mean it isn't happening.

      All print operations involving pdf generate postscript.

      Which then gets compressed down to pdf, even if you aren't aware of it.

      So you're doing Word-->PS-->PDF-->HTML

      I'm saying the PDF step there is not necessary.

      Possibly depending on your pdf to html converter you're actually doing the even worse: Word-->PS-->PDF-->PS-->HTML

      Most *nix PDF handlers also do PS transparently in my experience.

      If the file is particularly large and richly formatted even on a screaming new machine you can save a lot of time and grunt by cutting out the PDF steps.

      --
      'There is a Light that never goes out.'
    15. Re:PDF? by the_womble · · Score: 1
      I can see you have to accept Word. However, is there any way you can make contributors aware of alternatives? You would be doing everyone a favour if you could.

      For example, if there is a contributors area on the web site you could put up a page there describing the advantages to both you and them of other formats. Tell them about easy to use software (i.e. NOT editing TeX in Vi!). My favourite is Lyx. It is hardly difficult to use, it can produce DocBook, LaTeX, HTML, PDF etc and it is easier to use than MS Word for any document longer than a letter.

    16. Re:PDF? by Anonymous Coward · · Score: 0

      Jesus Christ I am happy you do not work for me. Tell me, does your mommy know you are hanging around here??

    17. Re:PDF? by tigersha · · Score: 1

      What makes you open source freak whackos think you can force your favorite toys down everyone's throats in the first place?? They are your users, and they will use whatever they damn please to get their job done. They do not exist for youur convenience, you exist for theirs.

      --
      The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
    18. Re:PDF? by anthony_dipierro · · Score: 2, Insightful

      Do you really think the extra bandwidth costs would be more than the cost of his salary while coming up with a better solution?

    19. Re:PDF? by Evil+Grinn · · Score: 1

      Slap a web interface on your site that allows (forced) the AUTHORS to enter their content into a plain textarea. Spend your time working on the design of the site and let them worry about the content.

    20. Re:PDF? by dfiguero · · Score: 1

      I thought TeX was like the l33t stuff for academics to use to write their papers...

      --
      My penguin ate my sig
  4. Tedious cleanup? by Timesprout · · Score: 1, Funny

    Sounds like a job for Mrs FooAtWFU

    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
    1. Re:Tedious cleanup? by FooAtWFU · · Score: 1

      This is pure speculation based on your post, but at a guess... I don't suppose there is a Mrs. Timesprout, now, is there?

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
  5. Two words: by Anonymous Coward · · Score: 0

    Chinese Children.

    Coincidentally, the captcha for this post is "chinked". Fucking hillarious.

  6. PDF? by night_flyer · · Score: 2, Insightful

    you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing

    --


    Thanks to file sharing, I purchase more CDs
    Thanks to the RIAA, I buy them used...
  7. Use antiword by shura57 · · Score: 2, Informative

    It takes Word file and spits out plain text. It can also do some more tricks.

    1. Re:Use antiword by uberdave · · Score: 1

      I tried it before, but it merely spit out keystrokes and mouse gestures.

      Oh wait! You're serious.

    2. Re:Use antiword by cloudmaster · · Score: 2, Informative

      He's obviously got access to Word, and saving as text in Word at least preserves most of the whitespace-based formatting. IIRC, antiword is mostly useful as a last-ditch effort to read a word doc, a step above piping the doc through "strings". :)

  8. Dreamweaver by necro2607 · · Score: 5, Informative

    I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.

    1. Re:Dreamweaver by fean · · Score: 5, Informative

      in Dreamweaver, there's a command "Clean up MS Word HTML". Its made to clean up Word's crappy html, and does a pretty nice job of it.

    2. Re:Dreamweaver by kortex · · Score: 1

      Ditto on this idea - Dreamweaver also includes a nifty "Clean Up Word HTML" tool that really kicks ass. It will nicely and completely sterilize all the redundancy and nested bs --- as well as Word-specific tags to leave you with nice, clean HTML afterwards.
      Apply style sheets to format and voila! - yer done :)

      --
      -- kortex "Not everything that counts can be counted, and not everything that can be counted counts"
    3. Re:Dreamweaver by zerofret · · Score: 1

      I also find Dreamweaver to be pretty good at converting Word DOCs into HTML. It doesn't do a fully automated conversion as I've got some mandated from above 'corporate image' weirdness I have to adjust for, but it gets me about 80% of the way there.

    4. Re:Dreamweaver by Anonymous Coward · · Score: 0

      "Dreamweaver is pretty good for formatting and working with stylesheets."

      Are you serious!!?? I almost spilled my coffee all over my keyboard when I read that line. Tables are also perfect for Standards Based XHTML layout!

      Dreamweaver absolutely butchers Stylesheets...

      I don't mean any disrespect to the poster, but unless you switch away from Dreamweaver and Tables, you are toast in the near future for Web Design.

    5. Re:Dreamweaver by drmike0099 · · Score: 3, Funny

      The OP is correct: 1) Open Dreamweaver. 2) Commands > Clean Up Word HTML... 3) Rejoice

    6. Re:Dreamweaver by ricosalomar · · Score: 0

      They're all correct. Crappy as DW can be, it does this job really, really well. I use it daily as a first step to posting our internal docs to our intranet.

    7. Re:Dreamweaver by ozbon · · Score: 2, Informative

      Yeah, I found that the "clean up word docs" did about 80% of the work, and then it was just a matter of a buttload of search/replace stuff in order to get it to finish the rest.

      Worked pretty well, once I'd got the search/replace stuff sussed out.

      Mind you, on a big word file you can think it's crashed when actually it's just doing lots of thinking...

      --
      I say we take off and nuke it from orbit. It's the only way to be sure...
    8. Re:Dreamweaver by Orrin+Bloquy · · Score: 1

      Dreamweaver does nothing of the sort to stylesheets... unless you're writing crap with syntax-based hacks in it (e.g. badly formed comments).

      Assuming you write normal CSS, DW handles it like it handles HTML: a language with a particular syntax and structure.

      About the only thing DWMX doesn't do well with CSS is handle nested sheets (supposedly fixed in MX04). Our site redesign relies on one sheet to manage CSS menus, and the core sheet imported by the pages imports the menu.css. This resulted in DW's Design Mode slowing to a crawl and refusing to display anything below the scroll area. We changed the menu.css' @import url to an absolute URL (which DW can't understand) and the problem was fixed.

      But the only time I've ever seen DW mangle a sheet was when the sheet's code used 'hide from browser X' malformed comment hacks, typically the ones aimed at IE 5 and older versions of Opera. Can you point to posts on css-discuss supporting your position?

      The grandparent poster never implied tables were acceptable formatting.

      --
      "Made up/misattributed quote that makes me look smart. I am on /. and I must look smart."
    9. Re:Dreamweaver by Anonymous Coward · · Score: 0

      No, no, no. Don't mess with exporting to HTML in Word and then opening in Dreamweaver and doing "clean up." And don't copy and paste either.

      Open a new document in Dreamweaver and do an Import > Word Doc. What you then get will come in cleaner than any of the other mentioned alternative methods.

    10. Re:Dreamweaver by uberchicken · · Score: 0

      I'm hoping Dreamweaver is going to fix the problems we're having with Word documentation containing images that we generate .html files from, in order to compile into a .chm help file. Your post gives me hope!

      I have to see if this really works: thank you, I'd like a "+1 Interesting".

    11. Re:Dreamweaver by necro2607 · · Score: 2, Insightful

      We use Dreamweaver exclusively for HTML/CSS programming and it doesn't create code that doesn't work - in fact the code it creates is very very compatible: I've never had to manually tweak code to obtain proper design and layout with the sites I've built.

      Indeed we'd love to move to advanced CSS for page formatting but that's a big step right now - there are no professional WYSIWYG editors that have the sheer range and quality of features we need - Page templates, ability for clients to update the site later in a very convenient WYSIWYG interface, high compatability with ultra-common web media such as Flash, etc. etc...

      Trust me we're keeping our eyes peeled for a better solution but right now Dreamweaver is the best available. Sometimes simply sticking to "standards" isn't neccesarily the best idea. In fact, sticking to proper standards creates sites that differ in appearance from browser to browser. Dreamweaver has very impressive awareness of inconsistencies and standards-deviation in many browsers.

    12. Re:Dreamweaver by Anonymous Coward · · Score: 5, Informative

      Also note that you have the ability to cut and paste formatted text from Word into the 'Design View' within dreamweaver and DW will automatically reformat the incoming text appropriately. In my brief test to make sure i wasnt talking out my a** i found it even supports word tables properly.
      If you paste text into the Code view, DW removes the formatting completely and just uses the raw text.

    13. Re:Dreamweaver by darkonc · · Score: 1
      ... "clean up word docs" did about 80% of the work, and then it was just a matter of a buttload of search/replace stuff in order to get it to finish the rest.

      That boatload of search/replace stuff might be able to be replaced with a perl/sed/awk script.

      If you're in an all-Windows shop, you can always load up knoppix to do that part -- or set aside 10 MB to do a desktop install of your favorite distro (knoppix is, once again an option) and dual boot. Better yet, just find an old machine in some storage room, somewhere that you can assign to the task.

      --
      Sometimes boldness is in fashion. Sometimes only the brave will be bold.
    14. Re:Dreamweaver by sconeu · · Score: 1

      Or Cygwin.

      --
      General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
    15. Re:Dreamweaver by shyster · · Score: 1
      That boatload of search/replace stuff might be able to be replaced with a perl/sed/awk script. If you're in an all-Windows shop, you can always load up knoppix to do that part -- or set aside 10 MB to do a desktop install of your favorite distro (knoppix is, once again an option) and dual boot. Better yet, just find an old machine in some storage room, somewhere that you can assign to the task.

      Yeah - that's much easier than just grabbing the Windows ports. Or Cygwin.

    16. Re:Dreamweaver by kderby2000 · · Score: 2, Informative

      Macromedia Contribute might be a better tool. It's (basically) a web browser, where you surf to a page, hit "edit", make your changes, and it posts back to the site. You can also upload Word & Excel. A site can also be set up to honor style sheets, Dreamweaver templates, etc.

    17. Re:Dreamweaver by bach37 · · Score: 1, Funny

      in Dreamweaver, there's a command "Clean up MS Word HTML".

      You mean the menu option that says 'Delete All'?

    18. Re:Dreamweaver by Anonymous Coward · · Score: 1, Informative

      My experience:

      I have DW MX on a PC and it doesn't do the table copying and pasting from Word.

      On my Mac I can copy tables from Word and paste them in to DW MX 2004 and they'll drop in nicely, minus all the funky Word stuff. DW will also attempt to keep logical formatting in good HTML. Which is nice.

      Alternatively, depending on the complexity of the Word docs, consider saving them all out as plain text and then going back to format correctly. If yu have many tables and images this could be a pain, but if you have reams of text it may be much quicker.

    19. Re:Dreamweaver by necro2607 · · Score: 1

      Yeah that's what we suggest our clients use so they can update their sites in the future. Then they can work with the templates and files we've created just as you mentioned.

    20. Re:Dreamweaver by zippthorne · · Score: 1

      a live cd would be easier for just a few runs, you don't have to install anything on the windows drive and figure out where it all is, mess with paths and whatnot: it's all set up on the cd. but the inconvenience of rebooting would limit its usefulness.

      --
      Can you be Even More Awesome?!
    21. Re:Dreamweaver by racermd · · Score: 1

      The solution I came up with where I work is very similar, but I've been able to use some Regex to do the rest of the "Find/Replace" work.

      Keep in mind, I use this to *remove* Word 2003 extra tag garbage and leave basic, mostly non-formatted text in the HTML:

      "Span" tags:
      [<][s][p][a][n][\s]*[\w]*[\W][\w]*[>]
      -AND-
      [<][/][s][p][a][n][>]

      "Location" tags (State/City/Place/Etc.)
      [<][s][t][1][\s]*[\W]*[\w]*[\s]*[\w]*[\W]*[\S]*[>]

      Of course, this isn't an all-inclusive list. But it's definitely a large chunk of what Dreamweaver's cleanup utility misses. Afterwards, manual formatting is much easier as there's much less to get in the way.

      Oh, and if anyone wants to see Microsoft's regular expression reference, it can be found here.

      --
      My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant
    22. Re:Dreamweaver by dubl-u · · Score: 1

      4) Now clean up Dreamweaver's HTML.

      (To be fair, I haven't actually tried this. My bias is entirely from some hideous globs of tag soup that a Dreamweaver user has handed me. It might be a PEBKAC issue rather than a Dreamweaver issue.)

    23. Re:Dreamweaver by Anonymous Coward · · Score: 0

      i used DWMX '04 on XP in my testing for my previous post. When i pasted into DW it quite nicely handled the table as well as all font formatting (bold, italic, etc).

    24. Re:Dreamweaver by dubiousmike · · Score: 1

      This is if you paste Word HTML into Dreamweaver HTML.

      It is better to copy and paste the text into Dreamweaver and then do you quick formatting using Dreamweaver. There is still shit HTML in Dreamweaver after using the "clean up MS HTML" command...

    25. Re:Dreamweaver by Anonymous Coward · · Score: 0

      you need the latest version of DW and the latest version of Office on XP to be able to cut and paste tables.

    26. Re:Dreamweaver by Anonymous Coward · · Score: 1, Interesting

      From my experience I've found that you have to do the 'clean up word docs' twice to get most (if not all) of that extra garbage.

    27. Re:Dreamweaver by KevinH456 · · Score: 2, Informative

      Dreamweaver MX 2004 goes one step farther and allows you to copy and paste word documents INCLUDING clip art and drawn images and graphs. It will automatically take all that and make it decent markup. It even picks up formatting. It strips all the word specific crap for you and then you can just format it as you like (using a stylesheet of course to make your life so much easier). From here you have nice html you can import into a CMS or code you can insert in your HTML template.

      We do this where I work. My job involves taking hundreds of word documents from professors and formatting them for online coursework at a major university in florida. Dreamweaver has made my life so much easier.

      --
      All sigs are created equal.
    28. Re:Dreamweaver by ace_brickman · · Score: 0, Offtopic

      I don't want to start a holy war here, but what is the deal with you Mac fanatics? I've been sitting here at my freelance gig in front of a Mac (a 8600/300 w/64 Megs of RAM) for about 20 minutes now while it attempts to derive some simple HTML from a 17MB Word file in Dreamweaver MX. 20 minutes. At home, on my Pentium Pro 200 running NT 4, which by all standards should be a lot slower than this Mac, the same operation would take about 2 minutes. If that. Try resetting your DW preferences to "default" and convert again. Tables and everything..

      --
      Users of the world: We're here to help you, but help us help you. (your IT dept)
    29. Re:Dreamweaver by brettper · · Score: 1

      It's a real shame that this command isn't available in Word.

      Oh well

    30. Re:Dreamweaver by osssmkatz · · Score: 1

      Perhaps they use Tidy. I know I won't get as many mod points, but hey karma isn't everything. Tidy should do a nice job of cleaning up Word HTML/X(H)ML http://tidy.sourceforge.net/ (also has a simple win32 editor and links to win32 help files, good documentation for all operating systems. I generally use it as integrated with a good editor. www.chami.com (Windows and WINE [seriously])

    31. Re:Dreamweaver by ozbon · · Score: 1

      I agree - but as it was with a local authority/council, who thought that even the Win2K I was using was new-fangled (and this was working for them last year) and they preferred to stick with Win98, such concepts as Knoppix and LiveCDs were a bit of a no-no.

      I could've used them, sure, but if anyone else had seen it would've been more hassle than it was worth.

      --
      I say we take off and nuke it from orbit. It's the only way to be sure...
    32. Re:Dreamweaver by sehryan · · Score: 1

      You can save your searches in Dreamweaver to be reused. Very helpful if you find yourself repeating a particular search.

      --
      The world moves for love. It kneels before it in awe.
    33. Re:Dreamweaver by gnu-generation-one · · Score: 1

      "I would suggest using Macromedia Dreamweaver..."

      Dreamweaver's "sanitise Word's HTML" filter seems to lock-up the computer once you feed in anything larger than 50 pages or so

    34. Re:Dreamweaver by Evil+Grinn · · Score: 1

      [s][p][a][n][\s]*[\w]*[\W][\w]*

      How is this different from "span\s*\w*\W\w*" ?

    35. Re:Dreamweaver by Anonymous Coward · · Score: 0

      i'd just like to point out that one experience with a mac is anecdotal evidence that is hardly empirically validated and definitely not means for a generalized derision of mac "fanatics"

    36. Re:Dreamweaver by rabbit994 · · Score: 1

      It is, I found Dreamweaver code when in design mode to be pretty decent. Frontpage is far worse though I would still do almost all my coding by hand in dreamweaver unless your under a time limit or it's something tedious like tables or ordered lists.

    37. Re:Dreamweaver by PriceIke · · Score: 1

      The 8600 is eight years old. For frak's sake, get a decent computer to do your work on. I don't know any "Mac fanatics" that still use things like 8600s for real work these days. I'd consider myself a "Mac fanatic" and I have a two year old G5.

      --
      It's not a lie. It's the truth with lossy compression.
    38. Re:Dreamweaver by dave420 · · Score: 1

      Dreamweaver produces really terrible code. Every WYSIWYG editor does. Word's HTML is even worse, but dreamweaver's isn't fit to be put on a website.

    39. Re:Dreamweaver by KevinH456 · · Score: 1

      Dreamweaver MX 2004 does a pretty clean coding job. Combined with Word import, built in validation, regex find/replace, auto formatting of source, etc.... it's actually not a bad alternative. You will have to hand tweak the output, but the time savings is more than worth that.

      --
      All sigs are created equal.
    40. Re:Dreamweaver by darkonc · · Score: 1

      Besides... The whole idea is to sneak Linux in the back door. if you explain to them that it's legally free, they might be a bit less worried about it.

      --
      Sometimes boldness is in fashion. Sometimes only the brave will be bold.
    41. Re:Dreamweaver by racermd · · Score: 1

      Yeah.... I'm used to using Regex in vbscript code I write on a regular basis, and the brackets tend to help with that.

      Besides, I just copy-paste the regex I want to use from my handy plain-text cheat/helper file. I had the brackets in there when I started putting it together, and I haven't really "fixed" it in any meaningful way - ever. It just works.

      --
      My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant
  9. Antiword by alanp · · Score: 3, Informative

    Try antiword, it's got a real decent HTML option.

    --

    Alanp

  10. Sounds like you should release on sourceforge by arete · · Score: 5, Interesting

    So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
    1. Re:Sounds like you should release on sourceforge by Anonymous Coward · · Score: 3, Insightful

      Since they were produced at work, the copyright on them is probably owned by the company and not by him.

      Plus the templates are probably in-house templates and thus would be useless outside of the company.

    2. Re:Sounds like you should release on sourceforge by iron-kurton · · Score: 1

      I'm not the parent poster, but since he works for a company, chances are he cannot publicly post something that falls under the NDA.

      --
      Change is inevitable, except from a vending machine -- Robert C. Gallagher
    3. Re:Sounds like you should release on sourceforge by extrasolar · · Score: 2, Funny

      See, if we were really elite, we would all automatically know how to do stuff like this with our favorite editor. There would be no Ask Slashdot. There's a reason why emacs is one of the most popular editors around and it's because it saves us from having to do this kind of repetative text work that should be done autonomously.

      But we're not elite and I'm now going to learn how to do macros in emacs :)

    4. Re:Sounds like you should release on sourceforge by Anonymous Coward · · Score: 0

      No worries... we're all family here. Just post it! No one will ever know.

    5. Re:Sounds like you should release on sourceforge by einhverfr · · Score: 1

      I did something similar with SED. Not completely foolproof but something like:

      s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
      s:::Ig
      s:::Ig
      s:BORDER=[0-9]*::Ig
      s:ALIGN=BOTTOM::g

      This does not fix everything, but it is a good start. Then I can use CSS to apply the changes I want.

      Since the script originally scrubbed out style tags, I called it decss.sed.....

      I should publish it on my web site :-)

      --

      LedgerSMB: Open source Accounting/ERP
    6. Re:Sounds like you should release on sourceforge by Anonymous Coward · · Score: 0

      You know, if you want a good operating system with a smaller footprint, switching to Linux+Bash is a LOT nicer than emacs. If you want a good text editor, allow me to recommend vi. If you just can't live without the emacs operating system, you can always run vi on it.

    7. Re:Sounds like you should release on sourceforge by c0n0 · · Score: 1

      Well, AFAIK, if he did it at work and he is an employee (salary) the intellectual property belongs to the company that hired him.
        If he is an independent contractor, there should be a contract explicitly stating the work to be done, and outlining who the owner of the intellectual property is. If such contract does not exist, then I *believe* he could do it.

      I could be wrong though.

    8. Re:Sounds like you should release on sourceforge by elrous0 · · Score: 1
      But we're not elite

      Speak for yourself. I've got the leet skillz to pwn your ass.

      Wait a minute... We were talking about Halo 2, right?

      -Eric

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    9. Re:Sounds like you should release on sourceforge by Bob+Uhl · · Score: 1
      Well, emacs keyboard macros are very useful, but essentially they just remember each keystroke you type. You probably want to look into writing elisp functions. It's a pretty simple little language, but is extremely powerful. Most likely, you'll end up with a large number of save-excursion and replace-regexp calls. But once it's written once, it'll be written forever.

      If you really wanted to be clever, you'd define a standard interface for mapping vendor HTML into real HTML, and offer the possibility for plug-in modules, and then you'd take over the world:-)

    10. Re:Sounds like you should release on sourceforge by Anonymous Coward · · Score: 0

      Probably because the post was made up. Check the guy's posting history, he's a whore.

    11. Re:Sounds like you should release on sourceforge by extrasolar · · Score: 1

      Well, one thing I realized recently is that just being able to use a text editor with macros solves 90% of the tedious text editing jobs out there. Yes, keyboard macros don't sound all that powerful, but their utility is in how easy and quick that you write one up. As you are doing some sort of text-conversion in the editor, you notice a pattern in your keystrokes, you quickly put it into a macro. Then, when you notice an even wider pattern, you quickly edit your macro to take that into account. The usefulness in this is that it's dynamic. Creating elisp functions, on the other hand, is static. There is a lot of stuff you'll only need to do in one session. That's where macros really shine.

  11. Textism by NoInfo · · Score: 4, Informative

    Here's a tool I saw linked off of O'Reilly Radar once:

    http://textism.com/wordcleaner/

    I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.

    1. Re:Textism by e**(i+pi)-1 · · Score: 3, Informative

      a standalone Perl script, I use daily is demoronizer.

    2. Re:Textism by Intron · · Score: 1

      Thanks for the link. Its worth it just to read the comments.

      --
      Intron: the portion of DNA which expresses nothing useful.
    3. Re:Textism by bjdevil66 · · Score: 2, Informative

      Textism looked pretty cool, so I tried it out with a typical .htm export (67K). However, it requires a subscription (Paypal payment) to process files larger than 20K. In my experience many .htm files pumped out by MS Office are larger than 20K, so I imagine the submitter may want to look elsewhere...

      With that said, if the people that have assigned the submitter with the web work want their employee to have the tools they need to do quality web work, they should pay for quality tools so the submitter can get the job done.

      Side note: The Dreamweaver "Command..." option suggested below just worked great on the same file.

    4. Re:Textism by hobbesx · · Score: 1

      But why would you want to do that? They're nice enough, even if it takes a bit of time when they come knocking on the door...

      Oh, wait. Demoronizer?
      ::blushes::

      --
      This rating is Unfair ( ) ( ) Fair (*) Funny
      Sigh... If only. Modding would be so much more fun.
    5. Re:Textism by Lars83 · · Score: 1

      If you're still writing HTML with tables, I'd like to introduce you to my friend CSS.

    6. Re:Textism by deepestblue · · Score: 2, Informative

      The demoronizer is b0rk3n. See http://www.unicode.org/faq/unicode_web.html#2

    7. Re:Textism by cyclomedia · · Score: 1

      except that if you're writing a scientific or engineering or political document that require tables of data you should use tables. obviously

      --
      If you don't risk failure you don't risk success.
    8. Re:Textism by Lars83 · · Score: 1

      Agreed.

    9. Re:Textism by Lifewish · · Score: 1

      I'm amused that the creator has listed the fact that it runs on perl as a bug :P

      --
      For the love of God, please learn to spell "ridiculous"!!!
  12. One suggestion by Da+Fokka · · Score: 3, Funny

    You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.

    Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.

    1. Re:One suggestion by Anonymous Coward · · Score: 3, Funny

      It's hard to find qualified monkeys - most of them already have jobs editing /. and cnn.com...

    2. Re:One suggestion by Anonymous Coward · · Score: 0

      Americon economy

      Freudian slip?

      Perhaps 'Americonomy' can be a devilishly witty new term for use in reference to big business in the US of A.

    3. Re:One suggestion by Anonymous Coward · · Score: 0

      That's so racist...

      Seriously...

  13. One Word... by ScentCone · · Score: 5, Funny

    ..."Intern"

    --
    Don't disappoint your bird dog. Go to the range.
    1. Re:One Word... by Cerdic · · Score: 5, Funny

      No, no, no...

      Usually they are from academic users

      It sounds like this might be a university environment. The correct answer should be grad students .

      --
      Advice for my fellow geeks: before seeking out that threesome you dream of, you might see what a TWOsome is like first.
    2. Re:One Word... by ScentCone · · Score: 1

      The correct answer should be grad students

      Or better yet, exchange students. The mythical tri-lingual graduate exchange students are, of course, ideal. They can work on the HTML and translate the content while they're at it. All your footnote are belong to us, etc.

      --
      Don't disappoint your bird dog. Go to the range.
    3. Re:One Word... by patricksevenlee · · Score: 1
      ..."Intern"

      Add a "cigar" in there and I think you've got something.

    4. Re:One Word... by jhmaughan · · Score: 1

      He probably IS the intern...

    5. Re:One Word... by ScentCone · · Score: 1

      He probably IS the intern...

      Touche!

      --
      Don't disappoint your bird dog. Go to the range.
    6. Re:One Word... by FooAtWFU · · Score: 1

      I'll admit it: I am an intern. As a matter of fact, I'm one of the best interns money can buy. (oooooh look shiny resume). But it's only a summer internship and has nothing repeat NOTHING (sort of quasiobligatory disclaimer here in case someone IBM is watching) to do with my work with this web site management position, which I mostly do during the school year (and is officially some sort of part-time salaried position with a university's economics department, if I recall correctly).

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    7. Re:One Word... by Anonymous Coward · · Score: 0

      Might wanna run a spellcheck on that puppy. One of the headings is misspelled. Now if it were in a post on /. I promise I wouldn't be pedantic enough to say anything, but it is a resume...

    8. Re:One Word... by cos(0) · · Score: 1

      Thomas, some friendly advice: spell-check your resume.

    9. Re:One Word... by simong_oz · · Score: 1

      It sounds like this might be a university environment. The correct answer should be grad students .

      The original poster is doing IT admin/support for academics, so they're probably one themselves!

      --
      "Because it's there." - George Mallory, when asked why he wanted to climb Mt Everest, March 18, 1923 (New York Times)
  14. Re:hi by Anonymous Coward · · Score: 2, Funny

    hello. how are you?

  15. Dreamweaver by GroovinWithMrBloe · · Score: 1

    Dreamweaver comes with a function explicitly for dealing with Word goodness (Clean Word HTML IIRC). Also, perhaps try HTML Tidy?

  16. Tidy by Anonymous Coward · · Score: 0

    tidy, from w3c. Dreamweaver will clean up some Word HTML.

  17. Convert to RTF first by Anonymous Coward · · Score: 0

    I dont have a link, or proper info, but I recall seeing someting here a few weeks back in which someone suggested saving the word doc as RTF, then they had a util to convert RTF to HTML - apparently it was really useful.

    1. Re:Convert to RTF first by El_Servas · · Score: 1

      Convert TFA? Now i'm confused....

    2. Re:Convert to RTF first by tonsofpcs · · Score: 2, Interesting

      Yes, it is, since RTF is a text-based format where all the formatting is open and close tags, much like HTML. Save a word doc as rtf instead and open it in notepad, and you will see. There are many tools premade to convert from RTF to HTML, but you can build your own easily.

    3. Re:Convert to RTF first by aspx · · Score: 1

      If you only want to pull out the meat, defined as the text and simple formatting, then you're right. But RTF can describe much more than that. Look at the spec. Not all RTF tags have HTML equivalents, and building the equivalent HTML for arbitrary RTF streams is especially non-trivial.

  18. PDF? by nizo · · Score: 1

    Perhaps offer every document as a pdf (there are plenty of conversion tools out there, such as ps2pdf, which you can use after printing the document to a postscript file), as well as offer it in whatever format was sent to you?

  19. You could... by Anonymous Coward · · Score: 0

    ask them to save it as an RTF file... Reading an RTF is much easier whilst supporting almost all of the important text formatting features.

  20. OpenOffice.Org, but not HTML export by KillerBob · · Score: 1

    OpenOffice.Org supports the ability to export a document as PDF. As you probably know, PDF viewers are available for all mainstream OSes, including Linux, from Adobe themselves.

    Unless you're dealing with content that has to be accessed or updated frequently, then PDF is the way to go.

    --
    If you believe everything you read, you'd better not read. - Japanese proverb
    1. Re:OpenOffice.Org, but not HTML export by emaneman · · Score: 0

      I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?

      --
      HAW HAW HAW
    2. Re:OpenOffice.Org, but not HTML export by Maxo-Texas · · Score: 1

      Yes that is what the article says- but like many users the question is why is he requiring HTML? It may be that using PDF would solve his problem even tho it does not answer his question.

      --
      She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  21. Change the Model? by Stanistani · · Score: 1

    Could you provide forms on your website for your academic users to submit the information directly?

    1. Re:Change the Model? by danheskett · · Score: 2, Interesting

      That's the best way. Really. You want the data to be in a structured format. Semantically structured if possible, but at least structured. Define a bunch of templates. Use a templating system like smarty or whatever to make it happen. Give your users a simple form - HTML, Windows, Java, whatever, that selects a template and reads in a list of fields from the template. Dynamically generate the form fields to be filled based on the template. Store the data. To generate a page start from the master record - be it in a database, an xml file, or whatever. Load the template and fill the data from the relational store. If you do it right you can even substitute different rendering layers and get an X/HTML version, a Word version, and a PDF version without any real substantial work. This also helps (1) create consistent documents, (2) create documents for more than one target format, (3) create searchable content with rich meta-data and (4) move to a more robust system later without tons of extra work. I've done it before, and if you spend a week engineering the solution properly it'll last years.

    2. Re:Change the Model? by Wiseleo · · Score: 1

      Have you tried entering a complex enough resume through one of those forms and exceeding their parameters? Now multiply that by dozens of pages in a typical paper...

      No way in hell I would chew up my paper and feed it paragraph by paragraph to a "smart" system. Yuck!

      --
      Leonid S. Knyshov
      Find me on Quora :)
    3. Re:Change the Model? by FooAtWFU · · Score: 1
      The problem with your suggestion is that the data isn't especially structured in nature. There are about three major inputs here: an Encyclopedia, Book Reviews, and Abstracts for various papers. While we do maintain templates for the various metadata for all these in a structured format, there's not much structure for the rest of the entries besides generic rich text formatting and the occasional table. The trouble is extracting the Useless Formatting (repeated insistence that This is Black 12-Point Times New Roman, using both font tags AND span style="") from the Useful Formatting (italics, please: this is the title of a book or journal, this is a footnote).

      Don't get me startted on stupid formatting for footnotes, either:

      <a name="_ftnref1" href="#_ftn1"><span class="MsoFootnoteReference"><sup><span><!--[if !supportFootnotes]--><span class="MsoFootnoteReference"><sup><font size="3" face="Times New Roman" color="black"><span style="font-size: 12pt; font-family: "Times New Roman"; color: black;">[1]</span></font></sup></span><!--[endif]- -></span></sup></span></a>

      Most of the cruft I can sorta-kinda-vaguely understand (though I don't like it or agree with it) but why are there TWO <sup>s? And what's with the <span> that has no attributes?

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    4. Re:Change the Model? by danheskett · · Score: 1

      So true. It's crappy markup at the worst. Really, really, really bad. But it's still formatted. And it can still be done in a formatted way - though with much more work. Along with what I suggested before devising some semantic tags that you can convert later to real markup might be handy for long blocks of text.. much like how BBCode takes unstructure text and makes it somewhat structured enough to be workable.

  22. Macromedia Dreamweaver by jonthegm · · Score: 1

    It's a dream come true for eliminating Word formatting in an html file, or for just copy/pasting from one file to the other.

    It doesn't seem to transfer colors over, but that may be user-error.

    You could always just write some php/perl scripts for scrubbing, too! RegExps are your friends!

  23. PDF is your friend by jeffs72 · · Score: 1
    I've found it easiest to just PDF bothersome word files myself. Call me lazy, but it works.

    --
    This article has recently been linked from Slashdot. Please keep an eye on the page history for errors or vandalism.
    1. Re:PDF is your friend by B3ryllium · · Score: 1

      PDF is a Web Pestilence in its own right.

  24. HTML Export by electroniceric · · Score: 2, Informative

    If you're using Office 2000, you can find the HTML filter here:

    http://www.microsoft.com/downloads/details.aspx?Fa milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN

    I believe this functionality is built into later versions of Word.

    Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.

    1. Re:HTML Export by Marxist+Hacker+42 · · Score: 4, Informative

      Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.

      --
      SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
    2. Re:HTML Export by MemeRot · · Score: 1

      I second this suggestion. It does a great job of producing comprehensible html out of the normal word mess. Now why this isn't the default option I'm at a loss to explain.

      Dreamweaver and the like may do a better job, but if you don't want to buy a tool like that for the occassional content paste, then this technique eliminates at least 80% of the work.

    3. Re:HTML Export by Petrushka · · Score: 1

      Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html".

      WOW, that's quite a difference from older versions of Office - I don't use MS Office any more, so I hadn't spotted this. It's still very messy if you export "other people's" Office documents into html, because no one uses styles and so all the formatting is done by ad hoc class tags; but it is most definitely a hell of a lot better than the screwed up mess that Office used to produce.

      Now if MS just removed the "Save as Web Page" option and replaced it with this "filtered" option, wouldn't the OP's life be a lot easier ...

      Oh, and as other people have noted, the Dreamweaver command to "clean up MS Word html" is pretty good, though nothing except a competent human is going to produce perfect tidy CSS-ed output.

    4. Re:HTML Export by Marxist+Hacker+42 · · Score: 1

      The OP's life would be easier- but other people's lives wouldn't be. Specifically those who use a copy of Personal Webserver, IE, and Word to edit Word Documents in place. I'm not sure I've ever heard of anybody outside of Microsoft doing that- but I know it can be done.

      --
      SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
    5. Re:HTML Export by plieb · · Score: 1

      Word 2004 - Macintosh does not not apear to have the Filtered HTML option.
      --

    6. Re:HTML Export by Marxist+Hacker+42 · · Score: 1

      I have no idea- I don't work on Macs. But somehow I'm not surprised- Microsoft specifically sabatoges the Macintosh version of office quite often.

      --
      SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
    7. Re:HTML Export by arch_avaj · · Score: 1

      Hmm this does work pretty well. I just tried it on one of my documents, the end result was similar formatting, and relatively organized HTML.

      A few things it messed up were table borders (for some reason it got rid of borders) and bullets.

      Here is what a bullet's code looked like. It only displayed properly in IE (Firefox/Opera give an ugly dot)
      <p class=MsoNormal style='margin-left:.5in;text-indent:-21.0pt'><span
      lang=EN-CA style='font-size:10.0pt;font-family:Symbol'>&#183; <span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
      </span></span><span lang=EN-CA style='font-size:11.0pt'>Bulleted Text</span></p>
      Note that the actually was a dot character, /. just standardizes the encoding.
    8. Re:HTML Export by Anonymous Coward · · Score: 0
      Well, it helps, but it still puts in extensive bonus stuff. The specific goal is to unify the style sheets. The user wants simple tags like

      and not

      .

    9. Re:HTML Export by Anonymous Coward · · Score: 0


      To me, this looks like a much better version of totally unusable html garbage from Word.

      If you open the resulting document in IE, it looks pretty good. But it still specifies fonts that you're not going to find on everyones machines. (though it does appear to use PANOSE, which I wasn't familiar with previously, and may alleviate that problem).

      My test document had 14 pages of style definitions compared to 3 pages of code for the content.

      Does anybody know how this version of the export renders in other browsers?

  25. T-I-D-Y works for me!!! by jaltoids · · Score: 1

    And it works pretty well... TIDY!!!

    Frankly I didnt think much of this tool till I had to convert a LOT of pages where there was going to be a ton of cleanup by hand. In some cases it was easier to go back and get word to spit out ugly html and then let tidy fix that (if you can belive it). Best of all it is FREE and easy to use!!!

  26. Dreamweaver by SlashChick · · Score: 4, Informative

    Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.

    Good luck!

  27. Have you tried Microsoft Frontpage? by Cerdic · · Score: 1

    If not, give it a try. In the past, anything I've taken from .doc to html (in particular, resumes), seemed to convert nicely if I did a cut-and-paste from Word straight into an html project in Frontpage.

    Considering that they have a common core and a part of the Office suite, they seem like they should be the most directly compatibile with each other.

    --
    Advice for my fellow geeks: before seeking out that threesome you dream of, you might see what a TWOsome is like first.
    1. Re:Have you tried Microsoft Frontpage? by Anonymous Coward · · Score: 0

      They don't have a common core; Frontpage was developed by a third party independent of Microsoft. It was designed to have the same look and feel as Word, and Microsoft liked it enough to buy it.

    2. Re:Have you tried Microsoft Frontpage? by Anonymous Coward · · Score: 0

      FrontPage works very well. The key is to drag the file from Windows Explorer onto a blank document in FrontPage. Do NOT open it with FrontPage or copy/paste. FrontPage will convert the document to basic html with some formatting, but no Word HTML.

  28. Textism's Word HTML Cleaner rocks by sco · · Score: 1

    http://textism.com/wordcleaner/

    "A tool that strips proprietary Microsoft tags and other cruft from Word HTML documents, leaving basic formatting intact. File sizes are greatly reduced, and the returned HTML is easier to read, revise and employ."

    5 for a 24-hour pass, 20 for a 1-year individual subscription.

    No, I don't work there.

    1. Re:Textism's Word HTML Cleaner rocks by sco · · Score: 1

      Ahem. Those prices should read EUR 5 and EUR 20.

  29. HTML Tidy by N8F8 · · Score: 5, Informative

    Save the Word document as filtered HTML and pipe the HTML through HTML Tidy. Nice clean HTML.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    1. Re:HTML Tidy by urdine · · Score: 1

      I second this method, after trying all the various horrible methods myself (including horribly complex Word macros). And if you need manipulation after that, bring out the perl regex script, should be easy enough to make it however you like at this point. Of course, if you need Word formating and tables, you're screwed.

    2. Re:HTML Tidy by N8F8 · · Score: 1

      I've also written several COM-Ad ins to export Word to HTML and XML. A relatively simple one automated the "Save as Filtered HTML" method and used a Tidy COM component to run Tidy on the output. A little extra Regex work after that produced decent formatted HTML. And it was pretty quick. I also wrote one to crawl the Word Object model but it runs slower.

      --
      "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    3. Re:HTML Tidy by zmokhtar · · Score: 1

      There's also Tidy Online if you'd rather not have to download and install the tool.

      http://infohound.net/tidy/

      There is a specific checkbox for cleaning Word 2000 generated HTML.

      --
      Why aren't we told when editors moderate our posts?
    4. Re:HTML Tidy by BlueWire · · Score: 1

      Most times I use this option I end up with an empty document. It used to work - something that got fixed in new versions of MS Word?

      --
      Yes, but whats that got to do with the price of tea in D'ni?
  30. Tidy by s4f · · Score: 0, Redundant

    http://tidy.sourceforge.net/ As I recall HTML-Tidy allows you to remove all of Words "enhancments".

  31. Tidy by valis · · Score: 1

    http://tidy.sourceforge.net/

    Check out the -bare and -clean options to remove microsoft cruft.

  32. wvWare by dougmc · · Score: 1
    wvWare is what you want.

    It can programatically convert most Word documents into html documents, and does about as good a job as one could expect. And it makes better html than Word does itself.

    1. Re:wvWare by javester · · Score: 1

      We used wvWare in an open-source CMS for a Fortune 10 to let PR English Majors upload Word files.

      The site takes about 15 mill hits/months.

      It works but there are a lot of caveats. Also, wvWare is no longer an active project so I wouldn't recommend it.

      We ended up giving the PR guys a Word macro to sanitize Word a bit before giving it to wvWare for processing.

      Also, the HTML it produces is very verbose and very heavy. It removes the MSisms but doesn't necessarily take out the redundant formatting info.

    2. Re:wvWare by dougmc · · Score: 1
      I apparantly buggered the link. It's http://wvware.sourceforge.net/ .

      javester said that it's not being maintained anymore, but I do see some recent updates at the Sourceforge page. If it's not being actively developed anymore, it would seem to be because it works pretty well now.

      In any event, I've found it to work very well for my use.

  33. your website by Anonymous Coward · · Score: 0
  34. no link for you, Slashdot hordes! by SeanTobin · · Score: 5, Informative

    Hmmm... sounds like a challenge to me. Let's see what we can dig up.

    Step 1: Let's look at his user page

    Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/

    Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..

    Step 2: Let's look at his author page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome :)

    A-ha! There is a link to his employer! It's Economic History Services. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.

    Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.

    --
    Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
    1. Re:no link for you, Slashdot hordes! by Dunbal · · Score: 5, Funny

      Which only goes to show:

            There is NO WAY the slashdot effect can be avoided. Resistance is futile...

      --
      Seven puppies were harmed during the making of this post.
    2. Re:no link for you, Slashdot hordes! by slo_learner · · Score: 2, Funny

      I can understand why you would hunt this information down for your own demented purient interest, but why did you have to post it?

      Didn't he clearly state that he didn't want to be slashdotted? This just seems like a perfect opportunity for the application of a little common sense along with just a hint of courtesy.

    3. Re:no link for you, Slashdot hordes! by javaxman · · Score: 1
      A-ha! There is a link to his employer!

      Why is it that I'm a tad disappointed the page loaded quickly?

    4. Re:no link for you, Slashdot hordes! by Anonymous Coward · · Score: 0

      There is NO WAY the slashdot effect can be avoided.

      Which must mean that the only way to win Slashdot is not to play.

      How about a nice game of chess?

    5. Re:no link for you, Slashdot hordes! by FooAtWFU · · Score: 3, Informative

      My SSH connection to my server still lives; I think my task was accomplished well enough. :)

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    6. Re:no link for you, Slashdot hordes! by FooAtWFU · · Score: 1
      Actually, I use emacs. But I really appreciate people who can grok vim (I can't). =D

      And you've hit the nail on the head: Book reviews and encyclopedia entries and abstracts (oh my). These things aren't exactly "structured" beyond the basic metadata (title author etc).

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    7. Re:no link for you, Slashdot hordes! by The_Wilschon · · Score: 1

      Actually, I use emacs. But I really appreciate people who can grok vim (I can't). =D

      Funny. I'm just the opposite. When I first started using linux (1998 or 1999), I knew that vi and emacs existed, so I tried both. I completely failed to understand emacs in the slightest, whereas I grasped vi rather dimly.

      Thus, my use of vi continued and greatly improved, and my use of emacs is nonexistent. But I do feel slightly in awe of people who grok emacs, simply because I can't.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    8. Re:no link for you, Slashdot hordes! by FooAtWFU · · Score: 2, Interesting

      actually, I'm quite all right. At first I was a trifle worried when I saw that my machine's load was a little high and the story relatively new, but then I realized that it was just running pisg to generate channel statistics for #wikipedia. It's a beefy server on a fast line, really; I don't anticipate any issues if I can hide way down in the comments page instead of in the fine summary...

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    9. Re:no link for you, Slashdot hordes! by Anonymous Coward · · Score: 0
      if you care a quick look at his source code has email adresses

      When you say his "source code" you mean this link, clicking on e-mail addresses right?

      lame

    10. Re:no link for you, Slashdot hordes! by Anonymous Coward · · Score: 0

      Never underestimate the power of lazy!

    11. Re:no link for you, Slashdot hordes! by jalefkowit · · Score: 4, Funny
      This just seems like a perfect opportunity for the application of a little common sense along with just a hint of courtesy.

      You must be new here.

    12. Re:no link for you, Slashdot hordes! by pVoid · · Score: 1
      You ain't usin' real SQL if you put quotes around your column and table names. Only square brackets would do.

      E.g. Select [karma] from Whores where [WhoreID] = 138474

    13. Re:no link for you, Slashdot hordes! by Anonymous Coward · · Score: 0

      He also is a volunteer for Wikipedia. Now try to slashdot them!

    14. Re:no link for you, Slashdot hordes! by Evil+Grinn · · Score: 1

      You ain't usin' real SQL if you put quotes around your column and table names.

      What is real SQL? From the SQL-92 standard:

      <delimited identifier> ::= <double quote> <delimited identifier body> <double quote>

  35. Word has this feature... by fervent_raptus · · Score: 1

    Open in Word Select All Hit: Control + Space

    1. Re:Word has this feature... by Rude+Turnip · · Score: 1

      I just tried that out on a 12 page letter in Word...what does it do exactly? I saw the bold formatting disappear from the cover page, but couldn't see any other changes.

  36. Print to PDF... by Anonymous Coward · · Score: 0

    I would suggest installing a PDF printer driver, printing to it to generate a PDF and then going from there, such as using any number of PDF to HTML applications, avaliable from google.com

  37. Outsource by Anonymous Coward · · Score: 0

    Outsource the boring and tideous work to India.

  38. HTML Tidy by LittleVito · · Score: 1

    I once had to convert a large number of pages generated by Word into something that was at least close to validating and I used Tidy HTML. It took a little bit of poking around with all the arguments to get it to do what I wanted, but once I had I just ran it on all the Word exports and it popped out clean code. It even had a special flag (though I don't remember it off the top of my head) to specifically deal with Word exports.

  39. HTML Tidy by John_Booty · · Score: 2, Informative

    HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.

    HTML Tidy:
    http://tidy.sourceforge.net/
    HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
    http://www.chami.com/html-kit/

    Countless other editors integrate with HTML Tidy as well. Have fun and good luck!

    --

    OtakuBooty.com: Smart, funny, sexy nerds.
  40. Donkey Punch Them by Anonymous Coward · · Score: 0

    That's the only way go fix the problem.

  41. Re:Duh by sTalking_Goat · · Score: 0, Offtopic

    how does a racist Troll get moderated Interesting?

    --

    My days of not taking you seriously are certainly coming to a middle...

  42. Three applications in Windoze to suggest... by Anonymous Coward · · Score: 0

    For batch conversions, there is nothing better that I know of than TextPipe. I also like askSam [the import feature lets you grab content from many different filetypes]. If there are not many files to do on a given day, or you just want a low resource-intensive approach, try PureText (I use 2.0). It is extremely easy to use. Good luck!

  43. MOD PARENT by Anonymous Coward · · Score: 0


    yes dreamweaver has a handy "clean up word HTML" function, you can even grab a trial version (but its worth the money imho)

  44. You need an intern by supercolony · · Score: 1

    Assuming that it is not in your power to change the material coming to you, then you must change how you process it.

    Quite frankly, the most cost effective way to deal with this problem is to hire an intern, temp or clerk. Train this person to formal very plain HTML, to your liking (or XML, or XHTML or whatever you prefer). Then use your application to apply the style you like to the HTML the temp made.

    If you want to involve more programming, you could whip up a parser to validate the intern's work. But the reality of the situation here is that unless you are working on a truly overwhelming volume of documents, it will be much cheaper to use human labor than to invest the programming time to automate the process.

    -jr

    1. Re:You need an intern by 0xABADC0DA · · Score: 1

      So basically:

      1. Hire intern to transcribe work .doc into html
      2. Outsource html transcription job to india ...
      4. Profit!!

  45. Macro in VBA by Yonatanz · · Score: 1
    If the articles are mostly text, then you can write yourself a simple VBA macro in Word, that iterates over the object model of the Word document, and creates the simplified HTML code.

    For example, you can turn every underlined, 18-pt text into <H1> headers, etc.

    This way you can keep the consistency quite easily, while still staying flexible.

    You can even create HTML that is compatible with the IDs and CLASSes of your site's existing CSS.

    This, however, requires that you know VB, and spend some time getting to know the Word object model, which is not too difficult

  46. Re:hi by Anonymous Coward · · Score: 2, Funny

    I'm fine thank you

  47. Tidy Flags by N8F8 · · Score: 5, Informative

    Almost forgot. The Tidy Docs will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    1. Re:Tidy Flags by Bogtha · · Score: 2, Informative

      I also recommend "--output-xhtml"

      Why? XHTML isn't any better than HTML 4.01 for almost anybody, and it's less compatible.

      --
      Bogtha Bogtha Bogtha
    2. Re:Tidy Flags by .nuno · · Score: 1

      You might want to use the HTML in some XML compliant system, such as Tridion as I have to nowadays, and believe me, you do want your stuff to be XML compliant. Or you might just want to look smart to all the other completely ignorant co-workers and say that all your code is XHTML compliant - with the bonus of having everyone looking at you and wondering why the hell they even bother...

      --
      .sig
    3. Re:Tidy Flags by Bogtha · · Score: 1

      You might want to use the HTML in some XML compliant system, such as Tridion as I have to nowadays

      That's an unusual requirement, which I'm sure the Ask Slashdotter would have mentioned had it been the case.

      and believe me, you do want your stuff to be XML compliant.

      No offence, but I'd rather have an explanation than simply take the unqualified assertion of some random Slashdotter at face value.

      Or you might just want to look smart to all the other completely ignorant co-workers and say that all your code is XHTML compliant

      Using a less-compatible format to boost your ego? Doesn't sound very smart to me.

      --
      Bogtha Bogtha Bogtha
    4. Re:Tidy Flags by Dryth · · Score: 1

      Using a less-compatible format to boost your ego? Doesn't sound very smart to me.

      Forgive my ignorance, but I'm wasn't aware that compatibility issues were so severe?

      Fundamentally an XHTML document is well-formed HTML document. Sending with application/xhtml+xml as a mime type has its issues, but that isn't an issue if the file's saved with anything other than a .xhtml extension or someone goes out of their way to send as xhtml+xml.

      Having a doctype throws some browsers into strict mode as opposed to quirks mode, but if anything this yields greater consistency between modern browsers. If that's a problem, just drop the doctype. This isn't even an XHTML-specific issue though.

      Most of the content coming out of Word shouldn't see much damage either way. The most complex things you'll encounter will be tables for which HTML vs. XHTML shouldn't even be an issue.

    5. Re:Tidy Flags by Bogtha · · Score: 1

      Forgive my ignorance, but I'm wasn't aware that compatibility issues were so severe?

      Don't get me wrong, there aren't lots of big problems, but why bother dealing with them at all if you don't have to?

      Fundamentally an XHTML document is well-formed HTML document.

      This isn't true. XHTML is fundamentally incompatible with HTML as its empty element syntax means something different due to SHORTTAG NET being used in HTML. Granted, this particular issue doesn't cause problems often, but that doesn't change the fact that they are fundamentally incompatible.

      Having a doctype throws some browsers into strict mode as opposed to quirks mode

      This isn't about doctype switching or tables. It's about things like not being able to use a character encoding other than UTF-8 or UTF-16 because the defaults and constraints of XML and Appendix C of the XHTML 1.0 recommendation prevent you from using anything else. Unicode is usually the best choice when dealing with Western languages, but still has quite severe compatibility issues when dealing with Asian languages among others.

      --
      Bogtha Bogtha Bogtha
    6. Re:Tidy Flags by Dryth · · Score: 1

      Don't get me wrong, there aren't lots of big problems, but why bother dealing with them at all if you don't have to?

      Because as previously suggested there are certain additional benefits to working with XHTML. The grass isn't very green on either side of the fence, but forgive my optimism in opting to move forward, rather than back.

      Personally, as someone who's been in a similar position of having to convert a large number of user-submitted documents for posting, I've found converting to XHTML with numeric entities invaluable in the past; first Tidy cleans the document, then I can post-process with an XML parser.

      This isn't true. XHTML is fundamentally incompatible with HTML as its empty element syntax means something different due to SHORTTAG NET being used in HTML. Granted, this particular issue doesn't cause problems often, but that doesn't change the fact that they are fundamentally incompatible.

      Fair enough, but again, the instances where this becomes an issue are relatively rare. Personally I'd leave the choice of accepting rare instances of incompatibility for rare beneficial applications up to the user.

      This isn't about doctype switching or tables. It's about things like not being able to use a character encoding other than UTF-8 or UTF-16 because the defaults and constraints of XML and Appendix C of the XHTML 1.0 recommendation prevent you from using anything else. Unicode is usually the best choice when dealing with Western languages, but still has quite severe compatibility issues when dealing with Asian languages among others.

      You can specify other character encoding in the XML declaration and meta content type (C.9). Though admittedly this isn't something that Tidy itself takes into consideration.

  48. Get it in PDF first. by frostman · · Score: 2, Interesting

    I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.

    What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.

    Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.

    Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.

    Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.

    --

    This Like That - fun with words!

    1. Re:Get it in PDF first. by Jeff+DeMaagd · · Score: 1

      I don't know about scripting, but OS X does include a "print to PDF" feature. I haven't found it necessary to use it yet. If you need to be very specific about how a document will look, then going to PDF is the way to go.

    2. Re:Get it in PDF first. by markdowling · · Score: 1

      You can set up a PDF Printer which emails you results using Samba 2+ (which is what we use) or download the Acrobat 7 Trial if you want to see what extra features can get you :)

    3. Re:Get it in PDF first. by StormShaman · · Score: 1

      This is a stupid suggestion. He wanted to remove the formatting; PDF would only preserve it. Also, why convert to PDF then HTML, when he could just convert straight to HTML? It seems you are answering the wrong question.

  49. Convert to PDF by Anonymous Coward · · Score: 1, Informative

    Adobe Acrobat PDF conversion preserves look

    Many free or cheap printing filters / converters available

    1. Re:Convert to PDF by FooAtWFU · · Score: 1

      I don't want to preserve the look. I want to destroy the look and replace it with another one. Every author of every book review or article has his or her own look. I don't want it. I want MY look, the look of all the other pages on the site. On the other hand, I can't go around destroying all the hyperlinks or italics/underlines/etc around titles or anything like that.

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
  50. Homesite by Hank+Chinaski · · Score: 1

    Homesite has a function to import and clean Word Documents.

    --
    IAAL
  51. HTML parsing by ChiralSoftware · · Score: 1
    If you really want to do it right, use an HTML parser to extract the content, and then re-render it. That's exactly what our mobile search engine does to convert web pages to mobile pages. It's non-trivial stuff. The advantage of doing it is that you do end up with clean, uniform HTML (or WML or XHTML in our case).

    Some future version of Tomcat should have built-in content parsing in its filters so that filter writers could write simple filters to transform content in a meaningful way. But I haven't seen that as a proposal anywhere.

  52. similar problem in Quark by Anonymous Coward · · Score: 0

    I have a similar problem in QuarkXpress. My current solution is to export the doc as HTML and then search and replace in BBEdit in order to clean things up. A regex would do the job except for the fact that Quark generates arbitrarily named stylesheets that require manual changes. I am considering writing a script that would parse Xpress-tagged output and convert it to HTML.

    I'd suggest something similar for Word...export as rtf (?) and parse it into valid HTML. However, Word's HTML is *much* worse than Quark's.

  53. wvWare by funkmeister · · Score: 1

    Try wvWare (http://wvware.sourceforge.net/). It works amazingly well for Word Excel and Powerpoint. I have used in Zope applications and have had very good results.

    From the site:

    This is the home of the wv library. The original name of the project, mswordview, was uncomfortably close to Microsoft's own product named wordview, so the library was renamed.

    wv is a library which allows access to Microsoft Word files. It can load and parse Word 2000, 97, 95 and 6 file formats. (These are the file formats known internally as Word 9, 8, 7 and 6.) There is some support for reading earlier formats as well: Word 2 docs are converted to plaintext.

    wv compiles and works under most operating systems. Although most development is carried out with Linux, wv should work on BSD, Solaris, OS/2, AIX, OSF1, and even (with varying levels of success) AmigaOS VMS. The GnuWin32 project maintains a port for Windows, and it is required to compile and work on all of AbiWord's supported platforms, of which there are a lot.

    wv allows other programs access to Word documents for the purpose of converting them to other formats. It is currently being used by AbiWord as its Word importer, and concepts and bits of code are being used by the KDE folks over at KWord in their word importer.

  54. Try HTML Transit by Stellent by RaSchi.de · · Score: 1

    I've had a similar task once and we used HTML Transit, a software by Stellent (http://www.stellent.com/) and distributed by Avantstar (http://www.avantstar.com/). You can define templates for all kinds of word styles and fine-tweak the HTML output quite neatly. And, another advantage, I had excellent support when some questions arose.

    --
    So long and thanx for all the fish, RaSchi
  55. Beautiful Soup by Anonymous Coward · · Score: 1, Informative
    If you like Python, there is an app out ther called Beautiful Soup which can suck in ugly, malformed markup and give you a parse tree you can play with before dumping it back out to html.

    P.S. There is a Ruby Port as well.

  56. use kword by 0xABADC0DA · · Score: 1

    I had to do this recently, but to print a 20-page document as 12 pages by removing page breaks. I found that kword using the html export filter and setting it to "HTMl 4.01 + Light (strict xhtml)" was the best mode. This doesn't use the style sheets and just converts to basic html.... no fancy positioning or fonts, just some headers and basic styles. This was using kwork 1.4.1 / kde 3.4.2 btw.

    Everything else I tried sucked, including OO.o's export.

  57. Resign from your executive position by Fastball · · Score: 2, Interesting

    What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"

    1. Re:Resign from your executive position by dougmc · · Score: 4, Insightful
      Everybody else is on board with plain text
      I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.

      I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.

      I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...

      Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)

    2. Re:Resign from your executive position by VGR · · Score: 5, Funny

      You think that's bad?

      I was given 61 screenshots (blithely dubbed "program requirements"), each its own Word document. Each containing only a (weirdly scaled) picture, of course.

      61 Word documents.

      --
      The Internet is full. Go away.
    3. Re:Resign from your executive position by Detritus · · Score: 2, Funny

      Because it isn't a "real memo" unless it is printed on company letterhead, formatted according to the company's style guide.

      --
      Mea navis aericumbens anguillis abundat
    4. Re:Resign from your executive position by Larry+Lightbulb · · Score: 2, Funny

      You get specs? And you're complaining?

    5. Re:Resign from your executive position by advocate_one · · Score: 2, Insightful
      because Microsoft put that "Oh, so convenient", send as email entry in the File menu of ms-word.... that's why... then the top of the totem pole dunderheads don't have to go to the trouble of firing up their email client and creating a message and then finding the file to attach it...

      and as another poster has suggested, perhaps it's the quality department to blame as a memo or whatever, isn't a real memo or whatever unless it has been created with the official approved template...

      --
      Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
    6. Re:Resign from your executive position by Nurgled · · Score: 1

      Everyone sends HTML in my workplace too. It seems the only purpose of this is to attach a bloated corporate signature containing a company logo in JPEG format. The body of the message is generally devoid of any special markup.

      I have my email client configured to ignore the HTML part of incoming mail, so I always see an amusingly butchered version of this signature. I don't actually include the signature in my outgoing mail, but no-one's really called me on it since I'm not in a customer-facing position. (They don't let the developers talk to the customers. :) )

      I do keep meaning to make a text-only rendition of the signature, though. Possibly featuring the logo in ASCII Art, though probably just eschewing the logo completely.

    7. Re:Resign from your executive position by mrjb · · Score: 1

      Oh, not me. I'd never do that. And I'd say it's pretty obvious why.

      Everybody else is on board with plain text

      I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.

      I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.

      I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...

      Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)

      --
      Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
    8. Re:Resign from your executive position by khakipuce · · Score: 2, Insightful
      I use mutt and fetchmail in a company of Exchange users
      isn't exactly "bending like a reed in the wind" -using a graphical mail client would be.

      Why do you use mutt and fetchmail? Why? Why? Why? Just about everywhere I have worked it has been easier (and often there is no choice) to just use what they use rather than trying to be clever or different. It is good to gain wide experience and it is good to have the flexibility to use the tools at hand.

      --
      Art is the mathematics of emotion
    9. Re:Resign from your executive position by dougmc · · Score: 1
      isn't exactly "bending like a reed in the wind" -using a graphical mail client would be.
      Let's not say `graphical mail client' when we really should be more specific and say Outlook.

      I guess one way to deal with it would be to just use Outlook. But that's certainly not the only way.

      Why don't I use Outlook? That's easy enough to answer --

      • I run Linux on my desktop at work. (I'm more productive under Linux, even in a land of Microsoft users.) Outlook doesn't run under Linux, though I might be able to make it run under Wine, and VMWare or rdesktop to a remote Windows box is an option
      • I archive most of the technical email aliases at the company and make them searchable for everybody. IT ought to do this, but for now I'm the one doing it, and it's far easier to do this under *nix than Windows.
      • I like to be able to access my mail similarly from home or from work or from the road. Most people do this by always running Outlook from their laptop, the same laptop, but I prefer to just ssh in.
      And it all works quite well. I don't have a problem with html emails -- I just have a line in ~/.mailcap that converts them to text for mutt --
      text/html; lynx -dump %s | sed 's/^ //' ; copiousoutput; nametemplate=%s.html
      and I have other lines that will fire up OpenOffice for Word files, xpdf for pdf files, etc. If I get an .html email that I need to view with a graphical browser (rare), I have mutt tell Mozilla to load it up for me.

      I wasn't complaining about the html emails -- I deal with them. Certainly, I'm not going to tell the rest of the company to change just so I can use my favorite mail reader.

      I was agreeing that management often doesn't seem to know how to properly deal with email, and that's true no matter what mail reader I use.

    10. Re:Resign from your executive position by jvagner · · Score: 1

      Maybe, just maybe, he likes using a mail client to deal with text that.. you know, let's him do it excusively with keyboard commands.

      I often wish I could go back to Pine. Really.

  58. Homesite by PenchantToLurk · · Score: 1

    I used to use homesite 4.5... it has a built-in macro to strip the office tags and styles out of html. http://www.macromedia.com/software/homesite/

  59. More specifically: Word into MS CMS by snowwrestler · · Score: 1

    We run Microsoft CMS for my company's Web site, which annoyingly accepts pastes direct from Word, complete with all the extraneous code. (As opposed to a normal text box, which strips formatting when accepting pasted text.)

    Since we style the text with CSS, we have to train everyone who works on the site to first paste anything from Word into Notepad to strip out Word code crap, then paste that into the CMS browser client, then re-apply formatting with the tools in the client toolbar. What a pain! I'd love to know if anyone has figured out a way to allow people to paste Word content directly into MS CMS without having to go through all those extra steps.

    --
    Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
    1. Re:More specifically: Word into MS CMS by Alien+Conspiracy · · Score: 1

      Can you edit the CMS server page so that it just has a regular textarea instead of anything fancy?

    2. Re:More specifically: Word into MS CMS by sbma44 · · Score: 2, Informative

      Are you using the telerik radeditor MCMS placeholder? It's free, and has capabilities that let you automatically strip out word formatting. In my experience it only sort of works... but it's better than nothing.

      You can also add an event handler for the updating event that does some regex tidying. Replacing the regex "]*>" will go a long way (better double-check that). You should be able to come up with a similar one for all the smarttag nonsense that gets inserted, too.

      Still, Word formatting remains a major bane to my existence. Good luck.

    3. Re:More specifically: Word into MS CMS by RedSteve · · Score: 1

      UGH! We have the exact same problem. And no matter how many times we tell our authors to NOT paste things into CMS directly from word, they still ignore us. In fact, I think a good number of them ignore us because they want their web pages to look exactly like their word docs; a couple have even complained that the page that went live looks nothing like what they submitted for approval.

      We are supposed to be implementing the telerik RAD editor in the next couple months, but our lead developer is stymied over how to make it strip much of anything from Word text, so I think I'm still going to be telling users to paste in plain text. Personally, I will continue to pre-convert text to HTML and paste it in in to HTML mode.

    4. Re:More specifically: Word into MS CMS by Anonymous Coward · · Score: 0

      Wow. Slashdot users really let me down on this thread.

      Paste into WordPad and it gets rid of all the junk but retains the text formatting and any embedded links.

      Losers all.

    5. Re:More specifically: Word into MS CMS by Uzuri · · Score: 1

      Oh lord, we have the same problem (different CMS) I've test driver WordCleaner and it works nicely, but costs money (so we haven't been able to talk any depts. into buying it). Our big problem is that our CMS eats curly quotes... and of course, everybody just has to use curly quotes.

      --
      I'm a she-slashdotter... but I make up for it by living with my folks.
  60. Re:Duh by rlandrum · · Score: 1

    Perhaps if the workforce in the US didn't use phrases like "sand monkeys", IT companies wouldn't be so inclined to look for good workers overseas.

  61. Use AbiWord on the command line by dominator · · Score: 1

    I'd recommend using the CVS version of AbiWord. It'll preserve almost all of your visual and semantic meaning using XHTML and CSS. This includes fairly complex things like endnotes, footnotes, tables, floating text boxes, etc.

    AbiWord --to=file.html file.doc

    http://www.abisource.com/

  62. Re:Duh by Anonymous Coward · · Score: 0

    "Yeah, hire some sand monkeys from overseas to do it. That's what all the IT companies are doing. Duh."

    Sand monkey? Did you write that with a sheet over your head?

  63. is HTML really necessary by lakeland · · Score: 1

    Most web users now seem to tolerate PDF files, and exporting from word to PDF is much more reliable than exporting from word to HTML.

    As a solution goes, it is pretty crude. However, it works quickly and easly, and produces nice looking output.

    1. Re:is HTML really necessary by ryanov · · Score: 1

      PDF is often irritating to use for certain applications, and for me is a turn off (load time for Adobe). When putting up my resume I had the same issue, as I wished to provide it in HTML as a third option (HTML, Word, PDF).

    2. Re:is HTML really necessary by Anonymous Coward · · Score: 0

      7.0 loads pretty much instantly now for me...

    3. Re:is HTML really necessary by mrchaotica · · Score: 1

      Some things -- like résumés -- probably shouldn't be in an editable format (copy-and-pastable text, yes; editable, no). That's one of the few things I'd use PDF instead of HTML for.

      Of course, I create the PDF by writing in HTML, and then converting it.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    4. Re:is HTML really necessary by ryanov · · Score: 1

      What speed machine? I bet it's a four digit number. None of my machines are -- there are few things I need that kind of speed for. PDF shouldn't be one of them.

    5. Re:is HTML really necessary by ryanov · · Score: 1

      Unless there is a good way to lock PDF's, you could certainly edit it with any number of programs. Few things, short of paper, are truly uneditable.

    6. Re:is HTML really necessary by mrchaotica · · Score: 1

      True, but it is more complicated than opening a HTML document in a text editor. For most purposes (such as the one I described), just making it a PDF is enough to keep people from messing with it.

      If you want real security, though, you'd sign it with PGP.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    7. Re:is HTML really necessary by Sivaram_Velauthapill · · Score: 1

      What's wrong with editable resumes? If you don't trust a potential employer with your resume, I'm not sure it's even worth pursuing them for a job. Employers can do whatever they want with your resume anyway...

      Now, if you are putting your resume on a website for public access or something then I would agree with your point that it is perhaps best to have it in PDF or something...

      --
      Sivaram Velauthapillai
      Seeking the meaning of life... @slashdot of all places ;)
    8. Re:is HTML really necessary by mrchaotica · · Score: 1

      Well, for example, I wouldn't want my résumé to look like crap because the formatting got screwed up or he doesn't have the right font, or whatever. Those things can happen much more easily with file formats that aren't PDF.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  64. Dreamweaver by brickballs · · Score: 1

    Doesn't dreamweaver have an 'unfuckup' button that fixes word-html?

    --
    "What does slashdotting mean?"
    "You've never heard of slashdot?"
    "I know it makes websites not work."
  65. Sure. by Anonymous Coward · · Score: 0

    Avoid all HTML export tools.

    Edit -> Copy
    Switch to gvim
    Edit -> Paste

    Seriously. People need to stop using Word (or FrontPage, for that matter) to design pages.

  66. Re:hi by Anonymous Coward · · Score: 2, Funny

    I'm fine too.

    I'm glad we have these little discussions. It makes my day so much more interesting.

    Let's do lunch.

  67. antiword by pizza_milkshake · · Score: 1
  68. AppleScript! by jimijon · · Score: 1

    Sounds like a perfect job for AppleScript. You can create a scriptable folder, drop your documents in it and let it copy and paste all the paragraphs and add some html tags, etc. Very flexible.

    --
    Mind | Body | Spirit | Cash
    1. Re:AppleScript! by diamondmagic · · Score: 1

      Or Perl/Shell script/python/PHP/ruby/... (in that order)

  69. Use a text editor by chia_monkey · · Score: 1

    That's what I did. Copy all the text in Word, paste it in a text editor (which kills all the formatting assuming you're not using RTF), copy that and paste it in your HTML editor (usually the same editor to code your HTML) or you can paste into Dreamweaver or similar and go that route. Quick and easy.

    --

    "He uses statistics as a drunken man uses lampposts...for support rather than illumination." - Andrew Lang
  70. avoiding the hand-edit by SethJohnson · · Score: 1



    I'm seeing a lot of 'use Dreamweaver' responses that are well-meaning and probably will solve this guy's dilemna. But what about those of us running CMS systems with text area inputs in forms? Our content people copy-and-paste directly from word and these crazy MsWord entities get crudly transposed into ASCII question marks.

    Anyone got a good regsub routine for correctly substituting these entities for their approximate ASCII equivalents? I'm just looking for pattern matching here... Don't need a bunch of code.

    Appreciatively,

    Seth

    1. Re:avoiding the hand-edit by Uggy · · Score: 1

      http://xinha.python-hosting.com/ is an textarea replacement based on HTMLArea. Works in IE and Gecko-based browsers. It has "Remove Word Formatting" and Tidy buttons.

      If you were going to use forms to cut and paste Word content, this should be your route.

      --
      Toddlers are the stormtroopers of the Lord of Entropy.
  71. Webworks Pro by 99BottlesOfBeerInMyF · · Score: 1

    You may want tot look at WebWorks pro application for sanely exporting Word files as HTML/XML. I've used it in the past (a handful of years ago) and it was pretty reasonable. It is worth investigating in any case.

    1. Re:Webworks Pro by hcdejong · · Score: 1

      If WWP for Word is like WWP for FrameMaker, it relies on paragraph styles to decide on output formatting so it'll only create usable output if the Word document is well-formatted.
      The company I work for has developed a tool that can reformat Word documents based on paragraph properties. You can use this to convert a document where every paragraph is style 'Normal' with some overrides into a document where every paragraph has a named style assigned to it and no overrides.

  72. A solution by Dr_Ish · · Score: 1

    I have run into a similar kind of set of problems, as I run an on-line philosophy journal (see http://ejap.louisiana.edu). The solution I found was to convert the documents down into RTF format as an intemediate step. There are a number of shareware RTF-to-HTML converters available. Unfortunatly, I cannot find the name of the program I usually use at the moment, or a link for it, but googling for "RTF to HTML" shareware produces quite a few likely candidates. This system worksjust fine for me. What I like best about the program I have is that it puts the HTML codes in in French! If you look at the source code for the most recent edition of my journal, you can see the system in action.

  73. fckeditor by mixmasterjake · · Score: 2, Informative

    fckeditor is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.

    The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.

    --
    TODO: come up with a clever sig
  74. HTML Tidy program by Todd+Knarr · · Score: 4, Informative

    One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/. It seems to clean up code (particularly from Word) quite a bit.

  75. PDF - GhostScipt by Embedded+Geek · · Score: 1

    Some have suggested using PDFs. To do this, I use Ghostscipt and Ghostword. Here is a good description from O'Reilly's Word Hacks on how to install it in Word.

    --

    "Prepare for the worst - hope for the best."

  76. DEMORONISER by Anonymous Coward · · Score: 0

    (Perl script)demoroniser - correct moronic and gratuitously incompatible HTML generated by Microsoft applications

    http://www.fourmilab.ch/webtools/demoroniser/

    1. Re:DEMORONISER by Kelson · · Score: 2, Informative

      The Demoroniser was nice in its time, but it assumes the output should be 7-bit ASCII, or ISO Latin-1 at best.

      The Unmoroniser is an updated version that handles Unicode properly and will do things like convert proprietary Windows-only curly quotes to the appropriate HTML4 entities instead of dropping them back to less accurate, typographically offensive straight quotes. Same with ligatures and other characters that the Demoronizer would munge instead of convert.

      http://rheme.net/unmoroniser/

  77. My past employer can by Anonymous Coward · · Score: 0

    607-272-4817, ask for Jim. Cyrus Company is a web development firm in upstate NY. I worked there for the last 3 years - I'm in the UK now - and we had a client who needed just this type of thing. Jim set it up (he can program in all kinds of languages I don't understand) and you can copy and paste from Word to an HTML form, keeping the format. There might be a browser requirement, but that's about it. I was amazed when I first saw it myself. If you have any questions, email me at paper@paperskies.com - sorry for the ad-sounding post, but it's the truth and I can't really think of any other way to put it! Regardless, good luck with the search.

  78. WordML - FO - XHTML/PDF by room101 · · Score: 4, Informative

    Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet to convert the WordML to FO (formatting objects).

    From there, do anything you want, like XHTML or PDF.

    Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!

    --
    room101 -- how much can you stand before they break you?
    (they always break you eventually)
    1. Re:WordML - FO - XHTML/PDF by dubbreak · · Score: 1

      Excellent, this should be modded +5 informative.

      If you can save as XML (both word and OO can) then you can use XSL. There is a project at University of Victoria that is doing that with shakespear works (all works are being converted to xml manualy, then then xsl/xslt is used to convert the file for presentation in various formats).

      --
      "If you are going through hell, keep going." - Winston Churchill
    2. Re:WordML - FO - XHTML/PDF by tigersha · · Score: 1

      There are quite a few markedup versions of old Will's works.

      If you buy the XML Bible (definite must if you work in this field) the CD that comes with the book has all the works of shakespeare as well as some other goodies. The Bible and Koran and a few other religious books, for instance.

      It also contains the text of the book as PDF.

      --
      The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
    3. Re:WordML - FO - XHTML/PDF by dubbreak · · Score: 1

      Thanks for the tip! I am still an undergrad and have not done any work with xml (hey I have a year or so to go, so i have time), but a friends work in the area has piqued my interest (the shakespeare stuff).

      A quote I liked from another fellow student doing something that involved xml stated of xml, " XML: It says everything and nothing at the same time."

      --
      "If you are going through hell, keep going." - Winston Churchill
  79. This is easy, while paper stays cheap by Anonymous Coward · · Score: 0

    Why not print it, scan it in, and post the jpg? With one of those multi-function printers with a sheet feeder for the scanner, it might even be fun!

    (for some definitions of "fun", anyway)

    1. Re:This is easy, while paper stays cheap by kubis · · Score: 1

      THIS is not a joke! one of my customers do regularly send emails in JPG. They simply edit the incoming email (add remarks, notations, comments, ...), print them out, scan them and send them back as jpg. I have tried to explain them that they can use the Forward button in email client, but theyre still doing it the old way. After third attempt to teach them how to forward the emails i begun to charge them for every multimegabyte emails i get from them. And guess what, they pay!

  80. Grrr... by Embedded+Geek · · Score: 1
    That should be "Ghostsc r ipt," of course.

    I could really use a speling cheker.

    --

    "Prepare for the worst - hope for the best."

  81. html tidy by roc97007 · · Score: 1
    I save as html from within word and then use "html tidy" to clean up the html. I think it's built into PHP now.

    http://www.w3.org/People/Raggett/tidy/

    Ron

    --
    Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
  82. demoronizer by cyrilc · · Score: 1

    I use a homebrew version of demoronizer with accumulated patches that I added to the script along the years + tidy to sort everything up

  83. Recreating formatting? by RoadWarriorX · · Score: 2, Insightful

    To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags.

    The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document to a web site, and I always try to get the author(s) to not worry about formatting. Formatted documents are pure evil simply because 9 times out of 10 it does not affect the relevant information that you are trying to convey to your audience. Sometimes, the authors give me grief about it, but I simply show them the possibilities of separating the content and presentation during the translation. I convert their documents to generic HTML (with whatever tools are available) and use CSS to apply relevant formatting for the type of document (a report, article, thesis, or whatever). No funky font tags, or weird tables. Just the let the HTML flows as it's meant to be.

    1. Re:Recreating formatting? by FooAtWFU · · Score: 1
      Exactly. This is my problem. I don't want the pretty fonts, I want the pretty HTML. The problem is when you get to referencing things like the Journal of Economic History all over- titles are supposed to be italics, there are occasional hyperlinks I'd like to leave, headings are kind of awesome... and apart from the insane overformatting which I referenced in another post, Word does a decent job of generating footnotes/endnotes.

      Now, etting this pretty stuff from Word is the tricky part.

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    2. Re:Recreating formatting? by RedSteve · · Score: 1

      OK, it sounds like you're on the right track -- knowing that your text needs to be marked up to indicate its structure. The next step is to actually make Word work with you.

      1. Set up Word styles to reflect the major structural parts of your document. By this I mean put your main headings in Heading 1 style, subheads in Heading 2 style, paragraphs in a new "paragraph" style, article titles in a character style called "articleTitle", etc.

      2. Use these styles to mark up your documents appropriately by applying the styles to the appropriate parts of the doc.

      3. Use the search & replace command to search for the styles...

      3a. use the "advanced > format > style" option and select the style you are looking to grab

      3b. replace it with the "Find What Text" (see the "special" option in the S&R dialog) surrounded by the appropriate tag.

      3c. the replacing text can be as simple as ^& (for surrounding the selected text with h1 tags) to providing a custom span or adding a class to a tag to reflect the actual semantic use of the text (e.g. ^& )

      4. After some cleanup -- Word will insert paragraph marks before your ending tags where you have used paragraph styles -- you can save your file as a text file to strip the word formatting and you should have a clean html file.

      Granted, this process assumes that you're willing to do a little Word formatting work up front, that you know how to use word styles, and that your formatting is fairly straightforward. It's also the result of some very cursory testing on word:Mac v.X; YMMV.

      Hope this helps.

    3. Re:Recreating formatting? by Anonymous Coward · · Score: 0

      Somebody mod parent up, please.

      I just tried this and, dammit, it works. After a 'save as web page' paragraphs styled as 'Title 1' now comes wrapped in h1 tags etc and, after a quick visit to the Dreamweaver 'clean up Word HTML' function, the only word cruft left is a few span tags.

      Now, if there was a way to automate the search/replace in Word I'd be even happier...

  84. Net-It is your magical tool by netringer · · Score: 3, Informative

    Net-It Central is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.

    Oh, you mean non-commercial magical tools?

    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  85. HTML Editor? by BradNelson · · Score: 1, Redundant

    Why not just copy the plain text to an HTML editor, or even Notepad? Then manually add any font variations (titles, subtitles, bullets, etc) that are needed. Even with Dreamweaver or (shudder) FrontPage, this would not be too hard to do, even with longer articles.

    1. Re:HTML Editor? by tweek · · Score: 1

      Have you seen the crap that word exports as HTML? It's impossible to navigate. You really need a tool that understands the special tags that MS uses in the "HTML" it creates.

      --
      "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  86. RTF by Anonymous Coward · · Score: 0

    What did happen to RTF? Seriously.

  87. Elaine's by Anonymous Coward · · Score: 1, Funny

    I've got a table at Elaine's. Can you make it?

    1. Re:Elaine's by Anonymous Coward · · Score: 1, Funny

      Not sure; I don't know Elaine and I'm not very handy with tools.

  88. Papyrus by Paolo+DF · · Score: 1

    I use this program, Papyrus, from rom-logicware http://www.rom-logicware.com/ that has got a quite good HTML export function, and also a quite good M$Word import.
    The quality of the import varies depending on the source document (for my kind of stuff it's very good), but the quality of the HTML export is EXCELLENT, tidy-proof.
    there is a demoversion that basically has got the only limitation of 1page printed with some letters swapped, but any other function is OK.
    Also, its size is extremely small (kinda 2-3 MB)

    --
    Pumbaa! I don't wonder; I know.
  89. Bittorrent link? by ari_j · · Score: 1

    Does anyone have the BT link for this? ;-D

  90. But why... by skelly33 · · Score: 1

    Why is it important to make the code beautiful if the objective is merely to publish content for legible consumption? Why not just use Word's HTML export capability and dump the results into your web page and be done with it? Your content will be published and who cares what the code looks like if nobody's going to be doing any significant editing to it..?

    Just curious. Personally, unless the document is too big for consideration, I'll usually recode the thing by hand if I need the code to be precise - I haven't met a code-generator that I like yet.

    1. Re:But why... by Paolo+DF · · Score: 1

      Have you ever seen what Word is capable to *invent* to produce a ONE WORD html page? Add that code cleaness is a must if you want to be portable and time-proof.

      --
      Pumbaa! I don't wonder; I know.
    2. Re:But why... by cranos · · Score: 1

      Probably because even the thought of the crap flood that is Word generated HTML offends mightily.

      On a less anti-MS note, cleaning up the HTML would probably reduce file size as well as moving towards more standards based output

    3. Re:But why... by Anonymous Coward · · Score: 0

      Yeah, damn that CSS and how it bloats up a webpage that has only one word on it!

      Course then if you use it to output an entire document, suddenly it's about half the size of straight HTML! How dare they!

      Geezuz!

  91. Does anyone have any...magical tools...? by exp(pi*sqrt(163)) · · Score: 1

    A computer?

    --
    Doesn't it make you feel good to know that our freedoms are protected by politicans, lawyers and journalists.
  92. Word? by Anonymous Coward · · Score: 0

    Word? Never heard of it. Request that all submissions be sent in one of these "industry" standard formats:

    • Plain Old Text
    • HTML Formatted
    • Extrans (html tags to text)
    • Code

    And, provide a Preview button so that people can preview what they are sending in.

  93. WYSIWYG javascript editor by scarlac · · Score: 1

    I noticed a nice feature in tinyMCE (javascript wysiwyg editor) that allows you to copy-paste stuff from word to tinyMCE.

    If this would really do any cleaning up I don't know, and sometimes tinyMCE has problems of it's own just keeping track of font styles (it keeps flooding me with <font> tags (eww!)).

    It's not a complete solution, and others in here have better suggestions, but this feature is certainly interesting (and relevant?)

  94. Abiword by forlornhope · · Score: 1

    The abiword project has a set of command line utilities to convert word documents to various other formats. Its called wv in Ubuntu/Debian. The one you want is called wvHtml.

    --
    "We Don't Need No Truthless Heros!" - Project 86
  95. The answer is insanely easy and cheap by Anonymous Coward · · Score: 0

    I do this all the time.

    Don't bother with any other lame program that some supposed /. know-it-all recommends.

    1. Copy your document from Word into Wordpad.
    2. Copy your document from Wordpad into your HTML editor of choice.

  96. Amen by Quadraginta · · Score: 5, Funny

    Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...

    1. Re:Amen by EeeJay · · Score: 1

      with circles and arrows and a paragraph on the back of each one...
      That sounds like Alice's Restuarant to me!

    2. Re:Amen by archeopterix · · Score: 1
      Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD> <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1250"> <TITLE></TITLE> <META NAME="GENERATOR" CONTENT="OpenOffice.org 1.1.4 (Win32)"> <META NAME="CREATED" CONTENT="20050810;9430237"> <META NAME="CHANGED" CONTENT="20050810;9445070"> <STYLE> <!-- @page { size: 21cm 29.7cm; margin: 2cm } P { margin-bottom: 0.21cm } --> </STYLE> </HEAD> <BODY LANG="en-US" DIR="LTR"> <P STYLE="margin-bottom: 0cm; text-decoration: none"><FONT COLOR="#800000"><FONT SIZE=5 STYLE="font-size: 20pt"><B><SPAN STYLE="background: #00b8ff">THANK YOU, VERY INTERESTING POST!!!!!!!!!!!</SPAN></B></FONT></FONT></P> </BODY> </HTML>
  97. Dreamweaver works well for this by Paleolithic · · Score: 1

    Dreamweaver does this well. You "clean" up the HTML and it cleans things up nicely.

  98. This might fit the job by InternetVoting · · Score: 1

    For anyone out there who use Gmail you might have noticed the the JavaScript based editor for emails.

    It will actually give you the ability to paste in code from other editors ( including MS Word ).

    The app is called TinyMCE. TinyMCE is a platform independent web based Javascript HTML WYSIWYG editor control released as Open Source under LGPL by Moxiecode Systems AB.

    It may not make perfect XHTML compliant code, but you can try the input and test the output results here.

    I've found it to be a pretty useful tool and it should clean up the HTML pretty well. All in all it's going to depend on the basic page layout you have to decide if this is the right fit for you.

  99. Word Document Reader and Converter by bradtes · · Score: 1

    rm

    Works every time.

  100. wvWare: Word to text-based format via XML by jonored · · Score: 1

    When I was looking into doing this sort of format stripping (for a college newspaper, oddly enough) I started really looking into wvWare. It really is a marvelous little program - it takes a document in word format and translates it into a document in some text-based format by essentially replacing any given bit of word formatting with the text from a tag in an XML file describing the destination format. If you want it to, for instance, keep paragraphs, bold and italic markings, but nothing else, you would write (Or, as I was doing, edit) an XML file specifying that for each you replace the beginning with the appropriate start tag, and the end with the appropriate end, and all other formatting in the document with nothing at all. I found the format for doing WML pages to be marvelously close to a very minimal HTML (only a tag or two away).

  101. Export it as XML and XSLT it to HTML by Anonymous Coward · · Score: 0

    1) get a copy of Word 2003
    2) "save as" an exemplar as XML
    3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
    4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
    5) publish

    1. Re:Export it as XML and XSLT it to HTML by Lemuridae · · Score: 3, Informative

      From the AC above:

          1) get a copy of Word 2003
          2) "save as" an exemplar as XML
          3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
          4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
          5) publish

      I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting with tranformations to Word HTML. This is wrong because it assumes that the Word to HTML conversion will produce usable HTML in the first place which is a bad assumption.

      The solution suggested by the AC could be combined into a program that drives the entire process using the Word COM API to save to XML and then then, for example, the MS Jet XSLT COM object model to automate the XML conversion. This could easily be maintained (eg: new Word formatting not previously encountered) with small changes to the XSLT.

      If the desire is to completely control the output without having control of the input then this is the best way to go. Yes, it's a bit of work but once you have a maintainable turn-key system you will save a lot of futzing with manual formatting. Use the power of XSLT.

    2. Re:Export it as XML and XSLT it to HTML by coolGuyZak · · Score: 1
      I've been wondering how long until using XSLT and XML was suggested.

      It occurs to me that you could have drastically shortened that time had you posted the solution yourself...

    3. Re:Export it as XML and XSLT it to HTML by llZENll · · Score: 1

      I'm guessing that since the guy doesn't know how to search for "word html filter" in a search engine he would have a pretty hard time doing what you suggest.

    4. Re:Export it as XML and XSLT it to HTML by Lemuridae · · Score: 1

      Well, the first post went with regex which I think is much harder than XSLT so the difficulty bar was set pretty high right out of the gate.

      XSLT isn't really that hard: find the tags you care about and transform them. The rest just get left behind - perfect for stripping things down and very easy to do incrementally.

      I read in a prior comment that the OP's web site had totally clean HTML. If you are going to be all OCD about your HTML this solution is probably where you would end up (after much messing about).

      The various tools and gizmo utilities will get you part of the way but never all the way there in an automated fashion.

  102. Pagify by bckspc · · Score: 2, Informative

    Pagify is a perl script I wrote to do this for another job. It's basically a series of regular expressions that: 1. purges all the proprietary XML gunk from the HTML file you save from Word. 2. chops the file into smaller files wherever a Heading 1 appears 3. attaches endnotes as footnotes to the appropriate pages. It's GPL'd, so go nuts.

  103. Try this.... by mormop · · Score: 4, Informative

    Demoroniser is, in the author's own man pages words:

    A Perl script which corrects incompatible HTML generated by Microsoft applications.

    You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.

    --
    Hmmmmmm..... Deep fried and look like Squirrel.
  104. take a couple steps back... by Vellmont · · Score: 1

    As you've found out word is intended to create paper documents, not web content. I think you really need to look at the bigger picture here. You're currently taking word documents and desperately trying to convert them to HTML so they can be published on the web. Great, but not a good solution to your larger problem.

    The really the question is, why are you accepting word documents in the first place? If your authors are serious about publishing on the web you should really be pushing them to give you content in html, or look into some kind of content management system. Any major website and most minor would cringe at the thought of using word as a content creation program. Those should really be your long term goals as this word->html business is really quite a crappy system. If you don't start pushing people now to change to something more sane they never will.

    Eventually you're going to end up with a difficult site to manage because of all the word->html conversions. Solve the short term problem with some cleanup program, but you really need to work on the long term problem before it kills you, or the site.

    --
    AccountKiller
    1. Re:take a couple steps back... by _Sharp'r_ · · Score: 1

      Right, but his users have tenure and their bosses don't care if they make the Sysadmin's job harder.

      The real solution is to start with the automated conversion stuff mentioned above and then use the free time generated to read back issues of BOFH. That will better prepare him for an educational establishment environment.

      --
      The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
    2. Re:take a couple steps back... by Vellmont · · Score: 1


      their bosses don't care if they make the Sysadmin's job harder.

      But they do (or should) care about added cost. Harder usually means it takes extra time, extra expertise, or both. That's more expensive.

      --
      AccountKiller
  105. Dreamweaver double plus plus by also+aswell · · Score: 1
    I also use Dreamweaver and would like to add to the previous post.

    Inspect the copy closely in the Design view before you strip the unsightly word commands so you don't miss any little trick that might get stripped in the process. This has happened to me once or twice.

    But don't hit the undo, usually there's a quick fix in Dreamweaver that will bring the page back to the way it looked before.

    A small asside... Attention Dreamweaver fans, Let's all let Adobe know how much we love this program as they absorb Macromedia later this year.

    --
    "Where did this apple come from?"
    --Alan Turing
  106. command-line solution: wv and tidy by Khopesh · · Score: 1
    Convert to HTML using wvWare (http://wvware.sourceforge.net/)

    Clean up HTML with HTML Tidy (http://tidy.sourceforge.net/)

    This can be easily scripted; no gui needed. Of course, you seem to want even /cleaner/ code ... this is only a starting point, but it seems like it will do most of the work for you.

    --
    Use my userscript to add story images to Slashdot. There's no going back.
  107. I just had to do this by gmajor · · Score: 1

    I used a combination of Perl and Ole (using the Win32::Ole package). MS Word has Ole hooks, designed to work seamlessly with the native Windows VBScript. With Perl's Win32::Ole, you can always do the equivalent, but you may have to use some syntactic acrobatics. After figuring out the proper syntax to use, it worked smoothly for me.

    Unfortunately, to check for bold, italic, or underlined text, I had to check every individual character to see if it was formatted. Very inefficient, especially with large documents.

    Google Groups is your friend :-) If you have any further questions, reply to this thread, and hoopefully I'll get back to you with some answers.

    1. Re:I just had to do this by gmajor · · Score: 1

      Btw, some more notes:

      I originally used perl+antiword. Unfortunately, antiword choked on handling the formatting (bold, italic, etc.) correctly, so I was forced to use the OLE package.

      The advantage of perl is that you can minimize having to copy and paste from the word files. You want as little human involvement with this process as possible.

  108. Had a similar problem with AbiWord by mi · · Score: 1
    AbiWord's HTML-generation may be better, but it is still ugly:
    • The <span... for every paragraph
    • giant and identical "style=...." qualifiers for each <td> and <tr>
    • each cell with an explicit paragraph inside it
    • each cell ending with an explicit </td>.
    Manual cleansing ended up reducing the file size by about 60%, without changing, how it looks.
    --
    In Soviet Washington the swamp drains you.
    1. Re:Had a similar problem with AbiWord by FooAtWFU · · Score: 1
      each cell ending with an explicit </td>.

      It's called XHTML. Maybe you've heard of it.

      I agree with the rest though...

      --
      The World Wide Web is dying. Soon, we shall have only the Internet.
    2. Re:Had a similar problem with AbiWord by mi · · Score: 1
      I have heard of XHTML, but I don't care to produce a document in it. I just want HTML, where redundant closing tags aren't required.

      I realize, that it may be more complex to validate/parse, but that is an engineering problem, not user's.

      And it is solved already -- numerous browsers parse it just fine.

      --
      In Soviet Washington the swamp drains you.
  109. Here is how to do it: by dspisak · · Score: 1

    Step 1: Buy a cheap Mac Mini
    Step 2: Buy copy of Office 2004
    Step 3: Open your horrendous Word .DOC in Word 2004
    Step 4: Print to PDF
    OR
    Step 4: Export to XHTML in Word

    Done.

    Exporting to XHTML in Word 2004 seems to do a pretty good job usually. However you should just print the DOCs to PDF and put those up instead.

    1. Re:Here is how to do it: by RomanySaad · · Score: 1

      You forgot:

      Step 5: Profit!!!

  110. Cross-Eyes by ndansmith · · Score: 1

    There is a nifty program called Cross Eyes which reveals all of the formatting in a Word doc (basically shows you the "source" of a .doc). It can help you see what is tripping you up and get it removed.

  111. NO NO NO NO NO NO NO! by lebow · · Score: 1
    NO that is a very bad idea! Unless ....

    You want to create a bloated internet. Or maybe you just want to drive users away from your site! Save PDF for documents that are meant to be printed.

    Why would you want to force your users to open up a different application to view your online content!?

    I can go on about why this is such a bad idea but I think it is very obvious.

    1. Re:NO NO NO NO NO NO NO! by Anonymous Coward · · Score: 0
      Why would you want to force your users to open up a different application to view your online content!?

      Surely you're not arguing that using PDF to post dead-tree documents on the Web is bad because an interpreter has to be fired up in order to view the file? I don't know about your browser of choice, but with Mozilla, Acroread is fired up automatically and runs inside of the existing window. As for a bloated Internet, it's not 1991 anymore, most people have fast connections.

      PDF files are fantastic to use on the Web. There are free readers available for many platforms, the content can be formatted (including table of contents, list of figures, index, etc.) much better than can be done with HTML and it looks great when printed (we have production quality color laserjet printers where I work).

    2. Re:NO NO NO NO NO NO NO! by lebow · · Score: 1
      Sorry to inform you but, Acroread is not just part of mozilla ( even though it runs in the same window ).

      umm I also don't see why you can't do lists of figures and indexes and all these other things you said with HTML ?

      Why would you want to force your users to open up a different application to view your online content!? Referring more to hot-linking to the word doc.

      There are also other alternatives to PDF like , PS, and TeX. I agree that if it is only going to be printed then fine. But not every one has a production quality color laserjet printers and some of us do have a slow internet connection at times. ( have you ever used your cell phone as a modem ? )

      What am I supposed to do with your word/pdf docs if I need/want to use a text only web browser ?

  112. screenshots by rnx · · Score: 0, Offtopic

    nt

  113. Dreamweaver MX 2004 by greymond · · Score: 2, Informative

    1) Open Word
    2) Select All -> Copy
    3) Open Dreamweaver
    4) File -> New Html Doc
    5) Paste
    6) Commands -> Clean up Word Html
    7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
    8) Done
    9) Drink beer
    10) Sleep

    1. Re:Dreamweaver MX 2004 by senocular · · Score: 1

      11) Profit!!

      Sorry, this is Slashdot, it had to be done.

  114. Pdf by wot.narg · · Score: 0

    Print to pdf in OO.o.

    Either post the pdfs, or last time I checked there are alot of cleaner tools to deal with them. (Google obviously does it with little problem, so its out there)

    --
    Roses are red
    Violets are blue
    In Soviet Russia
    Poems write you!
  115. Another F-ing article from Wir... oh, I see. by stevenharman · · Score: 0

    Did anyone else read that as "Sanely Moving from World to the Web?"

    I was thinking "Oh God, not another article from Wired!"

    And while I don't want to encourage the posting of every article in Wired on /., I also feel compelled to cite why I thought the above. So, realizing that this post will probably only be read by 2 other people at most... the article is here: http://wired.com/wired/archive/13.08/start.html?pg =3


    Go ahead and mod me Troll.
    --
    90% of being smart is knowing what you're dumb at.
  116. Standard formatting/html tidy by La+Camiseta · · Score: 1

    First off, like others have reccomended, use HTML Tidy.

    Secondly, create a set of standard templates/formatting rules and make sure that your guys keep to them. This makes everything a lot easier, possibly allowing you to even script the exportation of the documents, as well as making sure that there'll be a standard look and feel across the pages without a need for so much editing.

  117. Automate Word -COM Add-In or Macro by N8F8 · · Score: 1

    I mentioned HTML Tidy already, but you could also automate MS word and crawl through the Word Object model and reproduce the page in HTML. One advantage of this method is you can identiry all the parts of the document including footnotes and formulas. Here is some code to start with. The tricky part is to realize you have to crawl through each paragraph/range at the word level and check styles. Do it at the character level and the thing is a dog. Another tricky part is crawling through the document in the correct order. The link isn't my code but was the closest I could find on the web before I sat down and figured it out myself.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  118. Word and styles by jammed · · Score: 1


    What always got me was that MSWord has all these built in styles e.g. Heading 1, Heading 2 etc which get turned to complete html font size mush when you use save as html AND this goes for every other single convertor I've used.

    Why not use these to convert to structured html, eg h1 h2 etc?

    This is why back in 1998 in a previous company (i was lead developer in a publishing company) we had to build our own - which is still being used by some of my current clients.

    There are a number of problems with it; its always one of those "solutions" that I've always meant to get back to, correct all the issues and publish it online.

    If the documents you use are heavily structured using the built in styles then it maybe of some use to you.

    Its basically a VB (yuk) automation on Word. We had to use Powerpoint to get all the images out as gifs (yuk again), and we had to write a really interesting algorithm to get out tables as html rather than the awful images some convertors do.

    Maybe you could get in contact if you're interested.

  119. CourseGenie -- sounds like what you're looking for by stevek · · Score: 1

    CourseGenie is a product designed for exactly what you're looking for -- taking Word documents and bringing them out to sane, accessible HTML.

    It's especially designed for academic uses like you're looking for.

    http://www.horizonwimba.com/products/coursegenie/

    [disclaimer: Yes, I do work for the company :) ]

  120. Simple by Chosen+Reject · · Score: 1, Redundant

    Ctrl-a, Ctrl-c, open notepad, Ctrl-v, Ctrl-s, type document.html, press OK. Done.

    --
    Stop Global Warming!
    Just say no to irreversible processes!
  121. Academic users using Word? by MerlinTheWizard · · Score: 0

    Any "academic" user using Word should not call themselves academic. The real ones use LaTeX. ;-) LaTeX allows you to do everything you want, and more.

    1. Re:Academic users using Word? by Franck+Binard · · Score: 1

      Not true anymore. I did all my grad work in logic using latex2E. It was hell. Either suffer long chains of tex code or macro ur way out. Either was a pain. Then you had to wait for the dvi load, then transfer to pdf, then back to correct... The new equation editors for word that are out there are so much better for math. Finally, see what ur doing when ur doing it. simple...

    2. Re:Academic users using Word? by colinrichardday · · Score: 1

      It may be easier, but will it look as nice? Also, why did you go from dvi to pdf? Why not dvips to print? Also, is really easier to use an equation editor than to type control sequences?

    3. Re:Academic users using Word? by Franck+Binard · · Score: 1

      "It may be easier, but will it look as nice? "

      Not by default, but if you tweak word using the styles formatting, then you can get it to look just as good.

      "Also, why did you go from dvi to pdf? Why not dvips to print? "

      Even then, you still have to compile it, fix the tex errors look at the dvi, then fix the actual text errors then recompile and so on. gets tedious on long articles

      Also, is really easier to use an equation editor than to type control sequences?

      hell yeah !!! (except for search and replace which is much easier in tex. but that's it)

    4. Re:Academic users using Word? by colinrichardday · · Score: 1

      Yeah right, it's sooooo much easier clicking for an integral sign than typing \int.

      Also, you can use LyX as a front end to LaTeX.

  122. dude i totally beat you to it by Anonymous Coward · · Score: 0

    one post above you.

    too bad i'm A to the mothafuckin C.

  123. PowerPoint is just as bad by FredThompson · · Score: 1

    This is common to other M$ Office applications.

    Try loading PowerPoint-created HTML into something other than Internet Explorer and see what happens.

    I've got a short sample with IE and Firefox screenshots here: http://home.mindspring.com/~fredthompson/

    OpenOffice doesn't properly load all M$ Office files, especially those with fine formatting control or embedded video.

    Have you ever used WinFax to send a Word document? You'll notice the margins change.

    PDF seems to be the only way to keep the formatting but then you don't have the raw text content.

    Supposedly, the upcoming major release of M$ Office won't use proprietary formats. Yeah, well, we've heard that before so maybe yes, maybe no.

  124. Yes! by Kelson · · Score: 2, Insightful

    PDF is most suitable for documents that need to be printed with specific formatting.

    For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.

  125. only accept submissions via online form by FLoWCTRL · · Score: 1

    The simplest way to proceed would be to set up an online form for content submission. Tell them its the only way that you'll take submissions. Then they can cut & paste text into the fields that you specify, or if they are professors, more likely they'll give it to their grad students or department secretaries to do.

    You can give them some formatting options by using textarea tags and allowing a limited set of html tags into the content, the same way that slashdot does.

  126. Actually, an NDA probably doesn't matter. by arete · · Score: 2, Interesting

    In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.

    However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.

    At a minimum a lot of small companies would be fine with this - big companies would vary wildly.

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
    1. Re:Actually, an NDA probably doesn't matter. by fandog · · Score: 1

      Problem is liability; as a company you don't want to get sued if the friendly script (allegedly) hoses some user's computer, who undoubtedly would tell the courts it had really expensive stuff on it.

      I agree it would be useful, but we're graduating lawyers each year who need jobs....

    2. Re:Actually, an NDA probably doesn't matter. by Anonymous Coward · · Score: 0

      In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.

      If it's obvious, then why don't you post the scripts to SourceForge? After all, they're obvious.

    3. Re:Actually, an NDA probably doesn't matter. by Lord+Pillage · · Score: 1

      Obvious does not always equal ease.

      --
      try { Signature mysig = new CleverAttempt(); } catch(NonCleverSignatureException e) { postanyway(); }
    4. Re:Actually, an NDA probably doesn't matter. by mrchaotica · · Score: 5, Funny
      Doubtless he couldn't post the _documents_ that he converted.
      You realize he was converting them for the purpose of putting them on a website, right? ; )
      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    5. Re:Actually, an NDA probably doesn't matter. by Anonymous Coward · · Score: 0

      You realize he was converting them for the purpose of putting them on a website, right? ; )
       
      intranet

    6. Re:Actually, an NDA probably doesn't matter. by pembo13 · · Score: 1

      Sad. I remember when people just used to post useful code. No worries about NDA, IPO, or what not. ANd I'm just 20.

      --
      "Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
    7. Re:Actually, an NDA probably doesn't matter. by mikkom · · Score: 1
      In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.
      NDA doesn't cover that but copyright will. Everything you do on work is owned by company that you work for.
    8. Re:Actually, an NDA probably doesn't matter. by Evil+Grinn · · Score: 2, Insightful

      You realize he was converting them for the purpose of putting them on a website, right? ; )

      Did he say the website was public?

    9. Re:Actually, an NDA probably doesn't matter. by mrchaotica · · Score: 1

      Well, no -- but the guy who elsewhere in the thread who figured out what the "unnamed website" was did!

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  127. DreamWeaver MX - Clean up Word HTML by Hack+Jandy · · Score: 1
    Here is what I do:
    • Open the word document, then tell Word to "Save as Filtered HTML".
    • Close the docuemnt in Word, then open it up in Dreamweaver MX 2004.
    • Tell DW to "Clean up Word HTML", then do "Clean up HTML" and under specific tag put "span".
    Aside from some table issues, this cleans up about 99% of Word's garbage.

    HJ
    1. Re:DreamWeaver MX - Clean up Word HTML by Anonymous Coward · · Score: 0

      good idea....

      Just curious, is anyone planning to pony-up the $399 upgrade to Macromedia Studio 8?

    2. Re:DreamWeaver MX - Clean up Word HTML by Anonymous Coward · · Score: 0

      Certainly not I.

  128. Dreamweaver and CSS by also+aswell · · Score: 1

    Actually the problem may be with which browser you preview css pages in. As you may know, Explorer is getting further and further from complience with css standards. When I preview in other browsers, my css pages done in Dreamweaver look fine.

    I use a mac and no longer bother to check my css pages in Explorer since MS quit supporting the program for mac platform. For a while I previewed on a friends pc, and still do for pages that don't use css, but with the browser going so far out of suppot for web standards esp. reguarding css I depend on other browsers for compliance.

    It would be nice to get Explorer updated for the Mac, I like the way they do bookmarks. Do any of you slash doters know if there is a group asking MS to update?

    --
    "Where did this apple come from?"
    --Alan Turing
  129. Easy: INTERN by Anonymous Coward · · Score: 0

    Give it to the intern. :-D

  130. Reading is an incredibly underrated skill by Bogtha · · Score: 1

    Why not just copy the plain text to an HTML editor, or even Notepad? Then manually add any font variations (titles, subtitles, bullets, etc) that are needed.

    From TFAskSlashdot:

    ...it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?

    He's already doing what you suggest. He wants a better way.

    --
    Bogtha Bogtha Bogtha
  131. Word to Wikipedia? by WamBamBoozle · · Score: 1

    I find myself in a similar state but I'm trying to get from word (and html) to wiki markup.

    I agree that filtered html | tidy gives workable output for html. How do I get html converted to wiki markup? Or word to wiki markup for that matter?

    1. Re:Word to Wikipedia? by N8F8 · · Score: 1
      Either come up with a HTML-Wiki converter or develop or find a Word Add-In to do the conversion:
      --
      "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  132. what about PDF by FreeBSD+evangelist · · Score: 1
    If you converted the documents to PDF (http://sourceforge.net/projects/pdfcreator) they would appear exactly as created.

    Of course, that means your users would have to have an Acrobat plugin for their browser.

  133. word to web how about flash by Anonymous Coward · · Score: 0

    rather than pdf, you could use flash paper, of course this would upset those that don't like add ins but the installed base of flash is quite high, and they open faster than pdf's.

  134. 100% reliable method by andy_fish · · Score: 1

    Here's a method that's practically 100% reliable. You won't lose *any* formatting details. Also, the way I describe is by hand, but I'm pretty sure you could set something up to automate this.

    1) Open the document in Word
    2) Maximize the window
    3) Take a screenshot
    4) Upload the screenshot to the web
    5) Done!

    --
    & I wish I knew the password to your heart . . . &
  135. Mod parent up by uncoolcentral · · Score: 1

    The "Web Page (filtered)" solution is money. Easy and effective, given the scenario presented in the initial post. Further posts should all just herald this info. NAZIS!!!

  136. most academic papers by Anonymous Coward · · Score: 0

    are meant to be printed

  137. Method usable for any source document format by JoeGTN1 · · Score: 1

    In high school (several years ago) our school newspaper was produced in Quark Express, which did not lend itself to HTML at all (at least at the time). We would print the document as a PDF and then use BCL Magellan: http://www.bcltechnologies.com/document/products/m agellan/magellan.htm to convert it to HTML (and HTML that was readable on any browser at that...). It seems the company now has a web based solution: http://www.gohtm.com/ and that Magellan now converts from .doc as well.

  138. Tiger's Textutil + BBEdit by mmarlett · · Score: 1

    I run a small newspaper and get press releases and stories filed from people doing absolutely any file formate and styles that you can imagine. Quickly striping the text down is hugely important to me.

    I've become a huge fan of Textutil, a command-line tool built by Apple that was included in Mac OS X 10.4. It can process Doc, RTF, text and anything that Apple's OS can read. And it can spit files back out in any format you want.

    I wrote an AppleScript for BBEdit (you could just as easily do a Perl script) to strip out everything but the most generic tags -- italics and bolds -- so that I can use those files to my own ends. It rocks.

  139. Simple answer by Anonymous Coward · · Score: 0

    Deny the receipt of the documents in anything other than plain HTML.

  140. Save As? by v3xt0r · · Score: 0

    you can save word documents as html documents, directly in word.

    if it looks like shit, oh well... it was made in word.

    --
    the only permanence in existence, is the impermanence of existence.
  141. Standard Templates by HalWasRight · · Score: 1
    Having a standard word template with format styles that users must use might help you. Do you have any power to place standards on the submissions?

    I long for the days when I used Framemaker. It's style system is much easier to use the Words, and makes it much easier to enforce standard formatting. And MIF output was great for Perl transformations.

    --
    "This mission is too important to allow you to jeopardize it." -- HAL
  142. Unfortunately, by nothingHappens · · Score: 1

    here's what I do at a certain work-study web "developer" gig I work:

    In Word:
      - Select All
      - Copy
    Open Notepad and:
      - Paste
    Voila! Plain text! Now:
      - Copy again
      - Paste into Frontpage
      - add formatting tags at will

  143. Yes PDF for sure by Anonymous Coward · · Score: 0

    The only way to fly for mass document uploads. Just ask the IRS and thousands of companies that have significant document delivery via the web.

    Don't fight reality!

  144. MS provides a solution (Office 2000 HTML Filter) by joelsanda · · Score: 1
    --
    The Luddites were ahead of their time.
  145. The answer is FCKeditor by Mustang+Matt · · Score: 1
    --
    The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
  146. Office 2000 HTML Filter 2.0 by danila · · Score: 1

    Microsoft makes exactly such a tool and it's available here:
    http://office.microsoft.com/downloads/2000/Msohtmf 2.aspx.

    You could have saved us all a lot of time if you just searched for it instead of posting an Ask Slashdot question.

    --
    Future Wiki -- If you don't think about the future, you cannot have one.
    1. Re:Office 2000 HTML Filter 2.0 by binford2k · · Score: 1

      You could have saved the time you wasted posting by realizing that Microsoft HTML is not the HTML spoken by the rest of the world.

    2. Re:Office 2000 HTML Filter 2.0 by The+Cisco+Kid · · Score: 1

      Yes, but that tool runs only on Microsoft platforms - useless for 'the rest of us'.

      And as noted elsewhere, the 'HTML' produced by Microsoft programs is bloated and nonstandard, and looks like crap in pretty much anything except MSIE

    3. Re:Office 2000 HTML Filter 2.0 by danila · · Score: 1

      Have you tried this program? No. Then please don't comment on its quality, ok?

      This tool can strip all the bloat from Microsoft generated HTML to the bare essentials of HTML 2.0. And I am sure you can run it under wine.

      --
      Future Wiki -- If you don't think about the future, you cannot have one.
    4. Re:Office 2000 HTML Filter 2.0 by quaxzarron · · Score: 1

      Mod Parent Up! As compared to the "Save As... html" option, this is a *lot* better.

      --
      .sig(Anarchy Rules)
  147. Rerere no link for you, Slashdot hordes! by Anonymous Coward · · Score: 0



    Nice detective work. However, I disagree that all that nice clean code has all been done by hand. This is a guy who didn't just start looking for a solution to the problem. From randomly chosen source:
     
    ...meta name="generator" content=
        "HTML Tidy for Linux (vers 1st July 2004), see www.w3.org"...

    My solution is to do a Save As on the horror.doc or messy.pdf, and to save it as .XML (preferred) or .HTML or even .txt. It's easier to put quality markup into a naked or seminude document than it is to rip out all the proprietary Microcruft.

    I use GoLive for the fix, but mostly in code view.

    It's nice to see other people caring about the issue.

  148. delphi html tag stripper by bostonhobbit · · Score: 1

    This apps is great and free. http://www.wimb.net/index.php?s=delphi&page=27

  149. Perl :-) by Franck+Binard · · Score: 1

    This is a job for perl. I used to code for a company that does exactly what you need to do for universities. The ideal is to write your own scripts based on the document's structure (extract title, paragraph and so on), but there's a billion scripts to handle html parsing on cpan.

  150. Re:Dreamweaver okay, XM-109 rifle better by Anonymous Coward · · Score: 0

    My first thought is that the XM-109 25 mm rifle has been shown to be effective against armored vehicles at nearly 2,000 meters. It should be able to face down a Microsoft Word document, no matter what the formatting.

    And it is cost-effective, requiring only a single soldier to operate, with no more training than would be required to, say, produce a Microsoft Word document.

    Nevertheless, experience compells me to add that you can never have too much firepower where a Microsoft Word document is concerned.

  151. Antiword? by Trick · · Score: 1

    Assuming "Save as HTML" isn't an option (if you're getting your Word docs from someone you can't easily have re-save them for you, for example), I've used Antiword (a Word-to-text converter) for this sort of thing. It's been years, though, and I can't say with a whole ton of certainty that it works as well now as it did then.

    1. Re:Antiword? by binford2k · · Score: 1

      Microsoft HTML is not the HTML spoken by the rest of the world.

    2. Re:Antiword? by Trick · · Score: 1

      Yep, I realize that. However, a lot of the suggestions here were for how to convert Word HTML to "real" HTML, which obviously isn't possible if you don't have Word HTML to begin with.

  152. Same problem with my solution by aliens · · Score: 1

    Problem, I have writers with no HTML skills, not even basic ones despite two years of trying to get them to learn bold tags. Some people just don't get it.

    Solution they paste into a plain textarea, save and are sent to a page with a scaled down HTMLarea type editor where they can do the basic formatting they are allowed to do on the site. There should be nothing in their Word document formatting that can't be accomplished.

    Tables/Images are not meant to be part of the article and as such they should ask for that ability on an as needed basis.

    This has made my life better because I now get perfect XHTML transitional from their stories without the headaches caused from pasting in from Word.

    I have yet to hear anyone complain they can't get the format they want in their story.

    --
    -- taking over the world, we are.
  153. Antiword by dayton967 · · Score: 1

    Again as someone mentioned there is antiword which is here http://www.winfield.demon.nl/index.html

  154. Abiword --to html filename.doc by Anonymous Coward · · Score: 0

    Rather than trying to fix the horribly broken html microsoft word generates you might be better to try Abiword.

    Abiword is fast light and can be used at the command line to batch convert Word documents into HTML that is even cleaner than Microsoft Word 97 ever managed to produce.

    IIRC the command you need is abiword --to html filename.doc

  155. FCKEditor -- a great tool. by dpreston · · Score: 1

    Personally, I love FCKEditor [fckeditor.com]. It's open source, and can support ASP/PHP/etc. Simply put, it's a nice emulator for Word on the web, and it has an option that allows direct pasting from Word to it. Check out the demo, it's great! I use it in many projects of mine for clients (php/mysql scripting, mostly).

  156. Dreamweaver MX. by Blacken00100 · · Score: 1

    Open it in Dreamweaver. There's an option called, appropriately enough, "Fix Microsoft Word HTML." Hit it, and things get a whole lot cleaner.

  157. the solution...fckeditor! by happymedium · · Score: 1

    http://www.fckeditor.net/

    a very competent web-based word processor with one killer feature: "paste from word." i've tried it and it generates pretty clean html from even complicated ms word formatting. the only thing it doesn't seem to handle well is fonts, but i still think it'd be an excellent solution.

    (really unfortunate name though. worse when you realize it came from the author's initials.)

  158. Ultimite Solution by waltznumber3 · · Score: 0

    Coffee. Lots and lots of coffee (or bawls).

    --
    If you just took anything I said seriously, read it again.
  159. You can get anything you want.... by Quadraginta · · Score: 2, Funny

    But that's not what I came here to tell you about.

    I came to talk about the draft.

    1. Re:You can get anything you want.... by kfhickel · · Score: 1

      Yep, thanksgiving is getting close, almost time to see if I can find my turntable, and if it still works......

    2. Re:You can get anything you want.... by Quadraginta · · Score: 1

      It's probably where you put the typewriter and the carbon paper.

  160. Simple formated text by dariuscardren · · Score: 0

    use the tag on the for pre formated text, and jsut putting in heading tags when needed

  161. use .net by minus_273 · · Score: 1

    and load the word dll from asp.net. Works great.

    --
    The war with islam is a war on the beast
    The war on terror is a war for peace
  162. http://www.easyhtools.com/ by Anonymous Coward · · Score: 0

    Try this program. I find it really useful...

    Easy Text To HTML Converter (freeware)

    http://www.easyhtools.com/

  163. nuff sed by swordsaintzero · · Score: 1

    SED

    --
    Panel F, Relay #70
  164. WordPad - just the job. by markdowling · · Score: 1

    When copying/pasting any Word stuff to a HTML editor (like FrontPage) an intermediate paste to WordPad preserves most formatting.

    WordPad is also useful for opening bit-rotted Word documents which cause Word to freeze, allowing content to be copied/pasted into "fresh" .DOCs.

  165. Docbook? by Anonymous Coward · · Score: 0

    Assuming you have a recent enough version of word you could save it as Wordml (Microsoft Word XML vocab). Then you can convert the Wordml to DocBook (search wordml to docbook) or write your own stylesheet.

    Once in Docbook format, there exists a stylesheet (which can be customized if necessary) to convert to html.

  166. BCL - Magellan by M0b1u5 · · Score: 1

    What you need is the app Magellan from BCL.

    We use this in our ASP called "Diligent Boardbooks" (www.diligentbooks.com) and Magellan will convert .DOC, .XLS, .PPT and .PDF files into quite acceptable .HTML documents.

    The software is not perfect - but it is probably the best thing currently available.

    There are issues in the conversion process: such as text outside the printable area of the page, and transparent .GIF and .PNG files aren't seen as single images, they're broken into multiple 1 pixel high images (Which places a huge strain on an HTTPS connection due to the added latencies of multiple requests.) but overall, it works very well indeed.

    Our "Boardbooks" app is perhaps, the only product of its kind in the world: specifically designed for company directors, and board meetings, so that materials are available as soon as they are uploaded and approved - with note making capability, and the system keeps track of what pages you have viewed, and which have changed.

    All this with nothing except a web browser.

    Well worth investigating

    --
    How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
  167. Docbook by smartin · · Score: 1

    I switched from html to docbook about a year ago and am very happy. It's xml so it can be edited with anything or you can find editors for it (XMLMIND is ok) and it can be easily translated to pdf and other formats.

    --
    The difference between Canada and the USA is that in Canada healthcare is a right and gun ownership is a privilege.
  168. html_scrub by mck9 · · Score: 1
    Consider my html_srub utility.

    Depending on what you tell it in a configuration file, it removes or warns you about specified tags, attributes, and/or content bracketed by specified tags. You can use it as a filter and pipe the output into other tools as needed for other kinds of massaging.

  169. RTF by yrte · · Score: 2, Informative

    One option that can work for some situations is to export / save the file from .doc into .rtf (rich text format) and then use one of the free or pay RTF->HTML converters. I find using other software than Word to convert MSDOC -> RTF produces better results.

    Using that process has made preserving italics, bold, and special characters much easier for me and almost seems fully automatable.

    I've been using this method recently with some very simple search and replace and able to get good results.

  170. Postscript? by Ctrl-Z · · Score: 1

    I haven't done this before, but what about outputting from Word to Postscript and then running pstohtml to convert the postscript to HTML. Does pstohtml make better HTML than Word?

    --
    www.timcoleman.com is a total waste of your time. Never go there.
  171. Help me, help you... by gravyface · · Score: 1

    For years we used to accept any kind of garbage from our contributing authors until we put our foot down and asked them to adhere to a Word style guide we authored: use headings, not the font drop-down, use the list styles, not the auto-bullets. If your Word is structurally-sound, there are alot of open source tools (we used a commercial solution called HTML Transit) that will do a near perfect translation, including tricky tables. Garbage in...

    --
    body massage!
  172. Common... what? by DragonHawk · · Score: 2, Funny

    "... a perfect opportunity for the application of a little common sense..."

    What is this "common sense" of which you speak? Where may I download it from?

    --

    dragonhawk@iname.microsoft.com
    I do not like Microsoft. Remove them from my email address.
  173. Dive into Python by mixmasta · · Score: 1

    There is a cool chapter in the book, Dive into Python (free online, just google), which has a chapter or two on HTML processing. It may be worth a look.

    You can convert to html first and then run a script to parse the html with the sgmlparser module. Then just ignore all the msword crappola when writing the output files.

    Dunno, maybe that is an option.

    --
    #6495ED - cornflower blue
  174. kses by Anonymous Coward · · Score: 0

    kses - http://sourceforge.net/projects/kses - can be useful for cleaning up the 'filtered HTML' from word, which is still rubbish.

  175. Re:You can get anything you want torrents by j-beda · · Score: 1

    You could try just getting a torrent such as at http://www.google.ca/search?&q=alice%27s+restauran t+torrent

  176. alas times have changed by ecloud · · Score: 1

    I'd think stuff from academic sources would be in TeX.

  177. Get a contract with these guys by Rasta+Prefect · · Score: 1

    Talk to these people

    It's primarily a Learning Management System, but they do integration with other web sites for content posting.

    --
    Why?
  178. "You can get anything you want.." - iTunes by j-beda · · Score: 1

    Alternatively for $10 you can get the album at the iTunes Music Store, no turntable required...

  179. use the Tidy tools by amberarcher · · Score: 1

    HTML Tidy (http://tidy.sourceforge.net/) and its Java derivative JTidy (http://jtidy.sourceforge.net/) both have options to de-gunk HTML produced from MS Word. Does the job.

  180. Why not a CMS? by wirefarm · · Score: 1

    I looked at the site, using a link found below and I'm wondering why you don't use something like Drupal. (drupal.org)
    Setup is easy, it's free and after a slight learning curve, you'll save yourself hours or days of effort. With a bit of CSS tweaking, the site could even look the same. You could use wordpress, but with an academic site like that, you probably need something with more advanced taxonomies.
    Plus, you'd get things like RSS syndication, which adds immeasurably to the site's usefulness.
    For simple text documents in word, just copy and paste. If you need something formatted exactly, use a downloadable PDF. (Preferably with the text somewhere that Google can index it easily.

    --
    -- My Weblog.
  181. Spook says : move to a database, its logical by Anonymous Coward · · Score: 0

    This is a really interesting question, which I think we will start to see more of in the coming decade.

    There is a fundamental problem with the modern content ecosystem (one facet being: word to web.) What questions such as this point out, is that our current thinking in respect to content creation, is to attempt to cure to problem once it has already been presented (as in, someone has already created to word document, now we have to migrate it to html).

    We need prevention, not a cure. Prevent the problem before it every appears.

    Solution : Store the content separate from formatting until it needs to be published to a particular format.

    Large groups of people create various content which ideally should be a)produced in one or many formats and b) shared as chunks between common users.

    I've been working with enterprise level documentation problems for years, hell, i started in the days when documentation problems ment someone had lost the stapler. Today its no easier, people have thousands of documents, chunks of content and data stored in a never ending puzzle of directories. No one shares it, people cant find it, and you can not reuse it.

    Databases people! What's taken everyone so long.

    Ive found only one product capable at this time of what I speak of and i would be comfortable recommending.

    AuthorIT http://www.authorit.com/

    Other that AuthorIT, XML is looking promising, yet still far from an elegant solution and ultimately far from the best solution for author's.

    Its time content creation took the next step. Most other enterprise solutions have sensibly moved to databases, why shouldn't content?

  182. Hand editing solutions by Anonymous Coward · · Score: 0

    I created SWL http://cixar.com/swl because I had the same problem. It's just a small document language that lets me change a global template for my web sites without having to make any changes to my content documents, and lets me copy and paste relevant text from offensive document formats into VIM and spend a minute adding "half tags" to the beginning of lines that require style information. It's highly abbreviated, so making a bulleted list takes just four characters, no matter how many items it has. Same with an outline. Same with a table. Same with full page calendars. It lets me make macro tags if I need to produce some repetitive data without having to repeat the surrounding HTML I want. The template system lets me do things that I couldn't with HTML and CSS, like make ornate horizontal rules by overriding their tag. I'm making a new version that'll produce PDF's and HTML from the same source files. Come crash my server, if you will.

  183. Majix -> XSLT -> HTML by Sheepless · · Score: 1

    I may be crazy, but this pipeline works great if you (1) know how to program a DFA and (2) can rely on some semblance of consistence across various documents from a single source. Majix will also automate Word to convert it to RTF before you start processing it, at least the last time I used it.

    --
    Social media and technology thoughts: http://jasonkinner.wordpress.com
  184. Office Automation & VB & ASP by Anonymous Coward · · Score: 0

    Part of my job is doing exactly that...

    i did an activex dll in vb that commands MsWord, looking the doc template and the mappings defined in a ini file export using MsXml. most of the time new formats only need to define a new section (for the new template) in the ini file

    the dll is called from the submit btn in the ASP

  185. PHP & Mysql by Anonymous Coward · · Score: 0

    Sounds like you just need to get PHP and Mysql and make it database driven... then you could update the website from the web. Also, there are WYSIWYG editors made in javascript that make editing content similar to word. thats what i use at my site... or you could slap php-nuke on it like i did on my site

  186. save it as RTF and work from that by nickgrieve · · Score: 1

    word even has tools to do this as a batch job

  187. Mac Office & HTML by commodoresloat · · Score: 1
    Microsoft specifically sabatoges the Macintosh version of office quite often.

    That's false. I don't like Microsoft either, but I don't see any evidence of that (in fact the last two versions of Office for Mac have been hailed as better in some ways than the Windows versions). I don't know because I don't use Windows. There are incredibly bad compatibility issues with .doc files over PC/Mac versions of office, but the latest versions seem to read everything (besides, there are Office incompatibilities with Office documents from previous versions irrespective of platform).

    In any case, on Mac Office 2004 there is an option if you choose "Save as Web Page" for "Save only display information as HTML". This cuts out some of the cruft but not a ton of it; HTML from word is just tedious. I have taken to designing my documents first in HTML and then converting a copy to .doc rather than the other way around -- that way I don't have to deal with Word's output. Seriously there should be an option to ignore font settings, to not create new style sheets, etc. -- when converting DOC to HTML usually all I want is the information and any footnotes and tables properly converted; I don't care about font face and size info (unless it is different than the rest of the page in which case relative tags are what I want) and I sure don't want Microsoft's imperialist stylesheet information all over the place (right down to the style names like MsoNormal and MsoFootnoteText -- yeesh).

    1. Re:Mac Office & HTML by extra88 · · Score: 1
      Microsoft specifically sabatoges the Macintosh version of office quite often.

      That's false
      Yeah? Where's Outlook for Mac? Even with the latest version of everything on the client and server, Entourage's list of missing Exchange client features is as long as my arm.

      The Mac Business Unit has a pretty good track record for making decent programs and does sometimes have nice features prior to Office for Windows there's hardly feature parity between Office for the two platforms, particularly when it comes to "enterprise" features that leave Macs locked out of the organizations that use them.
    2. Re:Mac Office & HTML by commodoresloat · · Score: 1
      Yeah? Where's Outlook for Mac?

      I always thought the fact that Microsoft didn't provide Outlook for Mac was a way of sabotaging Windows users, not Mac users. ;)

    3. Re:Mac Office & HTML by Fuzzy+Eric · · Score: 1

      Yeah? Where's Outlook for Mac?

      Umm... Somewhere else, not infecting everyone in your addressbook with the latest script-kiddie virus?

    4. Re:Mac Office & HTML by extra88 · · Score: 1

      Umm... Somewhere else, not infecting everyone in your addressbook with the latest script-kiddie virus?

      Bah, Outlook 2000 was secured years ago by a patch (issued in 2001 I believe) and subsequent versions have had such protection built in, and then some.

      Outlook Express on the other hand...

  188. OS X by ninjakoala · · Score: 1

    If you have a Mac available with Office etc. you can just print and save the output as pdf. Pages may also do a good job of making web pages. It has worked well for me so far at least with various formats.

    --
    Against the grain
  189. Mod Parent Down by Anonymous Coward · · Score: 0

    Granted that the filtered option cleans a little bit,
    you still ain't free of the M$... stuff.

    M$ can't even leave the "filtered" version alone. Word leaves in a bunch of nasty extraneous attributes that are not XHTML compliant. However, the "filtered" version is a whole heck of a lot easier to write a clean-up script for.

  190. RTF2HTML by jcross · · Score: 1

    I can't speak from specific experience, but perhaps a conversion from Word to RTF, and then RTF to HTML would give the best results. Word does a fair job converting it's documents to RTF. That conversion will help get rid of some of the weirdness that is Word. Googling for RTF2HTML gives a variety of options. Once it's it's RTF, you might have better luck with scripting tools or other editors that can take the doc to HTML.

  191. Nifty solution, but not free by Anonymous Coward · · Score: 0

    You could use a WYSIWYG editor. Not HTMLarea or Fckeditor but xstandard (http://xstandard.com/).

    You have to get the pro version for it to clean MS word formating. But among all the other WYSIWYGs i've tried this is the best.

  192. rich textarea by weighn · · Score: 1
    search google for "rich textarea". There is a neat DHTML/javascript replacement for the html .

    Our company's Media Officer used to give me the Media Releases in Word format, which I would diligently convert to html.

    To ease my pain, I added a form to the intranet with a "rich textarea" in which she copy/pastes from Word. Add a few RegEx's and nice clean code - handles tables nicely too.

    --
    Mongrel News all the news that fits and froths
  193. Prevention vs Cure by Havsy · · Score: 1

    This is a really interesting question, which I think we will start to see more of in the coming decade. There is a fundamental problem with the modern content ecosystem (one facet being : word to web.) What questions such as this point out, is that our current thinking in respect to content creation, is to attempt to cure to problem once it has already been presented (as in, someone has already created to word document, now we have to migrate it to html). We need prevention, not a cure. Prevent the problem before it every appears. Solution : Store the content seperate from formatting until it needs to be published to a particular format. Large groups of people create various content which ideally should be a)produced in one or many formats and b) shared as chunks between common users. Ive been working with enterprise level documentation problems for years, hell, i started in the days when documentation problems ment someone had lost the stapler. Today its no easier, people have thousands of documents, chunks of content and data stored in a never ending puzzle of directories. No one shares it, people cant find it, and you can not reuse it. Databases people! Whats taken everyone so long. Ive found only one product capable at this time of what I speak of and i would be comfortable recommending. AuthorIThttp://www.authorit.com/ Other that AuthorIT, XML is looking promising, yet still far from an elegent solution and ultimatly far from the best solution for author's. Its time content creation took the next step. Most other enterprise solutions have sensibly moved to databases, why shouldn't content?

  194. HTML Tidy by Futurepower(R) · · Score: 1


    Don't forget HTML Tidy. It has an option to clean Word HTML output.

  195. openoffice... by BlueWire · · Score: 1

    If Openoffice can open and edit the file - can it "export" something more sane than MShtml?

    I'll try it later.

    --
    Yes, but whats that got to do with the price of tea in D'ni?
  196. "Word Hacks" author weighs in by andrewsavikas · · Score: 1

    In addition to taking the opportunity to shamelessly plug my book, I've posted a detailed response on the O'Reilly Developer Weblogs site, touching on using XSLT, VBA, Perl, Ruby, and more to get those Word docs into shape.

    --
    Andrew Savikas VP, Digital Initiatives O'Reilly Media, Inc.
  197. RTF 2 HTML, HTML Tidy and TinyMCE by msherer · · Score: 1

    For the Caravel Project (an OSS enterprise CMS) we chose RTF 2 HTML and HTMLTidy to automatically convert RTF files to HTML during the upload process. Despite the limitations, we found that exporting to RTF and doing our own conversion produced far cleaner code than anything MS did. If your Word documents are text-only, you can get away without additional editing unless the document uses lots of over-ridden stylesheets--the converter respects the stylesheets, while Word respects the overrides, which can yield some unpredictable results.

    We also recently switched editors to TinyMCE which has very reasonable 'paste from Word' and 'paste as plain text' features.

    I'm also interested in checking out the DocFrac project (http://docfrac.sourceforge.net/status.html) which looks like it might be a step up from RTF 2 HTML. While I think we're offering reasonable solutions, I would still consider Word conversion to be one of the weaker features.

    Michael Sherer
    http://caravelcms.org/

  198. Office 2000 HTML Filter 2.0 by indiechild · · Score: 1

    Not sure if anyone's mentioned this yet, but I recently found this little gem from Microsoft which does a pretty good job of cleaning up their horrendous MS Office HTML output.

    I've tried just about every free solution out there, and this seems to do the job better than anything else. Ironic really.

    I always go for standards-compliance so I don't tolerate any junk HTML code. This utility has saved me countless hours of hacking away at Word XP's awful HTML output.

  199. HTML Tidy cleans Word HTML. by Futurepower(R) · · Score: 2, Informative


    HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.

    It must be terrible to work at Microsoft and always do mediocre work.

    --
    If you support dishonesty and violence, don't say you are Christian.

  200. Servoo by Espressoman · · Score: 1

    Have a look at the Servoo project. Might be just what you are looking for. http://www.servoo.net/

  201. heres a thought by Anonymous Coward · · Score: 0

    grow a farkin brain, moran!!

    jesus.

    convert .doc to .html.. give me a farkin break.. how does this junk get cleared.

    had i'd asked that completely stupid question, i'd of been flamed worse than the likes of hell..

  202. Bad Word docs make for bad HTML by TheDoctorJimmyLuv · · Score: 1

    The tool you use to convert Word docs to other formats (HTML and PDF included) is, for the most part, irrelevant if those Word docs lack internal structure (semantic information), which in Word comes in the form of paragraph styles. These paragraph styles are analogous to HTML tags, like h, p, li, and so on. Unfortunately, most people who use Word are oblivious to the existence of these styles. In the typical Word doc, all paragraphs are just "Normal" with a ton of inline formatting; headings may look like headings (bigger font, bold) but they don't have a heading style. The structure of these docs is, thus, hidden from screen readers. Garbage in, garbage out.

    Actually, converting semantic Word docs to other formats isn't very difficult; the greater challenge is teaching authors to use Word properly. (Why can't MS better document styles, templates, et al?) The time investment is definitely worth it for people who use Word on a regular basis.

    BTW, when you convert Word to PDF in Acrobat, you can embed Word style information in the PDF in the form of tags, Adobe's answer to PDF accessibility issues. It works pretty well (if the Word doc has meaningful styles) and helps ensure your PDF files are 508-compliant.

    Good luck.

  203. His website by Anonymous Coward · · Score: 0

    Hahaha!! Here is his website. Let's bring his servers down!

  204. mswordview by sohp · · Score: 1

    Or as it is now known, wvWare. Includes wvHtml, which, "converts word documents into W3C certified HTML4.0 format." FOSS (GPL) command line.

  205. Trying this again by einhverfr · · Score: 2, Interesting

    The script (decss.sed) is:

    s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
    s:</FONT>::Ig
    s:<FONT[ -=\"A-Z0-9]*>::Ig
    s:BORDER=[0-9]*::Ig
    s:ALIGN=BO TTOM::g

    --

    LedgerSMB: Open source Accounting/ERP
  206. my tricks for this by Anonymous Coward · · Score: 0

    i also have to often batch process many Word documents into HTML. I've had success with running the file through antiword, outputting it as a text file, and then loading it into the CMS I use for the site.

    For static files, run the file through antiword, output to text, and run a simple regexp across the file that wraps the first line in an H1 tag (presumably this would be some sort of header) and adds a break tag to every line break.

  207. Try PureText by ZoomieDood · · Score: 1, Informative

    See this site. for PureText.

  208. Make them give you HTML by jesser · · Score: 1

    I belive the magical tool you are looking for is called the LART.

    --
    The shareholder is always right.
  209. Re:Duh by Anonymous Coward · · Score: 0

    See now that would actually be racist, if India was in the Middle East, where I hear there are lots of deserts and sand dunes and stuff. But you see, India's not in Middle East. So apart from a very small desert, there's also a huge coastline full of beaches, jungles, urban agglomerations and ..oh yeah the HIMALAYAS, with some of the tallest mountains in the world, covered in snow. So, please take the time to come up with a slightly more informed racist comment, it'll make you look smarter, and might even get you laid more often.

  210. Shouldn't Clean HTML Be A Standard Feature By Now? by buckminster · · Score: 1

    Seriously, if we wanted all of the smart tags and proprietary mark-up that Word generates that be an option that we would choose to include in our exported Word->HTML files? In the year 2005 "Export to HTML" should do just that. Export to clean HTML.
    How hard can that be? It's got to be easier than exporting all of the other junk that Word currently generates.

    If you think about this problem for more than two minutes it's pretty clear that Microsoft has made a the decision not t export to clean HTML.

  211. Adobe Acrobat Professional by Anonymous Coward · · Score: 0

    PDF the documents through Adobe. It would proabbly be your best bet, as there is no magic solution to word HTML, other then start from scratch or quit and find a new job.

  212. MacDonaldize 'em! by stronger · · Score: 1

    Simply don't accept format other than html. Deploy a simple and useful HTMLArea script so users can edit document (or copy-paste from Word) directly inside web browser. You can learn more on HTMLArea at http://www.htmlarea.com/> - there are plenty of them.
    My choice is the one from InteractiveTools http://sourceforge.net/projects/itools-htmlarea/>. It is released under BSD license and plays nice with Mozilla and IE 5.5+.

  213. Two free tips of the day! by tod_miller · · Score: 1

    1) Google for: clean up word html
    1.1) Click I am lucky (if you type it yourself)

    1.2) Download and use HtmlTidy, without checking I am sure it is the first hit.

    2) I don't think you have to worry about keeping your sanity. Too late.

    Congratulations! How did you pass the job interview?

    Nevermind.

    To confirm you're not a script,
    please type the word in this image: involve

    random letters - if you are visually impaired, please email us at pater@slashdot.org

    --
    #hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
    1. Re:Two free tips of the day! by `Sean · · Score: 1

      involve

  214. Hasn't M$ released the XML schemas for Office 2003 by Anonymous Coward · · Score: 0

    Seing that word now (supposedly) saves in some sort of "M$ Sane" type of XML schema, it should be possible to make a script to reform it to some sane (as in standards compliant sane) hmtl.
    http://www.microsoft.com/office/xml/default.mspx

    Otherwise I know Open Office saves the documents as a handfull of xml files, and zip compresses those files so you dont have to wory 'bout more than one file pr document.

    Either way, it should be possible to make some sort of sence from the xml file, and make some sort of script to translate the sane xml to html...

  215. pdf? by Anonymous Coward · · Score: 0

    seems like the easiest way to post standard content that will display correctly anywhere and everywhere

  216. Abiword by hachete · · Score: 1

    No, seriously

    For my website, http://www.badstep.net/, I edit the files in abiword then use an ant script to drive the export method from the CLI of Abiword.

    The reason I use Abiword is that the HTML export is that much better than anything elsen (at the time I looked at it), including XHTML, which is cool for XSLT to transform.

    I just wish the ant developers would integrate Cygwin better into ant so's the whole operation could be seamless.

    --
    Patriotism is a virtue of the vicious
  217. blogger.com editor by JumperCable · · Score: 1

    I have had some of the same trouble. My best free option to date is to get a blogger.com account.

    - Start to create a new blog post.
    - Copy the junk into the compose tab editor.
    - Switch over to the "Edit Html" tab
    - Copy all the html code except for the initial embeded graphic & paste into you new html document.

    You are now golden.

  218. Re:Duh by Anonymous Coward · · Score: 0
    i agree. I mean, what the fuck is a "sand monkey?" Indians? Can't we find a better term for them? "Sand Monkey" sounds like "Sand Nigger" aka "Camel Fucker" "Carpet Beater", "Ayatola Assahola", etc.

    Anyhow, it's only a matter of time before companies start using cheap African labor. (2 weeks later they'll remember that blackies are a bunch of five-fingering, lazy ass, crackhead sex fiends that want to rape the CEO's hot teen daughter).

  219. Re:CourseGenie -- sounds like what you're looking by AngryScot · · Score: 1

    I work for a University in Scotland I have used CG a few times and I love the clean HTML it generates.

    There are a few formatiing problems with it but nothing a few minutes in notepad can't fix.

    --

    All spelling mistakes are due to solar flares...honest

  220. Different Solution by squoozer · · Score: 1

    I wanted to do something like this a few weeks back and ran in to similar problems. The difference was I was starting from Open Office sxw files. I thought, and it would seem that I was wrong, that OOo could run the file against a custom xslt style sheet to create output. If it can I couldn't get it working and no one replied to my request for more information on it. The solution I decided upon (but haven't got round to implementing) is to simply write a little code to unzip the sxw file and run the relevent file against a style sheet. Not as nice as getting Writer to do it but it should work. You would need to add another step in front of that but opening the word document and saving it as a sxw but that is easy enough and could probably be scripted.

    --
    I used to have a better sig but it broke.
  221. Modify your CMS' input form with TinyMCE by ben81 · · Score: 1

    Sorry, if this has already been mentoined (didn't find TinyMCE though).

    I recently found out about TinyMCE (no, I don't work for them). It's a WYSIWYG edtior that adds text-processing features to HTML textareas. This one's especially cool, because it produces clean XHTML code, can be completley modified and is free.

    So, I would strip the functionality of TinyMCE down (it's really easy) to a format dropdown box (p, h1, h2, h3 ...) numeric and bullet lists, bold and italic. Then I'd copy the whole word document into the textarea removing all of word's formatting. An then I'd apply all the markup via TinyMCE ... clean, fast and consitent with your existing design.

    Take a look at the examples (remember, you can remove all the stuff you don't want).

  222. Are you trolling? by bbc · · Score: 1

    "Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML."

    Are you trolling? When they start sending in MS Word documents, we stop calling them "academic users", regardless of how useful MS Word is for their particular needs. No self-respecting academic uses MS Word, and at my old university you would quite deservedly get a low mark if you turned something in that had been made in MS Word. (And yes, that would be noticeable; footnotes all over the place, misnumbered TOCs, and lack of substance, because the students often spent up to 90% of their allotted time in wrestling with the program instead of writing the damn paper.)

    If I were you, I would just outlaw MS Word. Force users to use more sensible formats. It's good for you and even better for them.

    1. Re:Are you trolling? by windowpain · · Score: 1

      Are you trolling?

      "No self-respecting academic uses MS Word"

      Bombastic idiocy.

      "Force users to use more sensible formats."

      Typically techno-fascism. Don't let users select the tools they want to use. Force them to use the tools that make your job easier.

      Word has lots of problems and annoyances but it's used by tens of millions of people around the world and is reasonably inexpensive when purchased with the academeic discount.

      There are also a great many books, tutorials and other sources of information on it. And there is much information about it available for free on the Web.

      End users don't exist to make techies' jobs easier. Techies exist to make end users' jobs easier.

      --
      Insert witty sig here.
    2. Re:Are you trolling? by bbc · · Score: 1

      "Typically techno-fascism. Don't let users select the tools they want to use. Force them to use the tools that make your job easier."

      Yada-yada-yada. It's "Force them to make their job easier." In some realms (most?) MS Word is just a toy. You wouldn't expect a professional cook to use a Fisher Price plastic oven, nor would you expect somebody in academia to use MS Word, unless they need to write a letter to auntie Jane.

      That's got nothing to do with arrogance, it's got to do with "use the right tool for the job at hand". Somehow, Microsoft have managed to convince a hell of a lot of people that MS Word is suited for writing long and complex documents. Again and again that implicit claim has been disproved. There is a reason some faculties make the use of TeX obligatory, and it's got nothing to do with forcing users to do it the hard way.

    3. Re:Are you trolling? by windowpain · · Score: 1

      I've been writing long complex documents for more than twenty years. I've done complete (printed) newsletters, 100+ page technical manuals, press releases, a 300+ page book (which I submitted to my publisher as Word files--it was their requirement, btw), trifold brochures, flyers, ad mockups and much more using Word.

      I'm not defending Word as a great program. It can be annoying and frustrating. But it's no more annoying and frustrating than most other programs I've used. Outside of the legal profession where WordPerfect holds on to some diehard loyalists it is the de facto standard for word processing in much of the world.

      All that having been said I've tried AbiWord, OOO and a couple of other open source alternatives and none of them could match Word in the features I need.

      As dismaying as you may find it, Word's power and its hegemony means it often is the right tool for the job.

      But don't get too discouraged. The end of Word's hegemony is within our grasp. I'm convinced that a single word processor that can match Word feature-for-feature and apes its interface can topple Word. All it has to do is output files in pure XML using some schema that is open and widely accepted and Word will start to wither.

      I have seen the future. And it is XML.

      --
      Insert witty sig here.
  223. re: Sanely Moving from Word to the Web? by banewood · · Score: 1

    Macromedia Dreamweaver MX (the version I currently have) has a little utility under the Commands menu, which is called "Clean Up Word HTML." It does a passable job. You can choose between cleaning up Word 2000/2002 or Word 97/98. Check boxes let you choose to remove all Word-specific markup, clean up CSS, clean up font tags, set background color, or apply source formatting. It's possible that the more recent versions of Dreamweaver can do more, but I can't say for sure.

  224. document conversion by Anonymous Coward · · Score: 0

    If you have Dreamweaver, sometimes you can convert the Word document into Word HTML, open it in Dreamweaver, and use the clean up Word HTML feature. It's not always 100%, but it helps.

  225. some one most likely beat me to it, but by Anonymous Coward · · Score: 0

    Make the people who what this stuff posted give it to you in HTML instead of word. HTML is not that difficult and unless they are doing some goofy things in word they will only need to learn a few tags.

  226. Solution by phishtrader · · Score: 1

    Although it doesn't exactly meet your requirements, printing to PDF and then posting the PDF would preserve all of the original formating and would demand very little effort or time.

  227. Contribute by dejaffa · · Score: 1

    Don't Do It!

    I have to support a bunch of Contribute users at work, and the keys are constantly getting corrupted on the users' PC's. I may have less-clueful users than many, but I suspect they're representative.

    --
    There is no 'i' in team, but there is in fiasco...
    1. Re:Contribute by jtjin · · Score: 1

      I support a handful of Contribute users too, but we setup the connection right on their machines without using the connection keys. Sometimes we come across a few problems with the permissions set on the autogenerated folders by Contribute, but they're easy to fix.

      Other than that, Contribute has been a godsend, especially when you're trying to keep xhtml compliancy on a site managed by ... not so computer-literate content managers.

      --
      No rest for the livid.
  228. LaTeX by dodobh · · Score: 1

    Give your document providers a LaTeX stylesheet. Then they can get around with producing content,while your stylesheet handles _all_ formatting and possibly layout.

    --
    I can throw myself at the ground, and miss.
  229. TIDY it up by Gargamell · · Score: 1


    I don't know if anybody posted this, but depending on how much you want to formalize the content-push process, then perhaps Tidy is what you are looking for.

    http://tidy.sourceforge.net/

    If you are considering any kind of scripting solution, I would look into it. I would also include the DreamWeaver option as well in your thinking, if you are not considering a scripted solution.

    Best of luck! ~tim

  230. Re:"You can get anything you want.." - iTunes by chiefthe · · Score: 1

    Some power user you are. You can get it from the Russians [allofmp3.com] for a buck 46.

    --
    This was a quote of Kurt Vonnegut that didn't fit.
  231. Ween them off word.... by mergatoriod · · Score: 1

    Surely:

    1) Document a DTD for your contributors and make them aware of it (i'm sure they already exist for accademic papers and such), the DTD can be used in conjuction with some editors to validate the document produced.
    2) Use Xalan or similar stylesheet transformation tool to parse document and generate HTML.
    3) Create a page that your contributors can use to upload their documents and see how they look once style has been added.
    4) Write something to transform word documents into your DTD.
    5) Clean up documents that you have translated yourself
    6) Bounce documents back to their owner citing web based preview tool if they do not conform.

    Easy!

    Now you just got to implement all that stuff...:)

    Good luck.

  232. Word Clean up by Anonymous Coward · · Score: 0

    I use dreamweaver mx 2004 i first pull in the word doc to word and save as html then load into dreamweaver and there is an option to clean up word html and viola! out comes my perfectly clean still formatted word doc with a new css section in the head that is perfectly w3c compliant

  233. A Java tool to convert Word documents to XML by RaJuAlf · · Score: 1

    A finnish company Davisor has developed a Pure Java tool called Offisor that converts Word documents to corresponding custom rich XML (XMSW). The tool can be embedded in any Java -compatible application or service, or it can be used as a standalone (command line) application.

    There is a free downloadable demo version available for anyone who wishes to try it out. To company sells also some feature-rich XSL transformations from the custom XMSW format to popular standard formats, including XHTML.

    Please note that I'm an engineer working for this company, and therefore at least part of my pay-check comes from Offisor sales.

  234. Zope with Plone by Feneric · · Score: 1

    Use a CMS like Plone (built on top of Zope). Its built-in document type can automatically convert input documents in various formats (including MS-Word) into something more web-friendly.

  235. BBEdit by Koredor · · Score: 1

    Admittedly, it has been some time since I have used BBEdit, but the app used to have a feature called "Remove Gremlins" that got rid of a majority of the word trash.

  236. Obvious doesn't mean trivial by arete · · Score: 1

    For instance, the Microsoft Office interface is "obvious" - because they publicly released the software.

    If someone made StarOffice (later OpenOffice) while under NDA at Sun that NDA would not keep them from releasing parts of that code unless the programming techniques used were different than what any programmer would expect.*

    (Copyright WOULD prevent them from releasing it, of course!)

    But the fact that a clone of MSOffice is "obvious" does not mean it is trivial, it was still a huge project.

    *Curiously, I think reverse engineered MS protocols/formats probably ARE covered by NDA even at a company reverse-engineering them. But if the OpenOffice format is published it would no longer be protected.

    I am not a lawyer.

    Ben

    --
    Looking for freelance Actionscript (Flash/Flex) or ColdFusion work and/or freelance developers. Email me, put Slashdot
  237. Standardized templates simplify everybody's work by HWheel · · Score: 1

    I'm a tech writer and when I go to a new job, one of the first things I do is survey the existing documents and create a Word stylesheet that incorporates:
      - the three most common headings (H1, H2, and H3)
      - the most common paragraph format (in two formats - one with a space after, the other without a space after the paragraph)
      - and any special paragraph formats I see that I know users are married to.

    Then I start producing and make the stylesheet available so that before long - miracle of miracles - a number of people are using them and simplifing my job a lot.

    (By the way, I also publish a glossary early on so that everybody knows how to spell email, log-on, user ID, and Internet, as well as present phone numbers without parentheses. Every company has a list of vocabulary and words that need standardization.)

  238. That's nothing... by The+Queen · · Score: 1

    I am expected to create trade show booth-size graphics from logos that people send - in Word.

    "Would it help if we brought you our business card to scan?"

    No, sir, it f*cking would not. But thanks for trying.

    --

    The House Between - Original Sci-Fi Series
  239. Use this utility by manojar · · Score: 1

    Evaluation version, works for 10 days, and only 5 files at a time. http://www.flash-utility.com/download/doc2html.exe sorry if someone else has posted the same thing before, but I can't go through all the 500 comments before posting this.

  240. PDF! by sirber · · Score: 1

    PDF export via OpenOffice? It's a standart and respect page cut.

    --
    Be or ben't
  241. I have a perl script by Spazmania · · Score: 1

    Yeah, I hate that about Word. I have a perl script that strips out the worst of junk that MS Word seems to add. It does the job for me. Your milage may vary.

    http://bill.herrin.us/freebies/striphtml.pl

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  242. wvHtml by itomato · · Score: 2, Informative
    http://wvware.sourceforge.net/

    From the sourceforge page:

    wv Utilities

    Provided with the wv distribution is an application called wvWare. wvWare is a "power-user" application with lots of command-line options, doo-dads, bells, and whistles. Less interesting, but more convenient, are the helper scripts that use wvWare. These are:

            * wvHtml: convert your Word document into HTML4.0

    (there are more utilities for LaTeX, etc..


    I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.
  243. Word to HTML by Anonymous Coward · · Score: 0

    Back in 1999, I was hired by a company to fix just such a problem as this Word to HTML thing.
    I think Microsoft made this horribly mangled HTML in an effort to thwart Netscape. I fixed it by creating a program that parsed the HTML file, repaired the mistakes and spit out a new file. Worked pretty good as I recall.

    codifex

  244. WordPerfect by Anonymous Coward · · Score: 0

    Open the word document with WordPerfect and export it to pdf. done.

  245. Use a database by OreoCookie · · Score: 1

    You should save them as pdf files (or whatever) and store them in a database with some metadata. That way you can build a searchable archive of everything you ever post. It would be pretty straight forward to build a dynamic page that builds itself from the database. That way you never edit the page source, you only make changes to the database when you want the page content to change.

  246. I went through this. Ended up with OpenOffice by denis-The-menace · · Score: 1

    I have self-help Web site and it has tonnes of pictures. HTML (I call MS-HTML) saved from Word would turn all PNG graphics into bloated Gif/Jpg files and would ony look decent in IE.

    I switched to OpenOffice and I am better for it. In v1.1.4, I have to fix the line spacing in each HTML file but then it looks identical in IE and FF. This is fixed in OOo 2.0 beta.

    My advice: Unless you plan to give your DOC files to other people, switch to OpenOffice. Give PDFs to other people and export to HTML for you site in one step.

    Layout of menus is 90% like Word and it's free.

    --
    Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
  247. Newspapers have the same problem. by Anonymous Coward · · Score: 0

    Control-A, Control-C, open WordPad, Control-V.

    Control-A, Control-C, goto your CMS, Control-V.

    Retains all the links and text formatting but gets rid of all the garbage, and you don't have to pay an outrageous amount of money for Dreamweaver.

  248. Create a DB driven Site by n-baxley · · Score: 1

    I haven't seen your site (although there are those in this discussion that have found it, mwahaha), but if you haven't already, you may want to consider a database driven site. All of the supporting documents that relate to your main file could be uploaded to the database and pulled out dynamically as needed. I've done quite a few implementations of this where we take word documents and save them to the database using a tool called fckeditor. It has a function for pasting in content from Word docs. We then save that to the database, create the relationships, and wa-la! It's done.

    FCKeditor is open source, but if you ned any help getting it up and running let me know.

  249. COM + Python/Ruby by ralphc · · Score: 1

    Under Windows, some dynamic languages like Python and Ruby has COM and Win32 APIs. Use one of them to access Office COM objects, iterate through and grab the content you want and have it spit out the HTML you need.

  250. When did submission guidelines become forbidden? by cyberzod · · Score: 1

    Since shortly after the introduction of the typewriter, publishers have required that submissions be typed, not hand-written. In the early days, a typewriter was an incredibly expensive machine. But budding authors either bought one or rented one if they wanted their work to be published. Try submitting a hand-written (or probably even TYPED) document to a modern publisher and you will get a polite response (which will include your unread masterpiece) that suggests that you read their guidelines and try again.

    Just becuase MSWord has a 90% market share of the non-computerati doesn't mean that you must do away with submission standards and guidelines. I'd suggest that you write up what you expect and if a submission doesn't conform, it should be returned, unread, with a politely worded note explaining:

    1) Why it was rejected.
    2) Where to go to find the guidelines.
    3) Hints on how to conform to those guidelines (Word's Save As.. text/html or similar.) based on the format submitted.

    The internet should not be multi-cultural when it comes to simple content. All such content should be submitted as HTML, period. This is not a heavy burden, takes only a few seconds per document and the author has the greatest stake in getting it on the web.

    Submission guidelines: Good enough for print publishers, good enough for web publishers.

    Here is a thought to consider, the page I am typing in right now as I write this text contains submission guidelines. These guidelines include: the acceptable text tags, how to format URLs and even hints on how to be a good submitter. Everyone who has responded to you (and even you, yourself) have managed to follow those guidelines. I bet your customers can too.

  251. Chami HTML-Kit by Steauengeglase · · Score: 1

    I would suggest Chami's HTML-Kit. It has a fine Word 2K tag removal feature. It has saved me a few hours of staring blankly at thousands of lines of useless MS Source.

  252. Simple! by Dman33 · · Score: 1

    There is this handy tool that can go through all of the HTML for you and do all kinds of custom formatting, cleaning up and simplification custom per your specifications!

    It is called an intern.

  253. Dreamweaver.... by CyberdogOSX · · Score: 0

    ....has a great "Clean up Word HTML" filter. Use that, then use the "Clean up HTML"filter in the same menu specifying that it also take out font and span tags.

    then go through a search and replace the few that are left. works great.

    or copy plain text and reformat.

  254. Exactly how it works where I work by conJunk · · Score: 1

    I've got the same situation. All the content is generated in word, I need to get it live in HTML

    Just copy paste the word document into play text, and open it in dreamweaver, then, it's style sheets to the rescue, as you (should) only need to enter a tiny bit of markup

  255. Cue Transformers by ichigo+2.0 · · Score: 1

    ... Americon ...

    Come on someone, bring out some Transformers jokes!

  256. Not for nothing... by skids · · Score: 1

    ...but if footprint is your concern, there are smaller emacs-like editors that are good enough for everyday use, and can save you the pain and suffering of vi. At minimum, "jove" has keyboard macros.

  257. Word Problems by jimmyjim · · Score: 1

    I always have the font layout and text size problems with Word docs. My question is do you think longhorn oops I mean vista will fix this problem?

  258. LyX helps. by Explo · · Score: 1

    Why do it the hard way? Use something like LyX as a frontend. It offers a nice equation editor and reduces the need for manual tex/latex writing significantly (basically, if you want something fancy, you may have to do it manually, but routine stuff becomes pretty transparent).

    (Well, it does not help with the original question about Word->HTML conversion...)

    --
    Everyone who makes generalizations should be shot.
  259. Get a Copy of Adobe Distiller and Acrobat by Anonymous Coward · · Score: 0
    I'm a software developer and systems administrator for a team developing B2B and B2C applications for in-house use. While I personally haven't used Distiller, the content guys use it all the time in order to convert Word documents into PDF files (which are then posted to a document library on the B2B site). I also generate quite a bit of PDF content since I keep handwritten daily logs and hardcopy from application testing that gets scanned and posted on my in-house WWW site. Free PDF readers are available for many platform.

    If you really want to keep an ASCII text version, you can use the "Extract Text" feature of Acrobat, possibly performing OCR (which is a built-in function of Acrobat) on the document first. Granted, this is a $500 dollar solution (I'm a hardcore open-source software user and _never_ use commercial software personally), but it works extremely well.

  260. OpenOffice 2.0-beta "save-as" and "export" great by FreeUser · · Score: 1

    Open Office Version 1.9.122 (2.0-beta) is quite good for this.

    Load Micro$oft Word file.

    Export to HTML/PDF/whatever format you like. I've used it for my novel, and use both export-as-pdf and save-as-html, and with the exception of multi-columned text, saving as HTML works perfectly. Saving as PDF works perfectly for everything (including multicolumned entries and embedded fonts), as this example shows.

    --
    The Future of Human Evolution: Autonomy
  261. And some people type the whole message on the subj by bhudson · · Score: 1

    ject line.

  262. KISS - Keep it simple Stupid by Texas1st · · Score: 1

    Use PDF995 http://www.995software.com/ to convert the docs to PDF, then convert the PDF to HTML.

    --
    "I am a Bomb-Disposal Technician. If you see me running, try and keep up."
  263. Have you considered PDF conversion? by Anonymous Coward · · Score: 0

    There are several free tools to convert Word to PDF (and do a really good job).

  264. write a parser for it. by God+of+Lemmings · · Score: 1


    Lexx/flex + yacc/byacc/bison

    Or just use perl, thats what it's there for.

    --
    Non sequitur: Your facts are uncoordinated.
  265. terrific! by SethJohnson · · Score: 1



    Wow. Thanks for the reference to that wysiwyg textarea editor!!! For me, in my work, this is huge! I'm going to implement this ASAP!

    Appreciatively,

    Seth

  266. OpenOffice + PHP by HappyDrgn · · Score: 1

    OpenOffice saves files as an XML format. Many languages, such as PHP, have external libraries that can read, parse or rewrite XML documents. Once you have the file stripped down to just the important XML sections, it should be very trivial to rewrite this into HTML. I did something kind of like this for a client who wanted to alter RSS news sources on a local website.

    1. Re:OpenOffice + PHP by HappyDrgn · · Score: 1

      Browsing PHP.NET I found an even better way than this.
       
      http://us3.php.net/manual/en/class.com.php
       
      The COM function allows you to open an OLE compatible COM object (which MSWord is). Examples are provided for opening MSWord files. Enjoy!

  267. Nvu by MrResistor · · Score: 1

    Copy the text, and paste it into Nvu without formatting (found in the menus, not the default option unfortunately). From there it shouldn't be a big deal to format it however you like.

    --
    Under capitalism man exploits man. Under communism it's the other way around.
  268. bbedit by capsteve · · Score: 1

    bbedit has a neat feature which allows for batch editing of files, and if you wish, allows you to verify the change before committing... many programming tag libraries are available (html,perl,java,php), it really helps process large number of files. plus, if you use it with tiger, i bet you could do some funky automator scripts to really custom fit your needs.

    --
    three can keep a secret, if two are dead - benjamin franklin
  269. Re:OpenOffice 2.0-beta "save-as" and "export" grea by sonamchauhan · · Score: 2, Interesting

    Just expanding on your suggestion...

    Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.

    OpenOffice API:
        http://api.openoffice.org/

    Code snippet shows simplicity of converting OpenOffice Writer SXW document into PDF:
        http://codesnippets.services.openoffice.org/Writer /Writer.StoreWriterAsPDF.snip
    Perhaps a few small changes here would get him what he wants.

    Perl interface (ooolib):
        http://ooolib.sourceforge.net/doc/ooolib-0.1.5-doc .html#info
    There are also Java code snippets. I think it would be possible to convert the OOBasic snippet above to either Java or Perl.