Slashdot Mirror


Convert from HTML to XML With HTML Tidy

An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."

17 of 43 comments (clear)

  1. I'll check it out by Kethinov · · Score: 2

    I've always been interested in X(HT)ML, but I've never wanted to sit down and convert every single page by hand. This tool might be just what I need.

    --
    You're right, I wouldn't steal a car. But if it were possible, I sure as hell would download one!
    1. Re:I'll check it out by in10se · · Score: 2, Informative

      It's extremely useful for converting the "HTML" generated by Microsoft Office products into nice, clean, well formatted XHTML.

      --
      Popisms.com - Connecting pop culture
    2. Re:I'll check it out by fm6 · · Score: 4, Interesting

      I've sneered at XHTML in the past, but I was speaking out of ignorance. I was assuming it was just a silly attempt to preserve HTML in an XML world. Actually, it's a very convenient bridge between HTML and XML. It's only incidentally about web content, since browsers will always need to support legacy HTML, and thus will never adopt all of XHTML's structure and restrictions. But once you have your content in XHTML format, you can transform it into any XML application you choose, using XSLT scripts. Which opens up a whole world of possibilities for people with all their content in messy old word processor formats, since word processor now tend to come with HTML export filters.

  2. libxml2? by dakkar · · Score: 5, Informative

    A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt , which uses the parser from libxml2 , and this has the option of parsing strict HTML into an XML DOM.

    XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.

    --
    dakkar - mobilis in mobile
  3. This isn't new... by in10se · · Score: 3, Informative

    HTML Tidy has been our for years.



    Check out the Tidy Homepage or the project on SourceForge.

    --
    Popisms.com - Connecting pop culture
    1. Re:This isn't new... by AShocka · · Score: 2, Informative
      It's used in There's GUI versions, command line versions, etc.
  4. news for nerds? by Anonymous Coward · · Score: 2, Insightful

    more like tips for newbies

    but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.

    Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:

    <table>
    <form>
    <tr><td>...</td></tr>
    </table>
    </form>

    Basically mixing stuff in between table rows. Something like that anyway. Just be ready to handle a fatal error from Tidy, it surprised me at first because I thought it could eat anything.

    1. Re:news for nerds? by aWalrus · · Score: 3, Insightful

      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

      --
      Overcaffeinated. Angry geeks.
    2. Re:news for nerds? by mhesseltine · · Score: 2, Informative
      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

      Well, from the W3C page on HTML 4.01

      The FORM element acts as a container for controls. It specifies:
      * The layout of the form (given by the contents of the element).
      * The program that will handle the completed and submitted form (the action attribute). The receiving program must be able to parse name/value pairs in order to make use of them.
      * The method by which user data will be sent to the server (the method attribute).
      * A character encoding that must be accepted by the server in order to handle this form (the accept-charset attribute). User agents may advise the user of the value of the accept-charset attribute and/or restrict the user's ability to enter unrecognized characters.

      A form can contain text and markup (paragraphs, lists, etc.) in addition to form controls.

      Given that information, I would put the form outside the table element, since the table controls the layout of the form. Besides, shouldn't that be part of tidying up code?

      --
      Overrated / Underrated : Moderation :: Anonymous Coward : Posting
  5. Re:how many years has it produced XHTML ? by in10se · · Score: 2, Informative

    Again, it's converted to XHTML for a few years. I only posted the original message because I was quite surprised to see it on Slashdot. It's not uncommon to see a story that is a month or two old on the homepage, but several years old is crazy.

    --
    Popisms.com - Connecting pop culture
  6. Why use XHTML? by Alethes · · Score: 3, Informative

    Ian Hickson makes a good case here that using XHTML may not be the right direction to go -- at least at this point.

  7. Re:Why not use HTML4 then? by Enrico+Pulatzo · · Score: 3, Informative
    I use XHTML so I can:
    1. say that I use XHTML
    2. make it easier to parse my pages
    HTML 4.01 doesn't make you expressly close your tags, which causes XML processors to choke and die. I'd rather write it in a usable format once than have to Tidy-parse every time I want to update my search engine. Plus XSLT really is cool. I've got (somewhere) a stylesheet I wrote that will validate form data for me and then I can apply other xslt stylesheets to make the output, further seperating the output from the script that does the magic. Great way to update the look of a page without messing up (accidentally, of course) the code I wrote months ago.
  8. BBTidy BBEdit plugin (Mac OS) by jpkunst · · Score: 3, Informative

    If you are running MacOS with BBEdit, you can use the BBTidy plugin to get HTML Tidy integration in BBEdit.

    JP

  9. Re:how many years has it produced XHTML ? by LarryRiedel · · Score: 2, Informative

    The date for the referenced article is 18 Sep 2003, less than two weeks ago.

    Larry

  10. Re:how many years has it produced XHTML ? by kalidasa · · Score: 2, Informative

    4 years. I remember it being implemented before the XHTML recommendation was final. I remember it particularly because I've been using XHTML on my website since I converted it, then, in 1999, with HTML-Tidy.

  11. HTML to XHTML can only be made manually by chrysalis · · Score: 2, Interesting

    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    My company

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    My company

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

    --
    {{.sig}}
  12. HTML to XHTML can only be made manually (extrans) by chrysalis · · Score: 2, Interesting

    Argl, I forgot to enable "Extrans" before submitting the previous post :(

    Let's try again, sorry for the noise, I believed
    "plain old text" would escape HTML tags.

    ---

    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    <table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    <h1>My company</h1>

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

    --
    {{.sig}}