Slashdot Mirror


Convert from HTML to XML With HTML Tidy

An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."

2 of 43 comments (clear)

  1. news for nerds? by Anonymous Coward · · Score: 2, Insightful

    more like tips for newbies

    but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.

    Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:

    <table>
    <form>
    <tr><td>...</td></tr>
    </table>
    </form>

    Basically mixing stuff in between table rows. Something like that anyway. Just be ready to handle a fatal error from Tidy, it surprised me at first because I thought it could eat anything.

    1. Re:news for nerds? by aWalrus · · Score: 3, Insightful

      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

      --
      Overcaffeinated. Angry geeks.