Convert from HTML to XML With HTML Tidy
An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
more like tips for newbies
>
but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.
Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:
<table>
<form>
<tr><td>...</td></tr>
</table
</form>
Basically mixing stuff in between table rows. Something like that anyway. Just be ready to handle a fatal error from Tidy, it surprised me at first because I thought it could eat anything.