Convert from HTML to XML With HTML Tidy
An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt , which uses the parser from libxml2 , and this has the option of parsing strict HTML into an XML DOM.
XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.
dakkar - mobilis in mobile
HTML Tidy has been our for years.
Check out the Tidy Homepage or the project on SourceForge.
Popisms.com - Connecting pop culture
Ian Hickson makes a good case here that using XHTML may not be the right direction to go -- at least at this point.
That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?
Overcaffeinated. Angry geeks.
- say that I use XHTML
- make it easier to parse my pages
HTML 4.01 doesn't make you expressly close your tags, which causes XML processors to choke and die. I'd rather write it in a usable format once than have to Tidy-parse every time I want to update my search engine. Plus XSLT really is cool. I've got (somewhere) a stylesheet I wrote that will validate form data for me and then I can apply other xslt stylesheets to make the output, further seperating the output from the script that does the magic. Great way to update the look of a page without messing up (accidentally, of course) the code I wrote months ago.If you are running MacOS with BBEdit, you can use the BBTidy plugin to get HTML Tidy integration in BBEdit.
JP
I've sneered at XHTML in the past, but I was speaking out of ignorance. I was assuming it was just a silly attempt to preserve HTML in an XML world. Actually, it's a very convenient bridge between HTML and XML. It's only incidentally about web content, since browsers will always need to support legacy HTML, and thus will never adopt all of XHTML's structure and restrictions. But once you have your content in XHTML format, you can transform it into any XML application you choose, using XSLT scripts. Which opens up a whole world of possibilities for people with all their content in messy old word processor formats, since word processor now tend to come with HTML export filters.