Convert from HTML to XML With HTML Tidy
An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
I've always been interested in X(HT)ML, but I've never wanted to sit down and convert every single page by hand. This tool might be just what I need.
You're right, I wouldn't steal a car. But if it were possible, I sure as hell would download one!
A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt , which uses the parser from libxml2 , and this has the option of parsing strict HTML into an XML DOM.
XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.
dakkar - mobilis in mobile
HTML Tidy has been our for years.
Check out the Tidy Homepage or the project on SourceForge.
Popisms.com - Connecting pop culture
. . . ..
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
more like tips for newbies
>
but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.
Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:
<table>
<form>
<tr><td>...</td></tr>
</table
</form>
Basically mixing stuff in between table rows. Something like that anyway. Just be ready to handle a fatal error from Tidy, it surprised me at first because I thought it could eat anything.
No surprise really.
/. instead of just bookmarking
I think people must submit to
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Ian Hickson makes a good case here that using XHTML may not be the right direction to go -- at least at this point.
Yeah, the first line is a little flamebait-ey, but other than that it's informative.
If you are running Windows, there is a nice HTML editor called HTML-Kit that integrates HTML-Tidy right in. It's not WYSIWYG, it color codes your HTML and can format it a number of ways.
Properly former HTML 4.01 is pretty close to XHTML 1.0 transitional.
If you are running MacOS with BBEdit, you can use the BBTidy plugin to get HTML Tidy integration in BBEdit.
JP
MOD PARENT UP!! He's right, HTML-Kit is the choice of even those who use Dreamweaver MX, because Dreamweaver does not respect the formatting of your HTML. HTML coders use HTML Tidy and HTML-Kit to clean up Dreamweaver HTML output, and you-know-who's HTML output, of course, which is so disrespectful it would stomp on your toes if it could.
I recently used HTML Tidy to convert my website from html to xhtml. It works well. After Tidy, I used WDG HTML Validator to verify that the code was correct. (It validates XHTML as well.) If you install your own version of the validator you can more easily check your entire website. This is important if you have a lot of pages.
Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".
:
:
:
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
My company
to display a title, how do you want an automatic tool like Tidy to convert it to
My company
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.
{{.sig}}
Argl, I forgot to enable "Extrans" before submitting the previous post :(
:
:
:
Let's try again, sorry for the noise, I believed
"plain old text" would escape HTML tags.
---
Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
<table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>
to display a title, how do you want an automatic tool like Tidy to convert it to
<h1>My company</h1>
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.
{{.sig}}
This article is very appropriate for me; I'm in the process of updating about 200 Dreamweaver produced HTML pages to CSS-based XHTML pages. No font tag, no tables for layout. Full update from the ground up. But I already heard of Tidy and decided to use it weeks ago. The article comes off as a press release, with little content other that "Use Tidy!"
Now the problem comes, in like the parent comment says, in interpreting meaning from the XHTML spit out by Tidy. I'm hoping to use simple XSL to get that little bit out. In fact, that's the only useful thing in the article; the writer uses a simple XSL stylesheet to extract title, description, and date information from the XHTML file. For my project it's necessary because while Tidy will change font tags into styles, it must use internal style sheets, and can't determine which site-global styles to use. (Nor can it replace tables with CSS-positioned divs.) Hopefully, the XSL will do 90% of that work for me.
Tidy is great though. It solves 90% of the problem. (The other 10%, as always, is a b*tch.)
If you have a page in HTML, chances are that it should be in HTML. Using XML for web presentation is just stupid, unless you're playing buzzword bingo.