Convert from HTML to XML With HTML Tidy
An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."
I've sneered at XHTML in the past, but I was speaking out of ignorance. I was assuming it was just a silly attempt to preserve HTML in an XML world. Actually, it's a very convenient bridge between HTML and XML. It's only incidentally about web content, since browsers will always need to support legacy HTML, and thus will never adopt all of XHTML's structure and restrictions. But once you have your content in XHTML format, you can transform it into any XML application you choose, using XSLT scripts. Which opens up a whole world of possibilities for people with all their content in messy old word processor formats, since word processor now tend to come with HTML export filters.
Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".
:
:
:
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
My company
to display a title, how do you want an automatic tool like Tidy to convert it to
My company
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.
{{.sig}}
Argl, I forgot to enable "Extrans" before submitting the previous post :(
:
:
:
Let's try again, sorry for the noise, I believed
"plain old text" would escape HTML tags.
---
Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".
But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.
When you have an horror like
<table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>
to display a title, how do you want an automatic tool like Tidy to convert it to
<h1>My company</h1>
?
It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.
Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.
And the code will even be larger because of the indentation, closing and styles created by Tidy.
All benefits of XHTML/CSS are totally lost.
Look at an horror like
http://www.skyrock.com/
Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).
You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.
Convert this to XHTML using Tidy.
The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.
Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.
{{.sig}}