Slashdot Mirror


Convert from HTML to XML With HTML Tidy

An anonymous reader writes "HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools."

43 comments

  1. I'll check it out by Kethinov · · Score: 2

    I've always been interested in X(HT)ML, but I've never wanted to sit down and convert every single page by hand. This tool might be just what I need.

    --
    You're right, I wouldn't steal a car. But if it were possible, I sure as hell would download one!
    1. Re:I'll check it out by in10se · · Score: 2, Informative

      It's extremely useful for converting the "HTML" generated by Microsoft Office products into nice, clean, well formatted XHTML.

      --
      Popisms.com - Connecting pop culture
    2. Re:I'll check it out by fm6 · · Score: 4, Interesting

      I've sneered at XHTML in the past, but I was speaking out of ignorance. I was assuming it was just a silly attempt to preserve HTML in an XML world. Actually, it's a very convenient bridge between HTML and XML. It's only incidentally about web content, since browsers will always need to support legacy HTML, and thus will never adopt all of XHTML's structure and restrictions. But once you have your content in XHTML format, you can transform it into any XML application you choose, using XSLT scripts. Which opens up a whole world of possibilities for people with all their content in messy old word processor formats, since word processor now tend to come with HTML export filters.

    3. Re:I'll check it out by Kethinov · · Score: 1

      Enlighten me. Should I convert a web page from old HTML into XHTML, exactly how is the code reusable in areas other than web page design? I've always been under the impression that XHTML would one day just replace legacy HTML, but you seem to think otherwise.

      --
      You're right, I wouldn't steal a car. But if it were possible, I sure as hell would download one!
    4. Re:I'll check it out by Enrico+Pulatzo · · Score: 1

      XHTML is great for a bunch of reasons.

      First off, every reason to use HTML 4.0 is a reason to use XHTML, unless that reason happens to be "it's not XHTML!".

      Secondly, using XHTML allows you all the niceties of XML. This is great when you decide to update your site so it works in say cell-phone browsers, rather than just a PC browser. This alone is a great reason to use XHTML. As more and more data sources become xml aware, being able to easily connect them becomes important. XHTML allows you to do this in a simpler way than using HTML and an easily incomplete parser you wrote in Perl.

      Thirdly, (this is kind of a stretch) I'm going to assume that HTML won't be updated in any meaningful way ever again. If and when new and (more importantly) useful additions become available (like HTML 4.0's fieldset and colgroup tags) they won't be available to an HTML page. This again assumes that HTML is done (which is reasonable, as XHTML is the replacement for it), that XHTML is incomplete (which it is), and that a great feature may come about in the future that you want to use.

      Finally, I think you should use XHTML because I said so ;)

    5. Re:I'll check it out by fm6 · · Score: 1
      Well first of all, legacy HTML will never go away -- not as long as millions of people are hacking out web pages by hand, or using antiquated HTML editors. XHTML will never completely replace legacy HTML, and if I still thought that was XHTML's central purpose, I would still consider XHTML a waste of effort.

      The big virtue of XHTML is the big virtue of all XML document types: it's open. You can do anything with an XML document. I suppose that's also true of say TeX or RTF. Except these formats are very messy, and it's hard to extract the content from them. A good XML document type is well-structured, and thus relatively easy to access and manipulate.

      If all you want to do with a document is display it as a single web page, that's not a big deal. But suppose you want to add it to some well-structured document management system? Or make it a chapter in a book? Or deliver it to a cell phone browser that uses WML or some other simplified markup language? Then all you have to do is write a filter that transforms your XHTML into the necessary XML document type. The possibilities are endless, and all of them are enabled by the simple openness of XML.

      There are pitfalls, of course. A good XML application is carefully structured, and thoroughly separates presentation (layout, fonts, etc.) from content. That's why XHTML deprecates the use of formatting tags, like <center> and <font>, which act as if they designate content, but actually designate presentation. But there's nothing to prevent XHTML users from using deprecated features, or designers of other XML applications from structuring their documenting carelessly. So even after you run your document through HTML Tidy, you still might have to jump through a few hoops to transform it into a more sophisticated XML document type, such as DocBook. But the openness of XML makes just hoop-jumping a lot easier.

      Anybody who's interesting in playing the XML transformation game needs to learn to program in the #1 XML transformation language, XSLT. This person has written some good introductory material, both online and in book form. Plus her web site neatly demonstrates the flexibility of the technology she teaches and advocates.

  2. What about converting RSS to HTML ? by redelm · · Score: 1
    There seem to be some nice cut-down pages available as XML/RSS. Any good conversions or text-based readers?

    1. Re:What about converting RSS to HTML ? by Enrico+Pulatzo · · Score: 1

      If you're into the whole "roll your own" kind of thing, check out XSLT. You can probably find a stylesheet that converts RSS to HTML or with a day's worth of effort, you could write your own.

  3. libxml2? by dakkar · · Score: 5, Informative

    A few days ago I had to convert HTML pages into XHTML, stripping out a few extra elements and attributes. I used xsltproc, from libxslt , which uses the parser from libxml2 , and this has the option of parsing strict HTML into an XML DOM.

    XMLTidy can be useful when you have a not-so-strict HTML, but for most quick conversions I've found libxml2 &co to be quite light and easy.

    --
    dakkar - mobilis in mobile
    1. Re:libxml2? by Anonymous Coward · · Score: 0

      XmlStarlet http://xmlstar.sourceforge.net/
      which is based on libxml2/libxslt can be pretty useful too.

      Ex:

      xml fo --html your.html

  4. This isn't new... by in10se · · Score: 3, Informative

    HTML Tidy has been our for years.



    Check out the Tidy Homepage or the project on SourceForge.

    --
    Popisms.com - Connecting pop culture
    1. Re:This isn't new... by AShocka · · Score: 2, Informative
      It's used in There's GUI versions, command line versions, etc.
  5. how many years has it produced XHTML ? by DrSkwid · · Score: 1

    . . . ..

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    1. Re:how many years has it produced XHTML ? by in10se · · Score: 2, Informative

      Again, it's converted to XHTML for a few years. I only posted the original message because I was quite surprised to see it on Slashdot. It's not uncommon to see a story that is a month or two old on the homepage, but several years old is crazy.

      --
      Popisms.com - Connecting pop culture
    2. Re:how many years has it produced XHTML ? by LarryRiedel · · Score: 2, Informative

      The date for the referenced article is 18 Sep 2003, less than two weeks ago.

      Larry

    3. Re:how many years has it produced XHTML ? by kalidasa · · Score: 2, Informative

      4 years. I remember it being implemented before the XHTML recommendation was final. I remember it particularly because I've been using XHTML on my website since I converted it, then, in 1999, with HTML-Tidy.

    4. Re:how many years has it produced XHTML ? by adrizk · · Score: 1

      The date for the referenced article is 18 Sep 2003, less than two weeks ago.

      Yeah, but the fact remains that HTML Tidy has been around for years. Essentially this article is a tutorial on how to use tidy. It's almost like submitting a story about a man page.

  6. news for nerds? by Anonymous Coward · · Score: 2, Insightful

    more like tips for newbies

    but yeah this is a great tip.. especially if you are writing web-scrapers to extra data from web pages and/or convert them to RSS. Just use Tidy to tidy it, and then your favorite XML parser can slurp it right up and you can use XPath to pull out what you need.

    Look out though, there are some cases that Tidy chokes on. One that I keep running into is shit like this:

    <table>
    <form>
    <tr><td>...</td></tr>
    </table>
    </form>

    Basically mixing stuff in between table rows. Something like that anyway. Just be ready to handle a fatal error from Tidy, it surprised me at first because I thought it could eat anything.

    1. Re:news for nerds? by Anonymous Coward · · Score: 0

      It's likely because the form is both inside and outside the table.

    2. Re:news for nerds? by aWalrus · · Score: 3, Insightful

      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

      --
      Overcaffeinated. Angry geeks.
    3. Re:news for nerds? by mhesseltine · · Score: 2, Informative
      That's because that's invalid markup. When it gets tidied, where would you put the form? inside or outside the table?

      Well, from the W3C page on HTML 4.01

      The FORM element acts as a container for controls. It specifies:
      * The layout of the form (given by the contents of the element).
      * The program that will handle the completed and submitted form (the action attribute). The receiving program must be able to parse name/value pairs in order to make use of them.
      * The method by which user data will be sent to the server (the method attribute).
      * A character encoding that must be accepted by the server in order to handle this form (the accept-charset attribute). User agents may advise the user of the value of the accept-charset attribute and/or restrict the user's ability to enter unrecognized characters.

      A form can contain text and markup (paragraphs, lists, etc.) in addition to form controls.

      Given that information, I would put the form outside the table element, since the table controls the layout of the form. Besides, shouldn't that be part of tidying up code?

      --
      Overrated / Underrated : Moderation :: Anonymous Coward : Posting
    4. Re:news for nerds? by sahala · · Score: 1
      ...

      There's actually a reason why people write code like this. The tag in many browser implementations emits the equivalent of a line break after the tag. Heavily styled pages with forms therefore get unexpected spaces, which pisses off a lot of designers. Putting the as in the above prevents this, even if it's incorrect. Browsers don't seem to have a problem.

      Nowadays the correct way to do this is with CSS. should do the trick, or just toss a form {margin:0px} in the document style sheet.

    5. Re:news for nerds? by JeanPaulBob · · Score: 1

      Yes, it makes sense to nest the table inside the form. That problem was that the form started inside the table, and ended outside. It was improperly nested.

      Correct:
      <table>
      <form>
      </form>
      </table>

      Incorrect:
      <table>
      <form>
      </table>
      </form>

  7. hehe stupid /. by DrSkwid · · Score: 1

    No surprise really.

    I think people must submit to /. instead of just bookmarking

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  8. Why use XHTML? by Alethes · · Score: 3, Informative

    Ian Hickson makes a good case here that using XHTML may not be the right direction to go -- at least at this point.

    1. Re:Why use XHTML? by Enrico+Pulatzo · · Score: 1

      Well, this is veering off-topic, but the MIME-type isn't used for the most part is that the user agents that are in the market don't know how to handle the application/xhtml+xml type. I don't see this as any real reason to not use xhtml, you've just got to be careful to make it well formed. Ian's argument stands for crappy html too, and more than a few people I've run into don't want to use HTML for anything as the HTML they've run into doesn't make much sense. Some tags are open, some times you close one tag before another (a poor-formed mix of a block element and invisible form is what I'm thinking of here )

      In the end, crappy use of any standard will make people not want to use that standard. Just because it's functionally impossible to fully utilize the XHTML standard doesn't mean we shouldn't use it.

    2. Re:Why use XHTML? by UnuMondo · · Score: 1

      I use XHTML on my site for two reasons.

      The first is that I'm a nerd and I want to use a cutting-edge standard. I imagine that that is a big motivation for a lot of XHTML users.

      The second is that I'm a big LaTeX fan and its system of separating appearance from content really appeals to me. XHTML does the same for the web. One can concentrate on just putting information in there, and then can keep visual appearance in a separate place and easily replace it site-wide if necessary. A consequence of this is that the site's content is available to visually-impaired users. Some months ago Slashdot did an interview with a web-accesibility guru, and from that story it became evident that if one writes good, conforming XHTML from the beginning, it doesn't take much extra effort for the blind to be able to enjoy its content.

      Hickson's essay - whose gist is now over a year old - takes a pragmatic view: don't use XHTML because IE doesn't handle the ideal MIME type yet. It makes sense for more pedestrian sites to continue to use HTML 4.01. But for those who have personal sites whose visitors will mostly be using Mozilla or another browser that can handle application/xhtml+html, there's really no reason not to start using XHTML.

      --
      GPG Key ID: 8C444E97 Fingerprint: E7BA D851 9714 8D97 C4F9 1777 8168 6913 8C44 4E97
    3. Re:Why use XHTML? by AkaXakA · · Score: 1

      This is FUD.

      (At the end it even links to http://www.mozillaquestquest.com/, which, in their own words:
      After much soul searching we have decided to shut down MozillaQuestQuest. In our opinion we cannot compete with MozillaQuest's content for humour value.

      Keywords: Humour value.

    4. Re:Why use XHTML? by fm6 · · Score: 1
      This is sort of ad hominem, but it's hard to accept criticism of markup languages from a guy whose web pages are hand-formatted text!

      But let's skip the Latin and look at Hickson's actual arguments. They're a little convoluted, so maybe I'm not reading him correctly. As far as I can tell, he's mostly saying that correct XHTML is hard to prepare. Well, yeah, that's the whole point of using a tight-assed XML document type like XHTML instead of a tolerant, laid-back SGML document type like HTML: you're embracing restrictions that promote a well structured document. Following all the XML rules in XHTML is hard, and it's supposed to be. That's why I never prepare an XHTML document by hand. It's not that I don't believe in the goals of XHTML (structure, separating content and presentation), I just don't find it practical to manually touch all the bases that an XML document type requires me to touch. I use HTML 4, and avoid deprecated legacy tags and attributes. If you ever need a proper XML document, it's not that hard to do the conversion, probably using some kind of standard utility that will remember all the little details I'd forget.

      There are souls brave enough to prepare XHTML or other XML documents by hand. That's not a mistake in and of itself, but it's foolish to do so without passing the document through a validator. Actually, the best XML editors validate your document on the fly.

      Hickson also criticizes XHTML for its inconsistent browser support. Well, that's an issue with HTML as well, and something we need to take up with the browser vendors, not the XHTML advocates.

      Hickson makes the assumption (one I once shared) that XHTML is primarily for delivering web content. As I said in a previous post, that's not correct.

  9. *sigh* Who modded the parent down? Why? by JeanPaulBob · · Score: 1

    Yeah, the first line is a little flamebait-ey, but other than that it's informative.

  10. HTML-Kit by Mark+Pitman · · Score: 1

    If you are running Windows, there is a nice HTML editor called HTML-Kit that integrates HTML-Tidy right in. It's not WYSIWYG, it color codes your HTML and can format it a number of ways.

    1. Re:HTML-Kit by superyooser · · Score: 1
      Many web development applications have HTML Tidy built in. One I use is HTML Builder XP. Don't let the name fool you. It's more than an HTML editor. It comes with functions for creating CSS, ASP, and PHP (4.x integrated!) and customizable DHTML scripts. It has tabbed preview windows to check your rendered code in as many browsers as you have.

      HTML Builder XP is created by one of the two developers of the now-defunct 1st Page 2000 by Evrsoft. Evrsoft is now just the one remaining developer who has essentially abandoned the app. The vaporous "1st Page v. 3.0" is like the Duke Nukem Forever of web dev apps. If you happen to be among the thousands of web monkeys who have been waiting for the next version of 1st Page 2000, HTML Builder XP is what you're looking for.

    2. Re:HTML-Kit by Mark+Pitman · · Score: 1

      HTML Builder XP looks pretty good. I'll have to check that out!

  11. Why not use HTML4 then? by Anonymous Coward · · Score: 0

    Properly former HTML 4.01 is pretty close to XHTML 1.0 transitional.

    1. Re:Why not use HTML4 then? by Enrico+Pulatzo · · Score: 3, Informative
      I use XHTML so I can:
      1. say that I use XHTML
      2. make it easier to parse my pages
      HTML 4.01 doesn't make you expressly close your tags, which causes XML processors to choke and die. I'd rather write it in a usable format once than have to Tidy-parse every time I want to update my search engine. Plus XSLT really is cool. I've got (somewhere) a stylesheet I wrote that will validate form data for me and then I can apply other xslt stylesheets to make the output, further seperating the output from the script that does the magic. Great way to update the look of a page without messing up (accidentally, of course) the code I wrote months ago.
  12. BBTidy BBEdit plugin (Mac OS) by jpkunst · · Score: 3, Informative

    If you are running MacOS with BBEdit, you can use the BBTidy plugin to get HTML Tidy integration in BBEdit.

    JP

  13. He's right, HTML-Kit is the choice... by Futurepower(R) · · Score: 1


    MOD PARENT UP!! He's right, HTML-Kit is the choice of even those who use Dreamweaver MX, because Dreamweaver does not respect the formatting of your HTML. HTML coders use HTML Tidy and HTML-Kit to clean up Dreamweaver HTML output, and you-know-who's HTML output, of course, which is so disrespectful it would stomp on your toes if it could.

    1. Re:He's right, HTML-Kit is the choice... by FedeTXF · · Score: 1

      I used to have HTMLTidy inside (Allaire or Macromedia's) Homesite as the default CodeSweep to turn everything to XHTML 1.0 since 1999 or 2000.
      Just pressing CTRL-ALT-F.

  14. HTML Tidy and WDG by Muttonhead · · Score: 1

    I recently used HTML Tidy to convert my website from html to xhtml. It works well. After Tidy, I used WDG HTML Validator to verify that the code was correct. (It validates XHTML as well.) If you install your own version of the validator you can more easily check your entire website. This is important if you have a lot of pages.

  15. HTML to XHTML can only be made manually by chrysalis · · Score: 2, Interesting

    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    My company

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    My company

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

    --
    {{.sig}}
  16. HTML to XHTML can only be made manually (extrans) by chrysalis · · Score: 2, Interesting

    Argl, I forgot to enable "Extrans" before submitting the previous post :(

    Let's try again, sorry for the noise, I believed
    "plain old text" would escape HTML tags.

    ---

    Yes, HTMLTidy can "convert" an HTML page to XHTML. It basically adds CDATA marks, closes tags and create CSS classes instead of attributes like "background".

    But correct XHTML is more than that. The goal is to actually give the right context to every element of the text.

    When you have an horror like :

    <table><tr><td width="100%" align="center"><img src="transparentpix.gif" width="20"><font size="9"><b>My company</b></font><img src="transparentpix.gif" width="20"></td></tr></table>

    to display a title, how do you want an automatic tool like Tidy to convert it to :

    <h1>My company</h1>

    ?

    It just can't. It will see a table with no caption, no column headers and three elements : two images and a text that is not supposed to be a title at all.

    Converting an HTML web site with no semantic to XHTML using Tidy is useless. The result will still be unparsable (it will, but elements will have no meaning), the site will still be unaccessible to alternative browsers, it will still be a hell to maintain, etc. Of course easy navigation with the keyboard shortcuts using Mozilla is out of question.

    And the code will even be larger because of the indentation, closing and styles created by Tidy.

    All benefits of XHTML/CSS are totally lost.

    Look at an horror like :

    http://www.skyrock.com/

    Try to access it with Lynx or the built-in browser of a phone or PDA with no support for styles (ex: Sony/Ericsson P800).

    You don't see anything but the names of three files supposed to be images. And this is all you can see on the web site. You don't see any link nor any text.

    Convert this to XHTML using Tidy.

    The site still doesn't look like anything but three useless filenames. It's just twice longer to load because the code is larger.

    Correct XHTML sites have to be designed the right way from the ground up. There's no magic to convert an horror to something clean. And even manually, the best way to do so is almost always to restart from scratch.

    --
    {{.sig}}
  17. Re:HTML to XHTML can only be made manually (extran by Anonymous Coward · · Score: 0

    This article is very appropriate for me; I'm in the process of updating about 200 Dreamweaver produced HTML pages to CSS-based XHTML pages. No font tag, no tables for layout. Full update from the ground up. But I already heard of Tidy and decided to use it weeks ago. The article comes off as a press release, with little content other that "Use Tidy!"

    Now the problem comes, in like the parent comment says, in interpreting meaning from the XHTML spit out by Tidy. I'm hoping to use simple XSL to get that little bit out. In fact, that's the only useful thing in the article; the writer uses a simple XSL stylesheet to extract title, description, and date information from the XHTML file. For my project it's necessary because while Tidy will change font tags into styles, it must use internal style sheets, and can't determine which site-global styles to use. (Nor can it replace tables with CSS-positioned divs.) Hopefully, the XSL will do 90% of that work for me.

    Tidy is great though. It solves 90% of the problem. (The other 10%, as always, is a b*tch.)

  18. XHTML can go to hell by Anonymous Coward · · Score: 0

    If you have a page in HTML, chances are that it should be in HTML. Using XML for web presentation is just stupid, unless you're playing buzzword bingo.