Slashdot Mirror


DTD vs. XML Schema

AShocka writes "The W3C XML Schema Working Group has released the first public Working Draft of Requirements for XML Schema 1.1. Schemas are technology for specifying and constraining the structure of XML documents. The draft adds functionality and clarifies the XML Schema Recommendation Part 1 and Part 2. The XML Schema Valid FAQ highlights development issues and resources using XML Schema. This article at webmasterbase.com addresses the XML DTDs Vs XML Schema issue. Also see the W3C Conversion Tool from DTD to XML Schema and other XML Schema/DTD Editors."

17 of 248 comments (clear)

  1. Who needs XML when you got PXML? by Anonymous Coward · · Score: 2, Informative

    PXML is a subset of XML - an alternative to the bloated XML language.

    believe me, you won't use XML anymore if you once tried PXML

  2. Re:One is derided, one is end-of-life'd by mir · · Score: 4, Informative

    I think James Clarke's RELAX NG and W3C XML Schema is the best description (if slightly biased ;--) of the relative strength of the 2 technologies. Note that James Clarke also just released a new version of Trang , a tool that does conversions between Relax NG, Schemas and DTDs.

    --
    Look, that's why there's rules, understand? So that you think before you break 'em. (Terry Pratchett)
  3. XML Schemas are in XML by M.C.+Hampster · · Score: 3, Informative

    One of the greatest things about XML schemas is that they themselves are well-formed XML documents. This makes it a breeze to parse and create XML Schemas. I've just started using XML Schemas in development for the past few months, and they are fantastic. A huge improvement over both DTD and XDR (Microsoft's temporary schema format until XML Schemas came out).

    --
    Forget the whales - save the babies.
  4. Re:Validating with XML Schemas by VP · · Score: 4, Informative

    This is a misunderstanding of the way schema validation is supposed to work. Schemas have what is called "location hints" which should be used in case you have never before encountered a particular namespace. The key word, however, is "hints" - i.e. you should never have to remotly obtain a schema if you don't need to.

    In most cases, if you are doing schema validation, you already know whta schema you can expect, so they should be not only locally available, but also cached in memory...

    As for the ..."master" XSD schema... you never ever have to get it remotely - the parser should be implementing it already...

  5. XML is Great of Content Syndication and much more by valmont · · Score: 4, Informative
    I notice that this topic is generating many comments from hard-core backend programmers who mainly focus on inter-application messaging and various equivalents of remote procedure calls.

    In my experience, many benefits of XML come when dealing with the presentation layers of many application architectures, with the ability to repurpose syndicated data at wil, here are a few examples:

    • RSS which defines an easy standard for any site to provide "News" in a well-defined XML Format. This allows developers to write software to aggregate news from different sites into one convenient interface, sites to exchange news headlines with eachother.
    • Google Web APIs which allow developers to create their own custom google-powered search site with their own look and feel by simply proxying a user's search query to the google server which returns search results in XML data which can subsequently be transformed in HTML before being sent back to the user via various processes such as an XSLT transformation.
    • Amazon Web API, similar in principle to the above Google API, allows developers to enhance their sites by allowing their users to search for Amazon products without having to go the Amazon site itself. One interesting side-effect of such API is that an Amazon competitor, say Barnes and Noble, could offer a similar API to their own site. Now I could allow my users to use my service to search for books and offer them results and price comparisons from both Amazon and Barnes and Noble

    Effective use of XML and XSLT allows you to easily aggregate informational data from one or multiple sources and "repurpose" for an infinite variety of business and technological goals.

    One of the main benefits of XML is that it offers and effective, textual representation of "scructured data", that can be conveniently accessed and manipulated according to a slew of various surrounding standards such as XPath, DOM, XSLT, namespaces.

  6. Re:one is pathetic, the other ludicrous by Anonymous Coward · · Score: 1, Informative

    Yes, it's a pain to constrain the number of occurances, but it can be done:

    <!ELEMENT p (c,c,c?,c?,c?)>
    <!ELEMENT c EMPTY>

    would constrain each parent element "p" to 2 to 5 child elements "c". Something like:

    <!ELEMENT p (c{2,5})>
    <!ELEMENT c EMPTY>

    would be much better.

  7. Re:Blah, blah, blah by Anonymous Coward · · Score: 1, Informative

    Oh my, where to begin. Please, can't we get some folks in here that have actually worked on real, professional systems? Only a complete moron would make a statement like the parent post. And when I read it someone marked it Insightful, wow, only a bigger idiot (or maybe a PHB) would do that.

    First off, Xml is not hype. In it's simplest form its a format that has standard parsers on every platform. In it's most robust, it's a terrific data description language that can be used to describe really complex data.

    Here's an example of the power of Xml, in .NET there are XmlSerialization and Deserialization engines. Basically, you can take any object and get an xml representation of it (Deserialize), and by the same token you can make an Xml representation of an object and Serialize it into the object. Using these techiques allow you to pass data between application layers or between servers without getting all talky (i.e., in one call instead of setting individual properties, etc).

    Now, this is basically in the MS world what COM does but the power here is that you're passing complex data types from one application to another in a standard format.

    Here's another example, we wanted to store all error messages for an application in a standard xml file. I created an Xml Schema for the file to make sure that all of our developers entered the error codes in a proper format. At build time, I have a script that validates that the file is correct. Furthermore, to help the developers when they update the Xml file, VS.NET provides IntelliSense to let them know what tags go where (thanks to the schema reference).

    To me, that's pretty powerful stuff, considering that now I know at build time that everything concerning that section of the app is set up correctly.

    Personally, I think those of you out there who don't understand the value of Xml and Xml Schema also don't have a lick of real world programming experience. Hopefully I'll never have to work with you...

  8. I agree with you on entities by ttfkam · · Score: 2, Informative

    but I think you are totally off base with regard to CDATA sections. If anything, they make life easier for the parser, not harder -- at least when I was writing a parser, CDATA made things faster and easier. In cases where you are including a great deal of symbolic data -- for example, when you want to include a source code segment or ASCII art -- it is both easier to read, faster to parse, and *less* bloated.

    '<' takes up less space than '&lt;'. Assuming you have more than three or four of these in your text node, a CDATA section reduces the size of your document. For the parser, after the CDATA section is begun, only the character sequence ']]>' can end it. This means the parser only has to check for ']]>' and not '<', '&', '<?', '<!', etc.

    And yes, there is such a beast called XInclude, but it's currently only a candidate release. It's used like this:

    <foo>
    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="bar.xml" parse="xml">
    <xi:fallback>
    <para>This text goes in if bar.xml cannot be found or has an error</para>
    </xi:fallback>
    </xi:include>
    </foo>

    Hopefully most entities can go the way of the dodo.

    --

    - I don't need to go outside, my CRT tan'll do me just fine.
  9. Parsing without a DTD by Animats · · Score: 3, Informative
    It's actually possible to parse even SGML without a DTD, most of the time. I do this routinely in the SEC filing parser behind Downside. SEC filings come as a horrible mix of SGML and HTML, with occasional uuencoded PDFs and images. The SEC's validation is very light, and isn't based on a DTD. What comes through is a mess.

    The key to robust parsing is deferring the decision as to whether a tag has a closing tag until you've seen enough input to know. You have to read the whole document in, build a tree, then work on the tree, but for anything serious you want to do that anyway.

    This parser is in Perl. If anyone would like to take it over and put it on CPAN, let me know.

  10. Re:OT: how do you correctly embed flash by ubernostrum · · Score: 2, Informative
    Is there a correct way to put flash on a page and pass validator.w3c.org for valid HTML 4.01?

    Yup. Even in XHTML. Check out this article on A List Apart for a useful method.

  11. Re:Power by Lulu+of+the+Lotus-Ea · · Score: 4, Informative

    There certainly is a "vs." involved. There are many good reasons to choose DTDs for a given validation requirement rather than W3C XML Schemas. I address some of those in an IBM developerWorks articles:


    Comparing W3C XML Schemas and Document Type Definitions (DTDs)

    This is a bit old, but still correct. Not a lot has changed in either spec.

    I am currently working on a series of articles on RELAX NG. In most ways, I think RELAX NG really is the best of all worlds. It is more powerful than W3C XML Schemas, while being a natural extension of the semantics of DTDs. Moreover, if you choose to use the compact syntax (non-XML), you get something very easy to read and edit by hand.

    David...

  12. Re:All this hype about XML by sporty · · Score: 4, Informative

    That's funny, I just looked at the man page for gzip.

    Gzip uses the Lempel-Ziv algorithm used in zip and PKZIP.
    The amount of compression obtained depends on the size of
    the input and the distribution of common substrings. Typ-
    ically, text such as source code or English is reduced by
    60-70%. Compression is generally much better than that
    achieved by LZW (as used in compress), Huffman coding (as
    used in pack), or adaptive Huffman coding (compact).


    Mind you, XML is highly repeditive in it's tag use on long documents. Long as in multiple records, not necessarily byte length.

    Now let's take a larger file, 'cause after all, since modem users can download 5k html really quick. I've taken the soap distribution from apache (or was it sun) and took all the xml files in there and concatonated them together. 22k XML file. Not huge, but big enough for this example.

    Here's my findings:

    [caligraphy:~] spencerp% ls -al o.xml
    -rw-r--r-- 1 spencerp staff 22118 Jan 23 21:21 o.xml
    [caligraphy:~] spencerp% gzip o.xml
    [caligraphy:~] spencerp% ls -al o.xml.gz
    -rw-r--r-- 1 spencerp staff 3021 Jan 23 21:21 o.xml.gz
    [caligraphy:~] spencerp% gzip -l o.xml.gz
    compressed uncompr. ratio uncompressed_name
    3021 22118 86.4% o.xml


    Not bad for taking non repeditive text, with random xml schemas and getting 86.4%. Now imagine a larger one with a consistent schema. Compression goes even higher. Granted, it will be slightly larger than a binary. But even a 100meg file can be moved across a 100megabit network in 5 minutes time. And THAT is a lot of data.

    Btw, there is a falacy with your math. If I get 50% compression of an XML file, which could have been implemented in binary format, it doesn't mean the binary format would be 49 times smaller.

    --

    -
    ping -f 255.255.255.255 # if only

  13. Re:XML is Great of Content Syndication and much mo by sporty · · Score: 3, Informative
    Wow, I'm just runnin into you all over the place, aren't I.

    Ok- what if google, amazon, etc were to do the same thing, but translate in binary data, without tunneling overport 80 (which is bad, evil, and vile. Just ask any sys admin), and provide a library that parses the binary data for you?


    Well, that's why you'd use HTTPS with certificates, no? And nothing is wrong with the port. If you meant HTTP, then yeah, it's plaintext.

    Mind you, I don't have a choice of OS's at work. We use solaris and linux. Now amazon, being a windows shop (i'm guessing), only gives out dll's. Great, now I'm not supported. So fine, we use java. Did you know java class (binaries) are versioned? I'm stuck with 1.3.1 ATM and a 1.4 jdk is in the works. Problem is, some jdk's use one version of the binary while another uses.. another. I always hoped it was a universal format. Sadly let down.


    It would be the exact same thing- except it would be faster, use less bandwidth, be more secure, have session level security (which HTTP lacks). But it wouldn't be buzzword compliant.


    That's why technologies like JAXB and translets are poping up. with JAXB, you can bind particular classes to particular schemas/dtd. It speeds up processing. Translets are just compiled XSLT. Really fast since your xslt can be compiled/interpted once, run anyhwere. Kind of a chain technology. translet->xslt->java->machine language.

    And mind you, nothing is more secure about a binary format. It's just obfuscated. Hell, I hacked rengeade bbs's users database format so i can write a user deletion tool. Were they going for security, prolly not. Point is, binary is just obfuscated.

    As for your sessoin level security, that's not the job of your data format. Your data format and transport layer should be indepenent. It's why you can do SOAP over HTTP, SMTP/mail and possibly anything else that has a function() like response format. request->response. It's probably why ssh is so great. All it is, is a way of authentication, communication and encryption. You can create ssh tunnels for http as a proxy.
    --

    -
    ping -f 255.255.255.255 # if only

  14. Re:All this hype about XML by c_g12 · · Score: 2, Informative

    Have you ever used Castor? Its Marshalling Framework allows you to easily convert between Java classes and XML documents. This means that you can generate Java source code from an XML Schema (but not DTD, I think). Very useful: simply define your object model using XML Schema, and use Castor's Sourcecode Generator to spit out your Java source.

  15. Re:All this hype about XML by Dr.+Photo · · Score: 2, Informative
    The only new things about XML, IMHO, are that is has spawned more sub-specifications than any previous pretender to the crown.

    Sub-specifications?

    You mean like MathML, SMIL, SVG, XHTML, et al.?

    These are all modular lanuages that use XML.

    The XML client application uses one or more DTD or schema to determine how to interpret the various elements in the XML file, and you can intermingle e.g. MathML and XHTML and so forth all in the same XML file.

    Unless I'm grossly misinterpreting your comment (in which case I apologize), I can safely say that you didn't understand the article, since these "sub-specifications" you mentioned are exactly what DTDs/Schema are for, and exactly what makes XML a Good Thing.

    They didn't call it "Extensible" just so they could put a nice pretty "X" in "XML". (Though in all fairness, I must wonder if anyone could take something called "EML" seriously... ;)

  16. Check out Relax NG (RNG) by sbwoodside · · Score: 3, Informative

    I recently decided to go with RNG for my schemas after reading up on W3C XML Schema (WXS) and Relax NG (RNG) . RNG is just so much easier to read and understand. The real clincher for me was the inability in WXS 1.0 to describe non-deterministic structures. I mean, give me a break. I can't allow people to put the elements in a different order? That's just lame.

    What's more there's a fantastic tool dtdinst that converts DTDs into Relax NG. There's also tools to convert back and forth between WXS and RNG. So if I ever need to provide someone with a WXS schema I can just run it off automatically.

    Now I'm working on a system using AxKit to parse out the RNG schema, generate HTML forms for completion, roundtrip the data back to the server, assemble an instance document using DOM and display it using XSLT and CSS. But that's another story. People who don't "get" XML should really check out AxKit.

    simon

  17. Re:XML Schemas aren't just for validation by Anonymous Coward · · Score: 1, Informative

    Castor and Sun's JAXB do the same thing for Java. You can use a Schema to generate Java classes, or define bindings between existing Java classes and an XML schema to do serialization and deserialization.