DTD vs. XML Schema
AShocka writes "The W3C XML Schema Working Group has released the first public Working Draft of Requirements for XML
Schema 1.1. Schemas are technology for specifying and constraining
the structure of XML documents. The draft adds functionality and
clarifies the XML Schema Recommendation Part 1 and Part 2. The XML Schema Valid FAQ
highlights development issues and resources using XML Schema. This article at webmasterbase.com addresses the
XML DTDs Vs XML Schema issue.
Also see the W3C Conversion Tool from DTD to XML Schema
and other XML Schema/DTD Editors."
There's no "vs."
XML Schema are much more flexible and powerful.
There're also about 100 times more difficult and confusing.
PXML is a subset of XML - an alternative to the bloated XML language.
believe me, you won't use XML anymore if you once tried PXML
1. DTD 2. XML Schema 3. CowboyNeal validation (via SOAP over SMTP)
While the W3 continues to push Schema, they are also forming working groups for RELAX after pressure from XML luminaries such as James Clark.
XML Schema is also kinda whacked. It shows all the signs of being a committee specification.
The big problem with schema is that you actually have two type systems going. Element definitions are types for elements. Type definitions are actualy types for types for elements. I saw a hopelessly confused attempt by some UML people to express XML schema in UML, they simply could not understand that there was no way it could ever work. UML has completely different semantics.
There are a bunch of schema proposals that folk have said good things about. Eve keeps telling me I should look at Relax. But for the time being XML schema is going to be the basis for standards in W3C and OASIS.
There might be an opportunity to do a clean up job on XML schema in 4 or 5 years but that will only happen if it is causing real problems.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
dammit, right after I buy a book to finally learn XML in detail, they change the standards. :P
I am a programmer for a commercial company (yes I like to make money, and I program on WinTel). I year ago we had the XML craze we converted all our internal protocols to XML. I discovered that XML was just a lot of hype about nothing. There is nothing self-describing about it. Or maybe there is, just like the section names in an INI file describe the keys in them...
On the other hand the one thing that I did find XML useful for is easy parsing. If you use XML to develop a lower level protocol you end up with bloated 10k messages. But for high-level protocols or for configuration files it's great for only one reason: There are lots of ready-made tools. If you want to parse XML in Windows just load the IXMLDocument interface and it works at lightening speed. If you want to parse the messages in a web-browser through together a quick DOM parser or even use the build in DOM one! If you want to parse XML in PERL or C/C++ there are great libs. The only reason XML is good is because all the hype got people developing very neat tools. In one of my latest projects that needs to pass information between two programs written in different languages a used a Home-Made SOAP and designed a base class the persists using XML. I developed it in both langauges in under an hour!
So although it wastes bandwidth and there really isn't anything neat about it, it is comfortable I'll give it that.
God made the natural numbers; all else is the work of man - Kronecker
RelaxNG vs. W3C Schema makes a much more interesting discussions. DTD is obsolete in many ways... and most of the XML parsers support schema now.
XML is a very powerful tool.
On very important use is in creating interfaces between heterogeneous systems. Areadable character set and meaningful tags is very handy for developers. The hierarchical structure is extremely powerful. And, of course, the fact that it is a standard with common tools is invaluable.
However, one useful principle of such interfaces is "if you don't understand it, ignore it." In other words, when you get a message, look for what you want in it and use it. Ignore anything that isn't what you want. XML is ideally suited for this approach - especially if you use path based access rather than DOM tree traversal.
This approach to interfaces allows systems to interchange messages without exact version consistency, and without requiring a tight congruence of the applications. It allows a system to "tell what it knows" and another system to "read what it needs" without further ado.
Unfortunately, the use of schemas goes against this idea. It is IMHO a more old fashioned approach of rigidly constraining the messages to an exact specification. This can make interfaces far less robust and flexible, and increase the amount of work.
Schema processing may also be promoted to "verify" message integrity before processing. However, it only does so in the most primitive ways. Real world messages, especially in the business world, tend to have integrity rules that go far beyond what can be expressed in anything short of a complex computer program or equivalent declarations.
I am sure there are plenty of places where schemas make sense, but in the areas of commercial message interchange, they take a powerful and flexible construct and hobble it.
The only good weather is bad weather.
Honestly, I believe that xml is just a nice way to replace INI files and CSV files. Seems to be about it really. The odd business may use xml for b2b.
And XHTML really bites. You can tell the w3c doesn't listen.
When you bend over to take the latest XML Schema, don't forget your SOAP.
It's occurred to me maybe we are being too diligent in actually validating the schema itself, but I'm wondering what others think?
One of the greatest things about XML schemas is that they themselves are well-formed XML documents. This makes it a breeze to parse and create XML Schemas. I've just started using XML Schemas in development for the past few months, and they are fantastic. A huge improvement over both DTD and XDR (Microsoft's temporary schema format until XML Schemas came out).
Forget the whales - save the babies.
This approach to interfaces allows systems to interchange messages without exact version consistency, and without requiring a tight congruence of the applications. It allows a system to "tell what it knows" and another system to "read what it needs" without further ado.
Unfortunately, the use of schemas goes against this idea. It is IMHO a more old fashioned approach of rigidly constraining the messages to an exact specification. This can make interfaces far less robust and flexible, and increase the amount of work.
If your talking about using XML for data messaging not using schemas is just lazy. XML Schema allows optional elements and attributes and/or default values. So if it isn't required, then just make it optional. If you want multiversion interfaces, you have a different XMLSchema for each version. Then each side knows explicitly what the messaging protocol is.
While it's probably true that things mostly kinda work if the versions don't match, you shouldn't be relying on this. There's lots of software out there that does this but that doesn't mean it's the ideal.
If your using XML for markup of documents, schemas are somewhat less useful since the underlying semantics of the tags is usually more important.
I am not a number! I am a man! And don't you
Trimming bloat like namespaces and comments? Are you nuts?
How do you embed MathML in another document (like XHTML)? Currently it's with namespaces. How do you propose to do that without namespaces? Just the prefixes? What happens when two different markups use the same prefix? Wups! You're screwed!
No comments? This is supposed to make a better alternative to XML? It won't help readability, and it certainly isn't a major bottleneck during parsing.
Don't want the "bloat" of namespaces and comments? Wait for it... Wait for it... Don't use namespaces and comments in your documents! Wow! What a concept!
Maybe no Unicode in PXML hunh? So much for interoperability for any kind of data. You don't ever want your pet project used in East Asia (or Russia or Greece or most other places in the world) do you? Unicode too bloated? Why not just use ISO-8859-15 (basically ASCII w/ a Euro character -- which incidentally a Euro character isn't available in ASCII)? Oh wait! That's right. You don't want to allow processing instructions, which in XML tell you what encoding is used.
What happens if you want to change some of the basic syntax of PXML? Because you've nuked processing instructions, you can't specify a markup version like you can in XML.
Yes, yes. We've all seen your little pet project. I hope it was just a class assignment.
- I don't need to go outside, my CRT tan'll do me just fine.
I can't believe nobody's mentioned this yet. Microsoft has a tool that will do several things:
This makes writing your XSD almost trivial. The code-generation capabilities are very powerful, as well, as you can generate runtime classes for serialization/deserialization or classes derived from DataSet so you can treat XML files like any other database, etc. It's very useful if you're doing any
I'd be very surprised if there weren't other tools out there doing similar things. I simply mentioned xsd.exe because that's what I'm familiar with.
When you're talking about standards you need to have things specified exactly, and schemas give you a standard way to do that. And they also allow you to do things like automatically generate code blocks to represent your data in memory, saving developers of data-processing apps a lot of time. And not only that, they create a simple way to communicate between organizations. What would you have people do, look at the XML themselves and guess it's structure (which would work about 95% of the time, but that 5% will bite you in the ass when you get something unexpected).
And finally, Schemas don't force any of that on you. If you don't need schema support, then don't turn it on in your parser. You can still grab what you need out of the tree. Although you might not be able to throw just anything into it, that's probably a good thing. The last thing the world needs is thousands of tiny, ill-conceived exotic extensions to various Datatypes. It would make achieving universal compatibility a nightmare.
If your app doesn't need schemas, don't use 'em. If you don't need to validate, don't check em. If you need to put more data into your tree, maybe you should rethink what your doing or rewrite maybe your schema.
autopr0n is like, down and stuff.
When I was in school, the plural of "schema" was "schemata".
</>
I've already selected "No Karma Bonus". Beyond that I can't mod myself downward.
Prime numbers are exactly what Alan Greenspan says they are -S. Minsky
In my experience, many benefits of XML come when dealing with the presentation layers of many application architectures, with the ability to repurpose syndicated data at wil, here are a few examples:
Effective use of XML and XSLT allows you to easily aggregate informational data from one or multiple sources and "repurpose" for an infinite variety of business and technological goals.
One of the main benefits of XML is that it offers and effective, textual representation of "scructured data", that can be conveniently accessed and manipulated according to a slew of various surrounding standards such as XPath, DOM, XSLT, namespaces.
Extraordinary Vacations. Exceptional Prices
but I think you are totally off base with regard to CDATA sections. If anything, they make life easier for the parser, not harder -- at least when I was writing a parser, CDATA made things faster and easier. In cases where you are including a great deal of symbolic data -- for example, when you want to include a source code segment or ASCII art -- it is both easier to read, faster to parse, and *less* bloated.
'<' takes up less space than '<'. Assuming you have more than three or four of these in your text node, a CDATA section reduces the size of your document. For the parser, after the CDATA section is begun, only the character sequence ']]>' can end it. This means the parser only has to check for ']]>' and not '<', '&', '<?', '<!', etc.
And yes, there is such a beast called XInclude, but it's currently only a candidate release. It's used like this:
<foo>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="bar.xml" parse="xml">
<xi:fallback>
<para>This text goes in if bar.xml cannot be found or has an error</para>
</xi:fallback>
</xi:include>
</foo>
Hopefully most entities can go the way of the dodo.
- I don't need to go outside, my CRT tan'll do me just fine.
The key to robust parsing is deferring the decision as to whether a tag has a closing tag until you've seen enough input to know. You have to read the whole document in, build a tree, then work on the tree, but for anything serious you want to do that anyway.
This parser is in Perl. If anyone would like to take it over and put it on CPAN, let me know.
I don't agree with you that schema validation is useless. In many cases the documents are fully processed for business rules much later, but you want acknowledgement that your document has reached correctly and it passes atleast the most basic validation (e.g. dtd or schema validation). XML Schema do wonderful job at that. In our case, we always keep schema validation on new doc types until the system is stable and bug free and then remove validation for efficiency (for internal docs). We have discovered many subtle bugs in system which would have been extremely hard to track by looking at application error but were easier to find by looking at parser errors.
It would be somewhat unfortunate if both end up popular, because it will be more work to maintain both sets of tools than either one alone. That's probably what will happen, though, at least in the short term.
Yup. Even in XHTML. Check out this article on A List Apart for a useful method.
Well, that's why you'd use HTTPS with certificates, no? And nothing is wrong with the port. If you meant HTTP, then yeah, it's plaintext.
Mind you, I don't have a choice of OS's at work. We use solaris and linux. Now amazon, being a windows shop (i'm guessing), only gives out dll's. Great, now I'm not supported. So fine, we use java. Did you know java class (binaries) are versioned? I'm stuck with 1.3.1 ATM and a 1.4 jdk is in the works. Problem is, some jdk's use one version of the binary while another uses.. another. I always hoped it was a universal format. Sadly let down.
That's why technologies like JAXB and translets are poping up. with JAXB, you can bind particular classes to particular schemas/dtd. It speeds up processing. Translets are just compiled XSLT. Really fast since your xslt can be compiled/interpted once, run anyhwere. Kind of a chain technology. translet->xslt->java->machine language.
And mind you, nothing is more secure about a binary format. It's just obfuscated. Hell, I hacked rengeade bbs's users database format so i can write a user deletion tool. Were they going for security, prolly not. Point is, binary is just obfuscated.
As for your sessoin level security, that's not the job of your data format. Your data format and transport layer should be indepenent. It's why you can do SOAP over HTTP, SMTP/mail and possibly anything else that has a function() like response format. request->response. It's probably why ssh is so great. All it is, is a way of authentication, communication and encryption. You can create ssh tunnels for http as a proxy.
-
ping -f 255.255.255.255 # if only
The Schema WG decided on "schemas" so as not to add unexpected obscurity to the specification.
See this message.
Expected obscurity is of course just fine.
.. was a language only something you can "program"?
If it was called it a programming language that would be wrong, but it's certainly a language.
I'm sure their library would give you the data in exactly the format that you need it in, be available for the language that you want it in, the platform that you need it on, and they will continue to update and support every single variety. You would also, therefore, completely trust this closed, 3rd party code that you've now integrated into your product, to not have any bugs or security holes.
Open data formats are a good thing.
that the same applications of XML that drive the keening about bloat and hype seen in these comments are precisely those which are driving the specs to the wrong side of the 80/20 for XML/XSL's original goals: bringing the semantic power of SGML and DSSSL to the Web. Goals for which its purist cousins RelaxNG, REST, et. al. remain admirably suited.
The back-end curmudgeons are right, XML stinks for a universal wire format. But for loosely-coupled, message-based, semantically-rich systems it is hard to beat. And document-oriented systems which don't use XML barely deserve notice any longer.
I gently refer s-expression trolls to paul and oleg
illegitimii non ingravare
Absolutely. All the possible attributes, and kids of any element are there in one (OK, two) place(s) and you can garner the information about any element in a matter of seconds. With XML Schema you have to keep track of the levels of nesting and rifle through a series of name/value pairs to get the same information. It is in its greater expressiveness that the advantage of XSD is seen to lie. And there might be applications where this expressiveness necessitates the use of XSD.
However, XML Schema, has besides this expressivenss, one other great advantage. It is XML. As such it can be processed with the same XML tools one uses elsewhere with an XML application.
As an example, in one application, I take a DTD, translate it into XSD, and then run an XSL stylesheet over the XSD file to generate some base code used in my application. In this way I can ensure that my code will automatically be changed to reflect any minor changes made to my Schema.
So while I continue to write DTDs, I look on XML Schema as a way to translate, and bring my DTD into the XML universe, with all its attendant advantages.
Better to be despised for too anxious apprehensions, than ruined by too confident a security. --Edmund Burke
The value of XML is not the structure of the data. The tags, nodes, elements and attributes are just another format for parsing data. The power comes with the ability to VALIDATE the format. No other data exchange format has such an integrated approach to assuring the validity AND structure of data. Also, the hirearchical nature of XML makes it idealy suited to most information sets. It also, takes the organization of relationalal data to another level because node groupings inherently define a relationship between the information that is contained in the document. XML as just XML isn't that special but the ability to nest information and validate the structure make XML a more *reliable* data format. In the world of CSV files, ini files, and excel spread sheets and the like, it is a welcome change. As the tools evolve to take the comlexity out of creating things such as schemas. XML's potential as an interchange format will be fully realized. As for its verbosity, it is needed. The less structure the more the format is left open to interpretation.
Well, while I'm not particularly thrilled about having to learn another ten thousand things (XPath, XSLT, etc) in order to use XML to it's fullest extent, but it does have a huge advantage: easy to validate a document and it's data. I don't know about you, but I'm sick until my nuts ache having to parse through mile-long flat files produced by a legacy mainframe only to find out after-the-fact they forgot to leave out the damn carriage-return line-feed pairs and the layout is thrown off by one byte! Anyway, I can validate an XML document in like three or four lines of code, regardless of language. Plus, there are XML parsers written for just about every language and platform, even frickin OS390/Natural/ADABAS, which means I don't have to return a flat file to the COBOL folks! I can just have SQL Server 2K chew up the data, spit it out in XML format, and then have a DTS package automatically FTP the XML document back to the mainframe just in time for the JCL schedule to kick up! Dunno about you, but it makes my life easier for dang sure.
Spread the RC luvin'
I reallly like the approach of the REST guys, a much lighter weight and more intuitive approach to web services than SOAP.
Basically, they are saying - use HTTP as it was intened to be used, not abusing it in a way it was not meant to be abused.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
I see a lot of replies here that flame XML as an RPC protocol. Using XML as a message format for RPC is just one small part of what you can do with it.
:-)
The real strength you get for free when using XML is that you can use standard parsers and transformers to handle different kinds of data in a uniform manner.
I'll give you an example..
Today I was working on my wxWindows application, and I needed to translate (i18n) a lot of windows, dialogs and menus to a few languages. The resources are specified in XRC which is an XML format.
Within XRC files labels for buttons and other widgets are written in the language they were created in, usually english.
What I did was write an XML catalog file for each language. I then ran an XSLT processor with the XRC files, using the catalogs as plugins with a simple pattern match rule -- and I didn't have to write one single line of customized code to do it.
Though for RPC I'll stick to CORBA with IIOP any day
HTML is a subset of XML - an alternative to the bloated XMl language.
believe me, you wont use XML (and those pesky XSLTs) anymore if you once tried HTML
AND (most importantly) in virtually every single web browser that you can find, support for viewing this format over the internet is available and built into the browser itself!
Karma: NaN
I recently decided to go with RNG for my schemas after reading up on W3C XML Schema (WXS) and Relax NG (RNG) . RNG is just so much easier to read and understand. The real clincher for me was the inability in WXS 1.0 to describe non-deterministic structures. I mean, give me a break. I can't allow people to put the elements in a different order? That's just lame.
What's more there's a fantastic tool dtdinst that converts DTDs into Relax NG. There's also tools to convert back and forth between WXS and RNG. So if I ever need to provide someone with a WXS schema I can just run it off automatically.
Now I'm working on a system using AxKit to parse out the RNG schema, generate HTML forms for completion, roundtrip the data back to the server, assemble an instance document using DOM and display it using XSLT and CSS. But that's another story. People who don't "get" XML should really check out AxKit.
simon
home page
I hate snobby XML evangelists who do nothing but propose that companies change the systems that that have been working for them for 50 years, all to XML and the dozens of related technologies.
People should realize that XML adoption is a matter of degrees and it is perfectly acceptable to adopt just the concept of a simple markup language. Adoption of XML should NOT necessitate the adoption of XSLT, XHTML, SOAP, and every thing else under the sun that is related to it.
If you have a database system with a protocol for passing messages, then maybe add a new type of XML message with a privately known schema. Don't rip up your system and rewrite it as an XML web service, have it talk only the SQL Server XML services, only use XSLT to transform data, and only XPath for retrieving data.
I'm so fucking sick to death of overarchitects who never actually seem to get the job done on time. (And I'm no XP fan, either, so that's saying a lot!)
I would happily play around with XML Schema if only my Emacs/PSGML mode would accept a schema and treat it in the same way as it treats a DTD.
:-)
And sorry, I have neither the time to write my own Emacs mode nor the money to buy commercial XML tools.
Well, so I keep watching the tools and if they are Schema ready then so am I.
What happens when you want to have an Alpha box, a Pentium box, a handheld device, and an UltraSPARC box talk to one another? Simple, right? After all, an int is always 32 bits...err...umm...and everything is big endian...err...ummm...and all architectures use the same data structure padding... Well, at least your program took care of the padding issue...for that one data structure.
Wups! We've got a core dump waiting to happen. Okay, so we'll just make sure that everyone is using the same sizes and padding all around for any data structure I may need to pass over the network. Of course, this requires a mapping layer so I don't have to do this for every app and data structure that I write. I know: it's for interfaces and defines the general structure. I'll call it Interface Definition Language or IDL for short. Now I'll make sure that all of this information serializes to the network correctly and decodes on the other end without errors. This will be kind of like a stock broker it that I tell it what I want, and it translates it into something usable but more complex than I need to deal with for each app. I think I'll call it a broker too...an object broker...wait...missing something...messages going back and forth...asking for resources...aha! Object Request Broker! Yeah! Oh wait, but people may have different implementations and I want to be able to work with others. Let's agree on this. We'll call it the Common Object Request Broker...ummm...Architecture! Yeah, that's it!
Hmmm...now I need to make a configuration file for my program. I'll make it plain text. Hmmm...but it needs some kind of structure. I'll make it key/value pairs -- just put in a few equal signs and I'm done. Uh oh. My program is fairly modular, but I want to keep all of the settings in one place. If it's just key/value pairs, everything will get jumbled together. I know! I'll use an INI file. Microsoft used to use those to group items together. Now I can just use those nifty GetPrivateProfileString calls, specify the group and the key, and away I go. But uh oh! I have this subcomponent that requires a group within a group. Let me hack something together... Argh! This data file is getting tougher and tougher to parse. I want to finish writing my program that does something useful, not fiddle away at a dumb configuration file parser. What I need is a standardized, hierarchical format that is still plain text and human readable. Hmmm...what's this "XML" thing? I can have the configuration all in memory or read it in piecemeal? Parsers are already written? If I don't like the parser I'm using, I can just plug in another one? I can read the file from any programming language out there? Sign me up!
FYI: This binary vs. "plain text" tripe needs to go away. All text files are binary files. What is the letter 'B' but a 0x42 (66 in decimal, 01000010 in binary)? It's a piece of translation software that turns that 0x42 into the character 'B' on our screens. I just so happens that <foo/> is clearer to the human eye -- after the preliminary software translation step -- than a serialized C data structure. Clearer to the human eye means that the human fixing bugs can see the error faster. CPUs are hovering aroung the 3GHz range now, but the human mind seems to be falling further and further behind Moore's Law. Perhaps we should help the human mind out a bit and give a bit more work to the CPU.
Yes...I know...I'm a dick. I'm comfortable with that.
- I don't need to go outside, my CRT tan'll do me just fine.