Effective XML

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday February 28, 2005 @09:30AM from the under-weaknesses-you-put-xml dept.

James Edward Gray II writes "I'm not an XML junkie and I thought this was a very good book, so I'm betting that XML aficionados will love it. Effective XML covers 50 best practices that all developers should know and use. This amounts to a book of distilled wisdom that will push you a good distance up the chart of XML mastery." Read on for the rest of Gray's review. Effective XML author Elliotte Rusty Harold pages 304 publisher Addison-Wesley rating 8 reviewer James Edward Gray II ISBN 0321150406 summary A guide to the correct use of XML.

Before I tell you what's inside though, let me tell you what you won't find in these pages. Primarily you need to know that this book does not teach XML. I know a lot of books say that, yet still include an introduction or appendix that covers the basics, but this isn't one of them. You're expected to know XML from page one. Even syntax is only covered from a proper usage angle. Personally, I appreciated this. It always bothers me when an obvious non-beginner's book starts off by wasting a chapter on things I should already know. You just need to be aware when you buy that you won't learn XML here. Knowledge of namespaces, DTDs, the W3C's Schema Language, XSLT, and more aren't strictly required to get something out of this book, but they certainly would help you get a lot more out of it.

What you will get here is coverage of fifty miscellaneous topics spread across four sections on "Syntax", "Structure", "Semantics", and "Implementation". In "Syntax", ten topics delve into the details of things like DTDs, entity references and the XML declaration itself. It may sound silly to dig deep into a single line of XML that simply declares the format, but I doubt you will think so after reading that topic. There's a lot going on in that line and you want to be in control of those decisions instead of just copying and pasting. Entity references are an even smaller chunk of XML output, but they too get illuminated by a rare insight on how and when they should be used, and for what. Did you know that it is possible to write a namespace savvy DTD? I do now and I learned that in this section as well.

The second section of the book covers "Structure", and to me it was the best part. This collection of seventeen topics is loaded with good advice about how to build an XML document that will be ideal for anyone who needs to work with it. Here you see how metadata should be stored in XML, get tips on embedding binary content, learn which schema language is better for which tasks, and finally understand rare XML constructs like processing instructions and exactly what they are for. Additionally, there's a lot of general advice on the right way to mark up content that's really worth its weight in gold. Just one example of what I learned here is that I under appreciate mixed content for great constructs like <name><given>John</given> <family>Doe</family>, <title>Ph.D.</title></name>. If you like that, you'll enjoy this whole section.

Section three, "Semantics", deals primarily with parsers and their APIs. Again, you won't learn any APIs here. What's covered is their strengths and weaknesses and why you should choose a given API for a given task. SAX and DOM are the main focus of these ten topics, but there are other details sprinkled in, like XPath.

The fourth and final section is all about "Implementation". The thirteen topics here address client-side XML styling, server-side transformations, signatures, encryption, compression, and more. My favorite topic here was a terrific coverage of Unicode and how it affects XML. All developers should know at least as much about Unicode as what's printed here and this is a fine source to learn it from.

One thing that really stands out in the whole text is that the author isn't afraid to cover the dark side of XML. He will tell you where the design process was less than perfect, which tools have little practical value, and some of the problems with where XML technologies are headed. This isn't complaining though. All of this is targeted at how it affects XML developers today. You learn what you can safely skip and what should be outright avoided. The author even tells you what XML is bad at and gives you advice about when you shouldn't use it. That's the mark of a man who knows his subject, if you ask me.

All told, I think the author failed to completely convince me his way is perfect on only 2 topics. That means I learned 48 expert XML tricks. Surely that's worth the cost of the book in time and money. This isn't the first XML book you need, but I think it is the second XML book everyone should read.

You can purchase Effective XML from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

20 of 269 comments (clear)

Min score:

Reason:

Sort:

Re:The Problem With XML by Further82 · 2005-02-28 09:54 · Score: 1, Insightful

it is inefficiant machine wise because string parsing is an extremly slow and computationaly expensive operation. While perl and friends make it seem easy to the programer, the machine is still truging through the text one character at a time. Try writing an XML parser from scratch in C (no std:string) and see how difficult it is.
XML Seems Cool by Aknaton · 2005-02-28 09:56 · Score: 2, Insightful

XML seems cool to me. I like the thought of being able to design a schema to suit my personal needs. But when it comes time to make use of that schema and actually keep data in it, it seems to be useless, as least as far as an end user (non programmer) is concerned.

Do I have the wrong impression?
1. Re:XML Seems Cool by gizmofan · 2005-02-28 10:09 · Score: 2, Insightful
  
  XML is a way of decorating data with meaning but it's not the most efficient or effective way of doing it. From a software point of view it's expensive to parse - incredibly so when heavily nested/structured and just in terms of size it can be huge in terms of the raw data that it's actually transmitting. The main problem I have with the way XML is often used is the fact that's it's the worst of both worlds. It documents the data that it encapsulates badly from a human point of view (it's difficult to read and repetitive) and verbosely from a machine point of view (ditto). Why not use something more apt from a machine point of view (lisp s expressions?) and something more apt from a human point of view (a document?).
2. Re:XML Seems Cool by Creosote · 2005-02-28 11:24 · Score: 2, Insightful
  
  XML isn't really "useless", but keeping data in XML files is probably a bad idea. What if you mistype one character in one tag for instance? What does your document mean now?
  This is sort of like saying that programming in C is a bad idea, because what happens if you mistype a function name, and your program refuses to run? That's what debuggers are for. Likewise, the XML world is full of open-source or low-cost schema-aware editors and validators. Minimally you should use an editor that knows which elements and attributes are legal while you're entering data. If you design a schema appropriately for your data, you can constrain data types with a great degree of precision.
Re:n00b - help! by aldoman · 2005-02-28 10:02 · Score: 5, Insightful

XML is totally overhyped, which sadly makes people think it is a lot more complex than it is.

Think of it more like CSV than mySQL. It's just a format for representing structured data. It also happens to be that it's quite easily read by humans.

Yes, you can do incredibly advanced things with XML, but there is nothing you can do in XML compared to your own propietary data storing language.

The reason people use XML instead of writing their own data storing format is simple:- there is a lot of tools for parsing it, which you'd have to write yourself if you had your own format.

As for the javascript and XML example, it's impressive, but it's far more javascript than XML.

--
IntechHosting - Free domain, 2GB, PHP, £4.95/$8.95
Tip #1 by Anonymous Coward · 2005-02-28 10:08 · Score: 1, Insightful

1) XML is not designed to be used for everything under the sun.
Re:The Problem With XML by Further82 · 2005-02-28 10:09 · Score: 4, Insightful

They are supposed to be written so people can make programs to read the data without spending hours reading huge cryptic implementation manuals. You forget that computers do not program themselves yet. People still need to do that and XML is easier for people to read and thus easier for them to make programs to read. When machines can program themselves...we wouldnt be having this conversation.
Just because you CAN... by IGnatius+T+Foobar · 2005-02-28 10:27 · Score: 4, Insightful

Sometimes, the most effective use of XML is to simply not use XML at all. XML is a wonderfully useful tool when applied correctly. It's architecture-independent and is a great way to communicate unstructured and/or hierarchial data.

Sometimes, though, your data can be simple enough that XML is overkill. Software developers need to make themselves aware of situations when they might be better served by a simple "flat file" of delimited data. In situations like this, using XML can amount to what I like to call "gratuitous complexity."

Always use the right tool for the job.

--
Tired of FB/Google censorship? Visit UNCENSORED!
Re:The Problem With XML by pyrrho · 2005-02-28 10:28 · Score: 2, Insightful

one of the original ideas of XML was that a simple (SAX like) parser can be written by "a graduate student in two weeks".

The validation etc is more difficult, but then it's not a matter of parsing the XML in the first place.

It matters what you mean, but in general XML is easily parsed by machines... and easily represented in internal datastructure which are however efficient you make them.

--
-pyrrho
Re:The Problem With XML by Anonymous Coward · 2005-02-28 10:32 · Score: 1, Insightful

Are you kidding? Let's say you have three employees. You want to send the following data on each to a remote machine: first name, last name, salary in dollars.

Joe Smith $48000
Jane Smith $50000
Steve Shmo $65000

Method #1: send the following as a string:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE employees .. blah blah blah lots of crap here>
<employees>
<employee>
<first-name>Joe</first-name>
<last-name>Smith</last-name>
<salary>48000</salary>
</employee>
<employee>
<first-name>Jane</first-name>
<last-name>Smith</last-name>
<salary>50000</salary>
</employee>
<employee>
<first-name>Steve</first-name>
<last-name>Shmo</last-name>
<salary>65000</salary>
</employee>
</employees>

Method #2: send the following as a single string:

3R3:Joe5:Smith5:480004:Jane5:Smith5:500005:Steve 4: Shmo5:65000

I.e., number of records, followed by R, then length as decimal string, colon, string in iso-8859-1, repeated three times for First name, last name, salary, repeated for each row

There's no way a parse for #1 is going to be more efficient than #2. In fact a parser for #2 can be made secure more easily because you can pre-allocate your buffers.

#2 is easier to generate, and easier to explain than a full XML parser.

Yeah, #2 is a little harder to read. That's because it's a *machine* format, designed to be easy for a machine to parse, and somewhat easy for a human to debug occasionally. If you need to read lots of it, write a program to dump it in a human-readable format. (I bet your human-readable format won't look anything like XML.. it'll probably look more like YAML)

The advantage of XML is that you *don't* have to write the parser at all. It is slightly more programmer-efficient in many situations. And the tags give you the illusion of understanding the meaning. (I say illusion because there's no way to know which tags are optional in my example. You still need a description (a schema or DTD), which you need for method #2 anyway).
Re:n00b - help! by Piquan · 2005-02-28 10:46 · Score: 5, Insightful

The coolness of XML is not in the format (which sucks); it's in the technologies around it.
RelaxNG, for instance, lets you verify that your XML file is built correctly for your app: you write a RelaxNG spec for your XML file format, and then it verifies that all the mandatory fields are there, in whatever order is necessary, with the correct datatypes, etc, etc. RelaxNG processors are part of most major XML libraries now, so if you're writing Perl you can just tell your Perl library to validate your file and it's done. If you're editing in Emacs (with nxml-mode), you can point Emacs at your RelaxNG file, and have tab completion, error highlighting, etc, etc-- all customized for your file format.
XSLT lets you take an XML file and perform transformations on it into another (possibly XML) file format. Need to convert XML into SQL INSERTS? Piece of cake. I use it to extract particular parts of an XML file and convert them into a significantly differently-ordered Lisp structure.
Most modern web browsers are becoming CSS engines rather than HTML engines. So you can stick a CSS stylesheet reference at the top of your XML file, and have the CSS generate something that looks like what you want the user to see. The data file looks good to the app, and looks good to the user. You can also (with some browsers) use more powerful transformations using something like DSSSL or XSLT.
DOM for a standard data manipulation API, so each program you write doesn't have a different data access language. XPath as a language to perform more complex queries. XML Namespaces to let users or apps tag their data with extensions. XInclude for data sharing. All of these are things you get for free with XML.
All of these are general technologies, not specific apps. So they should be usable in most major libraries in most languages. (If you're using Perl, I'd recommend XML::LibXML.)
Don't think of XML as just a file format, because that part sucks. Think of it as a buffet table of technologies. When you write a program, 10% is to do the program's processing; the other 90% is to handle I/O, data management, and other housekeeping. Using XML lets you get a lot of that for free.
PS: I'm not an XML fanatic. A year ago, I was told to use XML for one particular project and was disgusted at the idea. I still think that XML gets a lot wrong, but I've come to recognize what benefits XML provides.
Re:Mod parent up by Anonymous Coward · 2005-02-28 10:47 · Score: 1, Insightful

I strongly prefer the RDF data model

Ugh, no. How do I say that object X has the following TWO properties? I can't. I have say: "person has first name tom". "person has last name jones". I can't say "person has first name tom and last name jones".

The Relational model is the best model for data because 1) it allows multiple attributes in a single predicate and 2) you don't have to repeat the attributes, they just go in the relation header. But these are DATA MODELS, they aren't TEXT FORMATS, which is what XML is. They are trying to reverse-engineer a hierarchic data model for XML, but hierarchic data models are flawed because they are optimized for certain uses and not others (i.e., what if the data I want is at the leaves of the tree?)
Re:The Problem With XML by eap · 2005-02-28 10:50 · Score: 4, Insightful

Of course someone else might find a good way to tell me why I should use 40 characters to transmit what should have taken 10 characters and how it should have been faster or more efficient some way to use it. The whole concept was definitely good for a lot of programmer payroll time.
I would not be so quick to dismiss XML because of traditional arguments. Having worked with several different ways of storing and transmitting structured information, I can say without question XML comes out easiest in the end.
If you're only transmitting 10 characters, then yes XML is not for you. However, if you're describing dynamically changing, complex data, even in large amounts, XML is very handy.
There are turnkey parsers for XML that are well tested and which allow the client to see an abstracted view of the data as an object, at any level of detail desired.
Platform independence is built in.
It's easy to syntactically validate XML, as it's done automatically. It's also easy to isolate logical validation into discrete units since XML couples easily to object oriented designs.
Very large XML messages can be processed quickly using a pull parser. Pull parsing is faster than SAX and has the intuitive benefit of being client driven, not event driven.
Re:Really? by sosume · 2005-02-28 10:50 · Score: 2, Insightful

I'm more interested in using XML as a means for language independent object persistence (not just cheesy .NET XmlSerializer class stuff either). How much coverage of such things is there in the book? Ie; creating an object in Java on one machine, persisting it and it's state to an XML file, and recreating it on some other machine in C++ or C#. I'm tired of writing my own "protocols" to migrate running code from one app to another.

You have obviously never looked into soap, which seems to be able to address every requirement you are describing.

But, not using Soap is quite common on Slashdot ;)
Re:hmmm by elharo · 2005-02-28 11:18 · Score: 5, Insightful

Ever try to debug deeply nested LISP in a plain vanilla text editor? Ever try to find exactly which closing parenthesis is missing where? That's why end-tags have names. It's pure human factors. Computers don't care about this. People do.

SGML (XML's precursor) did have minimized end-tags like . Experience proved this caused more pain than it alleviated. Hence the lack of minimized end-tags in XML.
so do we love or hate Mozilla and FireFox today? by roman_mir · 2005-02-28 11:51 · Score: 4, Insightful

After all XUL and RDF together with js, css and resource files - that's what makes FireFox tick.

--
You can't handle the truth.
What's so bad about XML? by rikkus-x · 2005-02-28 12:16 · Score: 3, Insightful

I give customers a specification showing how I would like data sent to me. They can use the specification to tell them how to store their data, because they can read it. They can check that their data matches the specification, because their machine can read it.

When I receive their data, I can check that it matches the specification, because my machine can read it. If there is something wrong with their data, I can point out where it's broken, because it's human-readable.

Writing specifications is easy. Writing generators and parsers is easy. The tools are ubiquitous. Generation and parsing are usually fast 'enough'. The standards are freely available. Complex data structures may be described. Data may be transformed using a common language based on XML itself.

Yes, I'd like it to be easier to write XML parsing tools. Yes, I'd like it to be easier to write tools which handle XML more efficiently. No, the two points above don't make XML the devil's data encapsulation.

Rik
1. Re:What's so bad about XML? by Xorkid · 2005-02-28 12:54 · Score: 2, Insightful
  
  Nothing,
  People just fail to realise what XML is (or isn't). Basically XML is just a way for you to define your own (markup) language for any purpose.
  That it. Is not a database replacement. It won't walk on water or feed the hungry or kill all the communists/terrorists.
  But if you want to persist textual data with structure, in a form that will most probably be readable in 20 years time, XML is for you.
  
  --
  www.microsoft.com/athome/sec urity/children/kidtalk.mspx Was This Information Useful?
Re:The Problem With XML by Proc6 · 2005-02-28 13:39 · Score: 4, Insightful

This is like comment #492 that XML is slow and a poor format to use for databasing.
People are trying to use XML for something other than for which it was intended then complaining at the sub-standard results. Surprise? XML is a common format to make it possible to move data between different, I'll use the word "domains" (as in division not URL), it should be used for "just" that.
In other words, XML should be a "transport" mechanism. It's so I'm not writing a new parser by hand everytime some wanker like you sends me a file in yet another made-up-on-the-spot type. Your example is relatively clean but in the real world as the data gets harder to describe, humans start to make more ignorant made-up-on-the-spot rules like "Well ok if theres a sub record the line will start with a -, well ok it could be a + too, if the subrecord can only contain numbers... no you know what lets make it -n if the sub records can contain numbers only..". No matter how ingenious your "format" is, the problem isn't your format, its that your format isn't my other customers format.
XML should be used in scenarios where the time spent being able to use all the readily available XML parsing and validating tools you don't have to re-invent the wheel writing is more than the milliseconds saved parsing a longer document "once".
Don't use XML as your main, permanent, datastore for a gigantic database and complain. It's not for that. Its for when I need a copy of your data and I don't want to pay for a copy of "JackoffDb version 5" that you run, or hire a team of programmers to write a translator just to read your files. Gimme XML, I can take that and understand its contents and schema with ease, then Ill import it into my own system here.

--
I'm Rick James with mod points biatch!
Re:damn by charlieo88 · 2005-02-28 14:23 · Score: 2, Insightful

HA! I'd mod you up if you weren't already maxed out.