Effective XML

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday November 24, 2003 @07:15AM from the specificity dept.

milaf writes "Who doesn't know about XML nowadays? Quite a few people, actually: there has been so much hype around it that some people think that XML is a programming language, a database, or both at the same time. On the other hand, if you are a developer, chances are that you feel that -- no matter its usefulness -- there is not much to XML. After all, it may take just a few hours to get the hang of creating and parsing an XML document. Maybe this is why most of the many and voluminous books discuss numerous XML-related technologies, but say less about the usage of XML itself." Read on for milaf's review of a book that takes the opposite tack. Effective XML: 50 Specific Ways to Improve Your XML author Elliotte Rusty Harold pages 336 publisher Addison-Wesley rating 10/10 reviewer milaf ISBN 0321150406 summary Very well written collection of topics on XML Best Practices

In Effective XML: 50 Specific Ways to Improve Your XML, Elliotte Rusty Harold takes a different approach: know your elements and tags -- they are not the same thing! -- and weigh your choices in a context, because any technology applied for the wrong reasons may fail to deliver on its promises.

Following Scott Myers' groundbreaking Effective C++, the author invites us to re-evaluate seemingly trivial issues to discover that life is not as simple as it seems in the world of XML. In each of the 50 items (chapters), he gets into the inner workings of the language, its usage and related standards, thus giving us specific advice on how to use XML correctly and efficiently. The 300-page book is divided into four parts: Syntax, Structure, Semantics, and Implementation. Yet in the introduction, the author sets the tone by discussing such fundamental issues as "Element versus Tag," "Children versus Child Elements versus Content," "Text versus Character Data versus Markup," etc. On these first pages the author started earning my trust and admiration for his knowledge and ability to get right to the point in a clear and simple language.

The first part, Syntax, contains items covering issues related to the microstructure of the language, and best practices in writing legible,maintainable, and extensible XML documents. (In it, over 19 pages are dedicated to the implications of the XML declaration!) That seems a lot for one XML statement that most people cut-and-paste at the top of their XML documents without giving it much thought, doesn't it? Actually not, if you follow the author's reasoning and examples.

The second part, Structure, discusses issues that arise when creating data representation in XML, i.e. mapping real-world information into trees, elements, and attributes of an XML document; it also talks about tools and techniques for designing and documenting namespaces and schemas.

The third part, Semantics, explains the best ways to convert structural information represented in XML documents into the data with its semantics. It teaches us how to choose the appropriate API and tools for different types of processing to achieve the best effect. This chapter has a lot of good advice for creating solutions that are simple, effective, and robust.

The final part, Implementation, advises the reader on design and integration issues related to the utilization of XML; these issues include data integrity, verification, compression, authentication, caching, etc.

This book will be useful to a professional with any level of experience. It may be used as a tutorial and read from the cover to cover, or one can enjoy reading selected items, depending on the experience and taste. The book's very detailed index makes it an excellent reference on the subject as well. In the prefix to the book, the author writes, "Learning the fundamentals of XML might take a programmer a week. Learning how to use XML effectively might take a lifetime." I'm not sure about the "lifetime" -- that's an awfully long time for using one technology -- but for the most confident of us this still may not be enough :) . Your mileage may vary, but I suspect that you could shave a few months off that time by browsing through this book once in a while. Most importantly, it will make you a better professional and make you proud of the results of your work. Wouldn't this worth your while?

You can purchase Effective XML: 50 Specific Ways to Improve Your XML from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

27 of 312 comments (clear)

library by Pompatus · 2003-11-24 07:19 · Score: 5, Interesting

If you want to read any book for free, just ask your local library to order it and they will. Libraries guess at what books people want to read, so if anyone shows any interest in any book, they order it. They loose their federal funding if they don't spend the money they are allocated, so they are generally VERY willing to buy as much as possible.

--

----
Squirrel ... It's not just for breakfast anymore
One thing is for sure... by foistboinder · 2003-11-24 07:19 · Score: 4, Funny

It's got to be better than Ineffective XML

--
Yet Another Web Site
Government Health Warning by NickFitz · 2003-11-24 07:20 · Score: 4, Funny

Learning how to use XML effectively might take a lifetime
...
you could shave a few months off that time by browsing through this book

Reading this book shortens life expectancy. Still, it's your choice...

--
Using HTML in email is like putting sound effects on your phone calls. Just say <strong>no</strong>.
Unix Tab-Separated ASCII Files vs. XML by billstewart · 2003-11-24 07:23 · Score: 4, Interesting

Sure, XML isn't inherently that deep - but neither are the tab-separated ASCII files which Unix tools used to do all kinds of really powerful things. Similarly, LISP property lists aren't that complex. XML's a bit more flexible, and carries enough decoration with it that people are willing to use it for building interfaces that they might not build using ASCII or XDR. And anything that lets the EDI people replace their stuff with simpler, more open technology is good too..

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
1. Re:Unix Tab-Separated ASCII Files vs. XML by Anml4ixoye · 2003-11-24 07:35 · Score: 4, Interesting
  
  And anything that lets the EDI people replace their stuff with simpler, more open technology is good too..
  
  My current project for the last 8 months has been working on just that - parsing HIPAA EDI transactions. We do it by converting them to XML data structures. There is a decent white paper about it too.
  
  What I've found is that, for readability, XML is the way to go. For performance, EDI is definately better. I have one EDI file that is 23k. When expanded to XML, it is close to 5000 lines long.
  
  I agree with an earlier post. If you are using an hardware XML accelerator, or using small XML documents (config, etc), or needing readibility over performanc then it is great. But I have a hard time believeing that it will replace tab-seperate files any time soon (not that the parent poster was implying this).
  
  --
  Random Musings
The main issue with XML is performance by Anonymous Coward · 2003-11-24 07:23 · Score: 4, Informative

Others have said it before, but I'll say it again. XML is heavy weight and isn't free. The best example of this is SQLXML. Although it sounds nice to use SQLXML, the performance on most commercial database see a huge drop in performance. This is due to the fact that parsing XML blows and eats up copious amounts of CPU and memory. I've had people ask me about how to solve problems with SOAP on windows and java applications. The bottom line is, unless you're using hardware XML accelerators, XML is a resource hog.
On a related note, more details on Microsoft Indigo are finally available. According to this article on XML mania microsoft's future platform will use XML as much as possible. More details are available on microsft's site. The funniest part is they are claiming indigo + longhorn will be the best thing since slice bread. Maybe they haven't learned the hard lesson that parsing XML kills performance.
1. Re:The main issue with XML is performance by I8TheWorm · 2003-11-24 07:47 · Score: 5, Insightful
  
  To put it another way...
  
  this single record
  
  Doe, John 1234567 12/1/2001
  
  took 31 bytes, while it's XML companion (using short, simple tags) took 96 bytes.
  
  Not all XML files wind up being 3 times the size of their flatfile counterparts, but they are inherintly larger. There really isn't a way to make loading/parsing that data any faster, by the nature of working with ASCII/ANSI files. XML will always be slower.
  
  --
  Saying Android is a family of phones is akin to saying Linux is a family of PCs.
2. Re:The main issue with XML is performance by Boing · 2003-11-24 08:31 · Score: 4, Insightful
  
  Doe, John 1234567 12/1/2001
  took 31 bytes, while it's XML companion (using short, simple tags) took 96 bytes.
  
  Uh huh. Now let me ask you, is that record space-delimited? Comma-delimited? Fixed-width [shudder]? If it's fixed width, and the first name is fixed at four characters, is the person's name "John" or "John-Paul"?
  31 bytes for your record, and 96 for equivalent XML... but how many extra bytes were spent on code to manage your particular flavor of data? How much time was spent in development of that code? How does that time (and associated cost) compare to the extra millisecond/record required to transmit and process the XML data?
  XML is standard. It can fit almost any type of data (though binary data is not currently the most effective thing in the world, but it can be incorporated). Since MS is integrating XML into all of their products, we won't have to worry about many people who don't have a good XML library installed on their systems. So instead of 50 programs with their own (limited and likely buggy) data formatting subsystems, we'll have 50 programs that each call one library on disk, in a standard, robust system with enough exposure to squash the show-stopping bugs.
  XML will always be slower.
  
  Depends on how you look at it. If the aforementioned widely-available XML parser gets enough of a beating, it will be optimized like you wouldn't believe. Yes, two data processors (one XML, one markupless) with equal amounts of work spent on them will perform in favor of the simpler format... but XML's simplicity and universality will make it so that the XML parsers will have more eyes.
  The same philosophy is why the well known open-source programs (linux, apache, etc) are functional and stable as hell:
  Wide use + Openness = Greatness.
XML... by the+man+with+the+pla · 2003-11-24 07:23 · Score: 5, Insightful

I think one of the main problems with the embedding of XML architecture into office productivity software is unfortunately the end user. I mean, how long have programmes like MS Word had "document properties" contained in them, and how many people are actually using them? I'm currently working on a project to retrieve documents accross a company's backed-up data from the past 10 years, and there is very very little metadata available for us to do any searching on. Unless the embedded XML contained within office suites is brought more "to the fore" and in the face of users, instead of being a behind the scenes 'option', people just are not going to use it

--
The linux hacker
where are the open source XML repositories by acomj · 2003-11-24 07:25 · Score: 4, Interesting

XML would work better if there were consistent DTDs for tagging information that everyone would use. There should be an open database of these DTDS.

I was looking for a simple one to tag photos with. Couldn't find it, made my own. Is there a repository of these DTDs out there?
1. Re:where are the open source XML repositories by Anonymous Coward · 2003-11-24 07:45 · Score: 5, Informative
  
  Maybe here?
2. Re:where are the open source XML repositories by GeckoX · 2003-11-24 08:08 · Score: 5, Informative
  
  You have absolutely NO idea what you are talking about, and of course have been modded +3 insightful. Good one mods.
  
  XML is extensible by it's very nature. By itself, an xml file is just that, an xml file, it means absolutely NOTHING without context and definition.
  
  This is what DTD's do. They don't limit xml in any way, rather they describe a particular use of xml. For example: SVG, MathML and XHTML are all languages that use xml. Each one of these languages have a DTD that define the format for a valid xml document FOR THAT LANGUAGE.
  
  Just because a DTD for SVG exists doesn't mean that anything at all has changed with xml itself.
  
  Next, XSLT is a technology with a very specific purpose, simply put: To take an xml file as input and create a new xml file for output based on the rules written into the transform.
  
  So, with all of that said, there is absolutely NO reason why there shouldn't be a DTD repository, and again, there is no reason why there shouldn't be a PhotoAlbum DTD in that repository. What problems would this cause? None. What benefits could be observed? Instead of everyone needing an xml document to describe photo albums rolling their own format, people might just reuse a standard DTD to do so. And application writers just might too. And lo and behold, Application X on platform Y might be able, with no work involved, open Album AA Created by Application BB on platform CC.
  
  Getting some of the big picture?
  
  --
  No Comment.
milaf, if you could expand a bit... by Randolpho · 2003-11-24 07:27 · Score: 4, Insightful

Does the book discuss the pros and cons of XML? Such as, when is it a good idea to use XML? When would a CSV, INI, or other structured text document be a better choice than XML?

These are issues that need to be solved first, before one creates an effective XML structure. Does the book address them?

--
"Times have not become more violent. They have just become more televised."
-Marilyn Manson
1. Re:milaf, if you could expand a bit... by LetterJ · 2003-11-24 07:48 · Score: 4, Insightful
  
  Unfortunately, most Slashdot reviews are little more than book reports with pretty much no analysis. They end up just listing what the chapters contain.
  
  Incidentally, one of the main reasons to choose XML over either CSV or INI is that both of those formats are pretty driven by rigid "column" type structures. In most INI files there's only room for pairs of names and single values. In CSV records are one row with a set number of fields.
  
  XML lets you expand the children fully and represent more complex data. For instance, a classical CSV file with address information for customers would have columns for street address, city and then start to have problems when you start having columns for State (when you actually consider the world outside the US), postal codes, etc. If this is in XML, you can have your schema be more flexible and say that each <customer> contains a <shippingaddress> element which can contain either a <state> or a <province> or neither.
  
  In other words, you can use trees to represent data instead of flat rows. I'm not saying that it's the be-all and end-all that the evangelists say it is. There are still lots of places that simpler text files and other data storage formats are better, but XML can be useful.
  
  --
  
  The Glass is Too Big: My Take on Things
XML Limited in at least one regard. by AllergicToMilk · 2003-11-24 07:28 · Score: 4, Interesting

One of the things that I have found limiting about XML is that it is inheirently hierarchical. Real "things" can be categorized many ways. Hierarchical classification systems (such as our modern file systems) work poorly to classify a broad scope of information. Thus, some of the new development in the FS in Longhorn and also some I've head about, but can't remember, for Linux.

--
There are only 6,863,795,529 types of people in the world.
Here's the list of 50 by FearUncertaintyDoubt · 2003-11-24 07:32 · Score: 5, Informative

Syntax:
Include an XML Declaration
Mark Up with ASCII if Possible
Stay with XML 1.0
Use Standard Entity References
Comment DTDs Liberally
Name Elements with Camel Case
Parameterize DTDs
Modularize DTDs
Distinguish Text from Markup
White Space Matters

Structure:
Make Structure Explicit through Markup
Store Metadata in Attributes
Remember Mixed Content
Allow All XML Syntax
Build on Top of Structures, Not Syntax
Prefer URLs to Unparsed Entities and Notations
Use Processing Instructions for Process-Specific Content
Include All Information in the Instance Document
Encode Binary Data Using Quoted Printable and/or Base64
Use Namespaces for Modularity and Extensibility
Rely on Namespace URIs, Not Prefixes
Don't Use Namespace Prefixes in Element Content and Attribute Values
Reuse XHTML for Generic Narrative Content
Choose the Right Schema Language for the Job
Pretend There's No Such Thing as the PSVI
Version Documents, Schemas, and Stylesheets
Mark Up According to Meaning

Semantics:
Use Only What You Need
Always Use a Parser
Layer Functionality
Program to Standard APIs
Choose SAX for Computer Efficiency
Choose DOM for Standards Support
Read the Complete DTD
Navigate with XPath
Serialize XML with XML
Validate Inside Your Program with Schemas

Implementation:
Write in Unicode
Parameterize XSLT Stylesheets
Avoid Vendor Lock-In
Hang On to Your Relational Database
Document Namespaces with RDDL
Preprocess XSLT on the Server Side
Serve XML+CSS to the Client
Pick the Correct MIME Media Type
Tidy Up Your HTML
Catalog Common Resources
Verify Documents with XML Digital Signatures
Hide Confidential Data with XML Encryption
Compress if Space Is a Problem
My experience with XML by Valar · 2003-11-24 07:35 · Score: 4, Insightful

It has been my experience with XML that it is like a lot of other things in development: the good developers understand it immediately and have native intuition towards best practices. The bad developers never really get it and spend their time reproducing tricks they saw in a cookbook. That's good and fine until you need something that doesn't quite fit into categories a, b or c. Another example of this is how high school and university data structure/algorithm classes never spend any time of development of new data structures that exactly meet the problem specification. Instead they lay out half a dozen types of linear lists, a couple of trees, and some hashing functions and say, "Well, you can glue just about anything together from this." Perhaps this book takes what is, IMHO, the better approach-- laying out the tools and politely explaining what the implication of each is, rather than attempting to list out pages of cute examples of what each can do.

--

====
Crudely Drawn Games
5 years in the business... by pong · 2003-11-24 07:42 · Score: 5, Insightful

... and it is starting to dawn on me that trends like pervasive XMLization is going to haunt us for ever. The combination of business-minded consultants that push a market to create demand for themselves and a huge number of clueless but enthusiastic developers that will jump on any new idea and push it where it doesn't want to go unsurprisingly leads to this kind of instability.

I hate XML with a passion. Let me present you with three examples

1) Programming languages based on XML.

Yes, it is true. Perverted minds, somewhere on this planet, actually seems to think that this is a neat idea! Since their initial conception the pivotal point of programming languages have been to raise the level of programming. To move from the computers domain to the human domain - to make it more intuitive an natural for a human being to program a computer. With these new XML-based languages we are moving a step backwards, because truely the only benefit of XML in this context is that it is easier for computers to parse, while it is certainly harder for humans.

2) XSLT

Have you tried it? I rest my case.

3) SOAP

Okay, initially this actually seemed like a good idea to me, but having thought about it, I really think it sucks. Okay, so it is easier to implement SOAP for a particular platform or programming language, but a wire protocol is like a compiler or an OS kernel in a certain sense - it is okay that it is very hard to write, as long as it is stable and high performance, because it is such a central component.
X is for the Xtensions, M is for the Metadata... by jefu · 2003-11-24 07:44 · Score: 4, Interesting

and L is for the Laughter it brings us.
I have not read this book, but it sounds interesting already.
XML is an interesting technology that has the potential for changing the way we use technology in all kinds of weird and wonderful ways. (And in a few ways that may not be so wonderful.) But using XML correctly is tough. I've written and discarded more DTDs and schemata than I care to admit because they were seriously flawed. Getting it right is important and very, very hard.
XML looks simple, and in some ways it is. But in so many other ways it is not simple at all - in large part because it gives us a tool to approach some very hard problems. And hard problems, often even when expressed in the simplest way around, tend to stay hard. (Calculus makes saying some things simple, for example, but understanding those things still takes work and insight.)
I will be taking a good look at this book in the near future to see what it has to say. And I'd urge those who dislike XML to do the same. And finally, even those who like XML need to think hard about how to use it well, so perhaps this would be a good read for them too.
What are you talking about? by mellon · 2003-11-24 07:47 · Score: 5, Insightful

XML is just text! If the XML parser is slow, write a faster one! Figure out where the bottlenecks are! Don't give me this XML is slow crap. This is slashdot - you're supposed to be a geek. If you don't like XML, fine, but come up with a geeky reason not to like it, not some problem whose solution is just to roll up your sleeves and do some hacking!

Oy! :')
1. Re:What are you talking about? by nat5an · 2003-11-24 07:54 · Score: 5, Insightful
  
  Okay, fine XML isn't slow by nature. But it's a generalized solution. Not every set of data needs to be stored in a general tree, so putting every set into one will often create a lot of extra work. The benefit of XML is its portablity, and the price is the performance hit you take from packing and unpacking all that data.
  
  --
  Head down, go to sleep to the rhythm of the war drums...
2. Re:What are you talking about? by Anonymous Coward · 2003-11-24 08:40 · Score: 5, Funny
  
  Have you ever tried storing a picture in it? <pixelrow> <pixel> <value channel="red" level="0.023"/> <value channel="blue" level="0.22"/> <value channel="green" level="0.5"/> </pixel> ... </pixelrow> ... :)
W3C by sielwolf · 2003-11-24 08:08 · Score: 4, Informative

Browse the Technical Reports, Recommendations and Proposed Recommendations at W3C as there are a lot of DTDs and Schemas there. I found a DTD for generic simulation representation there. There's quite a bit if you take the time to look.

--
What is music when you despise all sound?
XML is just tagged s-lists. by BrittPark · 2003-11-24 09:10 · Score: 4, Insightful

XML is highly overrated and generally over-used. Admittedly XML + CSS is better than html, but beyond that its only reasonable use is as a generalized syntax for configuration files, and as such does a good job, or at least I've had success using it that way in the past. Many (if not most) of its other uses are just poor program design. Soap is an extremely silly idea. Why use XML for a marshalling syntax for RPC? It's slower, bulkier, and just a bad choice in comparison to a binary marshalling mechanism. Now as a syntax for an RPC's IDL XML makes a lot of sense, but not as a transport.

Glad to get that off my chest. I have a bitter history with XML. I was the first person at my former company to bring XML in as a uniform configuration file format for our product, but then found myself a couple of years later forced into adding XML specific features to the filesystem that was the core of our company's product. I spent a week thinking about the idea, and concluded that it was a bad one. Thus followed a long (and fruitless) battle with management to scratch the plan. The end result was a technically nifty but useless set of features. The work remains unreleased for lack of customer interest. At least I get a bit of "I told you so." pleasure.
Sure: by rodentia · 2003-11-24 09:22 · Score: 5, Informative

<?xml version="1.0" ?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/sv g10.dtd"> <svg> <line x1="50" y1="50" x2="300" y2="300" style="stroke:#FF0000; stroke-width:4;stroke-opacity:0.3;"/> <line x1="50" y1="100" x2="300" y2="350" style="stroke:#FF0000; stroke-width:4;stroke-opacity:1;"/> </svg>>

--
illegitimii non ingravare
XML is very fast by Doug+Merritt · 2003-11-24 10:08 · Score: 4, Interesting

XML is heavy weight ... ...see a huge drop in performance. This is due to the fact that parsing XML blows and eats up copious amounts of CPU and memory.

That's because everyone uses slow XML parsers. Some years ago at one of the then-top 5 web portals I was unhappy with the standard SAX/DOM parser in use; it was ridiculously slow (and buggy).
So I wrote a new one. Parsing XML became one hundred fold faster! I timed it quite carefully.
Other people in this thread are saying "of course XML is slower than binary formats, it's 3 times bigger." But a factor of 3 in performance is nothing, considering some of the advantages.
A slowdown of 100, on the other hand, is absurd.
I don't know why people don't rebel against this and make faster XML parsers the widely-used ones; for whatever reason, apparently everyone continues using slow parsers.
At any rate, no, XML is not slow. It's just a simple, easy to parse format, for which IBM and others have written very, very slow parsers.
And everyone just assumes that it has to be slow. Sheesh, why should an XML parser be slower than a C++ compiler??? Come on.

--
Professional Wild-Eyed Visionary
Several chapters are online by elharo · 2003-11-25 03:50 · Score: 4, Informative

Nice review. Thanks! It's interesting how many of the comments here relate directly to chapters in the book. For instance, there's a lot of concern about XML's perceived verboseness. This is addressed directly in Item 50, Compress if space is a problem. This chapter and ten others are online at http://www.cafeconleche.org/books/effectivexml/ . Check it out.