Effective XML

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday November 24, 2003 @07:15AM from the specificity dept.

milaf writes "Who doesn't know about XML nowadays? Quite a few people, actually: there has been so much hype around it that some people think that XML is a programming language, a database, or both at the same time. On the other hand, if you are a developer, chances are that you feel that -- no matter its usefulness -- there is not much to XML. After all, it may take just a few hours to get the hang of creating and parsing an XML document. Maybe this is why most of the many and voluminous books discuss numerous XML-related technologies, but say less about the usage of XML itself." Read on for milaf's review of a book that takes the opposite tack. Effective XML: 50 Specific Ways to Improve Your XML author Elliotte Rusty Harold pages 336 publisher Addison-Wesley rating 10/10 reviewer milaf ISBN 0321150406 summary Very well written collection of topics on XML Best Practices

In Effective XML: 50 Specific Ways to Improve Your XML, Elliotte Rusty Harold takes a different approach: know your elements and tags -- they are not the same thing! -- and weigh your choices in a context, because any technology applied for the wrong reasons may fail to deliver on its promises.

Following Scott Myers' groundbreaking Effective C++, the author invites us to re-evaluate seemingly trivial issues to discover that life is not as simple as it seems in the world of XML. In each of the 50 items (chapters), he gets into the inner workings of the language, its usage and related standards, thus giving us specific advice on how to use XML correctly and efficiently. The 300-page book is divided into four parts: Syntax, Structure, Semantics, and Implementation. Yet in the introduction, the author sets the tone by discussing such fundamental issues as "Element versus Tag," "Children versus Child Elements versus Content," "Text versus Character Data versus Markup," etc. On these first pages the author started earning my trust and admiration for his knowledge and ability to get right to the point in a clear and simple language.

The first part, Syntax, contains items covering issues related to the microstructure of the language, and best practices in writing legible,maintainable, and extensible XML documents. (In it, over 19 pages are dedicated to the implications of the XML declaration!) That seems a lot for one XML statement that most people cut-and-paste at the top of their XML documents without giving it much thought, doesn't it? Actually not, if you follow the author's reasoning and examples.

The second part, Structure, discusses issues that arise when creating data representation in XML, i.e. mapping real-world information into trees, elements, and attributes of an XML document; it also talks about tools and techniques for designing and documenting namespaces and schemas.

The third part, Semantics, explains the best ways to convert structural information represented in XML documents into the data with its semantics. It teaches us how to choose the appropriate API and tools for different types of processing to achieve the best effect. This chapter has a lot of good advice for creating solutions that are simple, effective, and robust.

The final part, Implementation, advises the reader on design and integration issues related to the utilization of XML; these issues include data integrity, verification, compression, authentication, caching, etc.

This book will be useful to a professional with any level of experience. It may be used as a tutorial and read from the cover to cover, or one can enjoy reading selected items, depending on the experience and taste. The book's very detailed index makes it an excellent reference on the subject as well. In the prefix to the book, the author writes, "Learning the fundamentals of XML might take a programmer a week. Learning how to use XML effectively might take a lifetime." I'm not sure about the "lifetime" -- that's an awfully long time for using one technology -- but for the most confident of us this still may not be enough :) . Your mileage may vary, but I suspect that you could shave a few months off that time by browsing through this book once in a while. Most importantly, it will make you a better professional and make you proud of the results of your work. Wouldn't this worth your while?

You can purchase Effective XML: 50 Specific Ways to Improve Your XML from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

20 of 312 comments (clear)

Min score:

Reason:

Sort:

library by Pompatus · 2003-11-24 07:19 · Score: 5, Interesting

If you want to read any book for free, just ask your local library to order it and they will. Libraries guess at what books people want to read, so if anyone shows any interest in any book, they order it. They loose their federal funding if they don't spend the money they are allocated, so they are generally VERY willing to buy as much as possible.

--

----
Squirrel ... It's not just for breakfast anymore
Unix Tab-Separated ASCII Files vs. XML by billstewart · 2003-11-24 07:23 · Score: 4, Interesting

Sure, XML isn't inherently that deep - but neither are the tab-separated ASCII files which Unix tools used to do all kinds of really powerful things. Similarly, LISP property lists aren't that complex. XML's a bit more flexible, and carries enough decoration with it that people are willing to use it for building interfaces that they might not build using ASCII or XDR. And anything that lets the EDI people replace their stuff with simpler, more open technology is good too..

--

Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
1. Re:Unix Tab-Separated ASCII Files vs. XML by Anml4ixoye · 2003-11-24 07:35 · Score: 4, Interesting
  
  And anything that lets the EDI people replace their stuff with simpler, more open technology is good too..
  
  My current project for the last 8 months has been working on just that - parsing HIPAA EDI transactions. We do it by converting them to XML data structures. There is a decent white paper about it too.
  
  What I've found is that, for readability, XML is the way to go. For performance, EDI is definately better. I have one EDI file that is 23k. When expanded to XML, it is close to 5000 lines long.
  
  I agree with an earlier post. If you are using an hardware XML accelerator, or using small XML documents (config, etc), or needing readibility over performanc then it is great. But I have a hard time believeing that it will replace tab-seperate files any time soon (not that the parent poster was implying this).
  
  --
  Random Musings
where are the open source XML repositories by acomj · 2003-11-24 07:25 · Score: 4, Interesting

XML would work better if there were consistent DTDs for tagging information that everyone would use. There should be an open database of these DTDS.

I was looking for a simple one to tag photos with. Couldn't find it, made my own. Is there a repository of these DTDs out there?
1. Re:where are the open source XML repositories by mugnyte · 2003-11-24 07:46 · Score: 0, Interesting
  
  No. Because XML is supposed to be extensible. The closest thing may be a XLST that taught everyone looking at your file what tags you had to display and their data types. But as for standardizing - everything that gets fixed becomes a limit to be broken some day.
  
  Other than that, you're defining a file format. Why bother? Many formats already exist that encapsulate some form of metadata. Keeping the metadata with the file is another whole subject anyway. Metadata itself should be connect-able to the target file, but not bound. There's a lot of thought about this already going around.
  
  mug
XML Limited in at least one regard. by AllergicToMilk · 2003-11-24 07:28 · Score: 4, Interesting

One of the things that I have found limiting about XML is that it is inheirently hierarchical. Real "things" can be categorized many ways. Hierarchical classification systems (such as our modern file systems) work poorly to classify a broad scope of information. Thus, some of the new development in the FS in Longhorn and also some I've head about, but can't remember, for Linux.

--
There are only 6,863,795,529 types of people in the world.
Re:The main issue with XML is performance by I8TheWorm · 2003-11-24 07:40 · Score: 2, Interesting

I share your opinion regarding XML, and have yet to find a great reason to use it, other than feeding data to our vendors systems through their proprietary file layouts.

On that note though, I wonder if this author has some insight into better uses for XML than what I've typically seen (XML does everything!). I won't, however, be running out to buy it, as XML will always be just more bloat and a resource hog by nature.

--
Saying Android is a family of phones is akin to saying Linux is a family of PCs.
X is for the Xtensions, M is for the Metadata... by jefu · 2003-11-24 07:44 · Score: 4, Interesting

and L is for the Laughter it brings us.
I have not read this book, but it sounds interesting already.
XML is an interesting technology that has the potential for changing the way we use technology in all kinds of weird and wonderful ways. (And in a few ways that may not be so wonderful.) But using XML correctly is tough. I've written and discarded more DTDs and schemata than I care to admit because they were seriously flawed. Getting it right is important and very, very hard.
XML looks simple, and in some ways it is. But in so many other ways it is not simple at all - in large part because it gives us a tool to approach some very hard problems. And hard problems, often even when expressed in the simplest way around, tend to stay hard. (Calculus makes saying some things simple, for example, but understanding those things still takes work and insight.)
I will be taking a good look at this book in the near future to see what it has to say. And I'd urge those who dislike XML to do the same. And finally, even those who like XML need to think hard about how to use it well, so perhaps this would be a good read for them too.
Re:The main issue with XML is performance by Citizen+of+Earth · 2003-11-24 08:01 · Score: 3, Interesting

Others have said it before, but I'll say it again. XML is heavy weight and isn't free.

XML needs to be updated to allow binary encoding. The open-source high-performance parser/generator library at the link demonstrates the performance gain.
Sick of XML? Try YAML. by Chromodromic · 2003-11-24 08:20 · Score: 3, Interesting

Reading through the posts on this board, I tend to agree with the criticisms about XML. It's a big dreadnought of a specification when, in most cases, a nice light corsair or even single-seat fighter would do the trick. Still, I would normally be inclined to say of XML what is said about Democracy: it's the worst system out there, except for all the others.

Then I found YAML. Long and short, YAML is very lightweight, eminently readable, easy to use (parsers exist in multiple languages) and a pleasure all kinds of projects that require data serialization. Where XML branches off into other types of uses, like XSL programming, YAML doesn't really compete. I find this to be a strength, actually, because once you've used YAML and seen it in action, XSL seems like a big, fat add-on. But for those that rely on XSL and other things, YAML won't do the trick.

But if all you need is data serialization in a compact, easy-to-read, easy-to-use package -- and this, in my opinion, is by far what XML is most used for -- then YAML is great. Give it a shot.

As for XML. I used to hate it with a passion. Now I still hate it, but I'm less passionate. The creators of XML are ambitious people, and they tried to do something in that spirit. It works, basically and XML doesn't deserve *all* the bad press it gets.

--
Chr0m0Dr0m!C
1. Re:Sick of XML? Try YAML. by oren · 2003-11-24 18:34 · Score: 2, Interesting
  
  XML and YAML have different "sweet spot" domains, though you can apply both technologies outside their intended domain.
  
  XML is great for "documents" - text documents, that is. XML does an admirable job seperating "content" from "markup" which can be used to drive "presentation". It really is a big improvement over SGML. Things like DocBook, and CSS stylesheets, make XML the choice for writing documents.
  
  YAML is great for "data" - data structures, that is. YAML directly maps to common application data structures, so the result is more readable for both humans and computer programs. It is still very new, but is gaining acceptance, and IMVHO is the way to serialize data.
  
  Sure you can use XML for data (lots of people do) and YAML for documents (the YAML spec is written as a YAML document, just as a test of how far this can be pushed). But in both cases you are using the technologies outside their intended domain and suffer the consequences. It is all about using the right tool for the job.
  
  XML was never designed for data, it is an "Extensible Mark Up Language" for crying out loud. Promoting it as the end-all be-all solution for serializing data is strange - it is like promoting the use of the C++ programming language for writing scripts (it is all "programs", right?).
  
  In contrast, YAML Ain't Markup Language - it was designed specifically for data, and is very good at what it does. Just as the world has mostly come to accept that "system languages" and "scripting languages" are different animals, it will discover that "document formats" and "data formats" are different animals - hence the need for both XML and YAML.
  
  (I'm one of the YAML spec authors, so the above reflects about 33% of the "official YAML position" :-)
Re:What are you talking about? by Anml4ixoye · 2003-11-24 08:37 · Score: 3, Interesting

You bring up some really good points. The reason that you hear a lot of "XML is slow" is because of the usage of XPATH. To use XPATH expressions, most implementations parse the entire XML document into memory.

I suppose you *could* write a custom parser. If your structure is well-defined, and not subject to a lot of changes, you could significantly increase performance that way. The other option is to parse the document once, get out what you need to get out into smaller chunks, dump the larger document, and only work off the smaller chunks.

Looks like TMTOWTDI is not just for Perl

--
Random Musings
Re:XML... by Zo0ok · 2003-11-24 08:57 · Score: 2, Interesting

I saw a Microsoft demo that was supposed to show how powerful and useful it could be to insert XML-tags into Word documents. The idea was to fill the Word document with useful information (just fill in the users name here, and all information about the user is automatically inserted, now how good isnt that?). MS calls this Smart Document.

So, I took a look in the XML-file that the connected to the Word document to make it smart. I wasnt very impressed (but fairly amused) when I saw that the XML-file was like 30 lines of blahah, and in the middle of it I found a reference to a .dll-file.

If I need to write a .dll-file that conforms to Word interfaces (that MS of course will have debugged and patched in about two years, and then they'll obsolete it by releasing a new version), then writing something that GENERATES a Word document and gives it to the user makes much more sense to me...

And in any case, XML has nothing to do with it... they could of course have created "tag" functionality in Word without using XML.
Re:5 years in the business... by LetterJ · 2003-11-24 09:13 · Score: 1, Interesting

"2) XSLT
Have you tried it? I rest my case."

Yes and I wouldn't rest my case on that statement if I were you.

I've been working with XSLT professionally (for big clients including 3M) for 3 years, building the top tier in 3 tier architectures and have no problems working with it. It makes perfect sense for what it is: a solution for turning XML into something else, whether another XML document, another XSLT stylesheet (which I'll admit can be a brainbending exercise), HTML or plain formatted ASCII. In places where multiple presentations will exist for a given dataset or the presentation will change due to constantly redefined presentation requirements (ahem marketing ahem), XSLT gives you the flexibility to just keep building the same XML documents in your app and make them look like they're supposed to with different XSLT.

<shamelessplug>
Incidentally, I'm looking for a web development contract in St. Paul/Minneapolis if anyone's looking for an XSLT expert (or PHP or any of my other areas of expertise) who actually knows how to solve real problems. Email me for more info.
</shamelessplug>

--

The Glass is Too Big: My Take on Things
Re:One way to improve it. Don't use it. by JohnnyCannuk · 2003-11-24 09:28 · Score: 2, Interesting

Well, duh, if you are using XML for non-heirarchical data, then your using it wrong.

On the other hand if it looked more like this:

&ltRecords&gt
&ltRECORD id = .. NAME=".." ADDRESS=".." AGE = ".."/&gt
&ltRECORD id = .. NAME=".." ADDRESS=".." AGE = ".."/&gt
&ltRECORD id = .. NAME=".." ADDRESS=".." AGE = ".."/&gt
&ltRECORD id = .. NAME=".." ADDRESS=".." AGE = ".."/&gt
&lt/Records&gt

and if the tag was nested in something else, then xml is appropriate.

At the risk of sounding trite "right tool for the job".

I am currently working on an EDI application where the highly structured and hierarchical nature of our data makes it perfect for xml. Add in good tools and searching capabilities (Like XSLT for transforming the raw structure to something else or XPath for searching it) and you have a very powerful data exchange that is platform and language neutral.

But just as you wouldn't use VB to program kernel modules or device drivers, you wouldn't (and shouldn't) use XML for everything, just because it's cool and new.

I am always amazed by the XML luddites on /. The same folks who insist that obscure languages like Haskell, Dylan or Eifel are "better than ${your language here} why doesn't anybody else use it". will still insist on transmitting and storing data in language and platform dependant binary files or in non-self describing data structures such as:

..\t..\t..

..\t..\t..

..\t..\t..

As for it not being efficient, well that really depends on what you mean by efficient. If you mean that it is slow to read, then you have chosen the wrong parser. Not a fault of the markup itself. Perhaps the design of your document is inefficient. But If you want a way to efficiently exchange self-describing data between applications written on different plaforms in different languages, then use XML.

Or come up with something better.

--
Never by hatred has hatred been appeased, only by kindness - the Buddha
piffle by rodentia · 2003-11-24 09:47 · Score: 2, Interesting

Bandwidth is an order of magnitude more limiting than tree parsing, egg. That and the facilities the tool vendors decorate their stuff with. Of course its not free, what is?

SQLXML and most other value-adds are bull. Your business objects should optimize the hell out of their DB access and return XML. XML is messaging and presentation tier glue. Read the book.

--
illegitimii non ingravare
XML is very fast by Doug+Merritt · 2003-11-24 10:08 · Score: 4, Interesting

XML is heavy weight ... ...see a huge drop in performance. This is due to the fact that parsing XML blows and eats up copious amounts of CPU and memory.

That's because everyone uses slow XML parsers. Some years ago at one of the then-top 5 web portals I was unhappy with the standard SAX/DOM parser in use; it was ridiculously slow (and buggy).
So I wrote a new one. Parsing XML became one hundred fold faster! I timed it quite carefully.
Other people in this thread are saying "of course XML is slower than binary formats, it's 3 times bigger." But a factor of 3 in performance is nothing, considering some of the advantages.
A slowdown of 100, on the other hand, is absurd.
I don't know why people don't rebel against this and make faster XML parsers the widely-used ones; for whatever reason, apparently everyone continues using slow parsers.
At any rate, no, XML is not slow. It's just a simple, easy to parse format, for which IBM and others have written very, very slow parsers.
And everyone just assumes that it has to be slow. Sheesh, why should an XML parser be slower than a C++ compiler??? Come on.

--
Professional Wild-Eyed Visionary
Re:What are you talking about? by mellon · 2003-11-24 10:11 · Score: 2, Interesting

This is by no means assured. When you store data in a binary format, you generally have to have code to deal with byte-swapping and other format conversions. Also, generally speaking, the limitation on character parsing is memory bandwidth - if you are using a modern CPU, it is going to spend most of its time waiting for bits to come out of memory, and it doesn't care whether they're an ASCII (or utf8) byte stream or binary words.

Also, a lot of stuff that goes around in packets is free-form text anyway, not binary data. So in the case where you're just passing numbers around, yes, XML is going to be a bit slower simply because there are more bits to pull out of the buffer. But in the case of plain text, the difference is probably not going to be very significant. In cases where it is significant, you probably don't want to use XML.

You are right that XML is not a panacea - I wouldn't use it for every application. I think a lot of the anti-xml rhetoric we hear is because so many people do use it for the wrong applications, and then other people see what they've done and start retching.

A couple more points - XML::Twig allows you to parse XML in PERL without sucking the whole file in at once. Also, the article to which I was replying was talking about SQLXML, which I presume is already plain text. It's tough to imagine that XML is really going to make that significantly slower - if it is, it's probably because of a poor implementation, not increased data size.
xml or perl by Anonymous Coward · 2003-11-24 10:27 · Score: 1, Interesting

Recently I was developing a pseudo file system and was using xml to store the metadata (ie date, name, link references, permissions, etc.). The chief advantage of using xml was that the data files were text and could be readly edited and read. However they need to be accessed often and performance was a dog. My boss saw what I was doing and recommended I use perl syntax to represent the hierarchal data and use Data::Dumper and Safe::rdo. I did and performance improved several times while still retaining the advantages of text. For example (using a nominal order record) instead of <order> <customer> <name> <fname>Bill</fname> <lname>Brune</lname> ... </name> </name> <customer> </order> <manifest> &nbs p; <item> <id>209</id> <title>Grapes of Wrath</title> <qnt>1</qnt> <unit_price>$10.75</unit_price> &nbsp ; </item> ... would look something like ( compacted to avoid the lameness filter). order => { customer => { fname=>'Bill' lname=>'Brune' ... manifest => [ { id=>1, title=>'Grapes wrath', qnt=>1 unit_price=>$10.75 }, { ... } The added advantage is that you can also add code to such as { 'timestamp'=> scalar localtime, 'pid'=> getppid, ... }
What are you talking about?-If I'm Lion I'm dying. by Anonymous Coward · 2003-11-24 10:31 · Score: 1, Interesting

<svg width="160" height="160" stroke="none"> <polygon fill="#f2cc99" points="28,7 33,3 40,1 47,2 54,5 60,8 62,5 66,4 71,5 73,11 72,20 66,36 62,43 62,46 60,48 56,51 56,54 62,82 63,100 50,137 53,143 51,150 33,150 30,147 27,140 24,140 21,148 2,148 1,144 2,142 5,137 6,128 2,103 2,98 3,87 4,72 10,51 17,37 13,31 12,28 10,27 6,20 7,14 7,9 12,5 16,3 21,3 25,5 28,7 28,7 28,7"/> <polygon fill="#e5b27f" points="57,32 54,30 55,33 53,31 53,34 51,31 51,34 50,32 50,35 48,33 48,36 50,40 50,38 51,40 51,38 52,39 53,37 54,39 54,37 55,39 56,38 56,39 57,38 58,34 57,32 57,32 57,32"/> <polygon fill="#eb8080" points="51,40 53,40 55,40 58,40 57,42 54,44 51,40 51,40 51,40"/> <polygon fill="#f2cc99" points="71,92 63,99 56,118 50,140 55,142 63,143 73,137 85,133 94,115 94,104 91,101 85,100 75,100 71,92 71,92 71,92"/> <polygon fill="#9c826b" points="22,92 19,96 19,100 23,112 25,130 28,135 32,126 30,128 32,124 33,120 30,123 32,119 29,121 30,118 28,119 30,117 28,117 30,114 31,111 28,111 30,110 27,109 28,107 26,107 27,104 24,106 25,104 26,101 23,103 24,100 22,102 22,99 24,95 22,96 23,94 22,94 22,92 22,92 22,92"/> <polygon fill="#9c826b" points="30,145 32,147 32,147 34,145 36,145 37,148 38,149 40,149 43,144 44,148 45,149 46,148 48,143 49,145 49,148 50,148 52,147 53,143 54,144 52,150 51,151 38,151 34,150 30,148 30,145 30,145 30,145"/> <polygon fill="#9c826b" points="85,100 88,100 91,103 94,108 94,115 90,122 82,133 71,137 68,141 63,143 66,141 67,138 67,136 66,133 62,131 62,129 64,128 66,126 68,126 67,125 68,125 67,123 69,124 68,122 71,122 70,123 71,124 70,124 70,126 68,126 70,128 67,128 67,129 70,131 72,133 73,130 74,133 76,129 76,131 78,128 78,130 80,126 80,128 82,125 82,126 83,124 84,122 88,119 90,115 92,112 91,106 90,104 87,101 85,100 85,100 85,100"/> <polygon fill="#9c826b" points="60,82 60,95 60,101 56,107 51,113 48,120 52,120 50,125 47,130 46,135 48,138 53,141 53,136 55,133 58,132 62,131 61,128 61,116 63,108 68,104 71,111 77,100 70,86 60,82 60,82 60,82"/> <polygon fill="#9c826b" points="31,51 36,57 38,62 43,66 50,67 56,70 60,82 61,76 56,56 48,59 40,54 31,51 31,51 31,51"/> <polygon fill="#9c826b" points="8,23 14,25 15,27 13,28 17,30 16,32 19,32 22,33 18,38 14,32 13,29 10,26 8,23 8,23 8,23"/> <polygon fill="#9c826b" points="28,14 27,14 26,11 24,10 22,7 19,7 16,9 12,10 11,12 12,16 15,18 12,18 14,22 16,24 16,28 20,28 22,28 22,23 27,21 30,17 30,16 27,18 28,14 28,14 28,14"/> <polygon fill="#9c826b" points="56,30 56,33 57,36 58,42 59,42 62,42 62,34 63,31 62,29 60,31 58,31 56,30 56,30 56,30"/> <polygon fill="#9c826b" points="42,18 41,21 43,23 44,25 45,22 42,18 42,18 42,18"/> <polygon fill="#9c826b" points="56,19 56,22 58,23 56,25 55,26 54,24 55,21 56,19 56,19 56,19"/> <polygon fill="#9c826b" points="39,54 42,52 42,54 43,53 43,54 45,54 45,55 46,54 46,56 48,56 50,56 51,56 53,55 56,53 56,56 50,58 42,58 39,54 39,54 39,54"/> <polygon fill="#9c826b" points="39,46 41,48 41,46 44,47 46,47 49,46 51,43 54,44 57,43 56,46 58,47 60,48 58,50 56,50 51,48 45,50 40,50 39,46 39,46 39,46"/> <polygon fill="#9c826b" points="59,13 61,14 63,14 61,12 64,12 62,11 64,11 64,10 65,10 65,8 66,9 68,9 67,7 69,8 70,7 70,9 70,9 71,11 71,13 70,15 70,16 70,18 68,20 67,21 66,23 64,27 62,28 62,24 60,20 58,17 58,14 59,13 59,13 59,13"/> <polygon fill="#9c826b" points="34,29 36,30 37,30 40,30 42,30 41,32 38,32 35,30 34,29 34,29 34,29"/> <polygon fill="#9c826b" points="34,86 32,88 30,93 33,90 31,96 33,94 31,98 32,97 32,102 34,100 34,107 35,102 36,108 36,103 38,108 37,102 38,100 37,101 37,97 36,101 36,96 34,100 35,94 33,98 35,92 33,92 36,88 34,88 34,86 34,86 34,86"/> <polygon fill="#ffcc7f" points="37,27 38,29 40,29 42,29 43,26 42,25 40,25 37,27 37,27 37,27"/> <polygon fill="#ffcc7f" points="58,26 57,27 57,29 58,30 60,29 62,26 60,25 58,26 58,26 58,26"/>