Tim Bray On The Origin Of XML
gManZboy writes "Queue just posted an interview with XML co-inventor Tim Bray (currently at Sun Microsystems). Interestingly enough the interviewer is none other than database pioneer Jim Gray (currently at Microsoft). Among other things, in their discussion Tim reveals where the idea for XML actually came from: Tim's work on the OED at Waterloo."
We all know Microsoft invented XML, how else could have filed a patent for it:)
< td padding="5px" > I'm < td >
** "It's not my job to stand between the people talking to me, and the ones listening to me." -- Pego the Jerk
I think it's very funny that XML looks like it is based on SGML.
But according to the interview, it seems that the similarities are merely coincidental.
How's that old saying go?
Those that do not understand Lisp are doomed to reinvent it, badly.
Why can't someone reinvent C so that it sucks less?
From the "Jim Gray" link:
Jim Gray is a "Distinguished Engineer" in Microsoft's Scaleable Servers Research Group and manager of Microsoft's Bay Area Research Center (BARC).
OK, Xerox has their famous Palo Alto Reseach Center (PARC), so Microsoft just has to have its own similarly named center in the same general vicinity. Sheesh!
---------------------------------------------
SERENITY NOW!!!!!!!!!!!!!!!!
But seriously, XML is good technology, but how can Microsoft patent something they don't even invent..... Oh, sorry, they filed in the US. Got it!
Thanks Tim, the world owes you one!
But okay you're right, you gotta use those CPU cycles for something...
--Don't give the world what it asks for, but what it needs.
I was damned by [GNU Project founder] Richard Stallman in egregiously profane language for working on it.
Why do I not find this hard to believe...
"database pioneer ... (currently at Microsoft)"
translated for slashdot readers:
"sellout"
TB And we missed. XML is a lot more complex than it really needs to be. It's just unkludgy enough to make it over the goal line. The burning issues? People were already starting to talk about using the Web for various kinds of machine-to-machine transactions and for doing a lot of automated processing of the things that were going through the pipes.
Amazingly, for such a popular method of 'communication' between and within applications, XML is admitted by most to be rather flawed and bulky...
Get a free iPod Nano 4GB!
Tim Bray and Jim Gray. Which one's the ward and which one's Batman?
Gray interviews Bray, should have done it in May. Over by the bay.
Is the my karma burning? Oh what the hay.
That's hogwash. Everyone knows that the idea for XML came from the tablets of stone that Moses brought down from Mount Sinai. In these tablets were the beginnings of self-describing data. That alone was where the commandments of W3C was originally sent out to the world.
But only in the last decade have scholars used transformation style sheets and super-computers to find more declarative complex types, hidden in the original Hebrew CDATA. It is thought there are tens if not hundreds of specifications in these texts that may never have a finalized draft.
Progress has been slow, while the discovery of SOAP in the 1800's has made the hygiene of data possible, there much that has yet to be standardized. Considering the aging DTD schemas left from the era of King James, it will be crucial to the data-exchange of humanity to uncover more secrets of XML.
Isn't that the guy who was hounding Pete Rose at the All Star game?
I work with XML every day. And every day I wonder the same thing: why the hell does the end tag name have to be repeated? Why can't it just be optional? In other words, why can't it just be abbreviated as: <tagname>data</> ?
Oh MAN I wish they could have done just that one little thing for us. It would cut our datagram size down by at least 30%, maybe more.
I wonder how I'm supposed to write real comments including code examples here. Slashdot sure ssems stupid sometimes.
Now this is what I call understatement.
Have you ever seen these guys in the same room at the same time? No? I thought as much.
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
You know, the people who invented XML were a bunch of publishing technology geeks, and we really thought we were doing the smart document format for the future. Little did we know that it was going to be used for syndicated news feeds and purchase orders.
The most amazing thing is that back then in 1995-1996 at Open Text we were already using SGML as a data exchange protocol. All of us there (including Tim) ought to have known that XML would also have a life as a computer-to-computer communication protocol. Problem was that at the time so much of the SGML discourse was wrapped around the content versus format debate that we missed the obvious: the main of use of XML was not a replacement for HTML as a text format for the web, but as a kind of uber ASCII to allow the ready exchange of data between disimilar applications (just like ASCII in its time had eased the transfer of data between dismilar hardware and/or software platforms).
TB: I spent two years sitting on the Web consortium's technical architecture group, on the phone every week and face-to-face several times a year with Tim Berners-Lee. To this day, I remain fairly unconvinced of the core Semantic Web proposition.
Everyone who has actually done work on knowledge representation in the real world knows that this is a huge, difficult problem, unlikely to be solved anytime soon, as Tim Bray claims.
The only people who claim otherwise are either frauds or ignorant. The Semantic Web initiative has both: Tim Berners-Lee is very smart, but not a computer scientist, so he's not aware of the size of the challenge, plus he's a genuinely nice person, so he tends to trust others too much.
He has surrounded himself with the snake oil AI salesmen from the early 1980s who had promised us impending ubiquitous intelligent computers. Those fraudsters got found out back then, and spent the next fifteen years in academic limbo, only to be rescued by Tim Berners-Lee naivete.
I hadn't thought about that. Very insightful.
There has got to be a reason though. Maybe that validation wouldn't be as good or something like that?
That's the only thing I can think of. With the notation you can tell that something is wrong, but not necessarily where.
The Internet is full. Go Away!!!
Perhaps because it becomes ambiguous when you start nesting and overlapping tags?
why the hell does the end tag name have to be repeated?
Because that is the single biggest source of headaches in parsing SGML, the precursor of XML, in which such a construct is allowed.
It also makes error recovery very difficult, something that we know is quite important from all that malformed HTML code out there. The XML creators knew that too.
It's not ambiguous for nesting; it would just close the closest opening tag.
The only reason I can think of is readability. When I have a long "if" statement in C or Perl, for example, I'll comment the closing curly brace with the statement's conditional.
something that we know is quite important from all that malformed HTML code out there.
The problem here is that browser display malformed HTML at all. If they didn't, missing end tags would be detected with the most rudimentary tests and there'd be little malformed HTML around.
Anyway, if I were to design XML, I'd go with a constrcut like <tagname>{...}. This would've made it just as readable and much easier to manage in a text editor.
Theirs is, in reality, a proprietory format, but to stay buzz-word compliant they use XML, which hurts performance -- sometimes dearly...
For example, to pass a couple of thousands of floating-point numbers from front end to a computation engine, each is converted to text string with something like <Parameter> around it. The giant strings (memory is cheap, right?) are kept in memory until the whole collection is ready to be sent out... The engine then parses the arriving XML and fills out the array of doubles for processing.
It really is disgusting, especially since freely available alternatives exist... For instance, PVM solved the problem of efficiently passing datasets between computers a decade ago, but nooo, we only studied XML in college -- and it is, like, really cool, dude...
In Soviet Washington the swamp drains you.
I think XML should have looked more like this:
XML sucks ... too verbose for humans and too ambiguous for machines.
One day we'll look back and laugh!
I work with XML every day. And every day I wonder the same thing: why the hell does the end tag name have to be repeated? Why can't it just be optional? In other words, why can't it just be abbreviated as: <tagname>data</> ?
Same thing for me, although I'd rather have C-like blocks e.g. {data} so it's easier to jump from one side to the other (as any good editor will allow you to do). And quoting could be made easier, too (Come on, <? What were they thinkin?!). The only advantage of not using \ as everyone else does is that if you actually have to store a string that's quoted for something else you don't have to write e.g. \\\\\\\\ (what you need for instance to grep for a \\ from a shell).
The second open source project was fortune(s) which quickly developed a -o option! Fortunes most likely started with very bored OED programmers.
If more than 60% of your datagram size is element names, your element names are too long. Or you're using nested elements when you should be using attributes.
Because as I've pointed out for the one-billionth time. XML is not S-Expressions Now why don't you all give it a rest.
Yeah that'd work great if you knew 100% of the time that you'd never get bad data. If you've got a multi-nested element hierarchy however and you lose one or two of your , how do you know where to put them back in? It's very easy to look for an opening tag followed by a closing tag of the same name, especially when building a parser that error-checks.
e ment>
You know what would cut down the datagram size more? Smaller tag names. Tag names don't have to be readable so much as uniquely identifiable; you can use an interface layer in the editor to make the tag names user friendly and then de-friendify them for transit. Then you've got:
<a>
<b>woo</b>
</a>
insted of:
<element>
<subelement>woo</subelement>
</el
According to wc, switching to single-character element names instead of the multicharacter ones would give a 41% reduction in bulk, for the example above.
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
Lots of people have thought about it. Not Very Insightful.
The reason is that if the parser encounters unbalanced end-tags, and they're all just </>, the parser will go farther and get very confused before it dies.
It will be very difficult to pinpoint *which* tag isn't closed, like C's optional {} after an if(), or SGML's optional closing tags.
It's much easier to correct if your parser can say "You forgot to close <account> on line 115" rather than "Something or other is unbalanced somewhere before line 224."
Or just maybe something like
Q: How does an XML newbie go about learing what it is including xslt, dtd, and how to structure xml, xslt, dtd so that it does not break in 5 years and is not ungodly complex?
My initial impression is that XML is essentially as good as the VSAM/ISAM/Network Database Model and for similar reasons may drop out of use after 10 years.
&op;exampletag&cp;
I work with XML every day.
I'm sorry..
I've heard this quote in relation to XML before, and I don't get it. LISP is a programming language. XML is a method for storing data. About the only relation between the two that I can find is that both use nesting. So, why does this get brought up whenever XML is being discussed?
XML... but what it is good for?
Because it would make spotting your bug harder. Did you _mean_ to close that tag, or did you think you were closing a different tag? If all closing tags look the same it would make tracing certain bugs harder.
My Journal
Which causes the same problem described in the grandparent.
when i work with XML in java, i generally use just pass the XML through a GZIP stream. need to see the file contents? zcat. XML compresses well since it's repetative text. Lately I've been doing a lot of XUL code with PHP/smarty as the back-end, and again, I transparently gzip this...
So, this solves the problem of the size of the XML to be stored on disk or transmitted over network... The only difference is parsing. Again, when i'm in java, i use PICCOLO to parse the XML -- it uses a lexical analyzer (jflex?) to parse XML more like a compiler parses code, by tokenizing it. turns out, this is really fast.
Disk space is cheap. CPU's are fast. Mainstream XML parsing technology can always be made faster. Why must we abandon our beloved, human-readable, standardized format for files and protocols alike in favor of binary files?
< ele1> < ele2> < ele3> < /> < /> < ele4> < ele5> < /> < />
/ele3> < /ele1> < ele4> < ele5> < /ele5> < /ele4>
Which element did I forget to close?
< ele1> < ele2> < ele3> <
Clearer now?
You know what would cut down the datagram size more? Smaller tag names.
Not really, because any protocols that exchange large amounts of XML data should be compressing the data anyway, right?
also known as: BARF. The name was changed, no
doubt, in order to instill a greater sense among
MSFT employees there that they actually might
(someday) have a workable product. Hence, BARC.
XML is more complicated than it should be, but
it is NOT a MSFT "invention", and has no business
being patented by MSFT. Let alone, encumbered
with their viral and restrictive and expensive
licensing scheme. What it IS is yet another
example of the slimey "embrace/extend/extinguish"
monopolistic business practices of MSFT. If the
DoJ weren't more like a 90 year old grandmother
that misplaced her full dentures (aka the Dubya
regime), they would have MSFT back into court to
exact "new & improved" punishment on the 800 lb.
gorilla.
The element you forgot to close is the one whose content model doesn't allow the content that follows. (If you don't have a DTD or schema, you might as well get used to handling garbage, because you're going to see a lot of it.)
Do the words "Slashdot effect" ring a bell? Server farms are not cheap, and when you're supporting a truly large number of clients, gzip is no longer your friend. Pissing away cycles per octet per request per user just to ignore optional whitespace and comments(!) in my RPC is just stupid.
I don't think you got the joke.
See here for enlightenment.
NZ Electronics Enthusiasts: Check out my Trade Me Listings
Locoscript (on the Amstrad PCW Word Processor) was really big amongsty British Academia during the 1980's. It would be interesting to know if their [+bold]tagged format[-bold] had any influence on the OED..
If we're going to categorise the web then a fuzzy definition set with multiple overlapping definitions is going to be necessary. I suspect that del.icio.us is going to be the first step in this direction - link it into google and you've got a good stab at understanding what concepts web pages are actually connected to.
My Journal
Sure, the difference may or may not be smaller on larger files. But it exists.
Which is not _actually_ a problem.
-Lasse
"The essence of XML is this: the problem it solves is not hard, and it does not solve the problem well." -- Phil Wadler
"When in doubt, use brute force." Ken Thompson
Interesting suggestion. This technique might only work for XML documents above a certain level of size, number of tag types, and possibly even parsing complexity. The applicability might depend upon whether your utilization of XML serialized data is batch-oriented or transactional in nature. This technique would yield lots of benefits for scenarios where someone is dumping a few, very large XML documents across the wire, but perhaps not so much for scenarios where lots of small, quick XML documents are being exchanged back and forth. CPU saturation (and eventually memory I/O saturation) for example, might become a concern in certain scenarios.
In any case, it seems one name for this technique is XML "compaction". I searched around Sourceforge and found quite a few projects trying to tackle the general problem domain of efficient XML transmission. The compaction terminology was used and explicitly described by the Xqueeze project. There are other projects that either directly apply themselves against the XML compression problem or are tangentially resolving the problem by completely changing the representation format (no transcoding): xmltk, XMLPPM, XBIS XML, WAP Binary XML (WBXML). I will probably look at Xqueeze and XMLPPM for my own programming work that requires handling XML formatted data in a more batch-oriented setting.
So what if I stop at ele2 or ele3 - the document is wrong anyway and must be rejected. The result is the same - humans don't read XML anyway - only machines.
But bandwidth and even LANs aren't that fast which is where the bottleneck occurs. We're experimenting with the excellent Infragistics NetAdvantage suite of web controls like the grid. These things are ending up sending a 2MB HTML file across - okay, it's not XML in this case but it's the same idea/problem.
Rob.
Try marshalling and unmarshalling 1,000,000,000 floating point numbers as used in many mathematical simulations - XML is a non-starter.
How is your post insightful? Not closing a tag is not unique to Lisp-ish expressions - XML suffers from it as well.
Because you're an idiot. The difference you point out in your article are mostly not substantial, as e.g. SXML, the entire "XML Infoset" expressed in Scheme illustrates. The _only_ part of value in XML is as you point out the dealing with different character encodings. And that's really pretty independent from the rest of XML and could be applied just as easily to a SEXP-based-but-not-XML file format.
"Hmmm, OED might be unclear to tons of people reading this, I'll make them have to click on a link to know what I'm talking about."
Obligatory relation to discussion content:
Providing a link instead of writing a clear summary is choosing the wrong tool for the task at hand. Authors of some other comments in this thread have shown that XML also is the wrong tool for many of the tasks to which it is applied. Whether it's passing data internally within an application or summarizing an article for the homepage, choosing the right alternative can make a difference between efficient clarity and an inelegant kludge.
Applying the right algorithmic tool to the right problem is actually a focus of CS. This is why sorting routines are often studied -- for instance, a routine which is more efficient at sorting millions of unordered pieces of data may be very wasteful when dealing with nearly presorted data.
The distinction is not often understood and has more of an impact that the observer might think. For instance, when writing an application for a handheld in which data is kept sorted and is usually viewed between insertions it makes sense to sort after every data element added to the database. However, this means adding a single item to a mostly-ordered set. Understanding that quicksort is a poor choice for this application means a difference in battery life.
Somebody get that guy an ambulance!
Or even (:tagname data)?
Get this man a copy of Practical Common Lisp!
And they have infix notation...
S-expressions are in prefix notation. Infix describes expressions such as "1+2". Lots of parenthesis is hard to read, but twice that number of angle brackets is certainly not easier.
Blurring the line between data and code is a useful technique...
This only matters if you use the data in Lisp without being careful. Any non-interpreted language could use it just as safely as XML.
P.S. I don't even like Lisp, being a person who likes type checking before I actually execute a snippet of code. On the other hand, they really do have a point regarding S-expressions and XML.
And PAT was:
Sorry Tim, I couldn't resist. But you have to admit that PAT was rather ugly...
F U NE X N M? Son: "Dad... How do you spell 'hourly'?" Dad: "0 * * * *"
It's very easy to look for an opening <element> tag followed by a closing tag of the same name
It's easy for you, but it's impossible for me! Why? Because we consume XML on embedded devices with very small memory. We have no space for all the extra code that would be needed to do sophisticated recovery from XML syntax errors. When any XML error is detected, we simply discard the packet and jump into recovery mode.
The right solution is to let the USER decide if they want to bloat their datagram to facilitate sophisticated error recovery.
The incorrect assumption that you made is a perfect example of the lack of thinking that XML's design suffers from.
Yeah that'd work great if you knew 100% of the time that you'd never get bad data.
Pretty much. We use TCP, which has extremly high protection against packet corruption. The only XML error we have seen in deployment is truncation when somebody pulls out a CAT5 cable, or when a program crashes in the middle of a socket write.
(Obviously, we get other XML errors during development, but those errors are caused by software bugs. We don't need a sophisticated parser to make dynamic corrections in our buggy XML, when we can simply fix the bug.)
Please mod the parent post up!
First, lets add matching close parens to make error detection easier. (It might be handy to have a way (such as the "/") to indicate an end tag, but we're going for brevity here.) ... html)
(html
Now let's add attributes. It is probably most convenient to put these in a list right after the element name. Obviously if there are no attributes we need to put in an empty list so the parsing won't be ambiguous : ...) style) head) html)
(html '() (head '() (style '( type."text/css"
(Of course, one could use some special syntax to indicate attribute lists, or even map them into attributes. And then the attribute lists would not be required to follow the element name and the parser could tell the difference easily enough. ... attribute-list) html)
... attributes in here )# html)
(html (attribute-list
Or perhaps :
(html #(
Now we have the question of attribute lists that might not follow the element name - is that legal? Better not let it be, I can see several problems with that that would change the semantics - not unusefully, but incompatibly with XML.
OK. That looks like it will work and be (more or less) isomorphic to XML. How much space does it really save? One ">" for each start element and two characters for each end element - but there are the added characters for the empty attribute lists.
I doubt anyone would be terribly bothered if someone built a syntax that was isomorphic to that of XML (meaning that a syntax transformer and its inverse run on a document of either sort would produce the same document).
Go for it - I'm sure there are a lot of people who would like to be able to human-author xml without all the syntax. But wait - there are XML editors that do just that! (Most could do it better, but thats another problem.)
Pretty much. We use TCP, which has extremly high protection against packet corruption. The only XML error we have seen in deployment is truncation when somebody pulls out a CAT5 cable, or when a program crashes in the middle of a socket write.
How about when the power goes out? When a hard drive has a bad sector and transfers a malformed file? When your parser misses a closing tag, how does it know which XML element parents the next XML element? Does it guess? How would you recover from such an error?
I've written my share of XML parsers and routines, including a de-bloating script which does the single-character tag naming. I actually used 2 characters because we had more than 26 elements and I wanted to remain strictly alphabetic; but 2 characters is suitable for a document possessing 676 elements or fewer. 3 characters for 17576 elements. Anyhow, I've never done anything on an embedded platform. I've always had the space available to me to fix broken XML. I actually had this parser once which would try to recover data from a bad XSL transform and make it standards compliant. It did so by adding and removing tags as it saw fit (based on a config file I created) in order to ensure that every element was parented validly and only possessed valid child elements.
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
mod parent up. That is so damn true.
How about when the power goes out?
Same case as a program crash. There's no way for me to know if my client crashed or lost power. And I don't care.
When a hard drive has a bad sector and transfers a malformed file?
Same thing. It causes a crash or some other error. All errors are the same to me.
When your parser misses a closing tag, how does it know which XML element parents the next XML element?
Don't care. The XML is either perfect, or it's garbage. In our limited-memory embedded environment, we simply don't have the abililty to be dainty about this. All errors are handled the same.
How would you recover from such an error?
Hard reset.
I've always had the space available to me to fix broken XML.
That's the whole problem here. People make assumptions about the environment.
I wrote my XML parser by hand, in C. It does only the absolute minimum that it needs to extract a substring from the XML. I don't have the luxury of doing anything else.
I actually had this parser once which would try to recover data
That's a nice feature. But it's totally different from what I have the resources to do.
We've got a big ol' basket of apples and oranges here.
How much memory are you working with? What's your architecture? I don't want to get you into any trouble with your company or anything for improper disclosure, but I'm curious what sort of project you're working on, specifically.
Reinvent the wheel only at either a lower cost, greater effectiveness, or your own personal enrichment and satisfaction.
I don't want to get you into any trouble with your company or anything for improper disclosure, but I'm curious what sort of project you're working on, specifically.
No problem; that's why I'm posting as AC.
See: http://www.rabbitsemiconductor.com/ (2000 series). It's an 8-bit processor, similar to the Z80. We're very happy with the processor's speed, but we're trying to minimize memory usage to save cost. We use it for controlling various devices via RS232 serial ports. We also need it to run a TCP server so that we can remotely send it configuration data (in XML). Everything needs to fit in 256 KB of flash memory. The compiler is supplied by: http://www.zworld.com/
Last time I tried the Perl YAML module I could generate a pathological perl data structure (strings designed to look suspiciously like bits of YAML) and corrupt the output sufficiently that it didn't parse back into the same data structure.
This was a bit over a year ago.
I'm sorry, but I'm just not interested in using a format where I can't rely on it being clean enough to even pass printable text cleanly through a conversion and back again. Get back to me when you've got a format which isn't a crock of shit.
The extra redundancy of closing tag labels makes sense when your documents are generated by humans, like most SGML was.
It makes no sense at all for documents that are generated by programs, especially programs that create documents in some canonical manner like building DOM structures and then serializing them - if you trust the serializer, then it can't drop a close tag.
To a Lisp hacker, XML is S-expressions in drag.
But bandwidth and even LANs aren't that fast which is where the bottleneck occurs.
Like I said in my post, gzip works pretty damn well for networks too, as it supports streams. If you're running a web server, use something like mod_gzip -- if you're writing a network application with a custom-made XML-based protocol, you can simply wrap a gzip compressor/decompressor around the socket stream.
Binary XML is intended to make things parse faster, but as others have said, it's worth the extra CPU power to preserve a format that is human-readable. Compression is an easy fix for disk storage / network transmission problems.
What fucking retard modded this to a troll?
Whilst this isn't quite on-topic, it is to do with text versus binary formats :-) Our problem is that when you start trying to make a webapp do more client type functions (like grid, tree controls etc.) then the size of the HTML file with Javascript but most often state information becomes a problem. Even with an include, it still downloads a large jscript file as text and then compiles (or more likely interprets).
The same is true of using XML.
Is there anyway to compress the HTML and jscript files as they are fetched? Good-old modems used to compress the text stream on the fly, does the same thing happen with broadband?
Cheers, Rob.
As long as you can very easily rebuild the text version of the XML/HTML at the client end. Hmm, that's a nice way to secure things as well - encrypt the binary as well with keys and then people can't look at your source code if you don't want them to.
Binary XML data would also be inherently more secure to packet sniffers although not perfect.
Somebody must have thought about this before.
Cheers, Rob.
>Somebody must have thought about this before.
:-)
Ohh hang on, I've just re-invented the "the compiler"
Rob.
Is there anyway to compress the HTML and jscript files as they are fetched? Good-old modems used to compress the text stream on the fly, does the same thing happen with broadband?
Yes, as I've said now in 2 posts on this thread, you can transparently compress stuff from web servers. Web browsers send the server an "Accept-Encoding" header, that shows the different formats it can take. Most (all?) modern browsers support gzip.
The project I'm doing at work is XUL (XML user interface language, as used in Mozilla browsers, Thinlet and Macromedia Flex) and JavaScript. Both of these are sent to the client gzipped.
Furthermore his question about why no-one did this before 1996 is wrong. Electronic Arts did IFF in 1985 for the exact same purpose: interchange of data between applications and computers.
If Bray worked on the OED in 1985, he must have seen IBM Fellow Mike Cowlishaw's LEXX editor, which was the thing that displayed "the electronic version[] of dictionary. It was what we would now call XML. It had little embedded tags saying entry, word, and then pronunciation, etymology, a brief quotation, and the date, source, text, and so on." It was a color-terminal application, running (initially) on IBM's VM mainframe system. LEXX was the subject of an IBM Journal of Research and Development article back in 1987 - it's worth reading, even though the screen-shots didn't survive being scanned into the PDF.