SQL, XML, and the Relational Database Model
Kardamon writes "In an article on DBAzine, Fabian Pascal writes that SQL is not a good representation of the relational model, and is afraid the situation will get worse with XML and XQUERY. An overview of some of the reactions on the positions Pascal and also C.J. Date take on this issue is given in this article over at SearchDatabase.com by Sara Cushman."
Celko is misquoting Darwin in saying that "The idea that you will always know everything is arrogant".
t o.web/Missing-info-without-nulls.pdf
Date/Darwin/Pascal propose that you codify what you don't know (so to speak). Read their proposed solution here:
http://www.hughdarwen.freeola.com/TheThirdManifes
And yes, XML DBMS are a throwback to IBM IMS and other hierarchical DBMS products. Anyone who has ever used a hierarchical DBMS will tell you that there are some pretty non-trivial problems that you cannot work around due to their hierarchical data model, yet XML DBMS proponents propose we go back to that old, inflexible system!
Thanks,
--
Matt
Here is a history of SQL. I wanted to check the article's facts. Also, I was curious... History of SQL
No.
XML is a file format, it has nothing to do with objects, no more than HTML does, which is not at all.
However to counter your claim that XML is hard to represent in a relational database. Uhm. No, it's not.
XML consists (simplifying) of elements and attributes, elements may be nested.
A generic mapping to a relational database is that elements correspond to the entity tables, attributes correspond to columns in those tables, and the nesting of elements is modelled as a foreign key in the child entity records.
Whats so hard?
NZ Electronics Enthusiasts: Check out my Trade Me Listings
Readers interested in what Pascal and Date have to say may be interested in visiting Database Debunkings. It has lots of articles written by one or both of them.
Personally, I recommend the articles written by Date because they are clearly, concisely, accurately, and calmly written. Pascal's tend to turn into a rant, which I wouldn't mind but he always seems to refer to his books rather than give a detailed justification of his arguments.
There are a couple of "problems" that they have identified:
1) You can write a given query and number of different ways. This is not necessarily a SQL problem but due to this the query optimizers have to be enormously complex to handle complicated queries and by association you can have queries which describes two identical sets but have vastly different runtimes/costs.
2) Little/No support for relational domains (e.g. complex data types)
3) Non-updateable views (partially due to duplicate handling and/or allowing relations with no primary key)
4) Weak support for complex integrity constraints (e.g. business rules)
5) No support for entity sub/supertype relationships
6) Supports NULLs (Date/Pascal/Darwin do not like NULLs)
Try searching www.dbdebunk.com for SQL. Or pick up the great book "Practical Issues in Database Management" by Fabian Pascal.
Thanks,
--
Matt
"Just because two queries return the same results today do not mean that they will continue to do so in the future."
Total misunderstanding of what I wrote. To put it another way:
SQL allows you to write queries which are mathematically equivalent but result in vastly different query plans and performance. Again, not a particularly stinging-indictment of SQL as such but had it been designed differently it could have avoided such ambiguity in the language.
"Not a bug, it's a feature. The S in SQL is for "structure"... go hammer out your data into a structured format rather than a complex one and then come back."
So you're saying a tree has no "structure"? That a domain has no structure? If it had no structure, it would be a little difficult for computers to process.
"View stuff"
Pascal (or Date, can't remember) provides an iron-clad (mathematical definition) method of creating views which will always be updatable. There are structural deficiencies in SQL which prohibit this. I will not waste time/typing here illustrating them, they are all identified at their web site.
"SQL triggers" etc.
It is precisely the reason that applications were enforcing business rules that DBMS were invented all those years ago! There are plenty of reasons that application-enforcement of business rules is a bad thing. Again these are illustrated on their web site. Also, your quote about "SQL triggers" is basically re-stating what I mentioned: that SQL is poor at implementing business rules!
"Plenty of support, just not intrinsically."
Which is exactly the same as saying "no support for entity sub/supertypes". Plus, one-to-many tables are not the same, you're thinking of something else. Chapter 6 of Fabian Pascal's book "Practical Issues in Database Management" covers this in some depth.
"That's like trying to do math without a concept of zero."
Not quite the same. Remember that the relational model is based upon predicate logic and set theory. Set theory has the empty set, which is not the same as NULL. SQL products currently handle null in a ridiculous manner (some sort NULL oddly, comparison is difficult, summation is odd). Pascal/Date do not suggest that the concept of "unknown" is bad, just that the SQL representation as NULL is.
Thanks,
--
Matt
The problem is that most "Relational Database Management Systems" only represent one type of "relationship", that being "the table".
This, in turn, means that every operator (programmer, statement, etc) on the database must _individually_ "already understand" all the relationships that lie outside "the table" before they can act on the data at any significant scope.
That is, you, the programmer or operator must know, from some source besides the RDBMS itself, how the different tables work with eachother.
In simpler SQL-biased terms, you have to know, before you start, what is "good" to put in your WHERE CLAUSE to make a join. And then the RDBMS query optimizer needs to guess how to satisfy your needs in something other than glacial time.
Consider a new verb "EXPOSE thing, thing, thing, thing, thing..." that would fish out of a database the one-or-more relationships between the things, and produce a table-looking vector of "tuples" that consist sets of actual values for those things. This is what the theoretical "perfect" RDBMS would do.
Given (somewhat denormal 8-):
Customer ID -> Customer Name
Customer ID -> Street Address
Customer ID -> Zip Code
Customer ID -> Purchase Order ID
Purchase Order ID -> Part Number, Quantity
One should be able to "EXPOSE Part Number, Zip Code" and have the database "know the relationship" and produce the correct vector of tuples.
But that doesnt happen.
Now all the people bleating about the Higherarchial databases and bad things from the past are doing this harping because they remember the bad-old-days when a database would maintian one tree-structured set of relationships like this. In the higherarchical model, you could basically do this EXPOSE operation, but only if you had, by dint of pre-knowledge, asked for things lying on one linear path through the tree. (* simplified for brevity).
In essence, SQL requires you, the programmer, to be in the business of making up relationships that should be in the data or schema structure but isn't.
A magically complete RDBMS would take a series of vectors of the form "Independent Datum (key)->Dependent Datum (value)" (where either side of the arrow could be a list of atomic values). The RDBMS would then assemble and maintian tables or linked lists or whatever with no exposure of SQL-esque "tables" and the accessors would be storage method agnostic. (That is, there would be no such thing as a FROM CLAUSE.)
For instance, in the above list of relationships, Customer ID, Customer Name, Street Address, and Zip Code *could* all live in a "table", or not, but you would never know that. But a better vector of
Customer ID -> Street Name, Building Number, Suite to replace "Customer ID -> Street Address" has a table-feel, but would not bias against "EXPOSE Street Name, Part Number".
The power of this comes from being able to do:
EXPOSE Building, Part Number, Quantity
Where Building == "Word Trade Center 2";
And have the RDBMS already know the sequence of relationships to get from Building to (address elements) to Customer ID, to Part Number without the programmer writing the three stage join across the "uninteresting middle tables".
(The above presumes you have a building relation that has Street Name, Street Number -> Building or some such.)
All the XML nonsense is nonsense *_BECAUSE_* the strict-nesting enclosure requirements of XML make it "naturally" become hierarchically organized. But data exists outside the single-inheretance strict parantage trees that the hierarchical model dictates.
The problem is that SQL got real popular and so the idea of structural inferrence got pared down to tables and Query-Like constraints on tables before anybody had a chance to formalize the idea of living, encoded relationships between arbitrarily stored datum. So we never really got a language or system that could "EXPOSE".
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
Would you please enlighten me on how an XSLT job operating on a structure like above and showing all the employees who have a salary over 100k, having more than 20 directs and have travelled in the past 12 months look like? Also please include their manages names and phone numbers int he result. And what performance it would have?
] " > /> />
/>
/> />, Phone #<xsl:value-of select="@PhoneNumber" />, />
It sounds like you're kidding (or being sarcastic), and I'm not going to debug it (I'd need data, anyway.), but probably something like this:
<xsl:template match="/root/Employees/Employee[@Salary>100000
<xsl:variable name="EmployeeID" select="@EmployeeID"
<xsl:variable name="NumReports" select="count(/root/Employees/Employee[@ManagerID = $EmployeeID])"
<!--
XSLT sucks with dates, no matter what your data is arranged like. Storing dates as YYYYMMDD does
at least allow for comparisons. Also, there is no facility that I know of to retrieve the current
date, so I'm going to hard-code that.
-->
<xsl:variable name="RecentTravel" select="boolean(/root/Trips/Trip[@EmployeeID = $EmployeeID and @Date>=20030628 ])"
<xsl:if test="$RecentTravel and ($NumReports>20)" >
<xsl:variable name="ManagerID" select="@ManagerID"
<xsl:text>
Name: <xsl:value-of select="@Name"
Manager Name: <xsl:value-of select="/root/Employees/Employee[@EmployeeID = $ManagerID]/@Name"
</xsl:text>
</xsl:if>
</xsl:template>
And this approach seems to perform just fine in MSXSL, which I believe is DOM-based. It might give a SAX-based engine problems, because it jumps around so much.
Lest I give a false impression, I'm not suggesting that this sort of XML replace relational databases. The point is that a roughly relational layout is still a good approach, even when you need to be working in XML.