SQL, XML, and the Relational Database Model
Kardamon writes "In an article on DBAzine, Fabian Pascal writes that SQL is not a good representation of the relational model, and is afraid the situation will get worse with XML and XQUERY. An overview of some of the reactions on the positions Pascal and also C.J. Date take on this issue is given in this article over at SearchDatabase.com by Sara Cushman."
Celko is misquoting Darwin in saying that "The idea that you will always know everything is arrogant".
t o.web/Missing-info-without-nulls.pdf
Date/Darwin/Pascal propose that you codify what you don't know (so to speak). Read their proposed solution here:
http://www.hughdarwen.freeola.com/TheThirdManifes
And yes, XML DBMS are a throwback to IBM IMS and other hierarchical DBMS products. Anyone who has ever used a hierarchical DBMS will tell you that there are some pretty non-trivial problems that you cannot work around due to their hierarchical data model, yet XML DBMS proponents propose we go back to that old, inflexible system!
Thanks,
--
Matt
Here is a history of SQL. I wanted to check the article's facts. Also, I was curious... History of SQL
No.
XML is a file format, it has nothing to do with objects, no more than HTML does, which is not at all.
However to counter your claim that XML is hard to represent in a relational database. Uhm. No, it's not.
XML consists (simplifying) of elements and attributes, elements may be nested.
A generic mapping to a relational database is that elements correspond to the entity tables, attributes correspond to columns in those tables, and the nesting of elements is modelled as a foreign key in the child entity records.
Whats so hard?
NZ Electronics Enthusiasts: Check out my Trade Me Listings
Readers interested in what Pascal and Date have to say may be interested in visiting Database Debunkings. It has lots of articles written by one or both of them.
Personally, I recommend the articles written by Date because they are clearly, concisely, accurately, and calmly written. Pascal's tend to turn into a rant, which I wouldn't mind but he always seems to refer to his books rather than give a detailed justification of his arguments.
There are a couple of "problems" that they have identified:
1) You can write a given query and number of different ways. This is not necessarily a SQL problem but due to this the query optimizers have to be enormously complex to handle complicated queries and by association you can have queries which describes two identical sets but have vastly different runtimes/costs.
2) Little/No support for relational domains (e.g. complex data types)
3) Non-updateable views (partially due to duplicate handling and/or allowing relations with no primary key)
4) Weak support for complex integrity constraints (e.g. business rules)
5) No support for entity sub/supertype relationships
6) Supports NULLs (Date/Pascal/Darwin do not like NULLs)
Try searching www.dbdebunk.com for SQL. Or pick up the great book "Practical Issues in Database Management" by Fabian Pascal.
Thanks,
--
Matt
Whats so hard?
The "arbitrary XML" part. You must have existing mappings set up to process the XML. New forms of XML thus require a great deal of work on the part of the DB developer.
XML databases such as Xindice OTOH, allow you to create a table and insert XML in whatever format you chose. XPATH queries take a bit of getting used to, but you can query on tags, attributes, CDATA, or whatever else you chose at whatever level in the XML hierarchy you choose. Thus I can query for the list of addresses for all records that have a firstname attribute that is LIKE "Bob". Or I can dive down to the individual address level and query for all records that have an address of "Drury Lane" (important for tracking down the Muffin Man when you need a giant gingerbread cookie).
It's not like you can't do this stuff with SQL databases, it's simply a different method of accomplishing the task. Depending on the data you're working with, an XML database may very well be a more efficient method of storage and queries.
Javascript + Nintendo DSi = DSiCade
And the problem happens as soon as you want to know which producs uses a specific screw. Doing that query against the xml would be difficult, and/or extreamly slow.
Martin
Of course XML is going to be hard to represent in a relational database. Unless your tables are ( id, object text) and you pull out your XML and parse it.
This is completely false. If you had RTFA, it is mentioned that the relational model can represent hierarchies (and thus XML) just fine. It is SQL that is deficient for this purpose.
Also, it makes no sense to call XML "object-oriented," which is a programming language term[semi-OO? LOL]. XML is a syntactic hierarchy that can be used to represent "objects", just as it can be used to represent other types of data.
Last, just about every major database now supports XML as a native datatype, meaning you don't have to pull out XML documents and parse them.
bp
> An iterative SQL call would be needed unless the overall part depth was known at query time.
Both oracle & db2 have very good support for recursion. DB2 in particular can easily handle hierarchies and networks of unknown depth in a single query.
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
"Just because two queries return the same results today do not mean that they will continue to do so in the future."
Total misunderstanding of what I wrote. To put it another way:
SQL allows you to write queries which are mathematically equivalent but result in vastly different query plans and performance. Again, not a particularly stinging-indictment of SQL as such but had it been designed differently it could have avoided such ambiguity in the language.
"Not a bug, it's a feature. The S in SQL is for "structure"... go hammer out your data into a structured format rather than a complex one and then come back."
So you're saying a tree has no "structure"? That a domain has no structure? If it had no structure, it would be a little difficult for computers to process.
"View stuff"
Pascal (or Date, can't remember) provides an iron-clad (mathematical definition) method of creating views which will always be updatable. There are structural deficiencies in SQL which prohibit this. I will not waste time/typing here illustrating them, they are all identified at their web site.
"SQL triggers" etc.
It is precisely the reason that applications were enforcing business rules that DBMS were invented all those years ago! There are plenty of reasons that application-enforcement of business rules is a bad thing. Again these are illustrated on their web site. Also, your quote about "SQL triggers" is basically re-stating what I mentioned: that SQL is poor at implementing business rules!
"Plenty of support, just not intrinsically."
Which is exactly the same as saying "no support for entity sub/supertypes". Plus, one-to-many tables are not the same, you're thinking of something else. Chapter 6 of Fabian Pascal's book "Practical Issues in Database Management" covers this in some depth.
"That's like trying to do math without a concept of zero."
Not quite the same. Remember that the relational model is based upon predicate logic and set theory. Set theory has the empty set, which is not the same as NULL. SQL products currently handle null in a ridiculous manner (some sort NULL oddly, comparison is difficult, summation is odd). Pascal/Date do not suggest that the concept of "unknown" is bad, just that the SQL representation as NULL is.
Thanks,
--
Matt
This is off-topic, but it bothers me when someone blasts another for ignorance and then links into an otherwise perfectly good article ("The Myth of Self-Describing XML", by Eric Browne)which contains the following pearl:
Doesn't it bother anyone that someone who doesn't know his Latin (but uses it, anyway, where a perfectly good "after the fact" would suffice) is used as an authoritative source by the author? If you're trying to prove someone's incompetence, you won't want to counteract it with further incompetence. Mr. Browne may be a genius in relational databases, but makes stuff up just the same.
And in the unlikely case Mr. Browne should read this post, the correct form is a posteriori, as any law student probably knows.
The problem is that most "Relational Database Management Systems" only represent one type of "relationship", that being "the table".
This, in turn, means that every operator (programmer, statement, etc) on the database must _individually_ "already understand" all the relationships that lie outside "the table" before they can act on the data at any significant scope.
That is, you, the programmer or operator must know, from some source besides the RDBMS itself, how the different tables work with eachother.
In simpler SQL-biased terms, you have to know, before you start, what is "good" to put in your WHERE CLAUSE to make a join. And then the RDBMS query optimizer needs to guess how to satisfy your needs in something other than glacial time.
Consider a new verb "EXPOSE thing, thing, thing, thing, thing..." that would fish out of a database the one-or-more relationships between the things, and produce a table-looking vector of "tuples" that consist sets of actual values for those things. This is what the theoretical "perfect" RDBMS would do.
Given (somewhat denormal 8-):
Customer ID -> Customer Name
Customer ID -> Street Address
Customer ID -> Zip Code
Customer ID -> Purchase Order ID
Purchase Order ID -> Part Number, Quantity
One should be able to "EXPOSE Part Number, Zip Code" and have the database "know the relationship" and produce the correct vector of tuples.
But that doesnt happen.
Now all the people bleating about the Higherarchial databases and bad things from the past are doing this harping because they remember the bad-old-days when a database would maintian one tree-structured set of relationships like this. In the higherarchical model, you could basically do this EXPOSE operation, but only if you had, by dint of pre-knowledge, asked for things lying on one linear path through the tree. (* simplified for brevity).
In essence, SQL requires you, the programmer, to be in the business of making up relationships that should be in the data or schema structure but isn't.
A magically complete RDBMS would take a series of vectors of the form "Independent Datum (key)->Dependent Datum (value)" (where either side of the arrow could be a list of atomic values). The RDBMS would then assemble and maintian tables or linked lists or whatever with no exposure of SQL-esque "tables" and the accessors would be storage method agnostic. (That is, there would be no such thing as a FROM CLAUSE.)
For instance, in the above list of relationships, Customer ID, Customer Name, Street Address, and Zip Code *could* all live in a "table", or not, but you would never know that. But a better vector of
Customer ID -> Street Name, Building Number, Suite to replace "Customer ID -> Street Address" has a table-feel, but would not bias against "EXPOSE Street Name, Part Number".
The power of this comes from being able to do:
EXPOSE Building, Part Number, Quantity
Where Building == "Word Trade Center 2";
And have the RDBMS already know the sequence of relationships to get from Building to (address elements) to Customer ID, to Part Number without the programmer writing the three stage join across the "uninteresting middle tables".
(The above presumes you have a building relation that has Street Name, Street Number -> Building or some such.)
All the XML nonsense is nonsense *_BECAUSE_* the strict-nesting enclosure requirements of XML make it "naturally" become hierarchically organized. But data exists outside the single-inheretance strict parantage trees that the hierarchical model dictates.
The problem is that SQL got real popular and so the idea of structural inferrence got pared down to tables and Query-Like constraints on tables before anybody had a chance to formalize the idea of living, encoded relationships between arbitrarily stored datum. So we never really got a language or system that could "EXPOSE".
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
I beg to differ:
The ancestors of 'Mary' are:SQL might still not be true relational calculus (but then again, it might as well be, I don't know, and the parent doesn't know either), but at least it't possible to express the ancestor relationship.
Now, if only the PostgreSQL guys would implement this ...
FP, Date, and Darwen may or may not be extremists, but the above characterization is just plain wrong.
Date and Darwen have extensively documented the differences their vision and the commerial options in The Third Manifesto Chapter 7 (IIRC) of Pascal's Practical Issues in Database Management is devoted to tree processing in a relational language. There's extensive discussions of tree processing and handling transitive closure in The Third Manifesto as well. The cover hierarchies and networks, and in addition have several chapters devoted to inheritance.
You may not agree with their critique, but to say they haven't articulated the specifics is just wrong; they've spent 20 years doing exactly that.
No. You. Can't.
You simply cannot know what the data means without specified semantics (which you have to agree upon somehow; they aren't magically apparent from the XML itself -- that's why such abominations as DTD and XML-Schema exist).
HAND.
No, they are not, but the way to rectify them is a bit extreme. Keep in mind that -- as Date says again and again -- there is a difference between the logical model and the physical model. I'll summarize the example he used: a EMP_SALARY table.
Let's start simple:
[Sorry, ecode doesn't seem to want to do vertical alignment.]
Now, what do those two NULLs mean; do they carry value? If Bob is unemployed, we could write 0, but then it could be confused with unemployed and employed for no pay. Perhaps we don't know how much money Chris makes, but we do know that he makes some money.
You could fix this by adding a TYPE enumeration column that would take on values like EMPLOYED, UNEMPLOYED and UNKNOWN, but you'll still need the NULL value for the UNEMPLOYED and UNKNOWN cases.
This last part can be solved by logically segregating the table above (sans TYPE column) into three tables (one per type). Both the UNEMPLOYED and UNKNOWN_EMPLOYMENT tables would lack the SALARY column -- they would have only the EMP_ID as there's no more information to add. All rows in the EMPLOYED table would have a known salary.
NULLs have been removed and the design is further normalized -- some would say to the extreme.
Now, how you would model that physically without using NULL and still managing some level of performance I do not know. But that at least explains the reasoning behind NULL not being necessary.
As for myself, after nearly fifteen years of database design and implementation, I'm quite satisfied with using NULL where appropriate. I've never been befuddled by it nor sidelined by its behavior with respect to logical operators. Learn the rules and move on.
Freedom to fear. Freedom from thought. Freedom to kill.
I guess the War on Terror really is about freedom!
multi user read consistency. ...
online backup.
roll-forward recovery.
stored procedures, views, triggers, analytic functions, partitioning, function-based indexes
Enterprise class support from the vendor.
mysql just does not cut it yet.
I don't care what the RDBMS is provided that we don't have to recode every function provided in database kernel-space in user application space (huge waste of time).
the license costs are not what they used to be.
That's why sequences (as implemented by PostgreSQL and Oracle) are handy. Simply create a sequence, and call NEXTVAL(sequence_name) to get the identity of the next record. It isn't an autoincrementer (but can be used as one with default values), and you get the advantage of knowing what ID you will use before your insert - very handy for inserting a lot of related data at once. You can also do tricks like having sequences with different increment numbers, different base values, or even concatenate them with a string to get multi-master friendly safe replication.
For a sequence/nextval to be useful, the increment must be absolutely atomic - that is, it must return the next value and increment without any chance of the same number being given to another caller. Oracle and PostgreSQL do this for you.
You can simulate these in SQL Server with the following stored procedure (original source here: Sequence table: CREATE TABLE sequences ( -- sequence is a reserved word seq varchar(100) primary key, sequence_id int ); MS SQL Server stored procedure: CREATE PROCEDURE nextval @sequence varchar(100), @sequence_id INT OUTPUT AS -- return an error if sequence does not exist -- so we will know if someone truncates the table set @sequence_id = -1 UPDATE sequences SET @sequence_id = sequence_id = sequence_id + 1 WHERE seq = @sequence RETURN @sequence_id I've used this to great effect in SQL server. I've had to emulate it in Access once (yuck! it worked, though!). I've never tried it in MySQL, but it can probably be done.
This page (see LAST_INSERT_ID(expr) talks about how to do it in MySQL. I don't use MySQL when I can avoid it; it's a great, fast database for high-read environments without triggers and similar, but I've never seemed to end up working in one of those!
Lead developer, http://wisptools.net
Would you please enlighten me on how an XSLT job operating on a structure like above and showing all the employees who have a salary over 100k, having more than 20 directs and have travelled in the past 12 months look like? Also please include their manages names and phone numbers int he result. And what performance it would have?
] " > /> />
/>
/> />, Phone #<xsl:value-of select="@PhoneNumber" />, />
It sounds like you're kidding (or being sarcastic), and I'm not going to debug it (I'd need data, anyway.), but probably something like this:
<xsl:template match="/root/Employees/Employee[@Salary>100000
<xsl:variable name="EmployeeID" select="@EmployeeID"
<xsl:variable name="NumReports" select="count(/root/Employees/Employee[@ManagerID = $EmployeeID])"
<!--
XSLT sucks with dates, no matter what your data is arranged like. Storing dates as YYYYMMDD does
at least allow for comparisons. Also, there is no facility that I know of to retrieve the current
date, so I'm going to hard-code that.
-->
<xsl:variable name="RecentTravel" select="boolean(/root/Trips/Trip[@EmployeeID = $EmployeeID and @Date>=20030628 ])"
<xsl:if test="$RecentTravel and ($NumReports>20)" >
<xsl:variable name="ManagerID" select="@ManagerID"
<xsl:text>
Name: <xsl:value-of select="@Name"
Manager Name: <xsl:value-of select="/root/Employees/Employee[@EmployeeID = $ManagerID]/@Name"
</xsl:text>
</xsl:if>
</xsl:template>
And this approach seems to perform just fine in MSXSL, which I believe is DOM-based. It might give a SAX-based engine problems, because it jumps around so much.
Lest I give a false impression, I'm not suggesting that this sort of XML replace relational databases. The point is that a roughly relational layout is still a good approach, even when you need to be working in XML.
I think that document refers to new (connect by features) rather than (new connect by) features.
I've been happily using it on version 8. The connect_by_root & sys_connect or whatever it is may well be introduced in 10g. Certainly I could have killed for the ORDER SIBLINGS BY clause in 8 but it's only available in 9i+.
...useless until you realize you have to connect to the DB from several applications written in several different languages for which you have either to reimplement your way to manipulate the data everywhere, or you have to put in a middle layer of some sort which is able to talk multiple languages (via CORBA, SOAP, plain XML RPC, custom protocol, whatever) and ensure that everyone is accessing data exclusively through it.
Reimplementing logic everywhere across different languages is usually a bad approach because it doubles the development and testing effort.
Middle layers are usually hard to get right the first time, much harder than using stored procedures and triggers, since the typical procedural language is not so at ease at manipulating relational data.
On MySQL: it has a somewhat bad reputation in the field because of the people abusing it: it's fast, it's free, it's easy to set up, but living with these gotchas is definitively too painful for developers more concerned about correctness than speed.
In the end: firing up Firebird/Oracle/PostgreSQL/SAPDB for simple data is plainly stupid, but often it is done anyways since they do a decent job even in that cases; firing up MySQL for your 30+ GB DB containing your whole network topology which is used by everyone for billing, service assurance, troubleshooting, network planning and whatever is stupid as well. MySQL AB knows that, and in fact it now proposes MaxDB (was: SAPDB) as well.
You are wrong. The SQL:2003 standard specifies IDENTITY. See http://troels.arvin.dk/db/rdbms/#mix-identity
I'm not going to try and reply in detail, but since I participate in the W3C XML Query Working Group and am also the w3C XML Activity Lead, a few comments may be useful.
:-)
The article seems to says "I don't like SQL and I don't like XML and I think XML Query is about mergin them although I don't understand it very well, so the people working on XML Query must be stupid, and in any case it's easier to attack people than understand a specification".
Perhaps that's unfair, but it's clear to me that the writer is a little fuzzy on the design goals of XML and also on the focus of SQL development over the past 10 or 15 years.
In both cases the story is about interoperability.
If you look at the XML Query Home Page you'll see approximately two dozen implementations of the XML Query draft, including a number of open source ones. If you look at the public mailing list for comments, you'll see we received over 1100 detailed technical comments at the last public review. So there's a lot of interest in this work.
Why is that? One reason is that, like Web services and SOAP, XML Query is able to replace a lot of proprietary and hard-to-maitain middleware. Another reason is that for the first time we'll have a standard way to search over multiple kinds of data source.
Don is the primary editor of the XQuery language, but the technical decisions reflected in the specification are a result of collaboration, and are agreed on by aconsensus process by a much larger number of particpants. The goal is to make a language that people agree to implement and to use. With support announced by Microsoft, Oracle, IBM, BEA and others (see Web page mentioned above) and judging by the public interest, I think it's fair to say that's going to happen.
It's pretty rare to see a large complex system that everyone is happy with. It's actually pretty rare to see a small system that everyone is happy with. There are people who are unhappy with some features in the Unix cat program, but it's better to have cat in every Unix system than to have millions of shell scripts break on systems where it's missing! The trick, then, is often to include features that will lead to massively wider adoption, even if some people would rather be without them.
Then we have (as part of W3C Process) a public call for implementations so that we can test to see how confident we are that all the major features can be implemented compatibly (i.e. interoperably) in multiple independent implementations.
Features that were not implemented get removed before the specifications are final.
Is XML Query a waste of time? Is XML evil? Is SQL evil? A lot of people think otherwise, and some of them are pretty smart, so if you are concerned, take the time to read the specs and decide for yourself.
Live barefoot!
free engravings/woodcuts
I've read the article (maybe i've misread it but its 8:30am and a saturday:)) and although I can see lots of bashing of SQL, XML and nulls going on I can't see any 'alternative' or solutions suggested.
----- I refuse to have an argument with an unarmed person