Do XML-based Databases Live Up to the Hype?
douthitb asks: "I have recently started work as a contractor with a company developing/improving an application for exchanging large amounts of data. The current solution exchanges data via XML, but the data itself is stored in a SQL Server database. There is a concern about the overhead involved with wrapping and unwrapping the XML to get the data in and out of a relational database. The proposed solution is to use Tamino, an XML-based database. Neither I nor any of the other developers have any experience with Tamino, but the desired result is to remove the bottleneck of converting the XML back and forth. Does anyone have experience using Tamino (or any other XML-based database)? What benefits and/or difficulties did you have in using an XML database, as opposed to its relational counterpart? How large of a learning curve should be expected with a product like this? Do XML databases really live up to the hype? A similar topic was discussed on Slashdot way back when, so I was hoping to get some more up-to-date feedback on the subject."
"Sales reps from Software AG, the makers of Tamino, were brought in to discuss the benefits of their product with us. They, of course, presented Tamino as the end all, cure all database system (it will even clear your acne and make you popular with the girls!). The management of the company I'm contracting with were basically eating out of the sales reps' hands, without asking any of the "tough" questions about what the product can do; I was less convinced. Doing some initial searching on the Internet, I have had trouble finding much information about Tamino outside of the Software AG website."
Oracle and XSQL/XSLT works fine for the database we use at work. The overhead of wrapping and unwrapping the data doesn't seem to be any problem.
There is a concern about the overhead involved with wrapping and unwrapping the XML to get the data in and out of a relational database.
So, can you explain how an XML database will fix this?
Your database still needs to translate the verbose, human readable XML into an internal storage representation. If you're transfering the data between two SQL databases now, then I can't see why it should matter if you're parsing XML and putting into a "traditional" row-column RDBMS or parsing XML and putting into a datastructure more suited for storing XML data. The parsing is going to take exactly the same amount of time.
The XML database would help if you've mapped your data representation to XML, and are having a difficult time persisting it to SQL. For some data representations, going from XML to parsed binary RDBMS representation back XML may be difficult, and it may be easier to just go from XML to parsed binary representation of XML back to XML again. But either way, you're doing the parsing.
You're solving the wrong damned problem.
XML is kinda nice for some things, and really rotten for some things. Please do yourself a favor and sit down and try to decide what problem you are trying to solve. XML really stinks when it comes to sets: something that SQL based databses excel at.
I think that with the XML fetish we have these days, that we are reverting to the preSQL days of CODASYL or IMS (pre 1980s for those of you young'uns).
A few years back, I was brought in to a small company to build their new software on top of the Tamino DB. XML was "the way of the future" and we were asked to use it as much as we could. Software AG promised that everything would be easy to program and that their software functioned perfectly. Software AG's sales rep used the fact that Tamino was used in production by (insert major national company here) as a major selling argument. I later found out from a friend working there that they had only evaluated Tamino, found it useless, and never used it in production.
...) We couldn't even get a reliable unique key from the database. The id we did get "could change" if we were to backup and restore the database. Tamino also scaled very badly with simple queries taking up to a minute on the fastest PC we could buy.
Well, we did finish the software on time, but it was a complete nightmare. Software AG hardly gave us any straight answers (even though they charged big $ for customer support).
Tamino itself was missing a lot of features and seemed designed as a system for storing documents, totally lacking traditional database qualities (uniqueness, reliability, scalability,
Needless to say, the software was thrown away and rebuilt with a reliable SQL database.
I would strongly discourage anyone from bilding an application on top of an xml database, especially Tamino. If you really want to build your application on top of an xml database, I would seriously ask myself why and what difference it would make. Also, if you really need an xml interface, choose an ordinary sql db that has a xml plugin.
The thing the XML databases are nice for is if folks can't really lock down the schema
If you don't know the structure of your data, you're not dealing with data at all, but incoherent noise, which should be treated as an opaque object.
There are few shortcuts in life, and data storage is no exception. If you don't take the time to understand your data OR admit you don't understand it and treat it as an opaque object, you will likely get burned. Sometimes you won't, but don't let that fool you. You can drive for years without using your seat belt, until you get in an accident...
Just some food for thought for the budding data designers in the audience tonight.
But as the evolution from Assembler to C to Java has shown in programming languages, the day may come when raw performance takes a back seat to other concerns.
The point of a database is *data integrity*, not data storage and retrieval. Those are side issues. I can store data very quickly by dumping it to a raw disk device (/dev/hda1). But I will have a hell of a time guaranteeing data integrity (for instance, does each order item have a corresponding inventory item?).
Your evolution example of C to Java is one of increasing *abstraction* at the expense of speed. In a database, you don't want abstractions, you want your data to come out the way you put it in, and you want to be guaranteed that you will never have an invalid set of data in your database.
A bug in a C program means you have to rewrite your program. A bug in your data (bad data, in other words) could mean mistakes compounded on mistakes, that you can't ever unwind. I worked once on an order system that didn't cascade to the order line-items when the order itself was deleted or canceled. And royalties were paid to authors based on the order line-items. You can imagine after 5-6 years the shock when they realized they had paid 5-10% too much every single year because deleted order items remained in the database!
To put it formally, if your database asserts both X and NOT(X) in any given database due to inconsistent data, you can then create a result for ANY ARBITRARY PREDICATE as either true or false. In other words your database is completely broken and can return any arbitrary fact.
My point is, the poster should first ask themselves if data integrity is of utmost importance. If so, they should learn and understand the relational data model, then learn their database (whatever it is) and how it can be mapped onto the relational model. Since no truly relational databases exist today (SQL is bad joke), you need to perform this mapping step. Then program accordingly.
If data integrity is not important, then use whatever you want.
"I believe that you are confusing an RDBMS with an object store."
Excellent point... I've worked with some huge CORBA systems with semi-custom object databases and have seen firsthand the pain these systems can put you through.
One of the bigger vendors whose software we use claims to be porting their entire system to an Oracle or DB2 backed system instead.
Of course, they'll probally use some J2EE monstrosity to implement the new system, so performance will still suck.
Conformity is the jailer of freedom and enemy of growth. -JFK
In case anybody is interested, here are some suggestions for making a more relational-friendly alternative to XML, Here is a wiki topic.
Another potential problem is that existing RDBMS tend to be strong-typed. However, "dynamic relational" is not out of the question. Just because current RDBMS are strong-typed and have "static schemas" does not mean that is the only way to do it. There is a distinction between limits of implementations and limits of relational theory.
Table-ized A.I.
Juat as background - I work in Application Integration and spend much of my time dealing with moving data to and from RDBMS systems (Oracle, SQL Server, etc) and various external formats (EDI, flat file, XML, etc.)
It sounds like you are in danger of changing the original data store (the relational database) in order to preserve a data transfer mechanism (XML). This is probably a bad idea.
Why is the data in the database to begin with? Is it the database for some other business application? Probably - be carefull not to adversly impact that other application by changing it's database.
Why is the data being moved as XML? Do you have control of both ends of the transfer or does some outside entity control the data format?
Be carefull not to solve the wrong problem. Is the _primary_ business requirement to move data between two business systems, or is it to store and transfer XML data? It almost always is the former, even with an externally driven need to use XML to transfer data. In which case: leave the database alone and deal with the problems integrating the data tranfer (XML or otherwise) with the database.
And the parent post is right - XML is a bit of a fetish right now. This leads to it getting mis-applied to a lot of problems.
You're not likely to convince me to use a full XML database, no.
However, we should consider the viability of storing what you and others describe as unstructured documents in blobs with server-side operations available to you. Just because you're going to have some XML values (that's what they are) in your database doesn't mean the whole thing needs to be XML, nor does it mean you should have to do all operations client-side because you're using a relational database. What it does mean is that if you're determined to have XML values, you should have XML functions that match them. Nothing about the relational model prevents you from having this sort of complexity available to you, most vendors have just been slow to provide tools. A lot more could (and should) be done in the area of functional indexing so you don't have to "take things apart" in order to index them, too. I shouldn't have to create a separate "words_used" table to do full-text indexing on an attribute. To be fair, the relational model also doesn't say you have to break things down into small fields; I think people often get confused about this. RDBMSs usually only come with basic datatypes defined (integer, text, date/time, etc.) but it's perfectly acceptable to have field types of "list of integer" or "set of text" or "mapping of text to pair of integer and string" (yes, I generally code in C++, so STL structures come to mind). Having a field type of "XML stuff" is also acceptable.
The key element here is, however, the claim that you don't control the format of the data you're receiving. Yes, you can use XML-only tools on your documents because they're all known to be XML documents. But if you truly don't control the input, shouldn't you also have to deal with PDFs, TXTs, TIFFs, etc.? The point is that you do control your input, you have a baseline spec to deal with. In fact, you might have more: you might require all XML documents to have, more or less, the same structure. Do you? If you do, then that's an extra assumption you can use to your advantage. Every time you make such an assumption, you're working toward normalizing your data.
Abstractly, it would be just as appropriate to require all documents (particularly in the case of repair manuals where there are obvious patterns) to be in a very specific format, relational even. Why not? You've got them all using XML, and that's not necessarily the easiest thing to deal with -- in fact its "model" (if you can call it that) is far from simple, with a lot of gotchas (difference between putting data in a tag's attribute vs. putting it between tags.) You can't query what you don't have; just because manuals are in XML format doesn't mean you can ask "how long will this procedure take" and have it calculate the sum(step.time_required) for you, unless you actually have the data normalized. And if you've got it normalized, then the argument for XML ("it's not normalized and can't be") falls apart. The only reason you can ask how often certain part numbers are mentioned together in the same procedure is, specifically, that you know how and where to find part numbers (not just numbers of any sort) in the repair procedure. To do so, you've got to have assumptions about your document. Assumptions lead to normalization in a rather straightforward manner.
But to be clear to those who might think us confused: XML is physical (file format), relational is logical (data model). Speed is physical, features are logical. I've seen fast SQL and I've seen dog-slow SQL, just as I've seen fast and slow sorting algorithms, memory-management algorithms, etc. Speed, in general, is easy to improve. What's not easy to improve is a feature-set when you've locked yourself in. And that's why the relational model is important: it's logically, mathematically proven. Its operations are well-defined.
I'm not sure what we'd try to convince each other of at this point. Pretty much just talking past each other. As you say, convincing seems unlikely.