Choosing the Right XML Database?
Saqib Ali asks: "Later this year, I will be starting a project, that will involve storing XML data in a database. I understand why a Relational DB is not a good choice. I also understand why a pure OODB like Objectivity is not a good option either. So I started doing some research into various XML DBs like Apache Xindice, exist-db, Oracle 9i, and others, but I am unable to decide which XML DB to use. What criteria should one use when evaluating whether an XML DB will be a good option for a particular application? I would prefer using an Open Source solution. Initially my application wil involve storing reports in an XML repository, for retrieval via XPath, but the reports will get larger with time. Any suggestions on how to decide which database to use?"
To pick the right database, you need to analyze the structure of your data and the operations you intend to perform on it. XML is a useful general format for interchange of serialized data, but just because you have some data represented in XML doesn't mean you should store it in XML. What is the structure of the data? What will you do with it? Why is a relational database or a object database a bad choice for your application?
Thre are the things I would question first:
....
... it was very fast ... and well, you programmed your XML manipulation by directly manipulation "virtual" DOM trees inside of the DB. In SQL and in a relational DB, of course.
:-)
... XML. It makes no sense when you think you need to use XML because of hype or something ....
a) does it use XQuerry/XPath to access the DB or an other standard way or is it proprietary?
b) does it support your programming language of choice?
c) Where do you get fastest a running prototype?
C) is the most important point IMHO. If you have chosen the right DB you only know AFTER you have implemented your application. (( well, you can try to find test cases and try to predict if the DB is the right one by trying to scale tests up)) Note: I used the word try several times, because such an approach is only trial and error.
Ok, if you can just start coding, that was point c), and a standard like a) is supported, then you should be easyly able to hide the actual DB behind an suiting interface.
b) is only a matter of your flexibility
I would guess the appliacation has more constraints which will likely limit you or challange you to overcome than the DB used behind it.
I once read an article in a german magazine, they have put a DOM writer and a DOM reader as stored procedures into an SQL data base.
And all XML was stored in a few tables, element, attribute and such
So much to "relational" wont fit your needs
Regards,
angel'o'sphere
P.S. You gave not many hints why you need an XML database. A XML database makes only sense if your natural document format is
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Well, if you're just sticking the entire XML document into the table like a blob of text, then yeah, there's no problem. But then you can't really do anything with it other than retrieve it (i.e., you can't run a query on parts of it).
.. etc.. you have a complex many-to-one structure you have to store in multiple tables .. how would you do it? Well, you'd make a document table, a chapter table, a sentence table, and link them all together with unique id's .. etc.. you get the point I hope, that the XML doc's rich structure has to be decomposed into rows and columns.
But if you want the database to be aware of the *structure* of the data, you have to decompose the data into pieces, stick them in various tables, keep the integrity between the tables, and, oh yeah, write some code to convert the data back into XML when you want to get the whole document.
For instance if you are storing an XML document that's made with one-or-more Chapters, Paragraphs, or Sentences and each Chapter has one-or-more Paragraphs, and each Paragraph can contain Sentences
XML databases take care of this automatically and also can *index* the various parts of the document so that queries (XPath or otherwise) run faster (i.e., give me the documents that contain sentences beginning with "Hello").
XML databases take care of this automatically
Take care of what? Parsing? That is a parser, not a database. How about a specific example.
Relational is pretty flexible if you just know how to use it. (I agree that existing commercial relational systems could use some adjustments, but lets not throw the Cray out with the bathwater.)
Too much of this XML database stuff sounds like a return to the "navigational" databases of the 1960's. Do we really want that? Dr. Codd rescued us from those. Now you want to be un-rescued?
Table-ized A.I.
The relational model is a logical model and I challenge you to find any example of data that cannot be represented quite easily in the relational model. In your example, you have traded any notion of data integrity for what you assume will be faster data access. In fact, since the relational model makes no recommendations on how data is physically stored, this is not necessarily the case.
How would XPath enforce your rules on how parts can relate to other parts? Why don't you just try flat files and grep?
XML is a perfectly acceptable means of data representation. It does not however by any means provide a formal, coherent theory of data management.
I really hope you were kidding about teaching anyone anything. You have a lot of learning to do.
You can store anything in a SQL database, but you do have to take the time to design it and migrate the data as the schema changes.
Spending lots of time and money designing a system that the customer can not imagine is a waste of money, because you will have to change the design as the business units focus on what they want, normally after they see your initial results.
Sometimes you have to use duct tape.
I have one app in production that uses XML files as data stores. There are about 24 users. I also have apps in production with 1000s of users that use 20GB+ SQL databases. I use apps that utilize lisp dumps for fast read only datastores (in addition to emacs). There is a place for everything.
Joe
Joe Batt Solid Design
In my observation, if you don't have enough info to create a starting schema, then you need to do some more analysis.
This is exactly the problem. How do you get any analysis if the customer doesn't know what to ask for. Applictions evolve. The flexibility offered by an unstructed data store like XML lets you eveolve the data model like the rest of the application.
You gloss over the hard part with "etc..." Attributes or even structured child tags can not be anticipated and built into the schema or else by the time you do then you've just built an XML database.
Joe Batt Solid Design
If you look at the needs of most "complex hierarchical structures", it often turns out that trees are the wrong "structure" to begin with
What about cases where entities can contain instances of *themselves*? Or where the depth and width of the nesting is not necessarily known up front?
You end up creating these artificial "id" fields, and in so doing build a "tree" on top of the relational database, which is a very silly thing to do.
And what about cases where ordering of contained elements is important?
We would probably have to explore a specific case study to settle this.
I don't think it can be "settled". Apples and oranges, and all that. Sure, you can *force* a relational database to represent anything, it's just often like putting a round peg in a square hole, particularly for engineering, scientific, and mathematical problem domains.
If you wish to take this elsewhere as to not bloat up slashdot, I would be happy to
Take it where? I suggest your journal.
The worst case I have had was one in which ordering was important. Relational databases just can't handle that very well, being based on set theory IIRC. Sure, you can add a field containing an ordinal to represent the "order", but that breaks badly if you need to insert or remove items.