With XML, is the Time Right for Hierarchical DBs?

← Back to Stories (view on slashdot.org)

With XML, is the Time Right for Hierarchical DBs?

Posted by Cliff on Sunday November 18, 2001 @07:40AM from the digital-evolution-of-data-storage dept.

DullTrev asks: "The hierarchical database model existed before the far more familiar relational model. Hierarchical databases were blown away by relational versions because it was difficult to model a many-to-many relationship - the very basis of the hierarchical model is that each child element has only one parent element. However, we now live in a web world that demands quick access to a variety of data on a variety of platforms. XML is being used to facilitate this, and XML has, of course, a hierarchical structure." Do you think a hierarchical database would really be a better answer for storing XML data over the existing relational counterparts?

"There have been some pushes to create pure XML databases (info on XML in connection to databases is here and info on XML database products is here) with claims that as they support XML natively, they can offer many advantages over relation databases.

Some of these claims include speed, better handling of audio, graphic and other digital files, easier administration, and handling of unexpected elements. Software AG, a German firm, produce and sell a suite of XML products, including Tamino, a native XML database. They have lots of information on why they think there database is great, not surprisingly, but no benchmarks. So, do the Slashdot community think that with XML the time has come for hierarchical databases? Or is it better simply to use a relational database that can output in XML, or script your way to achieve the same goal?"

7 of 276 comments (clear)

Min score:

Reason:

Sort:

Reversed Question by devnullkac · 2001-11-18 07:47 · Score: 5, Insightful

From a purist perspective, I suspect the question is actually reversed: we shouldn't be talking about "XML data" is if it was somehow the core representation. It's usual intent is as a transmission format and, as such, needn't correspond directly to the organization of the source data.
Rather than discard the advantages of relational and object databases, should we instead ask how XML can be used to represent those kinds of relationships?

--
What do you mean they cut the power? How can they cut the power, man? They're animals!
Why relational databases dominate by coyote-san · 2001-11-18 08:14 · Score: 5, Insightful

Relational databases didn't come to dominate the database market because they pushed aside equally valid alternatives, they dominate the market because relational databases implement relational calculus. Indeed, that's the very touchstone that distinguishes relational databases from something like DBM and its many descendants.

And *that* is important because it assures the desiger and user that every possible operation is well-defined and (hopefully) correctly implemented. The exact syntax for a "join" may differ, and a specific implementation may be flawed, but everyone agrees to a common baseline.

For hierarchial databases to really take off, they need to have an equally strong mathematical underpinning. For now, AFAIK, there is none other than that you get when you map a hierarchial database into relational tables and use exactly those relational properties. That's a good start, but if you're only using the properties in relational databases, why not stick with them?

As for XML, that's completely irrelevant. It's a good format for transferring data, but that's about it. You can store hierarchial data in an XML file, but you can also use it to store purely relational data or completely unstructured data (in some CDATA block).

--
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
1. Re:Why relational databases dominate by Zeinfeld · 2001-11-18 13:39 · Score: 5, Insightful
  
  Relational databases didn't come to dominate the database market because they pushed aside equally valid alternatives, they dominate the market because relational databases implement relational calculus.
  That's rubbish. Back in in the 1960s when the first relational databases emerged nobody had a formal specification for a relational calculus. Today we can create a formal calculus for any data model, the Entity relational model is no different in that regard.
  SQL is a very 1960s / COBOL way of looking at a data structure. Most of the people using it simply do not have the breadth of experience of other data models to know its strengths or weaknesses. Most of the posts in the thread are as empty as those in an editor choice flamewar.
  The entity relationship model has been discarded by the programming language community in favor of typed set theory. Java and C# both have representations of sets, lists, etc., the only reason to use an entity relational model is to get persistence for the data structure.
  So you get this impedance mismatch and a pile of code whose sole purpose is to rewrite the data structures used in the program so that they match the data structures used in the persistence store.
  What we need is a persistence store with a data model that matches our programming language data model. Unfortunately most of the attempts to do this are half baked. All it should take is to add transaction statements into the language so that you declare a procedure to be transactional, it will be all or nothing.
  Unfortunately Sun made a pact with Oracle over Java and so they have remained stuck in the obsolete SQL world. C# looks to me to be a much better opportunity, Microsoft has little to lose from unifying the data model of the language with that of the persistence store and everything to gain.
  
  --
  Looking for an Information Security student project suggestion?
  Try http://dotcrimeManifesto.com/
2. Re:Why relational databases dominate by tim_maroney · 2001-11-18 15:13 · Score: 5, Insightful
  
  So you get this impedance mismatch and a pile of code whose sole purpose is to rewrite the data structures used in the program so that they match the data structures used in the persistence store.
  
  Exactly. What's more, this pile of code takes months to write even for a few dozen object types; it doesn't understand the idea of dependencies between objects so you have to add a whole layer to make sure that objects get persisted in the right order; it's incredibly hard to change, so the system design can't iterate; and simple objects like collections proliferate tables to the point of significant performance losses. It's a terrible way to build a software system unless the user model just happens to be adequately modeled by a fill-in-the-blanks table.
  
  This is why serious applications traditionally roll their own file formats. It's actually less work to manage most data models from scratch than it is to map them into the straitjacket of a relational database. Custom file formats serve in essence as hand-rolled object databases. Unfortunately, the rise of the three-tier client-server architecture has made the RDBMS layer an unquestioned assumption, with the result that modeling two dozen object types winds up generating over 50,000 lines of convoluted, slow and buggy source code. Modeling the same objects from scratch on a custom B-tree would take less than one fifth the code size. Doing it in a good ODBMS would be almost as trivial as specifying the data structures in XML.
  
  On my latest project, we ran into a strange issue when specifying the user interface of a discussion system. The designers wanted to mark read and unread messages per user -- in other words, functionality critical to providing a friendly user experience, which rn had fifteen years ago. The engineers hit the roof and said it was impossible. It turned out the reason was that this is an intrinsically hard problem on an RDBMS, although it's a trivial problem to solve in a hand-rolled .newsrc text file. Over the course of the project we ran into tons of these issues, and the interface design took a severe beating because of compromises to the limitations of an RDBMS back-end.
  
  Tim
Repeat after me ... by Serpent+Mage · 2001-11-18 08:32 · Score: 5, Informative

XML is not a magic bullet. Relational database won out over the Hierarchical model for a lot of reasons. For instance, there exists a number of integrity constraints with the Hierarchical model such as

1) No record occurrences except root records can exist without being related to a parent record occurrence. This means that
a) a child record cannot be inserted unless it is linked to a parent record.
b) a child record may be deleted independently of its parent however, deletion of the parent record automatically results in the deletion of all its child and descendent records.
c) the above rules do not apply to virtual child records and virtual parent records.

2) If a child record has 2 or more parent records from the SAME record type, the child record must be duplicated once under each parent record.

3) A child record having 2 or more parent records of DIFFERENT record types can do so only by having at most 1 real parent, with all the others represented as virtual parents. IMS limites the number of virtual parents to 1.

In addition to these flaws, relational databases have had over a decade to become mature, optimized, and enterprise scalable. Harddrive partitioning for such databases as oracle work out perfectly with the cylinder, sector, and tracks of a hard drive to allow for the fastest read/write times as can be possible.

Too often people see that XML "can" do so many things and decides that it should be the way things are done but XML is NOT a magic bullet and just because it has the potential to do something does not make it the best methodology for doing so.
Some thoughts... by Coventry · 2001-11-18 09:10 · Score: 5, Interesting
I have been struggling with these issues for awhile now, for various reasons. Why? Because I like Zope, but am, like most developers, more comfortable with relational data structures.

Zope uses an object database known as the ZODB. Some forms of many-to-many relationsships and such can be handled via the use of selection and multi-selection properties, which are designed to distinguish between a selected element and the list of available elements. The list of elements can be derived from a property on the current object, a property on a parent object, or be created via a method call - allowing for non-traditional (for OODBMS) cross-linking of objects. Of course, since this sort of thing is a workaround, no true relational links are created... 'Soft Relations' may be ok for MySQL, but in big application development, relationships must be enforced! Thus, the big-boys in RDBMS all enforce foreign keys (mysql does not)...

Of course, I've found that by careful creation of object heirarcies, very complex applications can be created on top of a OODBMS that are in fact more robust, in some ways, then the relational couterparts. The Bigest hurdle (Short-term) I see to OODBMS (including ones based upon XML [the ZODB can export objects as XML but they are stored differently internally]) is the lack of a true query and data manipulation language - like SQL. Sure, OQL exists, and is even technically a standard, but it A) sucks and B) is geared towards large java applications with huge amounts of active objects, not general purpose OODB queries. Thus, without such language, OODBMS are all disimilar in how one queries and creates/updates data, and in many cases, the only interface is a truely procedural one! Thus OODBMS are forced to use proprietary tools, and are locked into one system - not to mention speed of development (something normally associated with OO development and OODBMS in general) is hindered by the excessive amount of procedural calls one needs to simply query thier data...

Recently, an add-on to Zope addressed some of these issues. Called 'ZOQL' - it uses a SQL like syntax and allows for very discrete querying of the ZODB (something one had to do programatically using the 'ZCatalog' before) with all of the familar aggregate and comparison operators SQL users love... Of course, this _still_ doesn't address the issue of soft-relationships:

I think the bigest hurdle to OODBMS in the long term (tools like ZOQL are interfaces to existing systems, thus can be mplemented easily) is the lack of handling relationships. It seems that most RDBMS force a developer to think in Relational terms about the data, and most OODBMS force you to think in terms of objects... Most problems can be mapped to either of these domains, but you are forcing the data-model-type onto the problem. What is needed is a hybrid system, an 'Object-Relational' DBMS. This is to say that OODBMS system makers desist with the traditional OO idea that relations are of the following types:
- Object A is a Object B
- Object A Has a/many Object B(s)
What RDBMS systems excelled in (and thus fell into pupular use for) was ease of management and allowing common data to be moved and grouped. A 'Look-up Table' - for instance, which simply holds a list of common data (an enumerated list) and can be centrally maintained is a Boon in the RDBMS world. For example, you have a lookup table of car manufactureres, and one of them changes its name... Instead of updating all N Cars that are made by the manufacturer, you simply update the single record in lookup table. Since each car would have somehting akin to a 'Manuafactuer_ID' column linking it to the lookup table, the Cars belonging to the manufacturer are all taken care of.

How does one do this in a hierarchal system? Well, the easy answer would be that each manufacturer object contains all the cars that manufacturer makes. Simple, right? WRONG. Why?

Because each car also has a body-type (compact, sedan, SUV, truck, van, etc...) - which in a relational database would simple by another lookup table, but in an OODBMS poses data management issues. Do we put body-type higher then manufacturer? If so, then we have to maintain the list of manufacturers for each body type, causing headaches. Or do we put body-type below manufacturer, causing us to need to maintain a seperate list of body types for each manufacturer - these lists of course need to match exactly if we ever plan on being able to search or do reports based upon all cars of a specific body type.
Sadly enough, this sort of seperate-enumeration-relationship isn't implemented (well) in any OODBMS I've found.
Take the ZOBD for example, its selection and multiselection lists Try to handle this situation, but fail because relational integrety is not maintained! That is to say, behind the scenes it's not a true reference to a value in the enumerated list, but just a text entry representing a value in the list. If the value in the list changes, the selection-property does not update, leaving you with the equivilent of MySQL's bastard-children, the orphaned records.
This sort of soft-relationship handling is Ugly and BAD for maintainaility, but OODBMS users are faced with two ugly choices each time they map such a relationship: Do I store this as a plain-text property and just update N records each time this changes, or do I map it into the hierarchy and deal with the headaches incurred by doing so...?

I don't think I've answered the question, but hopefully I've at least shed some light on the subject for members of both the OODBMS camps and RDBMS camps... Now if only a useful ORDBMS were to come along...

(Note that PostgreSQL and some other RDBMS actualy can be used in a semi-OO manner, but this is usually reserved for inheritable structures of data to be used for specific extensions to the data model - thus the SUV table inherits from the Cars table and adds some columns - but all other relationships SUV has will still be relational)
--
man is machine
Data storage format is irrelevant by dgroskind · 2001-11-18 11:00 · Score: 5, Insightful

XML may be hierarchical but the data it is used to markup is not necessarily hierarchical. For instance, XML can be used to markup conventional fielded (flat file) data to serve as an interchange format.
More importantly, XML is used to impose some structure on inherently unstructured text. The structure it provides is based on some assumptions of how the data will be used or how it will be presented. If the data is used in some otherway, the markup can be useless.
An example is a book. For XML purposes, it can be described as structured by chapter, section, subsection, and paragraph. For information purposes, tags are assigned to represent the ideas, terminology, names and other index-like content. There is virtually no structure in these index type of tags but they convey the most important information in the book.
Or not. These tags are assigned based on assumptions about what readers are interested in. A different set of assumptions would produce a different set of tags even thought the structure of the document would stay the same. If the sentences and paragraphs are shuffled and exerpted for some other publication, even the structure becomes irrelevant.
How this inherently unstructured information is stored is relevant to how it is managed, that is, how it is backed up, how access is controled, how changes are tracked. However, when it comes to putting the information to some useful purpose, it is the retrieval mechanisms that are important. The issues here are how easily the user can specify the type of information he wants and how accurately the mechanism can find it. This process is usually independent of the underlying structure and uses some higher level concepts of relevance and context.
The question of whether to use a hierarchical, relational or object-oriented data structures misses the point for textual data, for which XML is commonly used, because none of these structures capture meaning.
Topic maps make a heroic stab at capturing meaning in XML markup but still only within a set of assumption. I suspect a true meaning markup language is theoretically impossible, or at least theoretically very far in the future.