Are Relational Databases Obsolete?
jpkunst sends us to Computerworld for a look at Michael Stonebraker's opinion that RDBMSs "should be considered legacy technology." Computerworld adds some background and analysis to Stonebraker's comments, which appear in a new blog, The Database Column. Stonebraker co-created the Ingres and Postgres technology while a researcher at UC Berkeley in the early 1970s. He predicts that "column stores will take over the [data] warehouse market over time, completely displacing row stores."
Okay, at the risk of sounding stupid...
Since when is a column store database and a relational database mutually exclusive concepts? I thought that both column store and row store (i.e. traditional) databases were just different means of storing data, and had nothing to do with whether a database was relational or not. I think the article misinterpreted what he said.
Also, I don't think it's news that Michael Stonebraker (a great name, by the way), co-founder and CEO of a company that (surprise!) happens to develop column store database software, thinks that column store databases are going to be the Next Big Thing. Right or wrong, his opinion can't exactly be considered unbiased...
The name of his blog is The Database Column after all.
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
Relational databases aren't being obsoleted. Some schema design heuristics are.
0xfeedface
Is there a dual-mode db, that lets you create a row-based or column-based "table"? I imagine cross-mode queries would kill performance, but at least you could have a system front-loaded with row tables, where data comes in, and then archive this data over time into the column-based tables, so that reads were fast.
stuff |
every article linked makes it clear that this is about warehousing as opposed to oltp. so is the technology dead? no - can it do everything? no
It's hard to believe that's how Micronians are made. Why don't we see it right now by having you both kiss one another?
>"column stores will take over the [data] warehouse market over time, completely displacing row stores."
Hmmmm. So if I rotate my Paradox or Excel table by 90 degrees, I have achieved database coolness? Who knew it was so easy.
Some mornings it's hardly worth chewing through the restraints to get out of bed.
...is duping himself and thus Slashdot is duping the stories by extension.
Stonebraker has been pushing the concept of column-oriented databases for quite some time now, trying to get someone, ANYONE, to listen that it's superior. While I think he has a point, I'm not sure if he really goes far enough. Our relational databases of today are heavily based on the ISAM files of yesteryear. Far too many products threw foreign keys on top of a collection of ISAMs and called it a day. Which is why we STILL have key integrity issues to this day.
It would be nice if we could take a step back and re-engineer our databases with more modern technology in mind. e.g. Instead of passing around abstract id numbers, it would be nice if we had reference objects that abstracted programmers away from the temptation of manually managing identifiers. Data storage is another area that can be improved, with Object Databases (really just fancy relational databases with their own access methods) showing how it's possible to store something more complex than integers and varchars.
The demands on our DBMSes are only going to grow. So there's something to be said for going back and reengineering things. If column-oriented databases are the answer, my opinion is that they're only PART of the answer. Take the redesign to its logical conclusion. Let's see databases that truly store any data, and enforce the integrity of their sets.
Javascript + Nintendo DSi = DSiCade
*headdesk* heaven forbid that solutions might be different depending on the situation and the data....
Nope! Apparently there is a new method and thus it must be the real One Twue Way.
You've all heard of the IBM product called DB2, right? So what was DB1? Answer: IMS, which is a hierarchical database. They were a pain in the ass to use--PSBs and all--but they were/are faster than hell and I doubt any company is going to throw them out for any reason. Same goes for relational databases. They're going nowhere. Sure, we have room for more but nobody is going to displace the RDBMS anytime soon.
No. There, that was easy !
It's like the packet of crisps that says "Is there a 20 pound note in here !!?" - the answer should always be 'No'.
Except maybe for one person.
sed -e 's/crisps/potato chips/' -e 's/pound/dollar/'
Obviously, he's biased. But more importantly, he just said that column-store databases are going to take over the WAREHOUSE market. That doesn't mean that row-store databases are going to become obsolete, because there will always be applications out there that do a substantial amount of writing as well as reading.
In fact, the new wave of user-generated-content websites and webapps seems to me to indicate the exact opposite - if anything, row-store databases, with their usefulness in write-heavy applications, should becoming, if anything, more and more necessary/useful on the web.
So...chalk this one up to some grandstanding on the part of a guy who wants to put more money in his pockets...
Meh, content is content. As long as the linked "article" is informative and sparks discussion, I'm happy.
From TFA:
"Column-oriented databases -- such as the one built by Stonebraker's latest start-up, Andover, Mass.-based Vertica Systems Inc. -- store data vertically in table columns rather than in successive rows. "
Marketing hype for his startup.
What a sleezeball.
Relational databases will be around as long as humans generate relational data. Take the classic example of an invoice that may have many entries, each entry referencing an inventory item. This sort of thing is likely to exist forever, and RDBMSes model that pretty well.
As far as whether the backend is row- or column-oriented - who cares? As long as I can use the one most appropriate to my access pattern, the implementation details just don't interest me enough to get worked into a furor. Don't get me wrong - I think that there are some neat developments in the works - it's just that I don't have a strong general preference on how my information is physically laid down on the platters.
A more interesting question for me is whether SQL is obsolete. For the most part, I'd say that it is in the sense that most people need never use it directly. We use SQLAlchemy instead of writing raw SQL, and the Java folks seem to be fond of Hibernate. I still look at the generated queries sometimes to convince myself that it's sane or for debugging or optimization purposes, but if they inserted a new middle layer between Alchemy and PostgreSQL that used something completely different, our code wouldn't notice the difference.
Dewey, what part of this looks like authorities should be involved?
Column stores are great (better than a row store) if you're just reading tons of data, but they're much more costly than a row store if you're writing tons of data.
Therefore, pick your method depending on your needs. Are you storing massive amounts of data? Column stores are probably not for you...Your application will run better on a row store, because writing to a row store is a simple matter of adding one more record to the file, whereas writing to a column store is often a matter of writing a record to many files...Obviously more costly.
On the other hand, are you dealing with a relatively static dataset, where you have far more reads than writes? Then a row store isn't the best bet, and you should try a column store. A query on a row store has to query entire rows, which means you'll often end up hitting fields you don't give a damn about while looking for the specific fields you want to return. With column stores, you can ignore any columns that aren't referenced in your query...Additionally, your data is homogenous in a column store, so you lose overhead attached to having to deal with different datatypes and can choose the best data compression by field rather than by data block.
Why do people insist that one size really does fit all?
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
The next big thing in DBMS:
turning your head sideways.
You can't talk about Wikipedia's flaws on Wikipedia
In my IT business, a vast majority of our top tier clients (grossing over US$100 million annually) are still using antiquated software that is still using a relational database backend. While these companies are generally VERY efficient in terms of providing services or products to their market, their accounting, purchase orders and project management software is decades outdated. Many of the companies that maintain these packages have merely made the interface more current (but still 5+ years old, but are still using terribly outdated software. I can't begin to tell you how often the words "FoxPro" and "MS SQL" come up and it ends up being a relational database "solution" or even worse.
It is very frustrating because we do have programmers on staff that create third party plug-ins to these databases to try to make solutions that the OEM code doesn't. When you meet younger programmers, many of them are frustrated themselves to work on ancient solutions that have no hope of being upgraded, because these industries we work in are not in a rush to try anything new and shiny, but instead are happy with the status quo.
I just bid a job a few months back that would cost $150,000 to upgrade their database infrastructure, and likely save the company $300,000+ annually in added efficiency, less downtime, and a more robust report system. Guess what they said? "We all think it is fine the way it is." That's money thrown out the window, employees who are frustrated (without knowing why), and forcing the company to lose efficiency by not being able to compete with newer companies that are utilizing newer technology to better their bottom line.
Ugh.
The main problem is SQL is just a description language for set data, and a relational database is exactly that a set data.
The main problem is so far nobody really has brought out something more reable to deal with sets in a mathematical sense, you could use mathematical operators but then things would become even less readable than SQL is.
All approaches on the programming side I have seen (criteria objects etc...) make things only easier in some domains, after that you revert to plain sql and its derivates.
Elminiating SQL would mean you probably have to eliminate the data storage model of sets as well.
Even Slashdot has become obsolete.
I tried turning my Oracle server on its side to get column-store access. Strangely, I didn't see any increase in performance. Perhaps I'll try the other side...
It must have been something you assimilated. . . .
Let me get this straight by paraphrasing: Column databases are the wave of the future, says a column database distributor on his new column database blog. And Red Hat would recommend you run your new column database on Red Hat Enterprise Linux, perhaps? I wonder what brand of kit Dell would recommend I run RHEL on ...
Would you like it better with IronPython and Linq?
From the perspective of an application developer, this is pure nonsense. I practically don't care wether my DB stores data in columns or rows or whatever.
;)
What I need is a good, consistant layer that can handle object-based tree structures - nothing more, nothing less. I want to dump my Java/Objective-C/C#/C++/PHP/Python objects in some storage layer, and I want to be able to get it back, search for it, etc.pp.
Yes, relational databases are (or should be) dead for most modern application designs. But not because of RDBMs are going to be replaced by column-oriented DBMS (which is, from application perspective, no difference - IMHO), but because OODBs solves most application problems better (not that good solutions exists, yet... *sigh*)
On the other hand, I never got any master degree. Shame on me. Just won a best paper award for a paper I did not even wrote. Maybe I just don't have the wits for this stuff
Are they now officially an also-ran? Has the whole concept failed to be usefully implemented commercially, or will it be another Lisp--elegant, beautiful, and largely unused because it's kind of weird?
Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
TRUNCATE TABLE SQL_LANGUAGE;
COMMIT;
There, feel better?
For data warehousing, a higher or different level of abstraction may be useful and make database design easier, particularly as paralellism becomes more and more common. Storing rich markup language or media in a database might be problematic as well.
But there's no way that RDBMS's are going away -- relational algebra simply solves too many data storage problems.
Conformity is the jailer of freedom and enemy of growth. -JFK
Once someone shows that there is no longer a use for any relationship between data entries, then we'll be able to say that RDBMSs are obsolete. Actually both headlines (/. and the linked article) are mistaken about what Michael Stonebraker is saying. He is talking about read intensive applications mostly and he is talking about optimization of data for reading purposes. This does not mean that RDBMSs are obsolete for all uses, just that he sees a faster way to retrieve data for certain uses.
You can't handle the truth.
SenSage built a column-oriented DB in 2001 and has had much success with the approach for their fast-input, fast-query, high-density, multi-TB databases. Stonebreaker was on their technical advisory board. Interesting that he now centers his own startup on the same principles. See http://en.wikipedia.org/wiki/Column-oriented_DBMS.
sed -e 's/crisps/potato chips/' -e 's/pound/dollar/' -e 's/note/bill/' -e 's/packet/bag/'
Not A Sig
sed -e 's/pound/pound note' -e 's/dollar/dollar bill'
To be applied to your sedscript. That's right, I sed'd your sed.
Vertica's getting a load of it free.
Hey, you never know, some good may come of it. Maybe some of the "Nobody ever got fired for buying Oracle/SQL Server/DB2" people will have to explain why they're using inappropriate technology.
Deleted
There, finished that off for ya. :)
This will be the year of the column based database.
If you mod this up, your slashdot background will turn into a beautiful sunset!
Staying OT with you, I do not agree with your opinion on SQL in general, although using it for basic CRUD in an object-oriented application is tedious and bug-prone. For that, I warmly recommend the up-and-coming Python-based ORM called SQLAlchemy. See: http://www.sqlalchemy.org/.
Along with Procedural Programming, this could REVOLUTIONIZE the software industry!!
Yep, I'd say they're obsolete given the wild success of their replacement(s)....oh...wait....
The Dvorak keyboard is more efficient by a factor of 10 and you don't see it taking over the keyboard layout landscape.
Just because something is "better", even in technology, doesn't mean it's going to take over.
I've also lived through the decline of mainframes...still around. The internet was going to replace faxes...I still have a fax machine.
Linux is better than Windows, columns are better than rows but I wouldn't get all a-twitter over either of them just yet. Particularly from someone selling column based data stores.
That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
I can understand people not reading every link on a slashdot article they comment on. But if you post the bloody link, is it too much asked to actually RTFA?! It's an article about a column. The actual column is quite interesting.
To add some content, this is about optimal storage for SQL databases in a data warehouse context where there are some interesting products that use something more optimal than the one size fits all solutions currently available from the big RDBMS vendors. The API on top is the same (i.e. SQL and other familiar dataware house APIs), which makes it quite easy to integrate.
Regarding the obsolescence question, one size fits all will be good enough for most for some time to come. Increasingly people are more than happy with lightweight options that are even less efficient on which they slap persistence layers that reduce performance even more just because it allows them to autogenerate all the code that deals with stuffing boring data in some storage. Not having to deal with that makes it irrelevant how the database works and allows you to focus on how you work with the data rather than worrying about tables, rows and ACID properties. Autogenerating code that interacts with the database allows you to do all sorts of interesting things in the generated code and the layers underneath. For example, the hibernate (a popular persistence layer for Java) people have been integrating Apache Lucene, a popular search index product, so that you can index and search your data objects using lucene search queries rather than sql. It's a quite neat solution that adds real value (e.g. fully text searchable product catalogs are dead easy with this).
Column based storage is just an optimization and not really that critical to the applications on top. If you need it, there are some specialized products currently. The author of the column is probably right about such solutions finding their way into mainstream products really soon. At the application level, you'll still be talking good old SQL to the damn thing though.
Jilles
Exactly. If you're using Python, you're not allowed to complain about SQL because there are good alternatives. Besides Alchemy, Django has a very nice object mapper of its own. Both of those have progressed to the point that writing raw SQL is simple unnecessary for almost any application development.
Dewey, what part of this looks like authorities should be involved?
Maybe his approach is all wrong. The database my company uses has MANY tables that are rarely written to, but a few that are written to ALL the time. Instead of trying to cram his 'one size fits all' database scheme down our throats and replace the current 'one size fits all' database scheme, maybe he should be trying to create a database engine that can do both.
I think you would have to determine the main use of the table beforehand (write-seldom or write-often), but the DB engine could use a different scheme for each table that way. I know some will claim that it can't be more efficient to split things this way, but remember that this guy is claiming 50x the speed for write-seldom operations.
As for Relational Databases... How is this exclusive to that? This is simply how the data is stored and accessed. If he is claiming 50x speed-up because he doesn't deal with the relational stuff, that's bunk. You could write a row-store database with much greater speed as well, given those parameters.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
'"In every major application area I can think of, it is possible to build a SQL DBMS engine with vertical market-specific internals that outperforms the 'one size fits all' engines by a factor of 50 or so," he wrote.'
I know very little about DBMS systems, but I thought it has always been true that you can achieve monumental performance increases by building somewhat specialized database systems in which the internals of the system make assumptions, and are tied to, the structure of the data being modelled. In fact, when RDBMS systems came in, one of the knocks on them was that they were far more resource-intensive than the hierarchical databases they displaced. However, the carved-in-stone assumptions of those models made them difficult and expensive change or repurposed.
I'm sure I remember innumerable articles insisting that "relational databases don't need to be really all that much terribly slower if you know how to optimize this that and the other thing..."
In other words, as an outsider viewing from a distance, I've assumed that the increasing use of RDBMS was an indication that in the real world it turned out that it was better to be slow, flexible, and general, than fast, rigid, and specialized.
So, what is a "column store?" It sounds like it is an agile, rapid development methodology for generating fast, rigid, specialized databases?
"How to Do Nothing," kids activities, back in print!
While I didn't particularly pay much attention in my database class, or go to it that often, from my current work with databases and a quick skim of some definitions for RDBMS, it strikes me that a good portion of people in this thread and the articles are using the term RDBMS incorrectly. (Though some of the posts appear to be in agreement with me) As far as I can determine RDBMS is solely (and simplified) the concept of relating data between different tables to decrease the repetition of said data. It's a method, a widely applied method, but just a method, not an actual type of database storage.
Sooo, wtf does RDBMS have to do with storing data with either columns or rows in a file?
"Now you know, and knowing is half the battle!"
Traditionally perl-objects are hashes with one blessed hash per instance. The hash contains all the instance variable values using their names as keys.
instead one can use blessed scalars holding a single integer value for instances and let the class variable contain all the instance data in arrays indexed by the instances scalar value.
This technique was originally promoted as an indirection to protect object data from direct manipution that bypassed get/set methods. But it also allows the object to be either row or column oriented internally. that is the class could store all the instance hashes in an array indexed by the scalar. or it could store each instance variable in a separate array that is indexed by the scalar value.
Thus the perl class can, on-the-fly, switch itself from column-oriented to row-oriented as needed while maintaining the same external interface.
Of course this is not a perl-exclusive feature and it can implemented in other languages. It just happens to be particularly easy and natural to do in perl.
Some drink at the fountain of knowledge. Others just gargle.
I would hardly use the word elegant to describe object databases? They seem elegant, right up until you try to version a schema or point a reporting tool at them. Interesting, yes. Elegant, hell no.
The near future. Mr. Stonebraker walks into a store.
Mr. Stonebraker: How much are these plums?
Checkout girl: Plums? They're $0.99, $1.39, $12.49, $15.99, $26.38, $13.37...
-Rob
Biblical fiscal responsibility
Stop doing it.
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
I can see how column stored data is faster for querying, but for these benefits, it will be slower for writes, and the quote claims that databases this way will replace row-format databases. In many cases it may, however, there are many write intensive applications. Additionally the index features of current row-store database engines is pretty much equivalent to column store, (including possible indices on column store tables.)
Ideally a RDBMS would allow a admin to select the primary storage strategy. To me, ideally lookup tables could be column and transaction tables could be row... seamlessly. And as mentioned previously, the term RDBMS really is used incorrectly in the summary, though I suspect it was meant as "traditional RDBMS"
The relational concept will still exist regardless of the underlying storage methods.
You are being MICROattacked, from various angles, in a SOFT manner.
IMHO, databases would be much, much different today if IBM had extended the VSAM file type from the mainframe to the PC (for the uninitiated, think of a combination flat-file table with one built-in index). In my experience, the vast majority of database requirements of office workers are simple, so simple that even applications like Access are overkill. What do people use a database for? Scan through an entire table looking for one specific piece of data using only one key. VSAM files would be perfect!
And things like the column-based database table (which strangely enough strikes me as an attempt by a spreadsheet user who naturally thinks sideways instead of down to create a sideways-oriented database) is even greater overkill.
Besides, everyone knows that the database query users *really* want to run is: "SELECT * FROM *", and get upset when told they can't do it. That's another reason they prefer flat file databases to relational. It's easier to print the data out on fan-fold paper and do searches the old fashioned way.
"My country, right or wrong; if right, to be kept right; and if wrong, to be set right." --Senator Carl Schurz (1872)
Sorry, but I still believe in B-Trees, ISAM files, fseek(), ftell() and text configuration files. Every time something new comes out somebody calls the old thing dead. I think a huge # of uses of databases could be handled by the techniques previously mentioned but we hit up Mysql causes its easiest to get going. Will this ever end? This is one of the main reasons I see for people getting that shiny new 3.2Ghz system and seeing that it isn't really that much faster than the 200Mhz Pentium. Lemmings use the newest ideas which are actually take more computing power to use but less brain power to implement. Not actually sure on this technique but I bet if I dig I'll find it.
I would love for a DB to store in rows when I have data that gets written to a lot but where the data is mainly just read I would love it to be stored in the column format.
I would select it based upon the table that way I could control the data easily without making it too complicated.
This would give the best of both worlds IMO
On the contrary.
... etc
From a standard 3rd generation programing language one can read and write into flat files and we can do close to this with a hierachical database.
We lose this with relational databases because the way the database organises data has no direct mapping to the way it might be set up in a standard programming language.
What this means is that every transaction to and from the database must go through a literally horrible re-mapping. IE. The language data structures do not correspond to the RDBMS data structures and visa versa.
As an example - in postgreSQL the last I looked at writing a simple row into a table where there were something like 100 columns in the row...
In the 3rd generation programming languages this was just a simple structure with 100 entries.
The data transfer from that structure generated a function call with more than 1000 parameters. This was to be mapped and re-mapped with each call to transfer data, this is even though the structure itself is static and determined at compile time.
Next: There were about 10 parameters per field (column).
1: Column name
2: Column name length
3: data type
4: data length
5: character representation
finally 10: Address where the data lives.
The thing is such a table could be set up very easily and populated with a simple loop that rolls in the required values via say a mapping function with about 10 arguments. This could be done ONCE at run time to prepare for the transfer of data and then the same table could be referenced for each call and simply an address could be sent with the transfer.
Noooo.. It was dynamic and the data was encoded as parameters on the stack. This means the stack must be build and torn down and rebuilt for each call.
Next - the implementation was so bad that the program would run in test mode with only a few parameter but it failed when the whole row was to be transfered.
I gave up on that interface.
---------------
Oracle had pre-compilers. They did the same damn thing. The code generated by the pre-compilers was just awful.
---------------
While there is much good to say about RDBMS's in general. The issue I ran into was the interface from 3rd generation languages took a HUGE step backward. IMHO we should have a high level language statement called DBRead() and DBWrite(). In C this should generally correspond to fread() and fwrite(). If this is too complex then DBWriteStruct() could be implemented with suitable mapping helper function.
Nooo...
In the old days one could read and write into a flat file at a given location with a single statement or function call depending on the language. Of course "where" to read and write became a real issue and I do fully understand the complexity of file based tree structures and so forth, especially since I wrote a lot of code to implement these algorithms.
The thing is now we have RDBMS and other solutions that give us the data organisational abilities we need - and we lose the ease of mapping these structures into a suitable structure or object in the programming language.
I for one do not think we have stepped forward very far at all.
-------------
I'll toss in a case in point made by a good buddy of mine who just happens to be one of the top geophysical programmers in this city.
One of his clients was running an application hooked to an Oracle database running on a fast SUN. Run times were measured in close to a day.
Finally they removed the Oracle interface and replaced it with a glorified flat file. They clearly built in some indexing. The result is the run times dropped to under 20 minuets.
As my buddy says - He will NOT use any RDBMS. He can take 5 views of the data comprising 1000's of seismic lines and the user can click on any trace number, line number, well tie and so forth and in real time he can modify all views of the data on as many as 5 s
This guy isn't really making a prediction, he's just generating publicity for himself, and that requires a gimmick. Creating the gimmick is easy, and it's done like this:
First, pick a dualism. Any dualism. Here's some examples:
1) You can type in a word processor, and you can type in a spreadsheet.
2) You've got computers, and you've got networks.
3) Skirts can be short, or skirts can be long.
Now, make the unsupported claim that the dualism is really a continuum. It might be true, or it might be bullshit. Like this:
1) Things you can type in resemble word processors or spreadsheets, to certain degrees.
2) You can do computing activity on an isolated computer, or you can do it on a totally networked system.
3) Skirts can be any length between long or short.
Now, assert that there is movement. Look at which end of the bullshit continuum you've created that your asserted movement is pointing at. There's your prediction for the future. If you hype that like a pro, you too can be a tech pundit on Slashdot or even somewhere important.
Here's the predictions that result:
1) Over time, word processors are going to become more like spreadsheets. Entire sections of documents are going to be calculated by formulas.
2) In the future, the network will be the computer.
3) Next year we're going to have short skirts. Yay!
How often do we see this kind of thing? Very often. And sometimes they can even be good ideas, worthy of making products out of. But even if they are terrible ideas, they're useful for hyping yourself.
Here's some more:
1) This article - Rows and columns. You assert that rows are more important that columns. Except in the glorious future where columns take over.
2) We've got Embedded processors outnumbering desktop computers. In the future, no desktops, just embedded processors in our skulls.
3) Paper is giving way to computers. In the future, completely paperless office!
4) Mouse, Joystick. In the future, you tilt your mouse.
5) Structured and object. Obviously, everything's going objects in the future. The death of structured programming means that loops and decisions are obsolete.
And so on.
I know this because I was a consultant. It's a basic skill.
No weapon in the arsenals of the world is so formidable as the will and moral courage of free men.-Ronald Reagan
I agree. In web development column-oriented storage is a way to go. I even tried to simulate stuff like that with PHP and MySQL, and it worked, but the maintenance of such databases was too difficult. The problem is, whenever you suggest that there might be something wrong with relational databases (for some application areas) there is always some smart-ass who accuses you of "not getting" it and promoting "outdated" hierarchical or network models. "ZOMG, u want to have efficient tree support with variable nodes? U r in teh stone age. We, on the other hand, are firmly rooted in 70s!"
"He predicts that "column stores will take over the [data] warehouse market over time, completely displacing row stores."
Can't I just turn my head sideways?
(PS - postgresql rocks)
Here's what I don't understand. Most relational databases provide the ability to create one or more indexes upon one or more columns within a database--essentially creating an alternate, column-specific file which maps column values into the row data object. In other words, an 'index' upon a column is a column-oriented database object.
So how is it that creating a column store makes reading a database more efficient, when indexing a row-oriented database essentially creates a column store alongside the row-oriented data?
The only place where I can see column-oriented databases would be more efficient is by using uniformity of datatype across the data in a column can make finding un-indexed data far more efficient. In other words, if you have an un-indexed column storing integers, the file becomes dirt-simple: it's essentially an array of values on disk, and finding the 30th row in a particular column of integers is as simple as a fseek() followed by an fread().
The differences between column and row orientation is not merely in the backend. Most column dbs support a more extensive query language that is made possible by the column orientation. They tend to be far better at ordered data (time series, symbol groupings, etc) and have extensions to exploit the ordering, such as moving average functions that aren't poorly bolted on.
SQL is also a huge part of the problem, and one of the benefits of column dbs is that is goes beyond SQL and allows more expressive queries.
Also I don't know where this poor idea keeps coming up that column-oriented tables aren't good for reading. They are often used for things like stock market data that spin tremendous amounts of data into a table very quickly, far faster than any row-oriented db could handle (especially if you need to index the row-based table).
Relational or not has nothing to do with how data is stored. Nor does it have anything to do with performance.
It's a standardized abstraction of a record keeping system. As such it has advantages and disadvantages.
The strategy behind relational databases is what we in modern parlance would call "separation of concerns". The application programmer concentrates on what he wants. The RDBMS systems programmer concentrates on storage and optimization. The DBA concentrates on tweaking and balancing the viewpoints of different users of the data.
The payoff for this is that data is no longer strongly coupled to a single purpose. It also turns out that by working on a small number of standard paradigms for years and years, the RDBMS developers can provide application programmers with systems that give them what they want faster than the app programmers would bother to make it in most instances. Certainly more reliably.
The cost is that the people who do this have to work within a common model or set of abstractions. It has always been possible to obtain faster results by throwing out the idea of a common abstraction and coding closer to the metal. It always will be. Nor has it ever been realistic to assume that the relational model would be optimal for all application domains; merely most (although we weren't so cognizant of this back in the 80s).
There are three things that have changed since the advent of the relational model, none of which obsolete the model, but do somewhat change its slice of the application domain pie.
(1) The vision of Grand Unified Databases in which all an organization's data assets live seamlessly has proven to be marketing malarkey. It's proven necessary to separate transaction processing from analytical processing of course, but even more fundamentally organizations have structure in part to hide irrelevant or sensitive data or to control the scope of policy making in various areas. Even though some vendors (notably Oracle) are somewhat better at creating massive database management systems with distributed administration, in the end the vision of enterprise GUDs would never have worked because people in different parts of an organization can't spend their lives in meetings hashing out each other's data models. Everybody's heard the analogy of the blind men and the elephant, but the reason organizations have parts is because it's more efficient to specialize and have a trunk expert who thinks of an elephant as snaky.
In general, the need to share data between databases is greater than we thought it would be. In part this is a consequence of the impracticality of Grand Unified Databases, but also because everybody is, in effect, networked to everybody else these days.
(2) There are new application domains which involve processing huge amounts of data for relatively narrow sets of operations. Examples would be DNA databases in bioinformatics or document fingerprinting for web indexing. These new applications don't diminish the need for relational databases in their traditional application domains.
(3) There's just too much friggen stuff for people to learn. I've worked with people out of master CS programs who could do amazing XSLT transformation, could draw UML for common design patterns blindfolded, but could not give a cogent explanation of what NULL means in a relational database system. The so called object-relational impedance mismatch has always been exaggerated in my opinion; the problem is finding people who really understand the relational paradigm and tools.
Overall, I think the relational model is a huge success, and will continue to be, although other models may be useful in restricted problem domains. However, the way we use the relational paradigm should be realistic.
One example is the concept of identity in the relational model. Nearly every relational database design flaw eventually boils down to getting the criteria for ide
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
You can store data internally in either row-major or column-major order, as you wish, and still present it to the user in exactly the same way onscreen. So how does this sort of internal storage change make the relational database obsolete?
I am more interested in biased facts.
Here's how I write out a customer record:
$dbh->do(
'insert into customer (id,name,yada1,yada2,yada3) values (?,?,?,?,?)',
undef,
@customer{id,name,yada1,yada2,yada3}
);
I think that's even easier than your 3rd-gen code, and I didn't have to write my own indexing code.
If i had $1 for everyone that predicted the end of the RDBMS i'd have about $20 by now... hmmm, thats not very impessive is it?
If i had $1,000,000 for everyone that predicted the end of the RDBMS i'd be a rich man by now. Ahhhh, thats better.
Seriously though, i can remember any number of dbms concepts that have come out and supposedly been potential for replacing rdbms as "the thing" for data storage. OODBMS springs to mind when c++ started to take hold, and even more so when java took off, "yeah, we need an OODBMS"
In reality, rdbms is and probably always will be flexible enough to handle almost any role. Maybe when we all have enterprise (starship that is) AI in our homes we'll have developed some form of natural data storage that will become the norm, but until then - i'll be a cynic!
I've been hearing people ringing the death bell of RDBMSs since the mid 90's. If I had a nickle for every Object Database I was forced to use because it was the next big thing... well, I'd have probably 20 or 30 cents, but the important thing is that every single one of those systems never because "the big thing". Some of the OLAP systems I've seen look promising because #1, they interoperate with relational databases nicely, and #2, they actually solve some performance problems. Still, does the storage engine really equal a paradigm shift?
So apparently the answer to the
NO.
Considering we're talking about row store versus column store, and the layout of data is dependent upon whether you are doing OLTP versus OLAP.
This really isn't my field, so I'm curious. What are some examples of non-relational database models?
When it becomes mainstream, it'll be just a checkbox I tick when I create a table on my Oracle, MS SQL, PGSQL or MySQL store to use it.
Stop with the nonsense.
I really quite like -some- aspects of M / Mumps / Cache in how it stores data. A large chunk of the healthcare industry still relies on programs which use exclusively this weird non-SQL language for reading and writing to the "database". Think of it as more like a permanent global disk-based array of arrays of data. There are no rows/columns, just an associative array which can contain any level of nested arrays or data, but is always synchronized with the disk.
Morphing Software
I don't think relational databases are obsolete, as there are a great many applications which the technology fits nicely. I do, however, think they are greatly overused. From what I can tell, people are being trained to use RDBMS anytime they need to persist data. I can't count the number of times I've asked somebody why they're using one and the response is, "It's what I know."
I read a very good statement about relational technology. It said one should not think of an RDBMS as a persistence tool. Sure, it can be used for that, but it's better to think of it as an integration tool. It should be used as a mechanism to permit the uniform sharing of information between different applications, built using different languages, running on different machines, hosted by different operating systems. That's, the statement argued, is the real power.
I'm inclined to agree. If you've got a simple blogger app that doesn't need distribution capabilities, doesn't need language-independence, doesn't need OS-independence, just needs to persist an ordered list of very basic data, an RDBMS is way overkill.
Stonebreaker's complaint is that modern RDMBS's are "one size fits all" when different approaches can outperform them in particular niches. The problem here is that real enterprises often need solutions that fill multiple roles (OLTP, OLAP, etc.) and integrate seemlessly. Having a solution that fills one role with less fuss and overhead and DBA workload than existing RDMBS's is of limited utility if it makes all the other things that need to connect to it take correspondingly more fuss than they would with a "one-size fits all" database system.
Of course, if it can be wrapped up in a way which is transparent to the user and integrates seemlessly "behind the scenes" with existing systems, this stops being an issue, but then you still have an RDBMS, just with a different underlying implementation of the storage engine. And its hardly as if optimized storage implementations for particular roles are something new to the RDBMS world.
I met Michael Stonebreaker once. He has a very very fine mind and while wha he comes up with may look like whackjob conclusions, when you hear him explain them they, and he make perfect sense.
I suggest very strongly you talk to him directly and tell him what you told us. I promise that time will not be wasted.
Need Mercedes parts ?
If I recall a commit after a truncate isn't necessary. No undo info is generated so the operation can not be rolled back.
Living in Kentucky, I can tell you a relational database would be handy if actually used. Why I've got in-laws that, well, I won't go into details. Let's just say I suspect my wife married an out of stater for fresh genes.
What? Not that kind of relational?
The world is made by those who show up for the job.
which appears to be the front-runner.
The concepts are interesting - they claim speed, and significantly lower disk-space requirements. They have a video on their site explaining the concepts (http://vertica.com/techoverview - you'll need to provide contact info to see it).
But the last time I talked to their sales representative, they did not support triggers, stored procedures, or any sort of high-availability options.
While the concept is interesting, until they roll out what are now standard database features (even MySQL has views, triggers, and stored procedures now), it won't bet taken seriously. Even then, they might go the way of object-oriented database systems (we tried Matisse without much luck - too much locking). Row-based database systems have a lot of momentum.
You seem to think that RDBMS's are obsolete. What do you think they should be replaced with? Even Stonebreaker isn't suggesting that set-theory-based math isn't the right way to go-- he is talking aobug physical storage choices (which work better in some cases), not the theoretical math basis for the information management.
LedgerSMB: Open source Accounting/ERP
What's the difference. Don't you guys feel the term relational database is actually the only thing dated here.
A "data" "base" is just a big pile of data. How you access it truly defines the database, not simply how the data is stored.
In the world of practicality we may have names for efficient ways to access the data stores, but ultimately they are just naming conventions that means little. The term relational database simply doesn't really mean anything. It's just a BS term made to describe one set of logic for access more or less the same data in the same format.
In it's most abstract and original form 'relational database' means nothing more than a file with fields that 'relate' to each other. So I'd argue that realistically, most any database can be defined relational. If in fact the fields can be shown to have a relational value in any sense to each other then it can be considered a relational database.
Therefore, this guy is obviously not the master of database knowledge or is trying to hype some new technology as so often is the case on slashdot.
Saying relational databased are obsolete is like MS saying procedural programming is obsolete because object oriented programming is the future. The reality is that object oriented programming contains procedural programming, not that they are exclusive. The same can be applied for new forms of indexing relational databases. The fact the field relate in an intelligent fashion does not change based on how your search the data. What you are seeing here is the corruption of terminology for the sake of selling a new technology. While we call standard SQL databases relational that doesn't mean other formats are not also relational even though they are significantly different at either the point of formating the data or accessing it. Saying something impressive like quantum database would still imply the fact that the field relate. A quantum database would redefine the methods of access, but none the less would not redefine the fact that databases are all relational.
This is just a term used to sell technology. All too often Slashdot focuses on this type of technological terminology.
I've heard similiar stories being blurted for years. First it was object relational data stores, then it was various abstraction layers.
Now somehow tilting a table 90 degrees will replace the 'legacy' RDBMS. Modern OLAP implementations have already surpassed whatever niche need these people think they will be filling...
... only old people use relational databases!
If you are doing this by altering your SQL, it has nothing to do with column store. Column store means, as others have noted, that the physical memory is organized by columns, not rows. Writing clever SQL that mimics a 'pivot table' is not going to alter the internal structure of the data. As a practical matter, I would achieve a 'dual mode' database by using PostgreSQL for 'row store' and Vertica when I wanted 'column store'. Then I would at least be able to write portable SQL since they use the same PostgreSQL front end. Since the front end has the SQL parser, I would be able to move my work product from one DB to the other with a maximum likelihood of success (query optimization is another story, of course). In practice, this would mean that I use PostgreSQL for transaction processing and Vertica for analytics. Sybase also offers both sorts of databases, so that would also be an option. But SQL Server doesn't support column store, see the Wikipedia article on column oriented DBMS at http://en.wikipedia.org/wiki/Column-oriented_DBMS.
Think global, act loco
Aren't relationships the core fundamental of a database?
I've got a database in about 6 shoeboxes in my bedroom. Old bills and documents. It's not relational.
Is that a better way than a relational database?
I think what we're referring to here is a mass of data with tags that are searchable. This method, while more modern, does not mean better.
Those tags have to be applied. And, they're still a relation.
I still think that, until someone presents a more efficient and reliable method, we should follow Codd's 12 rules.
They're using their grammar skills there.
Not only is the Relational DBMS dead, it is buried.
We now have the Non-relational DBMS. Data is stored in a manner not as it relates to other data, but how it does not relate to other data.
Features of the NRDBMS:
The NRDBMS is based on a non-column, non-row, non-cell oriented theoretical system outlined in paragraph 1D.10.T of the Savant manual.
Politics is the art of looking for trouble, finding it everywhere, diagnosing it incorrectly and applying the wrong fix.
Can I do this by turning my monitor on its side so the old rows are the new columns? Does it matter if I turn it left or right?
Everyone knows the best way to store data requires columns labelled with letters, and rows labeled with numbers!
If you turn the box sideways, that is all messed up. Well, I guess if you turn your monitor sideways too it would correct this... I am a genius!
I only look human.
My mother is a halfling and my dad is an ogre, so that makes me an Ogreling
- He helped created THE first relational DB.
- He later moved to creating an Object-Oriented Relational DB with Postgres in the 80's. Much of that tech has found its way into other DBs such as Oracle and even helped create the OODBMS world.
- Now, he is creating the Column store DB and announces that this will be the next big thing.
I would listen to him. Biased or not, He has a better track record than most intelligent ppl (and all the wanna-be/hasbeens; dvrack comes to mind) in the tech field.I prefer the "u" in honour as it seems to be missing these days.
I want a Beowulf cluster of column databases.
This sort of crap was dug up when "object-oriented databases" were invented,
Every little while somebody who doesn't even understand relational theory suggests that relational databases aren't "necessary", "efficient", "legacy", or some other crap.
Here are the facts: short of some sort of AI conceptual processing being developed, relational databases are the ONLY known method to CORRECTLY reflect the realities of modeling the real world as data (and even then there are problems.)
Go talk to C. J. Date or Fabian Pascal - they'll set you straight on that nonsense.
Go to the Database Debunkings site and learn something.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
Transpose.
Done.
Profit.
42.
You get the idea.
A site like slashdot have a lot more reads than writes. Perhaps these column thingies could help here too.
It could become a new storage engine for MySql, so applications that benefit from a column store can use it instead of one of the other available storage engines.
#1: Assuming what you think your customer needs is what your customer wants.
#2: Assuming they are the ones who made the mistake when you lost the job.
I know, I know, it's Microsoft.. BUT - check out the upcoming LINQ (Language Integrated Query) and DLINQ (LINQ for databases) in the .NET languages - I think this might be a step in the direction you desire.
SixD
What a "column oriented database" (never heard that term before) sounds like is IBM's old IMS, which grouped columns together (they didn't call them columns because relational databases didn't exist back then) to make it easy to process all of the same fields for all records in the table (they didn't call it a table). For example, you could iterate through all the phone numbers for every record. Mike Murach's company still sells a book on IMS DB (two volumes, but you only need part one) if you're interested in the power of this approach. IMS was designed in the days when config files were coded in assembler and assembled into a program because of efficiency constraints, so it is major ugly to code in. Basically, IMS was a framework (again, they didn't use that name) which loaded your assembled file definitions and ran against them. The IMS query language was so bizarre it's worth looking at if you've never seen it just to see something that bizarre. (And I mean bizarre, such as a space being part of the relational operator -- you have to code a space if you use a one-byte operator like "> " instead of a two-byte one like ">="!) If someone could dust this off and make it work in a more modern way, it would be one of the most remarkable comebacks in computer history. With modern partitioning for relational databases, and the fact that you'd have to have some sort of query language, I don't think this could replace general-purpose RDBMSes, but it could be a good extension to them some day.
On the contrary.
From a standard 3rd generation programing language one can read and write into flat files and we can do close to this with a hierachical database.
I think there is a key distinction here. Application object store vs data management. Hierarchical db's are far better at storing object information, but *far* worse at real data managment.
We lose this with relational databases because the way the database organises data has no direct mapping to the way it might be set up in a standard programming language.
What this means is that every transaction to and from the database must go through a literally horrible re-mapping. IE. The language data structures do not correspond to the RDBMS data structures and visa versa.
In LedgerSMB, we solved this by putting a functional interface in the db. Then we dynamically map the objects and their properties into functions and arguments. Works great :-)
As an example - in postgreSQL the last I looked at writing a simple row into a table where there were something like 100 columns in the row...
You are either trolling or you need to fire the DB architect who designed that. THere is *no way* that a 100-column table is good DB design. (Ok, mathematically, there is nothing that precludes it being good db design, but I can't even imagine a scenario where this would be OK).
In the 3rd generation programming languages this was just a simple structure with 100 entries.
Oh, you were the one who designed the 100-column table. Sorry..... Please go out and get some books on db design. You will thank me :-)
The data transfer from that structure generated a function call with more than 1000 parameters. This was to be mapped and re-mapped with each call to transfer data, this is even though the structure itself is static and determined at compile time.
IMO, your problem honestly is in the fact that you are using a monkey wrench as a ball peen hammer. It may sorta work but you are using the wrong tool for the job. If you want a simple object store use BDB or something like it. If you want a real data management solution, build your db *first.* If that is not your goal, use something other than an RDBMS.
Next: There were about 10 parameters per field (column).
... etc
1: Column name
2: Column name length
3: data type
4: data length
5: character representation
finally 10: Address where the data lives.
The thing is such a table could be set up very easily and populated with a simple loop that rolls in the required values via say a mapping function with about 10 arguments. This could be done ONCE at run time to prepare for the transfer of data and then the same table could be referenced for each call and simply an address could be sent with the transfer.
Noooo.. It was dynamic and the data was encoded as parameters on the stack. This means the stack must be build and torn down and rebuilt for each call.
How is this an issue with RDBMS's?
Next - the implementation was so bad that the program would run in test mode with only a few parameter but it failed when the whole row was to be transfered.
Again, this is not a PostgreSQL problem ;-)
I gave up on that interface.
From your description, that sounds like a wise choice.
While there is much good to say about RDBMS's in general. The issue I ran into was the interface from 3rd generation languages took a HUGE step backward. IMHO we should have a high level language statement called DBRead() and DBWrite(). In C this should generally correspond to fread() and fwrite(). If this is too complex then DBWriteStruct() could be implemented with suitable mapping helper function.
Again, this is an issue with the frameworks you are using. Personally, I tend to do the
LedgerSMB: Open source Accounting/ERP
I think the object database management system (ODBMS) will overtake RDBMSs in the future for several reasons (from the link and my own musings):
.NET, C++ and Smalltalk. This makes implimentation quick and easy - yet stable and scalable at the same time.
1. Object-oriented databases are designed to work well with object-oriented programming languages such as Python, Java, C#, Visual Basic
2. ODBMSs use exactly the same model as object-oriented programming languages.
3. It is also worth noting that object databases hold the record for the World's largest database (over 1000 Terabytes at Stanford Linear Accelerator Center).
4. Access to data can be faster because joins are often not needed (as in a tabular implementation of a relational database). This is because an object can be retrieved directly without a search, by following pointers (e.g. the objects are stored in trees for fast retrieval). Dynamic indexing schemes further speeds up retrieval of full text searches.
5. Provides data persistence to applications that are not necessarily 'always on' - e.g. HTTP based stateless applications.
I think RDBMSs will be around for some time -- but they will be relegated to more structured situations and smaller data sets. ODBMSs will take over where data is changing, persistence is critical, data types are mostly large binary objects with associated meta-data, and datasets are humongous.
Right now my favorite ODBMS is the ZODB (Zope Object Data Base) - an ODBMS system tightly integrated with both Python (implimented using Python's native 'pickle' object persistence functionality), and the Zope web application development system - which itself is built with and uses Python. You can learn more about Zope at Zope.org.
Lodragan Draoidh
The more you explain it, the more I don't understand it. - Mark Twain
Or just humans?
disk seek time is a real killer.
;-)
Instead of writing all the data in one operation, you get to seek, write, seek, write, seek, write, seek, write, etc.
Moving parts are slow
LedgerSMB: Open source Accounting/ERP
The point is clear. People could(and in many cases should) be getting a lot of use out of column oriented databases. The title of his post is "One Size Fits All - A Concept Whose Time Has Come and Gone". His argument basically comes down to saying that blindly throwing Oracle at all of your database problems is stupid and outdated. There are better solutions for many situations out there, and custom built solutions can offer many benefits as well.
I think there's some sort of "editor TOS" that specificaly prohibits any "article-based" editing. Or correction of grammar/spelling/typos. Or checking of whether links point to some random profit-blog.
Heck, I think the editor TOS pretty much reads: post 'em if you got 'em
Interested in a Flash-based MAME front end? Visit mame.danzbb.com
Indeed, I am using Django for several projects and it really has freed me from reading/writing SQL queries. Django has actually made my life easier and more pleasant in so many ways...
I've recently shifted from a cutting-edge linux shop to a behind-the-times microsoft shop and have been struggling to find decent MS tools rather than giving up and just going with what I know.
PS: looking at the code sample
int[] numbers = { 5, 4, 1, 3, 9, 8, 6, 7, 2, 0 };
var lowNums =
from n in numbers
where n 5
select n;
Doesn't it remind you of a PROC DATA step in SAS?
sed -e 's/crisps/potato chips/' -e 's/pound/dollar/'
Oh, you English and your funny language! A chip is a "crisp", and a truck is a "lorry". Next you'll be saying a wrench is a "spanner", a car's trunk is a "boot", and the President of the United States is a "wanker".
Soylent Green is peoplicious!
Column-oriented database systems are not new. Sybase has successfully sold its column-based IQ database for years as a high-performance business intelligence solution.
I have worked with IQ, they should have called it 'DUMB', it was bought on the premise that all 'TRANSACTSQL' was 100% compatible with SYBASE, well it wasn't, in fact it didn't even come close, what a pig.
You never catch me alive
And besides, I thought Object Databases were the next new thing.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
With regard to say 100 fields in a row.
In many cases your observation would be true that once you are getting up to 100 fields in a row that this is really pushing it.
However with geological data this is not always the case. Consider a borehole in a mining setting. The hole lives at a certain location and has directional information. The borehole is identified by an ID. It was drilled by such adn such a company at such and such a time and hence it still makes sense to carry all this data in a row identified by the borehole ID. If one breaks this data into multiple rows - then a merge may need to take place and this can be costly compared to the much easier job of just ignoring a few fields which are not needed.
Add to this it issue that if you have split the rows up then some of the data may be missing and some may be duplicate because the data structure allows it.
Now if you had two contractors drilling the same borehole then I would agree and say split the contractor data into one table and the technical data into another. In fact that was done even though there is no possiblity of having two or more contractors on a borehole. If something funny happens one can always assign a new borehole ID anyway.
Look at the technical data.
Isotopic signatures may be taken. There are over 100 isotopes alone.
Various mineral assays can be conducted. The number of fields here is also large.
In short it is quite easy to end up with more than 100 fields.
------------------
I'd like to address your assertion that an RDBMS is a "data managment tool" and not an "applications development tool". In my mind - one should not preclude the other.
By the time one has defined a view of the data it looks just like a flat file. We have a fixed number of columns of data - some of which may be dupicated and of course some of which may be composite data such as if we average something.
I have no problem with this at all.
My point is that if the data looks like a flat file and can be printed as a flat file by pretty much any SQL query engine, then why can't the RDBMS software present this to a 3rd generation programming language as a flat file that can be read. In fact it can.
One _could_ for instance fire up a child process and feed it the query and have it hand this off to some SQL query engine which then writes the data out to a flat file which the application then opens and reads with standard old 3rd generation programming statements like fread().
This would actually be fairly quick to program.
The thing is that its a horrible way to write an application. So why doesn't the RDBMS provide what looks like a flat file interface but do it in a fashion that is at least as elegant as the good old read and write logic that's been around for 30 years or more?
Similarly for writing data into the database.
One should not have to stand on their head and do back flips simply because there is so little attention to the interface between the programming language and the data management software.
The example I cited was from PostgreSQL where over 1000 parameters were generated by their precompiler. It was poorly implemented and I have not looked at it since.
Of course I do agree that what I am talking about is the API to the RDBMS.
Still, in my mind this is just as important as the data management capabilities of the system.
----------------
Oh - I want to address the comment that is a language like Perl you use a float.
For monetary values it is rare that it is valid to EVER use a float. In C and C++ you do have floats available. It is just worng under almost all circumstances to use them.
The reason is that when you add floats you cannot guarantee that rounding errors will crop up which will render what you are doing incorrect. For instance, if you wish to tally the items in an invoice then the items should in fact add up to the total you print. With floats this cannot
I'm not sure if I get you correctly. Are you looking for some Object-Relational Mapper perhaps? So, for Python there is SQLObject, SQLAlchemy and some other; from what i know Java has Hibernate which is more or less the same. Google for "ORM" plus your language of choice.
Excuse me but, isn't that what an index is for?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
What this means is that every transaction to and from the database must go through a literally horrible re-mapping. IE. The language data structures do not correspond to the RDBMS data structures and visa versa.
// data dictionary knows keys and types
I suggest you use dynamic languages (or type-free language) and just have the query API map a given row into an associative array (aka "map", "dictionary array"). Same with saving it: change the map as needed, and then push it to a SaveRow function. (Saving may require a data dictionary because generated SQL statements need to know whether each item is a string or not.)
resultSet = query(std, "select * from fooTable where bar=7");
while (rowMap = getNextRow(resultSet)) {
print("Name: " . rowMap["name"]);
rowMap["name"] = "I changed the name";
saveRow(std, rowMap, "fooTable");
}
Finally they removed the Oracle interface and replaced it with a glorified flat file. They clearly built in some indexing. The result is the run times dropped to under 20 minuets.
What a file system (hierarchical database) does is improve performance on one search factor at the *expense* of others. Yes, that particular operation may have been faster, but RDBMS are designed to work well under a *variety* of search "paths", something hierarchical stores (like files) cannot do well. Plus, there are often ways to tune RDBMS for common search paths if one wants to trade a single search path (factor) performance for general purpose. It may take some expertise though.
For example, I often want to search my files by date changed, regardless of folder. This takes a long time because the folder tree is "indexed" based on file/folder names, not other attributes such as dates. "Cross tree" queries are slow and combersome. (Some tools will index on other attributes, but this turns it into a network or navigational database instead of a hierarchical one).
Using RDBMS effectively does take experience and education. They are not a simple technology to use. But, they can be very powerful.
Table-ized A.I.
The next geekSessions is on this topic. If you're interested in hearing about alternatives to RDBMS in and using them in practice, take a look at the site. The event will be held on October 2nd in downtown SF and will also be available via webcast. In addition to the presentations, we'll have a Q&A session along with some food and FREE BEER. Speakers at the event are:
Josh Berkus from the PostgreSQL core team
Paul Querna from Apache and Bloglines (wrote his own filesystem for Bloglines)
Chad Walters from Powerset who is implementing BigTable there.
Hope to see you there!
If I desire the benefits of a column store in my relational database, I simply create an index. For warehouse-flavor queries, I might create a multi-column indexes. Robust RDBMS's will choose multi-column indexes as exclusive read targets if all of the queried columns reside within that index.
If you gents and ladies want to try out a completely different approach to organized data storage I would suggest that you give CouchDB a try.
First off, Stonebreaker has a vested commercial interest in the matter. Take everything he says with a grain of salt. Actually, take the whole damned salt lick. Secondly, whether data is stored by columns or rows is not a concern for the interface of an RDBMS. That is a physical implementation detail that any RDBMS could do. Don't forget Codd's 12 Rules. Pay close attention to Rule 8.
Some other alternatives like object approach has appeared, but the thing is that relational approach is the one it is mainly used without competition (for the moment)
VirtualWorldsHub.com - News, forums, resources
QUESTION: How is a column-store DB better than a row-store one using single-column indexes?
WHAT I THINK COLUMN-STORE DB ENGINES DO:
I don't see how column-stores make retrieval any easier. Most search conditions have more than one column involved. Won't several column-files have to be opened & read to match the conditions?
SELECT * FROM T WHERE ID BETWEEN 1 AND 10 AND LAST_NAME LIKE 'S%';
There are two column-files here, ID & LAST_NAME. Assume the data is sorted, but there are no indexes. ID is not a PRIMARY KEY!
The DB engine has to open two files. Then, it binary-searches each file. It has to hit ID twice for 1 and 10. Next, it sequentially checks all matches by the search pointer. Finally, it has to open & read the row-slices of all other column-files in that range.
Isn't all this what a single-column index provides? We can get a very small set of candidate rows. Yet, a row-store DB has all the related data in the same file, and WHERE can check all the other columns at once.
Can someone expalin the data-warehouse advantage here?
Row/column hybrids are a good idea, but there are a half dozen technical reasons why data in different formats won't work for redundancy. The big problem is those functions are in radically different sections of the system. The disk controller doesn't know anything about the database and vice versa.
Also, modern databases never have to wait for data blocks to actually hit the disk during normal operation. They only have to wait for log records to actually hit the disk (at transaction commit time). Once the log records are persistent, if the system crashes, they can be used during the database recovery process to bring all the data blocks back up to date.
Journalling file systems are similar, except usually only with regard to meta data changes.
Well, kinda, but you're still storing the original table, so while the index speeds stuff up, you're still left with the original bloated table hanging around backstage.
Column-style databases offer some of the advantages of an index, but without the storage-inefficient back end.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Of course there are tradeoffs, but a natural key only policy in a relational database of any sophistication is more or less insane. The biggest problem is that natural keys add at least one column for each level you go down in a database hierarchy. It is not uncommon for the natural key at the detail level of a typical snowflake table structure to have four or five columns.
And of course all of these columns have to be carried in all foreign keys, including primary foreign keys as well. So suddenly, simple association tables have *ten* primary key columns. That is a programming nightmare. Any relational database that requires more than two columns per table join on average is going to be difficult to manage, develop for, and maintain. It is also going to waste a lot of unnecessary resources.
In addition, a no synthetic key policy forces users to come up with unique identifiers instead of the database in cases where they do not care and where it is unnecessary. That is bad user interface design, plain and simple. Names and descriptions are usually unique - certainly enough for user purposes, but what kind of system uses a free form text field as a primary or foreign key?
Worse, if anything the user can change is used as a primary key, cascading updates will be required. Cascading updates are inefficient and most databases do not do them automatically.
That's why row-oriented databases have indexes and perform index scans.
> If I recall a commit after a truncate isn't necessary. No undo info is generated so the operation can not be rolled back.
Maybe in your toy RDBMS. Mine can even roll back DDL.
If one breaks this data into multiple rows - then a merge may need to take place and this can be costly compared to the much easier job of just ignoring a few fields which are not needed.
I'm not sure I follow that form of logic. Yes, sometimes it's easier just to have a big number of columns, but almost always is the first quick death to RDBMS. The strength of RDBMS comes from the relationships portion, not the tables portion. In fact, when you enter that large a number of columns per a row, and you query that row, the database has to read the entire row and think that you want all of that data. Even if every field is not in use, they may still take up space on the disk, so you're losing storage space AND time to transfer the data back to the application because you're essentially trying to communicate that "50 fields in this row are unused."
Let's consider your borehole example for a moment. You would have a table representing each unique borehole followed by a way of identifying and attributes tied to the borehole. At first glance, your approach is right, but at second glance when all attributes are not in use, it isn't necessarily a good idea. So take for example your approach, your table would look something like:
A simple structure, right? Except the notion that you don't necessarily use each column all the time. That alone is a hint that perhaps the columns are not always necessary. Instead, I would organize the data into two tables: one to identify the borehole and key pieces of information directly associated with the borehole and another table to store the attributes:
Now to reconstruct the data, I need to perform a join. Luckily, the Borehole ID is a primary key foreign key relationship so the database should have built some index to quickly map the pieces of data together so the join isn't that bad at all. It only becomes bad if I happen to ask for all of the data (in this case one table *might* be faster) or there is no index or relationship between the joined columns.
So here comes the critical question: how much of the data do you need? If you are querying for all of the data at a time, but do not use any of the SQL query features, why did you use RDBMS in the first place? The benefit of RDBMS is that you can look at slices of the data quickly and easily. For example if I wanted to know what companies operated on which boreholes, that is an easy query even with my proposed data arrangement:
If implemented as a flat file, I lose that functionality and have to implement it myself. In fact, in a flat file format, that operation can be horrible because I would have to search the entire flat file. In Oracle, I can use indexes and optimizer hints to speed up the query.
The hard part now becomes what happens when you want to query on multiple attributes? For example if I wanted to query on boreholes by company ACME and on directionX = 0 and directionY = 1, now I have to use the set operations. One straight-forward way to do that is:
Now I
Actually **his** article is fine - he is simply saying the current horozontal approach of RDMBS is not as efficient as vertical for data wharehousing applications. He is saying the concept "one size fits all" is obsolete as data wharehouses provide different stresses than OLTP applications. Nowhere does he say that RDBMS as a concept is obsolete or anything like that.
It's the moron at computerworld that misread his article and thought that he's saying that vertical is superior to RDMBS, which complete nonsense (that's like arguing TCP is better than Ethernet - meaningless!). Stonebraker is arguing about the on-disk layout used by common RDBMSs and saying what they currently use is less than ideal for some applications, which I can appreciate. He argues that his product, Vertica, outperforms conventional RDBMSs for datawharehousing.
As proof that the computerworld monkey is the moron, Vertica itself is described as "a brand-new column-oriented RDBMS". Yeap, Vertica is a RELATIONAL DATABASE. Sheesh.
One might occasionally want to query by company or some other attribute in order to find and correct data.
However with geological data in general queries like a borehole drilled by company X in such and such a direction makes no sense whatsoever.
In fact - this is where we really run into problems with data like this. In general - one is interested in data that falls within some 3-D space. It might just be 2-D mind you - like data that intersects a zone or is above a zone or below a zone.
The thing is that in all cases, one is generally best to determine what data could fall within a volume or zone of interest and then simply scan the data after that. This means a flat file approach actually works pretty good for data of this type. Geophysical data falls into this.
So with the borehole data one would want to carry the xyz or top hole location and bottom hole location with the main data for simple holes and then add a flag if there is a deviation survey. Sometimes these holes do look like a snake.
Then one wants to group this data via an artificially determined key which is 3-D in nature if possible to eliminate all data which cannot possibly participate in the query.
The borehole ID field doesn't accomplish much other than to let someone look up and possibly correct some data. As such - one is generally not even interested in how efficient this is. The number of updates and lookups by borehole ID are very low.
Its the areal retrievals and the volume scans that take the time. This is the mapping side of the picture. You are correct that flat files actually can make more sense because spinning through even the best RDBMS's is quite slow when compared to reading a flat file.
So then the question becomes, how long are the records. Does it make sense to split them up. The answer to this depends on the type of query. In our case we have a suite of queries that go through and generate flat files of certain formats. When one looks at these flat files it didn't make much sense to split them up.
The thing is in an RDBMS if you read from two tables your disk heads may seek constantly between the two read points = and then there is the merge operation. Even a fairly large record can be fetched from the disk within a few reads... a few for the index and one or two to fetch the data.
My point is that the power of an RDBMS is really not applicable to this type of data. However RDBMS's do confer other advantages such as standardized interfaces, maintanance tools, ability to do ad hoc queries and updates. All in all the RDBMS is a decent choice.
My experiances however is that the API sux and this is a major distraction.
To be a relational database the database must meet a very specific set of requirements. While a standard view of the databases from the DB administrators and normal users view may allow limited ways to manipulate the data, looking at the actual storage structures and how to efficiently use the resultant files can provide some extremely efficient computational methods. For example, if one structures data in a relational database with few fields, perhaps as low as two, per representation where each representation represents a single file then the data read directly from the file, the reads may be via flat file techniques, into an array or efficient storage table, then you achieve the best of both worlds. If the above example requires compressed data, then the relational data must be read from a single file via DB operations with two different mappings overlaying the single database file. The programmer must understand the data orientation so that compression/decompression occurs correctly. Extremely fast reads/writes may be achieved this way. The advantage of a column oriented database is that the files are inherently optimized for data mining without the need to hire an expensive programmer. If the company had multiple requirements for the same data then multiple databases may be required, unless they are willing to hire the expensive programmer. A key problem results when multiple databases loose synchronization. BTW, row based databases optimized for storage size (footprint) compress data by column, not by row or record. Using these techniques, I have achieved far greater data compression than comparable Google stored data.
What?
Replying just to tag the comment... hoping I won't post 23 more before I understand the first two paragraphs.
Making laws based on opinions that stem up from false informations leads to witch hunts.
Michael Stonebraker is certainly a well respected nae and he was been right on these issues in the past. Coinsidently I'd testing my software with a new version of PostgreSQL as I type. I think colum vs. row storage can be considered simply a option. I can even see it being an option that you specify at the table level. Most DBMS users really don't have much data. Today a 1,000,000 row table can be cached in RAM on even a low-end PC based server. Once cached in RAM row vs. column storage does not matter. I would imagine that 99& of the database table in te world have far fewer then a million rows. This discussion applies only to the very few that are really large.
Dr. Pauker, with Borland's Dr. Kahn, gave us Paradox for DOS. It was the column oriented Query by
Example. Just another implementation of Relational Algebra for SQL, but it was great for us business users.
I paid $800 for it, still use it. With 20 million records, it gives an answer before your pinkie comes off the enter key.
Don't forsake me.
With regard to say 100 fields in a row.
In many cases your observation would be true that once you are getting up to 100 fields in a row that this is really pushing it.
My experience is that db tables of more than about 10 columns tend to be indications that something may be wrong. The only tables I have ever seen with more than 50 columns have all been badly designed (thus far). While this is not a mathematical rule, it does seem to match my experience.
However with geological data this is not always the case. Consider a borehole in a mining setting. The hole lives at a certain location and has directional information. The borehole is identified by an ID.
Ok, so far we have latitude, longitude, and id.
It was drilled by such adn such a company at such and such a time and hence it still makes sense to carry all this data in a row identified by the borehole ID. If one breaks this data into multiple rows - then a merge may need to take place and this can be costly compared to the much easier job of just ignoring a few fields which are not needed.
Ok, we add a timestamp, and a company identifier (which joins to another company table).
So far we have:
Add to this it issue that if you have split the rows up then some of the data may be missing and some may be duplicate because the data structure allows it.
THe above seems very straight-forward. Note the NOT NULL constraints, and the foreign key constraints.
Note that if oyu put all company information in the same row you are adding duplicate information, and this can cause data management problems.
Now if you had two contractors drilling the same borehole then I would agree and say split the contractor data into one table and the technical data into another. In fact that was done even though there is no possiblity of having two or more contractors on a borehole. If something funny happens one can always assign a new borehole ID anyway.
No, I think you have that backwards. If you can have two holes drilled by the same contractor, you want to split the contractor data into another table. If you have two contractors drilling the same hole, you will need to create a pivot table for the many->many mapping.
Look at the technical data.
Isotopic signatures may be taken. There are over 100 isotopes alone.
Each isotope signature record belongs in a different row.
Various mineral assays can be conducted. The number of fields here is also large.
Same as above.
What happens if you want to start looking for another isotope or run a different mineral assay? Do you want to alter the schema of the table? Or rather add a field into an isotope or assay table that tells people about the test so you can store the results in another one.
In short it is quite easy to end up with more than 100 fields.
You still need to look up normalization.
Ok, take the above table for borehole and add the following tables:
LedgerSMB: Open source Accounting/ERP
This problem has been solved many many times over with modern OR mapping tools. It used to be the case where your options were pretty much:
.net shop and afaik DB was for VS2005/.net2 only - but I have no doubt theres similar tools for Netbeans or whatever your language of choice is half a google away.
/cry
* Stuff around with a bunch of mapping files and write the code by hand. (Time consuming)
* Use a template code generator (shitloads of code - gets you like 90% of the way there, and then you are stuffed and its impossible to maintain)
Now theres a third option, which is a plugin for your environment of choice that keeps some definitions synchronised behind the scenes and reads a bunch of metadata so nothings hardcoded and it can optimise queries at runtime.
Case in point: Earlier this week we bought a copy of Diamond Binding. Configuration was pretty minimal - it might not be the most customisable of tools, but I was never one for learning mapping theory anyway. Basically the setup time was ticking what tables I wanted in my data layer - it detected all the relationships and just worked out the box. I dont recall the price, but it was a bit over $500AU - which is about 2 days salary here.
We're a
So basically if you want to be tight and not spend any money, then its going to suck for you. Same as coding without an IDE. The tools are there though - and it seems that "mainstream" OR/M tools are out there - or at least "instant data layer" stuff for your average noobie VB.Net programmer thats actually got a solid OR/M foundation (instead of template codegen BS).
Actually its probably a bad thing. Anything that makes it easier for your average coder is just going to drive my rates down
3laws: No freebies, no backsies, GTFO.
Having read Slashdot for a couple of years now, I have come to the conclusion that computerworld magazine is a trashy rag of a publication. Just from recent memory, I recall articles on text messaging replacing email and a top ten list of obsolete programming languages that included C. Now RDMS, the most critical piece of technology underlying our information age, has been marked for deletion. Flame us no longer, computerworld, with your tales of misery and woe, and just stick with what you guys do best: hailing the latest Microsoft "innovation". (My Cheerios had a strange taste this morning. And the milk looked kinda yellowish.)
FAQs are evil.
Will the imminent transition to SSDs make any difference? Because row-based DBs means you're typically reading large chunks (one raw) sequentially, while columns stores means you're reading many small (number of columns) chunks for every row. I'd think that if random access time was almost none, you'd get almost the same write performance while read performance could be greatly improved because you only read the columns you need. It'd certainly make DB design easier too if you didn't have to worry about putting very light information in the same table as heavy blobs.
Live today, because you never know what tomorrow brings
This is exactly how search indexing works only with proper multi-node scaling and a ton of different ways to query it and push/extract structured data from non-structured data. So congratulations to this guy and his company. This is about 20% of the software most people in enterprise search/'data warehousing' have been using for over a decade.
Now there was another reply concerning the PIC system. I worked with Ultimates ( such hubris! ) version of PIC and DBasic in the late 80's. It was BLAZINGLY fast at reading data, but as others have mentioned writing data was another story.
There is an issue of preallocating your files. In PIC's implementation there was the concept of overflow frames as well. These were to handle the growth of Multivalued fields, which were in fact row based data. Now you have to understand that this was a time when hard disk storage was VERY expensive and nothing was counted in Gigabytes.
But even these days when you are talking about truly HUGE databases running into the Tera bytes of data both the preallocation overflow to be contended with. Now PIC had a pretty fine solution for this, but it waisted disk space. Tera Byte sized hard drives are on their way but I still do not think that the aforementioned problems are going to go away, in fact I think they will be amplified and here is why:
Databases of yesterday, and this new proposal is in fact not new, but its rather an old idea were there before CLOBS and BLOBS. Imagine a situation when you attempt to store a CLOB or BLOB with a row oriented approach. You will either have to have a massive overflow capability or you will have to preallocate your file to such a large extent as to make the technique cost prohibitive as your enormous, yet empty file occupies almost all of the available disk space. Additionally if you anticipated CLOB or BLOB storage estimates are wrong the file(s) will need resizing and reallocation on a fairly regular basis and this is not optimal as doing so requires you to have at least the size of the current data file in empty storage available.
I like the idea, but row oriented storage is not a prevalent design feature in any of the current databases out there, and I think for good reason.
Hey KID! Yeah you, get the fuck off my lawn!
Row stores and column stores are an implementation issue; both can (and often are) used with relational calculus. That row stores were a stupid idea in many application areas was obvious to many people already even when Stonebreaker was still pushing them. Now, he seems to have fallen into the other extreme. The biggest thing wrong with relational databases is probably SQL. SQL has become complex, and it is hard to predict performance of particular expressions across different implementations.
People will continue to use row stores, column stores, array stores, object stores, link stores, and simple persistent hash table, and they'll use it with relational and non-relational models. None of those approaches are obsolete, and none of them will be in the foreseeable future.
I don't get it. Storing data by column instead of row is a change at the physical level and has nothing to do with the data model. In other words, storage by column can be implemented with the relational data model, there is no contradiction.
If a column-based DBS is something entirely new, what data model does it use?
Ok, attribute-value modelling has its uses. We use it in LedgerSMB in exactly 2 places (argument lists for functions called on menu element activation and system settings). However it is *horrible* when you are trying to impose a natural structure on data.
LedgerSMB: Open source Accounting/ERP
What the blog refers to is an alternative to row based databases, not just a database catering to a niche part of the market. The author plays up the 'faster' bit too much (it turns into an advertisement).
However, the more important reason that relation databases fail to meet many application requirements is the lack of query flexibility (only noted in passing in the blog).
I don't think it is controversial to say that when one designs a relational database, you need to have an idea of what queries are likely to be made on that data. If the data is highly orthogonal, the queries don't matter much; there is only one way to factor the data. But in real life, collections of data are often not orthogonal, and the information being gathered may change over time. In a relational database, this means prophetic planning. Or refactoring the database every few years, a painful process for everyone concerned.
I would suggest that novel database architectures can be widely relevant even if they are not faster than the relational databases of our youth... Twenty years ago the relational database was necessary because of the slow speed of the hardware; now that speed has become a niche requirement, it is time for new architectures.
The XML database will take over eventually.
Actually, it sounds like he is using ECPG or another such precompiler. I'm sure those things have their place, but I'd personally never use anything like that. It's just too restrictive and all of the interfaces like that I've encountered are downright archaic, mostly due to being invented 3 decades ago.
Straight libpq is much nicer.
You are confusing an RDBMS with a persistent storage mechanism. It's really not hard to just keep data persistent any more. You don't need a 3GL or anything fancy, just some hooks to record your modifications on permanent storage, and keep a small working set cached in memory. It's an easy, trivial, solved problem. And it was solved before relational databases were invented.
RDBMSs do a lot more. Here are just a few advantages:
* different applications can access the same data
* guarantee integrity via declarative constraints that can validate against all of the data at once, not just the single record in question
* different applications can have the same guarantees of integrity, and a bug in the first application can't break the guarantee for the second application
RDBMSs were invented for a reason. Many, many software bugs can be traced back to a bad data state -- some invariant that was broken and uncaught. Often, these bugs are not caught until long after the insert has taken place, and often cause a cascade of new bad data and you don't find out until many records are wrong. A lot of code is imperative, and re-stating the invariant declaratively (i.e. a database constraint) helps catch a lot of those bugs.
Trying to put these declarative constraints in the application is a bad idea. When should they be checked? And in which applications should they be checked (all of them, one would hope)? If you see a declarative constraint, are you sure it's correct, or might it have been added after inconsistent data was entered and before the constraint was actually run?
Databases solve this by making some promises. If you put a ".. CHECK (age > 0)" on an attribute, it will check all the records before applying it, and then all the records afterward will need to pass through that constraint. That's a lot better of a guarantee, and you know it's true for all applications. Someone else's bug or quick hack won't violate it, so your application can rely on that as the truth. Same with UNIQUE or FOREIGN KEY.
If you think about your reasoning for a moment, it's very narrowly focused on storing and retrieving single records. Presumably, anything needing to look at the data as a whole would need to read it fully into the application and process it from application code.
You don't take into account other readers of data who might require consistent reports or anything else that needs to look at more than one record. You also don't take into account the horrible mess you have when the application is wrong and stores bad data, or when you need to do data format changes. In the types of databases you describe, almost any change requires reorganizing the data physically. In an RDBMS, you can make many changes without physically changing the physical layout.
Social scientists are inspired by theories; scientists are humbled by facts.
I don't think that has anything to do with the article. That is about storage on disk, not about manipulating pointers in memory to such an extent that a programming language that should never have been invented in the first place becomes even more ununderstandable.
If only I could add tags...
... and best of all, it's Open Source and free!
Official page
Wikipedia info
The real interest of column-based storage is that each column only carries a given data type. This means that allocation can be optimal (up to the bit level). Moreover, memory-mapping a column gives a contiguous array of uniform data, which is randomly accessible. Scanning a column consumes much less memory than for row-based tables, since only the relevant data has to be present in memory. Here is a great intro to its view model.
Another big advantage of Metakit (and probably other column-based DBMSs) is that it supports hierarchical tables (dubbed subviews). A column type can thus be a table itself. This means that rows can contain variable data, which eliminates most of the needs for table joins and relational algebra. Moreover the table definitions can be self-referential: one can easily represent tree data in a single table, such as an XML document or a file hierarchy. For example, the following table definition is sufficient to store a whole tree of ints: tree[value:I,children[^]] where ^ references the table itself.
I've used Metakit with success in several commercial projects. The learning curve is a bit steep, mostly because of the sometimes obscure documentation, the radical choices, and terminology which often depart from that used in traditional RDBMS. I had to experiment a bit to make the right choices. However, once in hand, it is lightning fast. I was able to design an object-based system where all data were stored in memory-mapped storage, and were accessed in a fully transactional manner. This means that in-memory objects were only wrappers around memory-mapped Metakit storage: setting an object property actually altered the data in the right column inside the storage.
Stonebraker is a crank.
None of his previous efforts have set the database world on fire, this will be nothing different.
I wish there was a "Bile Blog" for database stupidity.
Blessed hash, you say? Are you some kind of Rastafari?
One swallow does not a fellatrix make
Relational databases are not obsolete.
Are there any Relational databases available commercially? That is, a DBMS that actually implements the relational model? Not flaming, I genuinely want to know.
Here's an interesting site on the subject: DB Debunk.
Sometimes, I think they shouldn't be allow to speak in public. Making statements like this simply shows he's either a) ignorant or b) bias for some reason (which has already been discussed).
Anyone who works with databases professionally knows there is always a way between making the database fast for people entering data and making it fast for people reading the data. There's no 'right' answer and the entire statement of this article should have been laughed off - it certainly isn't news nor does it present a solution to the problem. I suspect that the overhead of adding information far outweighs the advantages of retrieval else there would already be widley available and in use solutions using it.
Column oriented DBMS's aren't some sort of new tech. For whatever reason, row oriented simply rules the market at this time. I remember when XML was the 'answer' to everything for a while when it was first 'discovered' (My buddies and I looked it over - weren't particularly impressed - and just sat and listend to the rantings) - everyone thought (stupidly) that we'd start storing everything in XML - which is assinine.
Last time I checked you create Index's on columns which are regularly accessed in a serial manner for lookups
And you make a balance between the number of index's you need versus the type of queries to be done ( more index's = faster lookup but slower row by row insertion )
Does sound like Column Store Database is a subset of the functionality of a half decent Relational Database
Perhaps we should consider RDBM's obsolete when they aren't widely used anymore.
1) those variables can be perl-ties to disk files
2) There are profound simmilarities.
Namely when storing variables by column they are (generally) the same type, allowing much greater access speed, serial processing speed, and compression.
Object oriented programming usualy has problems with speed when one must iterate over some deeply nested attribute value in an large collection of objects (for example to sort on a number). Column oriented storage already has those attributes collected in a single place.
So this has everything to do with the original article
Some drink at the fountain of knowledge. Others just gargle.
Well, mysql allows multiple storage engines. Each has its own proprietary file structures - anything from key-value pairs to isam.
DB2 used to support ISAM (and perhaps still does in its mainframe version?), but now mostly uses one main engine (which can store data in file system directories, in raw devices or raw volumes) but also has a completely separate xml storage engine that's pretty good. And just within the primary storage engine allows many different table types:
- mdc (multi-dimensional clustering) tables (like oracle - table is split into smaller tables on one server)
- compressed tables (gets you about 80% compression with old, nasty rows)
- typed tables (object tables)
- range partitioned tables (like oracle - table is split into smaller tables on one server)
- hash partitioned tables (partitions are split across many servers)
So, although I'm sure it would take work - I can't imagine they couldn't also offer a column-oriented table.
It looks to me like this argument is conflating an abstraction (relational theory) with its physical implementation. An RDBS shouldn't really care about whether data is stored by rows or columns.
Ok, so aside from warehouse vs mart discussions - when we're talking about a data warehouse we're generally talking about a star-schema. And in this data model you've typically got:
1. many dimension tables
- columns: many wide character columns
- rows: few (generally less than 5,000)
- examples: time, location, product, organization
- bottom line - column-orientation would be great here - except these tables are typically only a few mbytes anyway and are easily cachable.
2. very few (maybe 1) central fact table
- columns: a couple dozen integer columns - mostly just keys to the dimensions
- rows: many (often billions)
- examples: product sales, security events, etc
- bottom line: column-orientation doesn't buy you much - since the columns are so small and are just integers.
3. a dozen or more summary tables (pre-aggregated subsets of the fact table)
- columns: less than a dozen
- rows: typically 1% or less than what's in the fact table
- bottom line: column-orientation doesn't buy you much here either - since the columns are small and the number of rows aren't that significant.
Now consider:
a. partitioning: data is often partitioned by time ranges since time-series or current-version queries are the most common. This works extremely well with row-orientation since all the rows can be put into their own virtual table. But it won't work with column orientation. So, for example when you need to get all data for march 2007 - you won't be able to just scan that 2% of the data, you'll have to scan 100% of the column-oriented data.
b. compression: in db2 you can expect to achieve 80% compression of your fact tables. All those integers are just simple recurring patterns. Column orientation would certainly improve on that. But by how much?
c. writing: data warehouses may be read-mostly, but more and more there's a desire to get close to real-time reporting. The warehouse that I support gets loaded every hour for one set of data, and every minute for another. And users expect one set of data to be available for reporting within 30 minutes of being created. That set comes in batches of 100,000 or more rows. If the write-delay was sufficient we wouldn't be able to meet that requirement.
So, I'd consider that there may be some advantages to column-orientation. However, if you've got a good star-schema data model, are using partitioning and compression - then the benefits shouldn't be that significant and the loss of fast writes and easy partitioning may completely offset those meagre benefits.
Anyone for a game of Tetris? I'm betting it's Stonebraker's favorite as the goal of each stage is to rack up points by using columns to build and destroy rows....hehehehe
A fool throws a stone into a well and a thousand sages can not remove it.
Not all languages have this problem, however because it is not the only way to procede. Basically in any base number system, you have an ordered list of digits each of which is related to its neighbors by being a multiplier to the base raised to a power related to its position. So if I write: 1034.23, what I really have is 1x10^3 + 0*10^2 + 3*10^1 + 4*10^0 + 2*10^-1 + 3*10^-2. Addition is then straight-forward, and multiplaction is based on its distributive property. Since any integer digit can be stored in binary, one can build arbitrary base systems (and operate with them) even though the data is actually stored in binary. Decima-safe floating point systems tend to use a power of 10 as its base. For example, PostgreSQL uses base 1000 for its numeric data type (larger bases are more efficient when it comes to binary representation because you have fewer bits lost).
I know that C and Javascript use floating point types which are not decimal-safe. Perl, Python, and Ruby seem to have decimal safety built-in.
AI would also note that arbitrary precision libraries can give you roundoff errors too but for a different reason. These libraries are generally designed to work primarily with scientific measurements where you have to take into account the accuracy and precision of those measurements. Operations like rounding numbers can have unexpected effects down the road if you are not familiar with these.
For example, suppose we take: Hope this makes more sense.
LedgerSMB: Open source Accounting/ERP
Ouch. And this exactly why SQL should die as the primary interface to the RDMBS. How the hell is my compiler going to help me find simple typing errors when the interface to use the RDBMS is built upon ... strings! You are generating code from code, that's not integration, that's a hack! It's useful, as there is no sound other way to approach an rdbms, but it's not a pretty sight.
... having worked mostly in Oracle for the last year or so, I've been missing SQL Server.This would be the first time I read such a comment.
I went from SQL Server to Oracle/PostgreSQL and would never consider going back, ever.
T-SQL is the most obtuse PL I've ever had the disgrace to use. Compared to it, both PL/SQL and PL/pgSQL are a breeze.
- Sw Usr
Your comments are well received. Of course, Perl hashes are in-memory data structures, and in-memory structures are infinitely more flexible than on-disk structures (not just in Perl). The topic is actually about comparing methods of on-disk storage.
Of course, don't expect any Java weenies to understand the beauty and flexibility of the inside-out objects technique.
I wrote a time series database, Commodity Server, and I've come to the conclusion that the only thing a SQL database really needs to do this is to have the notion of a key as a range, rather than a discrete value.
Say, for example, I want to store:
5/1/2007 - 6/1/2007, 22
5/8/2007 - 5/10/2007 10
I would expect the history to be, on query:
5/1/2007 - 5/7/2007, 22
5/8/2007 - 5/10/2007 10
5/11/2007 - 6/1/2007 22
(neglecting time).
Storing that in SQL Server can be done in several ways, and, none of them are all that great. One way would be to store each row as 5/1/2007 - 5/7/2007, 22, and so forth, and another might be 5/1/2007, 22, 5/8/2007 10. In both cases, you get the problem. There's no real way within SQL to ask "4/1 - 10/1" and get a sane answer back.
To get that, you really just need to have something that hacks on the little tiny part of how keys are searched. You need to do what time series databases do, and understand ranges of dates. But, really, ultimately, to do it the right way, you need a sort of relational algebra that describes keys not in terms of discrete values but as linear functions of x or at least ranges of values across 2..n dimensions. Applications would be far greater than just time series and stock markets if you solved this problem. It would be good for maps, neural networks and all sorts of other problems as well. And THAT would make relational databases as we know them today obsolete for certain jobs.
This is my sig.
You are either trolling or you need to fire the DB architect who designed that. THere is *no way* that a 100-column table is good DB design.
Never say never. I have parts that I am legally required to keep nearly 300 pieces of information on for each and every part. I have a table with just over 300 columns that is 3NF. It is good DB design. Of course if some of the information only applied to certain types of parts then it would be possible to normalize and use smaller tables.
I've seen several other systems that needed huge tables because every item in the row needed every piece of information. It is not bad design. It's just a fact of life. Calling someone a bad designer because they're accurately modeling the real world is ridiculous.
Upon thinking about it, I've realized that it probably has to be somewhat ugly, being that it's a glue layer. We're stuck interfacing on both sides of the call, and interfacing interfaces to interfaces, via an interface (I know... just let it go, you know what I'm saying) is ugly because you're trying to create that piece of the puzzle that makes everything else fit and you're doing it in about 4 dimensions at the same time, sometimes blindfolded (see DAO documentation if you can find it).
And on top of that, there are 'dialects' for SQL, and JDBC is a slow implementation of a slow protocol. Check out ODBC vs. JET with a 3.6 MDAC, JET has committed a SQL query before ODBC has the connection fully open (I'm exaggerating, but not by much) - and JET is from the Access '95 era! What we need is to stop gluing interfaces and working between 5 layers. Let's simplify this mess by starting fresh with an RFC or something. 1 language, 1 engine programmed in that language, 1 dialect, 1 database object with which to interface.
I know it can't be done, and I know it's a bad way to do it, but it would make my life so much easier than trying to fix old Access '95 (upgraded to '97, upgraded to 2K) DAO code. I'm so sick of memory leaks from recordsets of workspaces of databases that haven't been explicitly closed in the correct order or scope (not that Access really has correct scoping, but we can pretend for my sanities sake) or whatever. Rant over. Thank you, I feel better, now... what was the question, again?
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
I was not aware that mathematics changed depending on what the regulations were. The normal forms are defined by cold hard mathematics.
One of the main points of third normal form is that there are no transitive functional dependencies. While it is not mathematically impossible that you could have a 300-row table which has no transitive funcitonal dependencies, I would find it highly unlikely.
Suppose, for example, that you are required to keep the supplier's current address and tax id for a part (and you only order from one supplier). First normal form would allow you to put all these in the same table. Third normal form would not because you have a transitive functional dependency. Part_id -> supplier_tax_id -> supplier_current_address. Hence you would have to break off the supplier information into a new table.
LedgerSMB: Open source Accounting/ERP
Are you trolling or simply just inexperienced? If I have >200 floating point measurements I have to keep for each part that is described by the row, how could I possibly normalize further? Are you claiming that because I have a 5.51111 in two different rows that I would be required to move that to a single row in another table to be 3NF? That is absolutely not true.
Suppose, for example, that you are required to keep the supplier's current address and tax id for a part (and you only order from one supplier)
Nice strawman. It's annoying to argue with idiots. They get into bad habits like this. They get tired of losing arguments so they get into the bad habit of building strawmen to knock down that are so ridiculously simple that even they can understand their made-up issue completely. Of course your simple example is true. The real world is much more complicated. In most nontrivial systems you will have tables with a large number of columns that are 3NF.
Are you trolling or simply just inexperienced? If I have >200 floating point measurements I have to keep for each part that is described by the row, how could I possibly normalize further? Are you claiming that because I have a 5.51111 in two different rows that I would be required to move that to a single row in another table to be 3NF? That is absolutely not true.
I said that nothing mathematically precluded the possibility, but that I think it would be very difficult to imagine a situation where even a 100-column table would not have transitive functional dependencies within the table.If you are telling me that there are definitely *none* of these, then I will take your word for it. BUt if it were my application, I would look for transitive dependencies.
Secondly my example, was only an example of the math to illustrate the concept of transitive dependencies. I will now give a clearer math example.
Basically, in algebra, suppose Y is a function of X. This means that for every X there is *exactly* one value of Y. Correct? It also means that Y is functionally dependant on X (i.e. Y -> X)
Now, suppose that at the same time, Z is a function of Y. This means that for every Y there is exactly one value of Z, correct?
(Z -> Y)
This also means that for every X, there is exactly one function of Z because functional dependencies are transitive. Hence in this case, 2NF would allow:
(Z -> Y -> X implies Z -> X) However, to reach 3NF, we should do: Hope this helps a little more.
Again, there is no matheamtical limit to the number of direct functional dependencies one can have, but in general, you are going to have to start justifying dependencies when you start to have a lot of columns.
LedgerSMB: Open source Accounting/ERP
A beta site called DemocraSay has posted a purse on this very subject, with a $300 purse split 50/50 among the winning author and the commenters. The competition lasts through 9/16. If you're interested, give it a try: http://www.democrasay.com/node/202. / L