Is the One-Size-Fits-All Database Dead?
jlbrown writes "In a new benchmarking paper, MIT professor Mike Stonebraker and colleagues demonstrate that specialized databases can have dramatic performance advantages over traditional databases (PDF) in four areas: text processing, data warehousing, stream processing, and scientific and intelligence applications. The advantage can be a factor of 10 or higher. The paper includes some interesting 'apples to apples' performance comparisons between commercial implementations of specialized architectures and relational databases in two areas: data warehousing and stream processing." From the paper: "A single code line will succeed whenever the intended customer base is reasonably uniform in their feature and query requirements. One can easily argue this uniformity for business data processing. However, in the last quarter century, a collection of new markets with new requirements has arisen. In addition, the relentless advance of technology has a tendency to change the optimization tactics from time to time."
Well it's about time we had some change around here!
One line blog. I hear that they're called Twitters now.
Have you noticed when you code your own routines for manipulating data (in effect, your own application specific database) you can produce stuff that is very, very fast? In the good old days of the Internet Bubble 1.0 I took an application specific database like this (originally for a record store) and generalized it into a generic database capable of handling all sorts of data. But every change I made to make the code more general also made it less efficient. The end result wasn't bad by any means: we solid it as an eCommerce database to a number of solutions, but as far as the original record store database went, the original version was by far the best. Yes. I *know* generic databases with fantastic optimization engines designed by database experts should be faster, but noticed how much time you have to spend with the likes of Oracle or MySQL trying to get it to do what to you is an exceedingly obvious way of doing something?
1) More and more specialized databases will begin cropping up.
2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.
3) Someone will write a paper about how we've gone from specialized to monolithic...
4) Something else will trigger specialization... (repeat)
Dvorak if you steal this one from me I'm going to stop reading your writing... oh wait.
It's natural to look at the edges of any feature or performance envelope. People that want to store petabytes of particle accellerator data, do complex queries to serve a million webpages a second, have hundreds of thousands of employees doing concurrent things to the backend.
But for most uses of databases - or any back-end processing - performance just isn't a factor and haven't been for years. Enron may have needed a huge data warehouse system; "Icepick Johhny's Bail Bonds and Securities Management" does not. Amazon needs the cutting edge in customer management; "Betty's Healing Crystals Online Shop (Now With 30% More Karma!)" not so much.
For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.
Trust the Computer. The Computer is your friend.
How did Perl & CSV fare?
It failed the "relational" part of the test. But it failed very quickly.
steve
(+1 Sarcastic)
Oh, you're not stuck, you're just unable to let go of the onion rings.
I was just thinking about writing an article on the same issue.
The problem I've noticed is that too many applications are becoming specialized in ways that are not handled well by traditional databases. The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system. Yet we keep trying to cram them into SQL databases, then get surprised when we're hit with performance problems and security issues. It's simply the wrong way to go about solving the problem.
As anyone with a compsci degree or equivalent experience can tell you, creating a custom database is not that hard. In the past it made sense to go with off-the-shelf databases because they were more flexible and robust. But now that modern technology is causing us to fight with the databases just to get the job done, the time saved from generic databases is starting to look like a wash. We might as well go back to custom databases (or database platforms like BerkeleyDB) for these specialized needs.
Javascript + Nintendo DSi = DSiCade
Who thinks that a specialized application (or algorithm) won't beat a generalized one in just about every case?
The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.
See also: high level compiled languages versus assembly language*.
(*and no, please don't quote the "magic compiler" myth... "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.)
Sometimes it's best to just let stupid people be stupid.
Sheesh...and it took someone from MIT to point this out? Look at a prime example of a high-end, heavily-scaled, specialized database: American Airlines' SABRE. The reservations and ticket-sales database system alone is arguably one of the most complex databases ever devised, is constantly (and I do mean constantly) being updated, is routinely accessed by hundreds of thousands of separate clients a day...and in its purest form, is completely command-line driven. (Ever see a command line for SABRE? People just THINK the APL symbol set looked arcane!) And yet this one system is expected to maintain carrier-grade uptime or better, and respond to any command or request within eight seconds of input. I've seen desktop (read: non-networked) Oracle databases that couldn't accomplish that!
All the world's an analog stage, and digital circuits play only bit parts.
We're all sick with "new fad: X is dead?" articles. Please reduce lameness to an acceptable level!
Can't we get used to the fact that specialized & new solutions don't magically kill existing popular solution to a problem?
And it's not a recent phenomenon, either, I bet it goes back to when the first proto-journalistic phenomenons formed in early uhman societies, and haunts us to this very day...
"Letters! Spoken speech dead?"
"Bicycles! Walking on foot dead?"
"Trains! Bicycles dead?"
"Cars! Trains dead?"
"Aeroplanes! Trains maybe dead again this time?"
"Computers! Brains dead?"
"Monitors! Printing dead yet?"
"Databases! File systems dead?"
"Specialized databases! Generic databases dead?"
In a nutshell. Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems (as noted above), so specialized systems will thrive just as well without killing anything.
He's the CTO of Streambase, so he's not just a "neutral" academic.
http://www.streambase.com/about/management.php
There's a difference between fitting and being forced to fit into something ;)
Me failed English...
FreeBSD over Linux. If my comments seem odd, this may explain...
I've made some similar discoveries myself!
Who woulda thought that specific-use items might improve the outcome of specific situations?
$nice = $webHosting + $domainNames + $sslCerts
Yep. On the plus side, the Perl hacker who put it together only wasted the time it took to write one line. Granted, the line was 103,954 characters long. He considered breaking it up into two lines to improve readability but ultimately rejected the notion -- anyone not capable of reading the program clearly had no business messing with it anyhow. (Quick question aside from the snark: since Perl has associative arrays can't it emulate a relational database? It was my understanding that after you've got associative arrays you can get to any other conceivable data structure... assuming you're willing to take the performance hit.)
Help poke pirates in the eyepatch, arr.
> It was my understanding that after you've got associative arrays you can get to any other conceivable data structure
Once you have lambda you can get to any conceivable data structure. The question is, do you really want to?
sub Y (&) { my $le=shift; return &{sub {&{sub {my $f=shift; &$f($f)}}(sub {my $f=shift; &$le(sub {&{&$f($f)}(@_)})});}}}
Done with slashdot, done with nerds, getting a life.
Almost as bad as trying to take seriously someone who dosn't know his it's from his its, right?
Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems
Very specialized? Please explain. Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both. They were okay in 1984 when you only had a few hundred files. But they don't scale. Category philosophers have known since before computers that hierarchy taxonomies were limited.
The problem is that the best alternative, set-based file systems, have a longer learning curve than trees. People pick up hierarchies pretty fast, but sets take longer to click. Power does not always come easy. I hope that geeks start using set-oriented file systems and then others catch up. The thing is that set-oriented file systems are enough like relational that one might as well use relational. If only the RDBMS were performance-tuned for file-like uses (with some special interfaces added).
Table-ized A.I.
I think it implements a Y combinator. Then again, it could just print out "Just another perl hacker". But I'm guessing on the Y combinator. Lets break it down so its readable:
... another block. This one actually makes sense if you look at it -- take the first argument from the list, evaluate it as a function on itself. We're assuming that is going to return a function. Why? Because that opening parent means we have arguments, such as they are, coming to the function.
sub Y (&) {
my $le=shift;
return &{
sub { ## SUB_A
&{
sub { ## SUB_B
my $f=shift;
&$f($f)
}
} ##Close SUB_A's block
(sub { ## SUB_C
my $f=shift;
&$le(sub { ##SUB_D
&{
&$f($f)
}
(@_)
}## END SUB_D
)} ##END SUB_C
); ##End the block enclosing SUB_C
} ## END SUB_A
} ## Close the return line
} ##Close sub Y
Y can have any number of parameters you want (this is sort of a "welcome to Perl, n00b, hope you enjoy your stay" bit of pain). The first line of the program assigns le to the first parameter and pops that one off the list. That & used in the next line passes the rest of the list to the function he's about to declare. So we're going to be returning the output of that function evaluated on the remaining argument list. Clear so far?
OK, moving on to SUB_A. We again use the & to pass the list of arguments through to
OK, unwrapping the arguments. There is only one argument -- a block of code encompassing SUB_C. (Wasted 15 minutes figuring that out. Thats what I get for doing this in Notepad instead of an IDE that would auto-indent for me. Friends don't let friends read Perl code.)
By now, bits and pieces of this are starting to look almost easy, if no closer to actual readable computer code. We reuse the function we popped from the list of arguments earlier, and we use the same trick to get a second function off of the argument list. We then apply that function to itself, assume the result is a function, and then run that function on the rest of the argument list. Then we pop that up the call stack and we're, blissfully, done.
So, now that we understand WTF this code is doing, how do we know its the Y combinator? Well, we've essentially got a bunch of arguments (f, x, whatever). We ended up doing LAMBDA(f,(LAMBDA(x,f (x x)),(LAMBDA(x,f (x x)))) . Which, since I took a compiler class once and have the nightmares to prove it, is the Y combinator.
Now you want to know the REALLY warped thing about this? I program Perl for a living (under protest!), I knew the answer going in (Googled the code), and I have an expensive theoretical CS education which includes all of the concepts trotted out here... and the Perl syntax STILL made me bloody swim through WTF was going on.
I. Hate. Perl.
And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them.
Help poke pirates in the eyepatch, arr.
As any English teacher will tell you, any language that will support great poetry and prose will also make it possible to write the most gawdawful cr*p. Perl bestows great powers, but the perl user must temper his cleverness with wisdom if he is to truly master his craft.
However in this specific case Google reveals that
was simply "borrowed" from y-combinator.pl. This is an instance of Perl being used in a self-referential manner to add a new capability (the Y combinator allows recursion of anonymous subroutines (why anyone would bother to do such an arcane thing comes back to the English teacher's remarks)). Self-referential statements are always difficult to understand because, well, they just are that way (including this one).Maybe they could make rubber databases ?
(or it's a bit of a stretch)
May contain traces of nut.
Made from the freshest electrons.
There's a reason for that. Many years ago, the Wisconsin database group (David DeWitt in particular) authored one of the first popular database benchmarks, the Wisconsin benchmarks. They showed that some databases performed embarrassingly poorly, which made a lot of people really angry. In fact, Larry Ellison got so angry, he tried to get DeWitt fired (Ellison wasn't clear on the concept of tenure). Since then, major databases have a "DeWitt clause" in their end-user license, which says that the name of the database can't be used when reporting benchmark results.
And this years ahead of Microsoft not allowing users to benchmark Vista at all!
Don't wrestle with pigs; you'll both get muddy, but the pig likes it.
Has anyone noticed the This article is published under a Creative Commons License Agreement, its the first time I've seen this applied to an academic paper. Another small step for the open-content movement.
There are four sorts of people in the world: fools, lunatics, idiots and morons. - Umberto Eco, Foucaut's pendulum.
Which reminds me of the Robin Williams joke
"They came in 3 sizes, extra large, large and white man"
"The weirdest thing about a mind, is that every answer that you find, is the basis of a brand new cliche" -
Btw. Postgres was a project from Stonebreaker meant to deal with the limitiations of SQL (POST inGRES).
See the history of PostgreSQL.
When the community picked the old, dormant Postgres source code up (no problem due to the BSD licensing), the first that was added (after some debates) was the SQL syntax, hence the name change to PostgreSQL.
Bye egghat.
-- "As a human being I claim the right to be widely inconsistent", John Peel
Outside of "Golfing", I'd strongly disagree. I don't think the community encourages it for the most.
This is from someone who's spent the last seven years with Perl and in the community. YMMV
-William Shatner can be neither created nor destroyed.
And the reason I hate Perl, more than the fact that the language makes it *possible* to have monstrosities like that one-liner, is that the community which surrounds the language actively encourages them.
Not all of us encourage this.
Its considered *clever* and a mark of great skill that you can strip out all the code that actually explains WTF your code is doing and be left with the perfectly compressed version.
They call this Perl Golf (shaving strokes of your game. Get it?)
Many of us do not consider it clever. Rather, we consider it stupid and counter-productive.
On the other hand, all of the sample answers posted at the Python Challenge are all golf style, and the Python Challenge is supposed to be a learning tool.
This is modeled as good Perl style to folks just starting with the language,
People who do this should be tied up with string and left in small dark rooms. For a month.
the Llama book has lots of code which looks like that, and code samples you find will look like it too.
This just isn't the case. The code samples in the Llama are no more or less obtuse the code samples in my Pragmatic Ruby book.
It appears that the community largely does not teach perl like it is a language that needs to be read.
I wish I could argue more strongly with you here, other than to assert that I come across code in many languages (Perl, Ruby, Java, C, Lisp), on a regular (daily, weekly, monthly) basis, at work and at home, in books, magazines, and online that appear written to not be read.
Your complaint of bad coding practices is endemic to the industry, and should not be used to condemn a language because it allows the freedom to code poorly.
My Heart Is A Flower