Is the One-Size-Fits-All Database Dead?
jlbrown writes "In a new benchmarking paper, MIT professor Mike Stonebraker and colleagues demonstrate that specialized databases can have dramatic performance advantages over traditional databases (PDF) in four areas: text processing, data warehousing, stream processing, and scientific and intelligence applications. The advantage can be a factor of 10 or higher. The paper includes some interesting 'apples to apples' performance comparisons between commercial implementations of specialized architectures and relational databases in two areas: data warehousing and stream processing." From the paper: "A single code line will succeed whenever the intended customer base is reasonably uniform in their feature and query requirements. One can easily argue this uniformity for business data processing. However, in the last quarter century, a collection of new markets with new requirements has arisen. In addition, the relentless advance of technology has a tendency to change the optimization tactics from time to time."
How did Perl & CSV fare?
The opposite of progress is congress
Well it's about time we had some change around here!
One line blog. I hear that they're called Twitters now.
The closest thing I can think of that fits that description is Postgres.
Have you noticed when you code your own routines for manipulating data (in effect, your own application specific database) you can produce stuff that is very, very fast? In the good old days of the Internet Bubble 1.0 I took an application specific database like this (originally for a record store) and generalized it into a generic database capable of handling all sorts of data. But every change I made to make the code more general also made it less efficient. The end result wasn't bad by any means: we solid it as an eCommerce database to a number of solutions, but as far as the original record store database went, the original version was by far the best. Yes. I *know* generic databases with fantastic optimization engines designed by database experts should be faster, but noticed how much time you have to spend with the likes of Oracle or MySQL trying to get it to do what to you is an exceedingly obvious way of doing something?
1) More and more specialized databases will begin cropping up.
2) Mainstream database systems will modularize their engines so they can be optimized for different applications and they can incorporate the benefits of the specialized databases while still maintaining a single uniform database management system.
3) Someone will write a paper about how we've gone from specialized to monolithic...
4) Something else will trigger specialization... (repeat)
Dvorak if you steal this one from me I'm going to stop reading your writing... oh wait.
I was hoping the article would mention specific relational databases (Oracle, PostgreSQL) results versus specialized ones.
It's natural to look at the edges of any feature or performance envelope. People that want to store petabytes of particle accellerator data, do complex queries to serve a million webpages a second, have hundreds of thousands of employees doing concurrent things to the backend.
But for most uses of databases - or any back-end processing - performance just isn't a factor and haven't been for years. Enron may have needed a huge data warehouse system; "Icepick Johhny's Bail Bonds and Securities Management" does not. Amazon needs the cutting edge in customer management; "Betty's Healing Crystals Online Shop (Now With 30% More Karma!)" not so much.
For the large majority of uses - whether you measure in aggregate volume or number of users - one size really fits all.
Trust the Computer. The Computer is your friend.
steve
(+1 Sarcastic)
Oh, you're not stuck, you're just unable to let go of the onion rings.
I was just thinking about writing an article on the same issue.
The problem I've noticed is that too many applications are becoming specialized in ways that are not handled well by traditional databases. The key example of this is forum software. Truly heirarchical in nature, the data is also of varying sizes, full of binary blobs, and generally unsuitable for your average SQL system. Yet we keep trying to cram them into SQL databases, then get surprised when we're hit with performance problems and security issues. It's simply the wrong way to go about solving the problem.
As anyone with a compsci degree or equivalent experience can tell you, creating a custom database is not that hard. In the past it made sense to go with off-the-shelf databases because they were more flexible and robust. But now that modern technology is causing us to fight with the databases just to get the job done, the time saved from generic databases is starting to look like a wash. We might as well go back to custom databases (or database platforms like BerkeleyDB) for these specialized needs.
Javascript + Nintendo DSi = DSiCade
Who thinks that a specialized application (or algorithm) won't beat a generalized one in just about every case?
The reason people use general databases is not because they think it's the ultimate in performance, it's because it's already written, already debugged, and -- most importantly -- programmer time is expensive, and hardware is cheap.
See also: high level compiled languages versus assembly language*.
(*and no, please don't quote the "magic compiler" myth... "modern compilers are so good nowadays that they can beat human written assembly code in just about every case". Only people who have never programmed extensively in assembly believe that.)
Sometimes it's best to just let stupid people be stupid.
This reminds me of the parallel databases class I took in college. Sure, specialized parallel databases (not distributed, mind you, parallel) using specialized hardware were definitely faster than the standard SQL-type relational databases...but so what? The costs were so much higher they were not feasible for most applications.
Specialized software and hardware outperforms generic implementations! Film at 11!
We don't have a state-run media we have a media-run state.
SW platform development always features a tradeoff between general purpose APIs and optimized performance engines. Databases are like this. The economic advantages for everyone in using an API even as awkward and somewhat inconsistent as SQL are more valuable than the lost performance in the fundamental relational/query model.
But it doesn't have to be that way. SQL can be retained as an API, but different storage/query engines can be run under the hood to better fit different storage/query models for different kinds of data/access. A better way out would be a successor to SQL that is more like a procedural language for objects with all operators/functions implicitly working on collections like tables. Yes, something like object lisp, best organized as a dataflow with triggers and events. So long as SQL can be automatically compiled into the new language, and back, for at least 5 years of peaceful coexistence.
--
make install -not war
Sheesh...and it took someone from MIT to point this out? Look at a prime example of a high-end, heavily-scaled, specialized database: American Airlines' SABRE. The reservations and ticket-sales database system alone is arguably one of the most complex databases ever devised, is constantly (and I do mean constantly) being updated, is routinely accessed by hundreds of thousands of separate clients a day...and in its purest form, is completely command-line driven. (Ever see a command line for SABRE? People just THINK the APL symbol set looked arcane!) And yet this one system is expected to maintain carrier-grade uptime or better, and respond to any command or request within eight seconds of input. I've seen desktop (read: non-networked) Oracle databases that couldn't accomplish that!
All the world's an analog stage, and digital circuits play only bit parts.
Back when he did postgres, it was at berkley. He then moved on to the private world to do a start-up from it. So now he is at MIT. Well, at least MIT picks up good ones.
I had a "simple" optimization project. It came down to one critical function (ISO JBIG compression). I coded the thing by hand in assembler, carefully manually scheduling instructions. It took me days. Managed to beat GNU gcc 2 and 3 by a reasonable margin. The latest Microsoft C compiler? Blew me away. I looked at the assembler it produced -- and I don't get where the gain is coming from. The compiler understands the machine better than I do.
Go figure -- I hung up my assembler badge. Still a useful skill for looking at core dumps, though. And for dealing with micro-controllers.
So, have you had at it and benchmarked your assembler vs. a compilers?
Just another "Cubible(sic) Joe" 2 17 3061
We're all sick with "new fad: X is dead?" articles. Please reduce lameness to an acceptable level!
Can't we get used to the fact that specialized & new solutions don't magically kill existing popular solution to a problem?
And it's not a recent phenomenon, either, I bet it goes back to when the first proto-journalistic phenomenons formed in early uhman societies, and haunts us to this very day...
"Letters! Spoken speech dead?"
"Bicycles! Walking on foot dead?"
"Trains! Bicycles dead?"
"Cars! Trains dead?"
"Aeroplanes! Trains maybe dead again this time?"
"Computers! Brains dead?"
"Monitors! Printing dead yet?"
"Databases! File systems dead?"
"Specialized databases! Generic databases dead?"
In a nutshell. Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems (as noted above), so specialized systems will thrive just as well without killing anything.
Hardware is cheap.
Developer time slightly less so.
Operational expenses will break your kneecaps and charge you for the bat. So often I have heard developers piss and moan about another debugging stage or standardisation in terms of installation procedures or configuration options or whatever, and haul out the hoary chestnut about their time being heinously expensive, when in fact what they could fix once will mean major benefits again, and again, and again, often on a weekly or monthly basis for scheduled tasks. Even accelerating a simple weekly data update from a three-hour, five-man task to a one-hour, one-man task saves you fourteen man-hours. (This example taken from real life with details stripped to protect the idiotic.) You don't have to have an MBA from Harvard to count these beans.
Remember, kids:
Up front costs (including dev time) are cheap.
Recurring costs are expensive.
He's the CTO of Streambase, so he's not just a "neutral" academic.
http://www.streambase.com/about/management.php
There never has been, and probably never will be. A small embedded database will never be replaced by a fat-asses SQL database any more than Linux will ever find aplace in the really bottom-end microcontroller systems.
Engineering is the art of compromise.
I've made some similar discoveries myself!
Who woulda thought that specific-use items might improve the outcome of specific situations?
$nice = $webHosting + $domainNames + $sslCerts
I've seen drop dead performance on flat file databases. I've seen molasses slow performance on mainframe relational databases. And I've seen about everything in between.
What I see as a HUGE factor is less the database chosen (though that is obviously important) and more how interactions with the database (updates, queries, etc) are constructed and managed.
For example, we one time had a relational database cycle application that was running for over eight hours every night, longer than the alloted time for all night time runs. One of our senior techs took a look at the program, changed the order of a couple of parentheses, and the program ran in less than fifteen minutes, with correct results.
I've also written flat file "database" applications, specialized with known characteristics that operated on extremely large databases (for the time, greater than 10G), and transactions were measured in milliseconds, typically .001 - .005 seconds) under heavy load. This application would never have held up under any kind of moderate requirement for updates, but I knew that.
I've many times seen overkill with hugely expensive databases hammering lightweight applications into some mangle relational solution.
I've never seen the world as a one-size-fits-all database solution. Vendors of course would tell us all different.
The problem with database articles like this is that they pretty much ignore the core objectives of a database. Wait, I hear you, "That's the point, but ignoring ACID and transactions, we can improve performance." There in lies the difference between understanding the "whole problem" vs a part of the problem.
Specialized solutions typically accomplish their objective by removing one or more aspects of good database design.
Databases are complex beasts and the good ones encompass a LOT of expertise and theory of data access, stuff that takes many months or even years to really understand. Specialized systems tend to focus exclusively on the highest level "problem" while ignoring the inherent problems of data access and modifications.
While there are specialized solutions that work within a limited range of criteria, and, in fact, improve performance, it should be the exception rather than the rule, because the good SQL databases are REALLY good (MySQL is not one of them) at parsing and creating a good query.
We still use Access every day. It's not dead yet!
Who has _not_ known this for years? The only reasons for generic DB's has been lowcost and flexibility. Does it take studys to point out common knowledge now?
I actually stared a Free/Open Source Social Networking project (FlightFeather) based on similar reasoning, which I also covered in a talk at LinuxWorld San Francisco 2006.
Specifically, when dealing with Web applications, get as much of the stored data as possible in HTML format -- then serve the resulting static pages. Much faster than pulling the data out of a generic SQL database, and rebuilding the page on every request. Even the page generation side can be very efficient. My benchmarking indicates that a single one-CPU machine should be able to handle several hundred comments per second in a discussion forum.
Bad Boss? Discuss it on bosstats.com
Almost as bad as trying to take seriously someone who dosn't know his it's from his its, right?
boats perform better on water then cars, stones don't work well as parachutes and fire warms your house better then blocks of ice. how long did it take this genious to come up with this?
If you mod me down, I will become more powerful than you can imagine....
Don't forget that a database is a very specialized form of a storage system, you can think of it as a very special sort of file system. It didn't kill file systems
Very specialized? Please explain. Anyhow, I *wish* file systems were dead. They have grown into messy trees that are unfixable because trees can only handle about 3 or 4 factors and then you either have to duplicate information (repeat factors), or play messy games, or both. They were okay in 1984 when you only had a few hundred files. But they don't scale. Category philosophers have known since before computers that hierarchy taxonomies were limited.
The problem is that the best alternative, set-based file systems, have a longer learning curve than trees. People pick up hierarchies pretty fast, but sets take longer to click. Power does not always come easy. I hope that geeks start using set-oriented file systems and then others catch up. The thing is that set-oriented file systems are enough like relational that one might as well use relational. If only the RDBMS were performance-tuned for file-like uses (with some special interfaces added).
Table-ized A.I.
As any English teacher will tell you, any language that will support great poetry and prose will also make it possible to write the most gawdawful cr*p. Perl bestows great powers, but the perl user must temper his cleverness with wisdom if he is to truly master his craft.
However in this specific case Google reveals that
was simply "borrowed" from y-combinator.pl. This is an instance of Perl being used in a self-referential manner to add a new capability (the Y combinator allows recursion of anonymous subroutines (why anyone would bother to do such an arcane thing comes back to the English teacher's remarks)). Self-referential statements are always difficult to understand because, well, they just are that way (including this one).Maybe they could make rubber databases ?
(or it's a bit of a stretch)
May contain traces of nut.
Made from the freshest electrons.
It's just not called SQL driven RDBMS. It's called Sleepycat.
Religion is what happens when nature strikes and groupthink goes wrong.
Speak for yourself weener-boy!
Engineering is the art of compromise.
So, they will always come off sounding a bit hokey to us.
Having said that, I will now say that we are entering into a "new age" -- gee, how often have you heard that hackneyed adage? -- of massive torrents of information with a damning need to manage it all.
What I'm surprise to not hear are hardware projects to manage specialized requirements for massive data streams. Couldn't someone come up with, for instance, a clever way to arrange a battery of Field-Programmable Logic Arrays to crunch data in real time for some really specialized application?
Ruby Neural Evolution of Augmenting Topologies
Aren't most condoms sold in one size fits all ?
Maybe they could make rubber databases ?
Or maybe they should make rubber condoms?
23 years ago I wrote a custom DB to maintain the status of millions of "universal" gift cards, it ran 3-5 orders of magnitude faster (on a 6 MHz IMB AT) than a commercial database running on a big IBM mainframe.
I reduced the key operations (what is the value of this gift card, when was it sold, has it been redeemed previously? etc) to just one operation:
Check and clear a single bit in a bitmap.
My program used 1 second to update 10K semi-randomly-ordered (i.e. in the order we got them back from the shops that had accepted them) records in a database of approximately 10 M records.
20 years later I wrote a totally new version of the same application, but this time the gift cards are electronic debet cards. This time I used Linux-Apache-MySQL-Perl to make a browser-based version, and I stored everything in the DB. Today that is plenty fast enough, and it allows us to make any kind of queries against the DB, like "How many transactions of less than 100 kr was accepted in December, broken down by business area/chain/shop/etc"
Terje
"almost all programming can be viewed as an exercise in caching"
Has anyone noticed the This article is published under a Creative Commons License Agreement, its the first time I've seen this applied to an academic paper. Another small step for the open-content movement.
There are four sorts of people in the world: fools, lunatics, idiots and morons. - Umberto Eco, Foucaut's pendulum.
Which reminds me of the Robin Williams joke
"They came in 3 sizes, extra large, large and white man"
"The weirdest thing about a mind, is that every answer that you find, is the basis of a brand new cliche" -
This is titled "OSFA? - Part 2: Benchmarking Results." Has anyone found Part 1?
There is an article, and it has many references. How is a 'Captain Obvious' sort of comment labeled Insightful? The insightful part is in the article. The first author, Michael Stonebraker, architected Ingres and Postgres. He looked at OLAP databases, which is a market that is much larger than a special case. He proposed storing the data in columns rather than in rows. He tested this, it works. In fact it works so well that he can clobber a $300,000 server cluster with a $800 dollar PC. I know that I would be pretty happy to spend a year porting to his database if I could pocket half of that annual hardware cost savings. The savings in electricty would be enough to pay for several pretty serious Starbucks addictions. His key insight seems to be that he can vastly improve OLAP performance by storing the data in columns rather than in rows. This change could be quite transparent to the end users & developers, except for the massive speed-up and cost savings, of course. This paper describes a general solution for a common problem. Stonebraker has developed Vertica , which is still support ad-hoc querries in SQL. This seems like a pretty general purpose solution for OLAP.
Think global, act loco
They have too much of a tendency to get stuck in loops.
*ducks*.
I think you have to take into account how secure the specific-use databases are going to be maintained. Who is going to write updates for security flaws in a specific-use database and support new technologies. Might it be safe to say we'll see companies forget or drop a specific-use database over a more general platform. I rather use a database that might be a little slower if I know that I can get regular updates to keep it current with my ever patching / updating / and re-securing Operating systems. How often do we see proprietary hardware stop working because M$ decided to make a new service pack? What happens when the latest version of your Debian or Red Hat install won't run your widget-PostMySQL super designed build because the new install no longer supports an older kernal?
This is, of course, what MUMPS advocates have been saying for years.
MUMPS is a very peculiar language that is very "politically incorrect" in terms of current language fashion. Its development has been entirely governed by pragmatic real-world requirements. It is one of the purest examples of an "application programming" language. It gets no respect from academics or theoreticians.
Its biggest strength is its built-in "globals," which are multidimensional sparse arrays. These arrays and the elements in them are automatically created simply by referring to them. The array indices are arbitrary strings. There can be an arbitrary number of subscripts and the same array can have elements with different numbers of subscripts. Oh, and they're always sorted automatically; each element is created automatically in its proper sequence, and there are fundamental operators for traversing arrays in sequence.
"Global" arrays are persistent across sessions, are stored on the disk, and as in ordinary practice can be hundreds of megabytes in size.
Before you say "this can all be done simply by writing a C++ class," I have to mention the important point, which is that the use of the globals is so intrinsic to the ordinary way MUMPS is really used in practice, that successful implementions of MUMPS must and in practice do make the implementation of globals efficient.
You really can just use "globals" all the time for everything. They work well enough that you don't need to reserve their use for when they're really needed. They're not a luxury. MUMPS programmers rarely use files, except for interchange in and out of the MUMPS universe. Within MUMPS, data is simply kept in globals; it's just the MUMPS way.
"Globals" are extremely flexible and lend themselves naturally to representations of real-world databases. These representations are typically one-off, ad-hoc representations designed by the programmer, who needs to make up-front decisions about the hierarchical organization in which the data will be stored, and writes special-purpose code to perform the accesses. Naturally, this sounds like the dark ages compared to relational technology, but there is an impressive tradeoff. If MUMPS fits the application, development times are short, and performance is dramatically better than for relational systems.
Whether or not this is important in the year 2006, it was very clear a decade ago when medium-scale database applications were typically hosted on minicomputers, that the same hardware resources could support several times as many users running a MUMPS application as a similar application implemented with a relational database, as various organizations found when they converted... in either direction.
Of course relational systems can and are implemented on top of MUMPS.
MUMPS underlies InterSystems' Cache product, and a MUMPS-like language with historical connections to MUMPS underlies the products of Meditech. I'm not sure what the current status of Pick is, but it has some similarities. The company I currently work for has nothing whatsoever to do with either system... except that our business IT system happens to be Pick-based.
Regardless of whether you think of MUMPS itself, there are almost certainly lessons to be learned from the durability of this language and its effectiveness.
"How to Do Nothing," kids activities, back in print!
Like many academics, he has founded a company. It hardly invalidates his research. They aren't trying to hide it either - one of the other authors is contributing in his capacity as a StreamBase employee, as shown right at the top of the paper.
Also, since nobody else has said anything about his neutrality or otherwise, you shouldn't put the word "neutral" in quotes like that. It makes you look like you are trying to set up a straw man. Neutrality is not even particularly important in a researcher. You'd have to go a long way to find one who didn't want his own particular theory to prevail, whatever the reason. Peer review of the content is the criterion on which papers are judged.
"The Milliard Gargantubrain? A mere abacus - mention it not."
The same argument is what gave rise to re-programmable generic processing components: CPUs. You'll note that the processor industry today (AMD in particular) is now also moving towards this kind of diversification. Gaming systems have been using dedicated GPUs for ages (today they're more powerful than entire PCs from 5 years ago) and I'm sure we remember back when math co-processors (i387) were introduced. You'll note that math co-processors were just absorbed back into the generic model.
It's another pendulum in the computing world (much like the serial/parallel dichotomy). Moving from a disparate number of diverse systems to a small number of all-purpose systems. The advances are always for performance, and they typically happen when the current generation plateaus. We've mastered the concepts of one generation, time to explore new concepts (by re-exploring old concepts).
In 10 or 15 years people will be complaining about the difficulty of data portability, the esoteric nature of these unique data files, and the lack of features in area X in one product and area Y in a second product, and the archaic languages you have to use on these old, unsupported systems. There will be a move back to generic storage engines, bringing with it the lessons learned from that round of insight.
Of course, there will always be demands for specialized components just as there will always be demand for generic, standard components. It's the centrists whose demands are for the best combination of performance and features that determine popularity.
The road to tyranny has always been paved with claims of necessity.
Good point, but bad example. Sabre migrated to a generic database (which MySQL and HP significant contractors on the project).
b re-HP-MySQL-case-study.pdf
http://h71028.www7.hp.com/enterprise/downloads/Sa
But your point is well taken, and practically all the large financial institutions (Fidelity, UBS, Merrill Lynch, etc) use KDB from kx.com; also APL derived like I think the old Sabre system was.
See also KDB from kx.com.
/ 11/14/22741/791
It's the database used by most large financial institutions (Fidelity, Merrill Lynch, UBS, etc); and stores all it's variables on the global K stack.
For thy types of database work stock traders do, it's orders of magnitude faster than the Oracles, etc of the world.
http://kx.com/
http://www.kuro5hin.org/?op=displaystory;sid=2002
Databases already have the ability to change storage engines as long as they support SQL. The reason my company shuns the database for many specific tasks is that SQL is ill-suited to perform many types of transformations, calculations, and aggregations on data. What may take many pages of SQL (and many temp tables) in a stored proc can be written in a simple Java class and will perform much better, as well as being easier to maintain. A lot of our processing goes like this Raw data from database (simple select queries, which are very fast) -> flat files -> custom Java code -> reporting engine or another database. The speedup over using stored procs or SQL based ETL tools ranges between a factor of 10 and a factor of 100. MDX is a better language than SQL for a lot of purposes, but not all.
Kinda sad that this was modded Funny rather than Insightful.
Many many many people get stuck on the fancy things the DB writers tell you SQL can do rather than thinking about the data differently. A truly classic mistake is wetting yourself over sub-queries when you're still using parent-IDs to do heirarchies.
k o4 is how you do trees in SQL that obviate that. Doing it that way removes most of the need to produce recursive queries to get all the levels. Unfortunately, *most* forum software *doesn't* do it this way because such advanced data manipulation doesn't occur to them.
Feh.
Parent IDs are a bad idea if you want to be able to look at multiple levels of the tree at once, especially the trees you get in threaded comments. http://www.dbazine.com/oracle/or-articles/tropash