Is the Relational Database Doomed?
DB Guy writes "There's an article over on Read Write Web about what the future of relational databases looks like when faced with new challenges to its dominance from key/value stores, such as SimpleDB, CouchDB, Project Voldemort and BigTable. The conclusion suggests that relational databases and key value stores aren't really mutually exclusive and instead are different tools for different requirements."
that's efficient -a summary that refutes the inflammatory headline
I'm just sayin'
Bla bla bla
Headline: Is the Relational Database Doomed?
Summary: "The conclusion being that relational databases and key value stores aren't really mutually exclusive and instead are different tools for different requirements."
WTF?
Someone forgot to put a where clause on that delete.
Reading slashdot one-liner: (irm http://rss.slashdot.org/Slashdot/slashdot).rdf.item | fl title,desc*
The flexibility offered in key/value databases is simply too good of a feature to pass up. However, do you really think you can get people to give up MSSQL? It'll be nice for smaller projects, but corporations wont even consider it for a number of years.
It isn't up for debate that tupple stores are a very useful tool. That being said, they aren't a silver bullet for *ALL* data storage situations. For types of data that are inherently tabular, I really doubt that 40 years of RDBMS development will be trumped by a tuple store. When you move to hierarchical data though, things are reversed.
Someone type this up and submit it to Digg.
Hey, read my article! Just to make sure you do, I'll pull a Dvorak and put in some incredibly sensational headline about how RDBMs are dewmed!!!!!! BWAHAHA, feed my advertisers!!!!
(Tune in ext week, when I write about how C programming is going to become extinct in the light of fantastic new development tools like C# and Ruby on Rails!!!)
The world's burning. Moped Jesus spotted on I50. Details at 11.
Is anyone else experiencing a long delay when loading the Slashdot homepage, like a couple of seconds during which the browser is unresponsive? I'd like to know if there's something I can do about it besides blocking the offending script and reducing Slashdot to an unusable shadow of itself. I don't intend to dive into 400 KBytes (!) of minified Javascript code to find what the hell it is doing.
This isn't digg. Posting that doesn't guarantee you +5
There's a db called Project Voldemort? That's awesome! I'm switching to that just for the name! I think my manager is a Harry Potter fan so getting approval shouldn't be too hard.
This same basic story keeps getting submitted from the same group of people who are generally trying to sell non-relational-DB stuff. This is an ad. Move along.
99.9% of database claim to follow the relational model.
The rest have scalability problems that 99.9% of developers will never see throughout their entire careers.
So the answer is a simple, emphatic, no.
Leave us RDBMS dinosaurs alone. String Name/Value pairs, that is a great innovation. In other news, Sun will be dropping all types from the Java object system and rely on the VOID type. Idiots.
The conclusion being that relational databases and key value stores aren't really mutually exclusive and instead are different tools for different requirements.
In related news, black is not white.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Map db = new HashMap();
beginTransaction(); // Synchronize on the map // Just serialize the fucker to a file. The idiots using this won't know the difference.
db.add("key", "value");
commitTransaction();
Ugh, yet another superficial blog post pimped out on slashdot. The guy doesn't have a solid technical grasp about data system and what really constitutes the difference between a system like BigTable or SimpleDB versus an RDBMS. Instead of talking about the differences in transaction management, consistency guarantees, etc. he comes up with brilliant ideas like RDBMSes are slower because they are more consistent.
Enough with the bad blog posts already, it's like facebook, only less interesting.
And people complain that i don't go read the articles and rely on summaries.
This is one of the reasons.
---- Booth was a patriot ----
The big dumb thing about key store values is that they are actually just a subset of relational algebra in theory and are thus readily implementable in a relational database in fact. If you really wanted to have a database just do key / store values, you could quite easily do that in any rdms.
This is my sig.
I do not believe someone dared to write this down : "If you don't need a relationnal database then relational database are unefficient". Waouh, this is rocket science guy! You should apply for the Fields medal.
It is obvious that if you do not need a structured and coherent base but a big hash table with property, then you do not need a Relationnal Database.
I won't believe it until Netcraft confirms it.
I have something in common with Stephen Hawking...
they think Nissan makes the Civic!
It has been suggested before that the life of the relational DB is coming to an end. I must say that while I agree with this statement: -
Relational databases scale well, but usually only when that scaling happens on a single server node. When the capacity of that single node is reached, you need to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale.
I disagree with the following statement: -
Try scaling to hundreds or thousands of nodes, rather than a few, and the complexities become overwhelming, and the characteristics that make RDBMS so appealing drastically reduce their viability as platforms for large distributed systems.
I submit that the complexity can be managed and that's why we have jobs.
I am an IT consultant at a major bank and we keep all kinds of data. Data that many find useless and is spread across 27 [major] nodes. Total records in our biggest table number about 57 million with 49 rows. I can tell you that data querying and integrity maintaining are a breeze if the schematic design is correct in the first place.
We are always designing and testing different scenarios. In cases where we have had to change the schema, it has been simple if one knows what to do.
I must say that Open Source DBs have worked for us though we rely on products from IBM and Oracle.
Our philosophy is: If it works in PostgreSQL, it will even do wonders on DB2 or Oracle. I do not see how we can do away with the relational DB. Whoever designed it in the beginning did a marvelous job.
In headlines, "?" implies that something is a serious question, whose answer is likely to be yes. One that makes it worth spending the time to read the article.
Imagine the headline said "Does Obama Smoke Crack?" and the article had a bunch of stuff about the president, with a last paragraph saying: "There is absolutely no reason to thing that President Obama has ever smoked crack."
-- Support a free market in the field of government
Really rational is the best way to take a data set and be able to access it in various ways. Many of the other concepts are indeed regressions and reintroduce problems a relational database solves. Relational allows you to able to display and view data in various different ways and apply the dataset in new ways, ways that may not have originally been a part of the original design of the application. Every time we hear someone harp about some new database technology that reintroduces all of the problems of the past, but relational is still the best and most versatile way to store your data in a way that allows for query flexibility.
Turns out, there's something called a "skateboard." You can use it to travel as far as the Quickie Mart, with nothing but your feet to propel it.
In conclusion, skateboards and automobiles aren't the same thing, so probably not.
This is.
What you need here is a quote from Monty Python. It always gets the poster a +5 here. (Yes, it's not Karma whoring due to funny mods not contributing but I would assume funny was what GP was after too). And XKCD strips are very common ways to get +5 too. As are bash.org quotes. And those are just +5 funnies. I could count a lot of ways to get nearly certain insightful mods on any Linux related article, etc...
So get rid of your elitism. There are a lot of stupid mods here, too.
I keep hoping that someday I'll read a Slashdot article headlined, "Is xxx doomed?" and the answer is... Yes!
So far... no.
Maybe we could see something like, "Is this the year that the Linux desktop doomed?"
Seriously? Get the Harry Potter out. Out now!
"Lack of speed can be overcome. In the worst case by patience." --Znork
Uh, no, it is so pointless I didn't even have to read it to know its pointless!
Is the Earth really flat?
LOLOLOL.
"Malo periculosam, libertatem quam quietam servitutem." -- Jefferson
You should have seen how quickly flat file usage flatlined when relational databases came out. I mean *nobody* uses plain text files anymore. Can you imagine not having your crontabs in a SQL DB?
The relational database is not going anywhere and nothing in that article is based on any firm understanding of managing data.
Is the notion of a "join" obsolete? No, but it is typically impractical in a high volume system. You would probably use denormalization as a strategy.
Scaling many nodes? OK, you still gotta put your data "in" something.
key/value indexing? yawn. select val from keyvalue_tab where key = foo;
The value can be basically anything, and most "relational" databases have good object support as well as XML, JSON, etc.
So we can establish that a SQL relational database can do *everything* a simpler system can do. Now, think about ALL the things you can do with your data in a real database.
What is the point of using a limited and less functional system? A good system, like Oracle, DB2, PostgreSQL, etc (!mysql of course) will do what you need AND allow you do do more should you be successful.
The problem with data is two fold: Managing read/write/deletes and finding what you are looking for. These problems have been solved. A good database will do this for you. Want to store object? XML, JSON, binary objects, or a specialized database extension works perfectly.
As an architect I tend to see databases are fancy storage systems, and in general they annoy me. I love object databases and distributed key-value pair store mechanisms. But even as jaded as I am I can see that the RDMBS isn't going anywhere any time. There are a lot of things that alternative storage systems simply cannot do well.
As for the Google argument (i.e., "bug Google does it and it works"), I've heard it a few times in meetings where a bright-eyed company executive is trying to make a case for their use. My response is usually "yeah, all you need now is to hire people like the ones that work at Google", at which point the argument is usually dropped. Making applications work with storage systems like that take an engineering mindset that's simply different than the talent at the average Fortune 500. RDBMS are pretty good at masking crappy development practices.
Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
There are plenty of solutions yet to be explored to handle that problem without having to dump the relational database model. One I think of right now might be to view a server park as a hard drive, view a table like a file and apply all that filesystem technology to a problem that rises today. We can for example apply the UNIX inodes filesystem so you can efficiently store a database with great scalability.
form the article: "For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the cover, RDBMS determines the "execution plan" that best answers our requests by using things like cost-based algorithms." So, you have no idea how optimizers work and how you can access tuning information, and you'd like to tell us RDBMSs are bad? Get of my lawn! (yay, I'm getting old)
Does that example of a relational DB have a serious error, or is that just me? Why have make key in two tables?
He lost cred right then.
No comprende? Let me type that a little slower for you...
In theory, I agree the most costly actions in a database are joins. It seems like the key/value model is a great solution to this, on the surface. However, what the key/value model does is push the cost to the application layer. Instead of ensuring relational integrity and conformity in the database, suddenly all app code has to do this on the frontend. Also, instead of managing this process in a single place, suddenly this process is distributed among multiple methods. Sure, the DB is more scaleable, but suddenly the app is a mess.
Relational databases need to die. I loved them and preached the goodness of them 10 years ago, but they are just too rigid for contemporary needs. I've learned better ways of organizing and filtering data.. but the old RDBMS school is too canonical (stubborn) and self-indulging to realize that needs are changing and their model doesn't fit.
We need efficient attribute/value models. We need to stop referencing data by where it is and start referencing it by what it is. There is too much data that needs to exist in different views, based on policy--not explicit placement.
Dumb-tags (attributes without values) like those used with Delicious bookmarks are also broken. They are too vague.
My own approach is that every attribute may have any number of value instances. Each value instance may, in turn, have sub-attributes. So you can look up data based on its characteristics even with disregard for its name. For example: /mycompany/mailserver1/ip of zone = infirewall
This returns all IP addresses under the "zone" attribute while also under the mailserver1 attribute that is under the mycompany attribute.
When validating instances of the "ip" attribute, it looks backward in the path because it is extremely quick that way.
The data server's sole responsibility is storing and retrieving information (not just data) in context (aka filtering).
Sorting is the responsibility of the client. This makes sense because there are an infinite number of algorithms one could have for sorting data (e.g. alphabetic mixed case, ASCII order, etc). To facilitate this, I wrote a method to return the number of values that would be returned if the values were requested. If too big a bite for the client, it can re-request the size of a smaller chunk, segmented according to the client's ordering method. This is useful for scale, in any case. Processing in chunks makes sense whether over a network of limited capacity or from directly form disk with limited memory.
And--this is a columnar approach like Google's BigTable is.. That means you get 10+ times faster read performance.
Matthew
This is slashdot, it was on digg a week ago and reddit a week before that.
...or at least an attempt at bad advertising or pursuasive writing (cognitive justification.)
OODBMS have been pushing this, and many of them are pushed as light weight key-value stores.
http://en.wikipedia.org/wiki/ODBMS
This isn't new, like OpenDoc's Bento
http://en.wikipedia.org/wiki/OpenDoc
That became IronDoc
http://linuxfinances.info/info/oodbms.html
The problem with any of it is that relational databases rule the enterprise space. You cannot get away from them, and they are far from dead, because you will always have business people wanting to do ad hoc reports, and those are best done against denormalized models (where object stores tend to get super normalized which is just bad for reporting because cross table joins are the most expensive thing you can do in any database.)
Yay.
/\/\icro/\/\uncher
As long as there is data that is related, I predict a form of relational database.
Genius! I know, pure genius.
There are still multi-billion dollar businesses operating the core of their business on COBOL systems, and they're decades older than relational database technology.
So don't bet on it.
Question everything
DOOOOOOOOOOMED!
The Kruger Dunning explains most post on
I keep hoping that someday I'll read a Slashdot article headlined, "Is xxx doomed?" and the answer is... Yes!
Feel free to submit any or all of the following:
Is CP/M doomed?
Is Microsoft Word for OS/2 doomed?[*]
Is pets.com doomed?
Is the passenger pigeon doomed?
Is T-Rex doomed?
Is Shoemaker-Levy doomed?
Get any of these accepted, and your wish will come true. :)
[*] Note, I own a copy of this, but I still suspect it's pretty doomed.
Quick someone tell CCP!
it will be a long time before the death of Relational DBs.
People will make them work regardless. It is a bit like COBOL, there's lots of it out there and it will be maintained for years to come...put it in the bank.
Yes I read most the article and perhaps reading the rest would answer this, but how is a key/value database different from a MySQL databse running MyISAM where you store a bunch of different objects as a string, maybe json_encoded or whatever in the row?
I am already in the process of migrating all my enterprise systems from Oracle to Project Voldemort.
Except...usually the xkcd links that get modded up are...relevant. That comic is about validating user input, it really doesn't have much to do with RDBMS at all.
You might want to look up what relational means, because it has nothing to do with one piece of data being related to another.
1. Netcraft...
2. Overlords...
3. Shampoo...
4. Skynet...
5. ????
6. Profit!
I say don't drink and drive, you might spill your drink. Before you get behind the wheel just stop and think.
About as doomed as COBOL
Well, apparently the relational database has been doomed for the last 20 years or so.
You'd think that the people running the thousands of systems with databases managing data on everything from bank accounts to medical records to what you bought from Wal-Mart last tuedsay would have heard the news by now and moved on to the Next Big Thing.
Insisting on "correct" English is like saying that there is only one, definitive recipe for chili.
It seems like every time I read one of these articles, it is written by someone with no knowledge of what is actually out there. I suppose that is normally because anyone with enough time to mouth off on the internet in article or blog form is not actually doing real work. The rest of us are too busy, you know, working?
Anyway, as someone that has been programming, particularly against just about every major database platform out there, I can tell you that there has always been a battle between relational vs. other types of databases, most notably graph databases. Full disclosure: I am partial to graph databases, but that is because I find a lot of utility in working in pure code and I also apply graph theory to a lot of my work.
There is no silver bullet and each kind of database is better at certain tasks. Long ago, hardware made much more of a difference than it does today and was one reason relational databases "won" out. Other reasons include marketing, the development of SQL and other standards, and the ease of applying relational mathematics which are easy to understand. I spent most of my life working with relational databases as they were simply in my comfort zone, until I realized I needed to get off my ass and learn more than just new languages and programming techniques (and realized relational dbs were such a huge fail at tasks like traversing graphs, storing dynamic columns ala a CRM, etc).
Particularly when you look at older languages like LISP, relational was a great fit vs. graph. Since then, many factors have caught up and many more languages, solutions, and designs are out there. In the meantime, graph databases never went away. It seems suddenly since the outbreak of "web 2.0" frameworks and ORMs, a lot of people who don't have a clue about SQL, and especially databases in general of all types find it a great idea to go out and make their own or put some huge hack on top of a relational db, or perhaps worse, try to come up with something entirely new that is not based on fact or need.
I've played around with some of the databases mentioned above, and just like MySQL, they are mainly reinventing the wheel badly. The google implementation is the only one I have seen that is not completely shoddy, but color me somewhat unimpressed as well. FYI, just because xyz company such as google or IBM uses some product does not make it good. On the contrary, it's often a warning sign as big companies are often what keeps sites like thedailywtf going.
Regarding graph databases, I'm currently using Gemstone with Smalltalk for instance, and I have used it with Java as well. I can tell you it is great, but it is no panacea. It's been around forever like many of its competitors which are also noteworthy, but a side-by-side comparison is best left outside the scope of this comment. Gemstone lets me avoid any ORM overhead and I can write and maintain queries, transactions, reports, etc in one place; my code where they belong for my projects. It's fast as hell and I have collections with millions of objects and there is no slow down. In fact, I migrated many of my databases from other systems including Oracle, DB2, MS SQL, and Postgres, and it smokes them all, but I am confident that if I can think of a huge laundry list of tasks where the opposite will be true.
For instance, I can do reporting using Gemstone as I mentioned, but it is left up to me writing my own code or using a library, where in something like MS SQL, I can just use reporting services to do more complicated cubes, or one of the 100 enterprise reporting tools of various size for Oracle. It also sucks in some ways if you need to share a lot of data between applications because you have to make decisions about what do you want to split out and how to manage the memory and load.
Even Gemstone itself is not so bold as to presume there is no use for relational databases. You can build a SQL layer on top of it and it has many tools to move data back and forth between any relational db, for instance.
The authors
oreilly radar recently covered the topic, as did Richard Jones, a last.fm person. Some decent reading in both
Coherence gives you all the magic you would want to have were you starting to put together high powered hashmap. Does key/value pairs, across multiple machines, with cache invalidation, etc. It also lets you perform interesting queries on the cache. It also can front end hibernate or other databases and act as a cache there too. It also works better than then most http session stores. It also.... Gah. This is one of my favorite JARs in my toolkit.
+++ UGUCAUCGUAUUUCU
There really isn't a true implementation of the relational model as per Codd and Date.
Also, SQL is a nightmare. A badly designed programming language which is not quite functional and not quite procedural and so needs a bunch of hacks to work properly. And then there is the issue of NULLS. And the fact that you can end up with ugly bag operations and path dependencies in SQL.
And just to start yet another flame war (Iknow, I just know some one is going to mod me as a troll today) key/value is just another way of saying "network database".
And another thing which I will probably get hammered for, if you normalize a DB properly you will get you objects almost for free. And vice versa. Where I see people having problems is that they either are :
1) lazy about defining and understanding their data
2) or likewise for their objects
3) or both.
If you do it properly will will get a nice set of multidimensional objects and fact/attribute tables which are orthogonal and lean. Easy to understand, search, join, build, compose, decompose, signal and track.
As opposed to a snarled up hacked together, overloaded, over inherited nightmare with hidden dependencies which I have seen too many times.
OK, you can slam me now.
putting the 'B' in LGBTQ+
SQL and all its pointy-headed progeny are the real problem with databases, not the relational vs. newMarketingBuzzwordDuJour arguments.
Database operations do not need to look like code or algorithms, the only reason they do is to provide jobs for database programmers.
Over 15 years ago Paradox's query-by-example was light-years ahead of today's soul-killing SQL crap.
SQL is not going away, though, any more than its idiot older brother Mumps (M, Caché).
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
Some of those systems appear to more or less still be "relational". If each row is treated like a map (associative array) of strings, then the "schema" for a given table is the set union of all attributes used in the table, and non-existing columns for a given row can be treated as nulls.
As long as an asterisk is not used in a query (ex: "select * from tableX"), then it will pretty much act like existing RDBMS, and as long as the type-explicitness issues are resolved based on dynamic language conventions. (Asterisks can be implemented perhaps, but it could be computationally expensive.)
It's kind of like dynamic (AKA "scripting") languages versus static or type-heavy languages. The static kind of languages requires more up-front info that "protects" the integrity of the thing at the expense of flexibility and declaration volume. The same dichotomy can be applied to RDBMS also. We have RDBMS that like a lot of info up-front, and now those which accept incremental or ad-hoc insertions are starting to be common (but still less standardized).
And constraints can be incrementally added, such as later requiring that every new record in a "Cars" table have a value for "brand" or the like.
One possible exception is that there were some examples that violated "map-ness" of records, such as having two colors for a car. If they instead supplied "color_1" and "color_2", then map set rules would not be violated, keeping it closer to true relational.
In short: We don't have to abandon relational to get dynamism.
Table-ized A.I.
Last I checked FDR was our only Unconstitutionally long office holder in the presidency....
No, FDR was constitutionally the president the whole tyme he was president. Amedment 22 limiting presidential terms to 2 wasn't ratified until 21 March 1947, after he died.
Falcon
Should there be a Law?
Actually i read TFA, and I just couldnt make sense of the benefits offered by the key value thing.
While I don't except RDBMS to ever disappear there are cases where key value data stores can be more efficient. One such is name, address. Look up a name to get the address associated with the name. When all you need is the name and address why use a database? Where this falls down is when you have more data such as orders the named entity placed and what was ordered.
Falcon
Should there be a Law?
MySQL : Doomed or key value store ?
Had to chime in here. I work with a PICK database daily and can tell you, it blows. Lack of tools, compatibility, and structure makes living with it more than a notion. If this is the way of the DB; I'd rather shovel shit at a hog farm.
http://en.wikipedia.org/wiki/Pick_operating_system
I can see the meeting now.
Developer: "Hey boss, I found a better product for the transaction processing data! It might save us a bunch of money on Oracle licenses!"
Boss: "Great, what is it?"
Developer: "Project Voldemort!"
Boss: "..."
Developer: "No really, let me explain..."
Boss: "I have a meeting to get to, but hey, let me know if you have any other great ideas."
A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'
From Tom Kyte's blog sql joke
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
So how does this have anything to do with dooming relational databases anyway?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
The real problem isn't table selection but all of your columns have to be named "pika!" with only different intonations.
I don't see relational databases going away any time soon.
Most (>70%) of the web is using them, and so far, they've worked very well.
What is missing is good support for them from the programming language point of view.
The nature of relational databases is declarative, as you define mathematically what you want, not how. That's a job for the database, and they've got huge compilers and optimizers for it.
Of course, the SQL language is a leaky abstraction of the pure relational calculus, and you have to know certain rules in order you query can be answered efficiently.
SQL doesn't fit well in imperative languages, where all you can do is write down instructions. Compare that with a language like Prolog, which is OOTB a relational database.
With apologies to Family Guy:
"Coming up next: Can bees think? A new study indicates that no, they cannot."
Congratulations, Slashdot, you're as ridiculous as a parody.
Seriously, when are you going to get real jobs, it must be hard lying to your parents every Christmas that you're doing something more honest and worthwhile like pimping.
My view on relational DBs is that architecturally they are a bad way to implement software.
I think they should be just used for tables with indexs, no stored procs or triggers anything else.
There should be code written in the language of your choice to control all the transactions and business logic etc.
Giving out database schemas as an interface and giving out database logins to client software is a disaster IMHO.
Much richer, more explicit and typesafe interfaces can be provided by modern programming languages than are possible with DB scripting procedures.
The DB providers have a vested interest in developers using all their more complicated, DB specific features to avoid their product being a mere commodity.
But like any other API or technology on offer, it is just as much what you reject that makes good software as what you accept.
In summary I think the relational DB as a marketable technology may be dead, as to my way of thinking it is just an API that does indexs on tables larger than memory and knows all the searching tricks and disk access performance tricks necessary to scale to large data sizes.
Probably transition to other databases is pending immediatly and will be a matter of months. The big database companies will die soon. Just outsource rewriting billions of line of code to India and programmer newbies who have not such fixed ideas yet. This will work out great.
Reading this I keep seeing OOP in there, and data as an object class.
This is just the OOP crowd trying to not learn SQL and do things their way. It won't replace a full RDBMS. And an RDBMS can scale quite nicely if you know what the hell you're doing.
Smells like PICK to me. Those days are gone, let them go. http://en.wikipedia.org/wiki/Pick_operating_system
Scaling when a hard disk was 6MB was a serious matter. But scaling when HD is 10 TB, when server has got 4 processor, 12 GB RAM. How many websites are there to overwhelm such a server?
Scaling to, like, 50 TB of data? What can it be?
The name of the MapReduce framework comes from the functional programming operations "map" and "reduce." Map takes as its input a collection of data, and a function that transforms data elements into other elements; it outputs a collection where each element of the input collection has been replaced by the result of applying that function to it. Reduce takes a collection of elements, an initial value of the same type as the elements, and a two-place, commutative, associative and symmetric operation; it produces as its output the value that results from applying the operation to the initial value and each element of the collection in turn, accumulating the partial results.
Map and reduce are operations that can be trivially parallelized. To parallelize map, you divide the collection into subcollections (in any arbitrary manner), and map over each of them in parallel. To parallelize reduce, you divide the collection into subcollections, also arbitrarily, reduce each subcollection independently, then apply the reduction operation to the partial results. (That works because the reduction operation is commutative, associative and symmetric.)
Well, guess what: this sort of technique is trivially applicable to relational database queries. A SQL query translates down to a combination of joins (the FROM clause), filters (the WHERE clause) and maps (the SELECT clause). Joins are trivially parallelizable; you give each execution unit a subset of the tuples of the driving relation. Filtering (the WHERE clause) is a kind of reduce operation. SELECT is a kind of map operation. This means that relational queries are not any less amenable to parallel execution than the stuff Google does.
But the killer thing here is that MapReduce says absolutely nothing about the updates problem. This is one of the big features of RDBMSs: the ability to handle concurrent query and modification. It also says nothing about the data integrity problem, which is also one of the big RDBMS features.
So, when you get down to it, there is a good argument to be made that many applications could make use of database technologies that support much faster querying, at the expense of very little updating. But there's no convincing argument that that technology isn't best implemented in the context of an RDBMS.
Are you adequate?
Okay, I RTFA. I've made a living with relational databases for ten years, but I have no experience with key/value databases. So maybe someone can explain this better than TFA did.
Bain writes: "The first benefit is that they are simple and thus scale much better than today's relational databases. If you are putting together a system in-house and intend to throw dozens or hundreds of servers behind your data store to cope with what you expect will be a massive demand in scale, then consider a key/value store."
So, they scale well if you're going to throw tons of hardware at them? Okay, I guess I can live with that. But then, in the "The Bad" section:
"In the cloud, key/value databases are usually multi-tenanted, which means that a lot of users and applications will use the same system. To prevent any one process from overloading the shared environment, most cloud data stores strictly limit the total impact that any single query can cause.... These limitations aren't a problem for your bread-and-butter application logic (adding, updating, deleting, and retrieving small numbers of items)."
So the resource usage in the cloud with a large number of users consumes so many resources (of your "hundreds of servers") that you need to limit your application to retrieving "small numbers of items."
I can see how there'd be some benefit to providing a subset of RDMS functionality to improve efficiency. But, if I'm understanding this article, they're apparently offsetting any gains from simplicity by duplicating data, justified by "hard drives are cheap." How the hell is this scaling "much better than today's relational databases?"
--I'm so big, my sig has its own sig.
-- See?
Getting past the buzzwords, the real issue here is that key/value databases don't do joins. Joins are expensive and hard to distribute across machines. For many web apps, one key is enough to find the relevant data, and general joins aren't necessary.
Both Google and Amazon realized this, and their key/value systems don't support joins. Developers of both systems have spoken in EE380 at Stanford, and were grilled over this issue. The big advantage of a join-free system is that a database can be split across machines without the need for elaborate intercommunication between them. You can simply put keys A-L on machine 1, and keys M-Z on machine 2. There are no crosslinks between the machines, and you don't have to do inter-machine locking. The front end machines just direct the query to the appropriate back-end machine based on the key.
There's a set of things you can't do this way, but they seem not to be the high-volume queries in web applications. That's the real insight here.
Arguably, the web crowd just reinvented ISAM.
Just in case you never heard about them, have a look at Tokyo Products http://tokyocabinet.sourceforge.net/index.html by the wonderful guy who already wrote QDBM, Hyper Estraier, etc.
The presentation tells you the basics, but Tokyo Products are quickly improving, and there's already a bunch of useful new features since the presentation, as seen in the Mixi's Blog : http://alpha.mixi.co.jp/blog/
Tokyo Products + Flare ( http://labs.gree.jp/Top/OpenSource/Flare-en.html ) makes SQL relational databases totally useless for almost every web app, except for beginners or conservative people.
Also, with the raise of products like Terracotta (for Java) and Maglev (Ruby VM), getting back to SQL really seems retardated.
{{.sig}}
I would love to find an object database that keeps relations between objects and the data. eg child, siblings, parents. It could be done with one of the these key/value db's but not so nicely.
Glad I'm not the only one who makes this typo. I think that our fingers get so used to "ing" that we automatically type it. But why don't we mis-type "pink" as "ping", "link" as "ling", or "fink" as "fing"?
Maybe it is the "th" part. "Something", "Everything", "Nothing".
I'm not sure how to break this habit either. Word 2007 seems to pick this up, but none of the browser spell checks will. ... anyway, back on topic I suppose.
But the killer thing here is that MapReduce says absolutely nothing about the updates problem.
That's because MapReduce is a data processing system, not a data storage system. You should read about BigTable, which is the data storage system we use (I work at Google), which does support updates.
In your comments on this thread, I think you miss the key difference between an RDBMS and a system like BigTable. BigTable is almost perfectly horizontally scalable. When you need more capacity, it really is as simple as throwing more machines at the problem.
RDBMS's can never give you this kind of horizontal scalability, because they make a promise to you that you can transactionally modify any two bits of data anywhere in your database. Fulfilling this promise requires that either your whole database lives on a single machine, or that you use a distributed transaction protocol like 2PC (which totally kills performance).
So when your database gets busier than a single machine can handle, you have to manually partition your database into multiple physical databases. All the nice RDBMS features like transactions, joins, foreign keys, triggers, etc. can only (reasonably) work within a single physical database. The divide between physical databases is something your application code has to deal with -- it has to know to direct its queries to the correct partition. And repartitioning your data to run on more machines later is an invasive procedure, both operationally and to your application's code.
BigTable is designed around the reality that a database of any significant size will need to run on more than one machine. It only guarantees that you can transactionally modify data within a single row. This gives BigTable the ability to move rows around between machines without the application even knowing this is happening. If you add more machines, BigTable can immediately start moving some subset of your rows onto this new machine.
I recommend reading this paper for a far more in-depth look at this pattern. The key point of this paper is:
BigTable calls such entities "rows."
If you do it properly will will get a nice set of multidimensional objects and fact/attribute tables which are orthogonal and lean. Easy to understand, search, join, build, compose, decompose, signal and track.
I'm led to believe it's not that easy, but I'd love to be shown wrong.
Also, SQL is a nightmare
I agree, and I think one of the interesting questions is why we don't have something better, or even just something else. There are probably millions of man-hours put into ORM or QBE layers, some with their own hacked-up query languages... that are eventually re-written as an SQL query. But as far as I know, despite the fact that we have open source databases, despite the fact that storage engines aren't married to queries... we don't have any other query languages directly supported by the database (unless, I don't know, is QUEL still supported by Postgres?).
Where's D? Why not have Prolog (or a tabled prolog if you're worried about unbounded queries)? Given the fecundity of the field with regards to all kinds of different programming languages, I don't understand why there seems to have to be One Query Language(TM).
Tweet, tweet.
Do you really need automatic referential integrity, or are you just trying to save the programmers some time?
Do you really need ad-hoc queries, or are you just trying to save the programmers some time?
Do your application programmers really write such buggy code that they cannot be trusted to write their own integrity/query code, or does your QA just suck ass?
Do you really need a client/server solution for data-storage, or is it just what you know and are comfortable with?
Do you really need an RDBMS as an integration tool for several applications, or are there better options?
Even in mathematical usage, a relation determines a relationship between entities. If (x,y) is in a relation R, we say that x and y are related by R.
So, shut the fuck up, you illiterate twat.
I know I'm being pedantic, but it grates on me to hear the term "materialized view". There is no such thing, and "materialized view" is a contradiction in terms, since a view by definition is never fixed and changes as the data in the tables it references change. You would be better off referring to "materialized views" as snapshots. Because once you "materialize" something, it is definitely no longer a view.
Beware of bugs in the above code; I have only proved it correct, not tried it.
The treat of Key/Value stores to Relational Databases is about as big as the threat of Visual Basic to C++. Yes, they're easier (so average script-kiddie posers might prefer them), but for many applications (especially in the pro business) they're just noch powerful enough.
The MAFIAA is a bunch of mindless jerks who will be the first up against the wall when the revolution comes
Sure I know - ? marks the place where parameter is inserted in a prepared statement
In other words: you have to read the actual code below to know ...
Way to miss the point - there are countless NON-relational database systems that can also handle "data that is related."
Oh, and here's your gratuitous insult: you're a cranially underdeveloped moribund coprophage.
1. Tutorial-D
2. SMEQL
Table-ized A.I.
I've used "real" RDBMSes (MySQL and PostgreSQL -- yes, I know, MySQL doens't really count compared to Postgres, Oracle or DB2, but I've used it...), MS-Jet (ugh), and SQLite, and my favorite by far is SQLite. PostgreSQL is lovely, and can do maybe 90% of what Oracle can for 0% of the price, but massive overkill for small apps (not to mention not very friendly to non-DBA system admins).
As for things that can get away with just a simple key-value mapping and don't need to be written to disk, I love std::map. The STL makes C++ that much nicer to live with (and it beats Microsoft's MFC utility classes any day).
-lee
PL/1 to doom COBOL.
Linux to doom Windows.
Ruby to doom PHP.
Chrome to doom Firefox.
Boxers to doom Y-Fronts.
Derivatives to doom economy (OK, OK - they did).
Her lips were softer than a duck's bill, but her quacks
I think having this arbitrary pool of key/value pairs, especially when a value is at most (and at least) a string, has potential to cause more chaos and havoc than what relational DBs have already emposed on the software development industry. Yes, simple key/value pairing has the benefits of simplicity and can be distributed easily, but ask yourself, would you really want to distribute it?
I agree with the author that relational databases are not a solution - they never were, but let's not lower ourselves to the level of simple key/value pairs. Just reading the specs of the free implementations out there, two followed the eventual consistency model which, lets face it, is absolutely pathetic for anything that calls itself a "database" for this day and age. Its fine for a social network site like facebook, but would you use it in a financial application? Or, as you put it, in a ticket reservation app.
A proper database is one that mirrors and evolves in parallel with the application that uses it. And the only database flavour that manages to do this right is a native object oriented database. Yes, it has been cluged up in the past, and there are way too many post-relational flavours that are nothing more than a glorified relational database. In my opinion, nothing that offers a SQL-like syntax for querying a database can be considered relational: not in this universe and not in any one of the parallel universes that may exist. If you realy want to solve a long term problem, you need to solve it right. Have a look at native object database, and transaction-oriented databases, which are a completely new breed of OO DBs combining transaction processing, auditability and native object support. And they can run in a distributed environment, scaling out as the load goes up. Take a look at DTS/S1 from Obsidian Dynamics.
http://www.obsidiandynamics.com/dts
The idea behind these kind of offerings is dead simple. If I'm writing my application using objects, instantiating classes, passing references around, then why would I resort to a relational product, let alone a key/value (albeit a distributed) hashmap to persist my data? Trust me, and I have been in this industry for many years, all roads lead to objects.