SQL and NoSQL are Two Sides of the Same Coin
An anonymous reader writes "NoSQL databases have become a hot topic with their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way. Microsoft researchers Erik Meijer and Gavin Bierman ... present a mathematical model and standardized query language that could be used to unify SQL and NoSQL data models."
Unify is not quite correct; the article shows that relational SQL and key-value NoSQL models are mathematically dual, and provides a monadic query language for what they have coined coSQL.
An inverse tachyon pulse would disperse the relational quantum silica into a focused warp field, thus purging all forms of slipstream space based SQL databases from subspace.
Resistance is futile. Your technological distinctiveness will be added to our own. You will become one with the morgue
so why the hell are we still using SQL?
Why are we still using C? Why are we still using HTML? Why are we still using FORTRAN (in the scientific community)? Same reason.
Might add that all these - C, HTML and FORTRAN - are still being updated, with new standards. So is SQL. It's really the same thing, and they all stick around for the same reasons, too.
We DO still program in COBOL.
And we DO still use SQL.
And we do so because it works.
Not only does SQL work, it is the best at what it does.
The only people who hate on SQL are the people who don't understand databases.
Generally, these are the same people who like labels, tag clouds and ruby on rails.
They produce a lot of high level hand waving with regards to the actual code and endless amounts of "herp derp I dunno" when asked why their shit performs slower than the 10 year old system it's supposed to replace. These are bad people.
What really pisses me off is that everyone fucking agreed with me until Android came out, then suddenly Java was cool, the performance was considered "good", and the quality of code and coders that it tends to bring about is now the acceptable norm.
Actually COBOL predates SQL by about 10 years. AFAIK nobody has written a language that implements the relational query model to replace SQL. And (though I have never written anything in COBOL he says thankfully) COBOL has its place even today. I would not be surprised if there are as many lines of COBOL still running in enterprises everywhere as there are of PHP or Perl in those same enterprises.
And COBOL even now is without question a better solution for business and application programming than C ever was or ever will be. (Of course it's arguable that there are other languages better for those tasks than COBOL as well.) C is good for device drivers, kernels and as a target for interpreted and scripting languages with compiled code generators. C is, as Kernighan, Ritchie or Thompson (I forget which) said, "a structured PDP-11 Macro-assembler". Today (putting my Nomex suit on...) IMHO application programmers should not be wasting their time coping with segfaults and compile-link cycles. Their time is worth more than the machine time that any cycle-saving difference. :)
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
To my mind, SQL's biggest problem over the years has been really shitty implementations (and yeah, I'm looking at you, MySQL).
The world's burning. Moped Jesus spotted on I50. Details at 11.
There are a few things with SQL that could be done better, and there's still some standardization needed. I'd like to see a SQL 2012 standard or something.
But you're entirely right. There is one place nosql makes sense, and it's gigantic data stores like facebook and google, where the quantity is overwhelming, and the quality isn't that important...it's okay miss a few things, and you're looking for sorta-random stuff. That is why NoSQL was invented.
No one should ever 'choose' to use NoSQL...if you're on the size of a project that needs NoSQL, I promise you you are nowhere near that decision...it will be decided between the seven project architects as they buy thirty servers to run the damn thing. That's the guys who have a legit need, or at least it's a legit option, of using NoSQL.
It's not useful for any other system in existence.
It's especially funny when toy projects try to use NoSQL. It's like idiots trying to run their watches off geothermal power. 'It's free power! FROM THE EARTH!'
Dude, you're using half an amp, perhaps you should learn how to use a watch battery instead of driving 2 mile polls into the ground as you walk around. It's not like SQL is fucking rocket science. In fact, right now, NoSQL is actually more complicated to use.
If corporations are people, aren't stockholders guilty of slavery?
An SQL statement walks into a bar. He sees 2 tables and asks "May I join you?"
Yes, no, and yes, in that order. I'm basing my answers on HBase, with which I have the most experience. My answers are also practically guaranteed to be wrong in somebody's eyes, because HBase is so much more flexible than an RDBMS. If I describe one way of doing something, another layout may work just as well, and somebody's going to favor that way.
How does indexing work in NoSQL? Are there EXPLAIN-type tools available?
EXPLAIN tools aren't really necessary in HBase, because almost all nontrivial queries are a scan over a small chunk of the alphanumerically-sorted rows. It will take a while, but please allow me to explain. Each row is a multi-value key-value store, with each value having a column name. If you really want to stick to the RDBMS style, you could have your key be a numeric row ID, and scan everything for every query. It would suck, because you're not using any indexes.
Indexes are more or less left up to the programmer. Creating an index is effectively just adding more rows to the table. For example, that RDBMS-style layout in the last paragraph could be a table of ID numbers, usernames, passwords, and permissions (for 50 billion people, I guess...). For whatever business reason, the main key will be the ID number. Those rows are easy. They have the expected value columns: username, password, permissions. To index by username, we add new rows, with just a column for the ID number. We could just duplicate the data, but let's not. Now, our table is going to be huge, but sparse. Half of the rows have three of four columns filled, and the other half has only one. Searching by name, it'll take two requests to get to the actual row we want, but that's okay. Doubling the amount of work lets us run faster.
The reason for that is HBase's split design. HBase's table is split into column families and regions. Column families are a means to group columns, so that even on data with overlapping key space, separate data could remain separate. Column families are stored as separate files in Hadoop. In our example, the username "index" could be a separate column family. That could speed up scanning, because the rows keyed by numeric usernames won't be interspersed with the rows keyed by user id. More importantly, the table is split into regions, each containing a number of rows. Those regions are also stored as separate files, and distributed across the entire Hadoop cluster.
The cluster is really where Hadoop gets its speed. If we were to run all of our processing from one central location, it would be horribly slow and require a ridiculous number of requests. Instead, we'll distribute everything, including the query, similar to how some RDBMS sharding schemes work. We send a request to all nodes, asking for "the row with the key that matches the value of the 'userid' column of the row with a given key". Each node will report back its results. Unlike RDBMS sharding, the partitioning is handled automatically by HBase into regions that are optimal. It's these regions that are scanned for every request.
After all of that, it should be quite clear: With HBase, the programmer is expected to know the layout of the data, and write requests based on the key. There is no EXPLAIN tool, because everything is just a key-value lookup.
Whew. Next question...
Can you do just about any query you could with SQL?
Yes, but it's different. Every lookup is handled by scanning a region (in parallel on nodes that have that region's data files), and checking each column of each row to see if:
Note that last item. The filter is simply a program that tells Hadoop whether the row (or some part of it) should be included in the returned results. That program can include other HBase requests, using other filters. If you're really stuck on using RDBMS
You do not have a moral or legal right to do absolutely anything you want.