SQL and NoSQL are Two Sides of the Same Coin
An anonymous reader writes "NoSQL databases have become a hot topic with their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way. Microsoft researchers Erik Meijer and Gavin Bierman ... present a mathematical model and standardized query language that could be used to unify SQL and NoSQL data models."
Unify is not quite correct; the article shows that relational SQL and key-value NoSQL models are mathematically dual, and provides a monadic query language for what they have coined coSQL.
An inverse tachyon pulse would disperse the relational quantum silica into a focused warp field, thus purging all forms of slipstream space based SQL databases from subspace.
Resistance is futile. Your technological distinctiveness will be added to our own. You will become one with the morgue
...is that SQL sucks as a language. It's not terribly expressive, the ordering of arguments is inconsistent, and whoever designed the way JOIN works should be in jail.
Frankly, I'd like to see SQL die and get replaced with something more modern. We don't program in Cobol anymore, so why the hell are we still using SQL?
There's no -1 for "I don't get it."
To my mind, SQL's biggest problem over the years has been really shitty implementations (and yeah, I'm looking at you, MySQL).
The world's burning. Moped Jesus spotted on I50. Details at 11.
There are only 2 types of languages:
- those people bitch about, and
- those no one ever used
An SQL statement walks into a bar. He sees 2 tables and asks "May I join you?"
Yes, no, and yes, in that order. I'm basing my answers on HBase, with which I have the most experience. My answers are also practically guaranteed to be wrong in somebody's eyes, because HBase is so much more flexible than an RDBMS. If I describe one way of doing something, another layout may work just as well, and somebody's going to favor that way.
How does indexing work in NoSQL? Are there EXPLAIN-type tools available?
EXPLAIN tools aren't really necessary in HBase, because almost all nontrivial queries are a scan over a small chunk of the alphanumerically-sorted rows. It will take a while, but please allow me to explain. Each row is a multi-value key-value store, with each value having a column name. If you really want to stick to the RDBMS style, you could have your key be a numeric row ID, and scan everything for every query. It would suck, because you're not using any indexes.
Indexes are more or less left up to the programmer. Creating an index is effectively just adding more rows to the table. For example, that RDBMS-style layout in the last paragraph could be a table of ID numbers, usernames, passwords, and permissions (for 50 billion people, I guess...). For whatever business reason, the main key will be the ID number. Those rows are easy. They have the expected value columns: username, password, permissions. To index by username, we add new rows, with just a column for the ID number. We could just duplicate the data, but let's not. Now, our table is going to be huge, but sparse. Half of the rows have three of four columns filled, and the other half has only one. Searching by name, it'll take two requests to get to the actual row we want, but that's okay. Doubling the amount of work lets us run faster.
The reason for that is HBase's split design. HBase's table is split into column families and regions. Column families are a means to group columns, so that even on data with overlapping key space, separate data could remain separate. Column families are stored as separate files in Hadoop. In our example, the username "index" could be a separate column family. That could speed up scanning, because the rows keyed by numeric usernames won't be interspersed with the rows keyed by user id. More importantly, the table is split into regions, each containing a number of rows. Those regions are also stored as separate files, and distributed across the entire Hadoop cluster.
The cluster is really where Hadoop gets its speed. If we were to run all of our processing from one central location, it would be horribly slow and require a ridiculous number of requests. Instead, we'll distribute everything, including the query, similar to how some RDBMS sharding schemes work. We send a request to all nodes, asking for "the row with the key that matches the value of the 'userid' column of the row with a given key". Each node will report back its results. Unlike RDBMS sharding, the partitioning is handled automatically by HBase into regions that are optimal. It's these regions that are scanned for every request.
After all of that, it should be quite clear: With HBase, the programmer is expected to know the layout of the data, and write requests based on the key. There is no EXPLAIN tool, because everything is just a key-value lookup.
Whew. Next question...
Can you do just about any query you could with SQL?
Yes, but it's different. Every lookup is handled by scanning a region (in parallel on nodes that have that region's data files), and checking each column of each row to see if:
Note that last item. The filter is simply a program that tells Hadoop whether the row (or some part of it) should be included in the returned results. That program can include other HBase requests, using other filters. If you're really stuck on using RDBMS
You do not have a moral or legal right to do absolutely anything you want.