Is It Time For NoSQL 2.0?
New submitter rescrv writes "Key-value stores (like Cassandra, Redis and DynamoDB) have been replacing traditional databases in many demanding web applications (e.g. Twitter, Google, Facebook, LinkedIn, and others). But for the most part, the differences between existing NoSQL systems come down to the choice of well-studied implementation techniques; in particular, they all provide a similar API that achieves high performance and scalability by limiting applications to simple operations like GET and PUT.
HyperDex, a new key-value store developed at Cornell, stands out in the NoSQL spectrum with its unique design. HyperDex employs a unique multi-dimensional hash function to enable efficient search operations — that is, objects may be retrieved without using the key (PDF) under which they are stored. Other systems employ indexing techniques to enable search, or enumerate all objects in the system. In contrast, HyperDex's design enables applications to retrieve search results directly from servers in the system. The results are impressive. Preliminary benchmark results on the project website show that HyperDex provides significant performance improvements over Cassandra and MongoDB. With its unique design, and impressive performance, it seems fittng to ask: Is HyperDex the start of NoSQL 2.0?"
Er...
Um...
Yeah...
I use Riak in production and, while it is notably-slow, I appreciate its fault-tolerance. I wonder if HyperDex is just as resilient? Also, Riak 1.1 just came out, finally adding management and diagnostics. Secondary indexing may not be as fast as HyperDex, but, depending on the design, "crashproof-ness" can come out the winner.
From the paper:
Efficient lookup of fully-specified objects is critical to
object insertion and deletion performance, and requires
a deterministic object to node mapping. Much like in
ring-based key-value stores[3, 17, 38], HyperDex maps
both object coordinates and nodes to the same hyperspace.
Specifically, HyperDex tessellates the hyperspace
into a grid of N regions in space. Zones, which we previously
defined to be mutually exclusive regions belonging
to one or more nodes, are created by assigning nodes
to each of the tessellated regions to be responsible for
all objects which hash to a coordinate within the region.
The zone mapping is disseminated to all clients which
may operate directly on the mapping without any routing
between server nodes.
This sounds like the old Berkeley DB/Sleepy Cat software.
Key/Value pairs instead of relational stuff. Worked with a product years ago that was built on Berkeley -- offered some pretty useful features that simply didn't map to object-relational stuff.
For some applications, you really do need something that works a little differently than an RDB ... however, there's still loads of things I can't imagine trying to do without one.
Choice is good in technology.
Lost at C:>. Found at C.
Er...
Um...
Why not learn both, and use whichever's strengths suit the application the best?
You do not have a moral or legal right to do absolutely anything you want.
http://www.youtube.com/watch?v=URJeuxI7kHo
is the best introduction to this subject I've seen. Until someone can explain the pros of hyperdex with a funny video featuring cute animals I'm sticking with technology that's been tested more thoroughly.
http://rareformnewmedia.com/
The hashing system is pretty neat. The idea that you could get at records without their specific key via search criterion is astounding.
In the future more advanced hashing systems will allow NoSQL databases to extract a set of records all containing a similar subset of data without keys at all!
Of course we'd need a name for the sections that are matching. Perhaps "Columns", yeah, then each result returned could be called a "Row", makes sense. I bet you could then create even more complex matching patterns for multiple "Columns" against each record in the data-set. If only there was a language to describe query we're sending to the servers... Oh! Server Query Language!
I can't wait to use SQL with NoSQL 3.0!
Many of the key-value pair DBs supply a Perl library that let you tie a Perl hash (%Variable) to the DB directly, giving you persistent hashes.
Makes database storage virtually a native feature of the language. Anybody who uses Perl is probably already a hash buff, so it is a win-win if you and your app already use Perl.
Disclaimer: I run a 10yo web "app" (Perl/CGI/Apache), so I'm a bit biased. But, the thing is rock-solid, so I'm not going to be too apologetic.
"Flame away, I wear asbestos underwear"
So, at what point do we all admit that a NoSQL database is basically a glorified file-system over a network and start calling it a file-system again?
This is a type of index, not a type of database. See locally sensitive hashing. It's an efficient way to find keys which are "near" the search key in some sense.
Such a mechanism could be provided in a key/value store or an SQL database. It's even possible to do it on top of an SQL database. It's more powerful in a database that can do joins, because you can ask questions with several approximate keys.
This is an area of active research. Many machine-learning algorithms are scaled up by locally sensitive hashing, so they can work on big data.
Isn't that what XML is for? XML files are also compatible across systems.
XML is more useful for transferring data between systems. For storing data is kind of sucks, since there's no indexes (not the kind we need for fast lookups anyway) and it's extremely verbose.
Because NoSQL wasn't hipster enough.
Kill all hipsters.
until I noticed that there seems to be a single point of failure in this system. from the site:
The HyperDex coordinator maintains the "hyperspace." This involves making sure that servers are up, detecting failed or slow nodes, taking them out of the system, and replacing them where necessary. The coordinator maintains a critical data structure, the hyperspace map, that establishes the mapping between the hyperspace and servers. Clients use this map to locate the servers they need to contact, while servers use it to perform object propagation and replication to achieve the application's desired goals.
How can people call a system "fault tolerant" and "distributed" when it might as well be running off a single box?
When keys are concatenated with a joining symbol, objects can only be retrieved when one posesses all of the joining keys. Hyperspace hashing allows object retrieval when only a subset of the attributes are available.
there are indexed xml retrieval systems with their own query languages to boot. Oracle XML DB is one (built on top its sql dbms)
Facebook, Google and friends wouldn't need such databases if they respected privacy, solve the privacy issues and a MyISAM will be enough to everyone. And for the marketing, just send pregnancy coupons to everyone, youll get em.
In using Oracle RDBMS, I see that for very large data set queries, using a hash join causes lots of disk activity (lots of paging, going to swap) Though hash functions are fast, this performance is from scanning though a hash table that's fully mapped in memory. Once your hash table gets too big for the available memory, you start using disk space (unindexed, sequential full reads) Isn't this a bottleneck in a distributed database that relies on hash functions? Wouldn't you want to have a distributed DB based on a distributed version of a B-Tree descendant (B+Tree, B*Tree,B**Tree) that would use memory AND storage and scale out more than just the available memory on all your nodes? Not only that, but you'd likely have better performance on range scans. Just thinking...
Whenever a /. headline asks a question, the answer is always No.
Hand it over to Mozilla. We could then have NoSQL 5.0 by the end of summer.
Why does a new product operating in the very same space as other keyvalue stores warrant an increment of the buzzword version number?
Mumps was NoSql before NoSql was cool: MUMPS and NoSql
Disclaimer: my only interaction with MUMPS has been via thedailywtf: A Case of the MUMPS
-- The Genesis project? What's that?
There's a difference between a database that returns XML and an XML file.
I didn't RTFA but are they trying to reinvent IMS?
Try it! Library of Babel
But they don't store the data as XML. They usually decompose it down to Infoset, and then store that in some relational fashion with indexes and stuff; and reconstute XML when returning results of a query.
I don't know about Oracle, but in my experience XML databases built on top of RDBMSes (I'm looking at you Microsoft) suck. XML data is often highly unstructured, and at least in the case of SQL Server, tries to force unstructured XML into a structure and ends up doing it poorly.
How about consistency? Does this database even support the notion of transactions?
If Pandora's box is destined to be opened, *I* want to be the one to open it.
the topic was storage and querying of xml, plenty of products do that, some without involving an sql database at all
eh, all XML is structured, hierarchically.
All of that processing and reconstitution really destroys the nutritional value, and excess compression contributes to high 0x80 levels. You really should be mindful about the data that you are putting into your program.
Yes, they always screw up the carefully balanced amounts of different kinds of character entities in the process.
I'm kind of confused about your reply, but to clarify -- if you want fast random access, XML is a terrible format. If you want to transfer data between two systems, XML can be excellent. The example of Oracle being able to return XML data just confirms what I'm saying -- the data is stored in Oracle's binary format, and *transferred to you* as XML.
You seem to be thinking I'm claiming there's something wrong with XML, but all I'm saying is that XML files are not designed to be databases.
CouchDB has native map-reduce indexing of arbitrary fields of the stored data. Doesn't appear to be anything new here in that regard.
Be careful. People in masks cannot be trusted.
http://www.mongodb-is-web-scale.com/
Which does not denote structure.
Oracle Coherence (formerly Tangosol) is a distributed cache with queries too. It has existed for many years and does exactly this (and more).
I don't think this is anything new.
Against