Slashdot Mirror


The NoSQL Ecosystem

abartels writes 'Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as NoSQL databases. The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg's (3 TB for green badges) or Facebook's (50 TB for inbox search) or eBay's (2 PB overall); per-server performance; and rigid schema design.'

32 of 381 comments (clear)

  1. Why worry? by Anonymous Coward · · Score: 5, Funny

    Microsoft Access is here!

    1. Re:Why worry? by MichaelSmith · · Score: 4, Funny

      Don't forget excel!

    2. Re:Why worry? by Manos_Of_Fate · · Score: 4, Funny

      Because there's no "scary because it's true" mod.

      --
      Isn't enough that I ruined a pony, making a gift for you?
    3. Re:Why worry? by Anonymous Coward · · Score: 5, Interesting

      You laugh, but the things I see done in Excel on a daily basis in production environments getting a LOT of work done are a testament to it's power. It is one of the best rapid application development platforms in existance. People with no CS background programming away in a functional style and getting shit done and not even realising they are programming. It could be so much better but it's still the best of breed. Any yes I have tried, and seen others try, O.O. et al. Forget it. Lets not go down that worn old road.

  2. hmm by buddyglass · · Score: 4, Insightful

    With regard to scalability, it strikes me that the problem isn't so much SQL but the fact that current SQL-based RDBMS implementations are optimized for smaller data sets.

    1. Re:hmm by phantomfive · · Score: 5, Insightful

      The biggest problem is the cloud. A lot of cloud APIs don't allow full relational database access, so now it seems we are coming up with all these justifications for why we don't really need it. Notice that this blog is from a company pushing a cloud based solution.

      --
      Qxe4
    2. Re:hmm by MightyMartian · · Score: 4, Insightful

      That's my take as well. We have these crippled semi-databases that lack a lot of useful features that anyone that has used RDBMSs over the last few decades have gotten used to, so suddenly it becomes a justification game; "Well, SQL doesn't deliver the output we need, so here's some half-way-to-SQL tools which are really better, kinda... oh yes, and Netcraft confirms it, SQL is dying!!!!"

      I have a feeling that this part hype, part inept programmers who don't actually understand SQL, or database optimization. The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

      --
      The world's burning. Moped Jesus spotted on I50. Details at 11.
    3. Re:hmm by KalvinB · · Score: 4, Insightful

      For the vast majority of use cases, large data sets can be made logically small with indexes or physically small with hashes.

      If you're dealing with massive data you're probably not dealing with complex relationships. E-Mail servers associate data with only one index: the e-mail address. Google only associates content with keywords. E-mail servers logically and physically separate email folders. Google logically and physically separates the datasets for various keywords. So by the time you hit it, it knows instantly where to look for what you want. You don't have a whole complex system of relationships between the data. It looks at the keywords , finds the predetermined results for each and combines the results.

    4. Re:hmm by mzito · · Score: 4, Insightful

      Uh, no, that is not correct. Relational DBMSes such as Oracle, Teradata, DB2, even SQL Server are all designed to scale into the multi-terabyte to petabyte range. The issue is one of a couple of things:

      - Cost - "real" relational databases are expensive. I once had a conversation with someone who worked at Google, who talked about how much infrastructure they have written/built/maintain to deal with MySQL. Many of those problems were solved in an "enterprise" DBMS 3-10 years ago. However, the cost of implementing one of those enterprise DBMS is so high that it is cheaper to build application layer intelligence on top of a stupid RDBMS than purchase something that works out of the box
      - Workload style - most of the literature around tuning DBMS is for OLTP or DSS workloads. Either small question, small response time (show me the five last things I bought from amazon.com) or big question, long response time (look through the last two years worth of shipping data and figure out where the best places to put our distribution centers would be). Many of these workloads are combos - there could be very large data sets and complex data interdependencies, with low latency requirements. It may be possible to write good SQL that does these things (in fact, I know a couple luminaries in the SQL space that will claim just that), but the community knowledge isn't there.
      - Application development - when you're building your app from scratch, you can afford to work around "quirks" (bugs) and "gaps" (fatal flaws) to get what you need. This dovetails with the other issues, but when your core business is building infrastructure, it's worth your while to deal with this. When your core business is selling insurance or widgets, or whatever, it is not.

      None of this is to say that the "nosql" movement is a bad thing, or that there's no reason for its existence, or that no one should bother looking at it. However, there is a definite trend of "this is so much better than SQL" for no good reason. SQL has scaled for years, and I know loads of companies who work with terabytes and terabytes of data on a single database without any issue.

      A far more interesting discussion is the data warehouse appliance space - partitioning SQL down to a large number of small CPUs and pushing those as close to the disk as possible.

      --
      me@mzi.to
    5. Re:hmm by buchner.johannes · · Score: 4, Interesting

      The first sign for me that someone is selling bullshit is when they try to act like this is some never before seen problem, when in fact there is a good four decades of research of database optimization.

      Your point is valid, but I think there is more to it. And the problems these solutions try to solve are quite old too. For example:

      Ever tried to design a database, but got the requirement that you should be able to reconstruct the modification history? It boils down to not deleting (ever), and 'deleted' flag fields and other uglyness. A multi-version relational database would be nice, you actually don't need modification/delete operations in this scenario, just 'updates' that add to the previous status. CouchDB does append operations.

      In some cases you may not need a complete SQL database, just key->value relations, but have them scaling very well. http://project-voldemort.com/ states: "It is basically just a big, distributed, persistent, fault-tolerant hash table." Then they state that they provide horizontal scalability, which MySQL doesn't (OTOH, we should really look at Oracle for these things).

      And you can't really say MapReduce/Hadoop is pointless.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    6. Re:hmm by QuoteMstr · · Score: 4, Insightful

      I think I'd rather see the opposite: That non-relation DBs become the mainstream, and they have SQL added for the odd occasion it is useful. Relational has some nice properties for ad-hoc querying, but for everything else they are a nuisance.

      Berkeley DB is a very good non-relational database with multiple language bindings, several storage engines, and transaction support. It's been around for 24 years, and has seen some appreciable use.

      But that use was nothing compared to the database explosion that SQLite brought about when it was released. SQLite is almost exactly like Berkeley DB, except that it has a SQL engine on top. Almost everyone is using SQLite, and many Berkeley DB users are moving over to it.

      Why? Because SQLite is relational! That constitutes some serious evidence that relationship databases are more than "a nuisance".

    7. Re:hmm by Hognoxious · · Score: 3, Funny

      Rigid schema design is an asset, not a liability!

      Not to people who think a free format text field is the ideal place to store the price, quantity and delivery date of an order. Why not, it's long enough for it all to fit. And it saves all that moving between fields.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    8. Re:hmm by geminidomino · · Score: 3, Insightful

      It's not without precedent. Drop all the features of SQL databases that make them a good idea and you end up with MySQL.

      (Burn, baby, burn)

    9. Re:hmm by larry+bagina · · Score: 3, Informative

      create post insert, update, and delete triggers which file the data (as well as the action, timestamp, and user) into an audit table.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  3. Re:bad design by bennomatic · · Score: 5, Funny

    I'm a terabyte sized binary blob, you insensitive clod!

    --
    The CB App. What's your 20?
  4. Dynamic Relational: change it, DON'T toss it by Tablizer · · Score: 5, Interesting

    The performance claims will probably be disputed by Oracle whizzes. However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array). Non-existent columns in any given row are treated as Null/empty instead of an error. Perhaps tables can also be created just by inserting a row into the (new) target table. No need for explicit schema management. Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified. We have dynamic app languages, so why not dynamic RDBMS also? Let's fiddle with and stretch RDBMS before outright tossing them. Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

    More at:
    http://geocities.com/tablizer/dynrelat.htm
    (And you thought geocities was de

    1. Re:Dynamic Relational: change it, DON'T toss it by sco08y · · Score: 4, Interesting

      However, the "rigid schema" claim bothers me. RDBMS can be built that have a very dynamic flavor to them. For example, treat each row as a map (associative array).

      You described an entity attribute value model, which winds up reinventing half the DBMS, poorly. Don't worry, *everyone* does one once until they realize it's a bad idea.

      Constraints, such as "required" or "number" can incrementally be added as the schema becomes solidified.

      A "rigid" schema is preventing a ton of totally redundant code being written on the app side. All those constraints wind up in the schema because your UI designer doesn't want to consider that Mary might have 5 addresses or 6 mothers or work 7 jobs simultaneously. And your UI tester doesn't want to test an exploding combinatorial number of possibilities.

      I'd like to see, however, a decent type system, proper logical / physical separation, etc.

      Maybe also overhaul or enhance SQL. It's a bit long in the tooth.

      I'm starting from scratch. (Currently I'm slowly retyping about 40 pages into Latex...)

  5. NoSQL? That'd Be DL/I, Right? by BBCWatcher · · Score: 4, Informative

    I think I've heard of non-relational databases before. There's a particularly famous one, in fact. What could it be? Let's see: first started shipping in 1969, now in its eleventh major version, JDBC and ODBC access, full XML support in and out, available with an optional paired transaction manager, extremely high performance, and holds a very large chunk of the world's financial information (among other things). It also ranks up there with Microsoft Windows as among the world's all-time highest grossing software products.

    ....You bet non-relational is still highly relevant and useful in many different roles. Different tools for different jobs and all.

  6. Re:bad design by munctional · · Score: 5, Insightful

    Ever heard of bloom filters? Sharding? Indexes? They are clearly not doing a table scan on 50gb of data every time you open your Facebook inbox.

    You know, there's a certain point where people need to stop and actually think about the implimentation.

    Um, they do. They regularly blog about their solutions to their problems and open source their solutions and contributions to existing projects. They come up with amazing solutions to their large scale problems. They're running over five million Erlang processes for their chat system!

    http://developers.facebook.com/news.php?blog=1

    http://github.com/facebook

    Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation".

    --
    Functional programming... for real men!
  7. Starting to love the idea by Just+Some+Guy · · Score: 4, Interesting

    I'm a huge PostgreSQL fan and took classes in formal database theory in college. I'm saying this as someone who understands and thoroughly appreciates relational databases: I'm starting to love schema-less systems. I've only been playing with CouchDB for a few weeks but can certainly see what such stores bring to the table. Specifically, a lot of the data I've stored over the years doesn't neatly map to a predefined tuple, and while one-to-one tables can go a long way toward addressing that, they're certainly not the most elegant or efficient or convenient representation of arbitrary data.

    I'm certainly not going to stop using an RDBMS for most purposes, but neither am I going to waste a lot of time trying to shoehorn an everchanging blob into one. Each tool has its place and I'm excited to see what niche this ecosystem evolves to fill.

    --
    Dewey, what part of this looks like authorities should be involved?
  8. Re:bad design by JavaPunk · · Score: 4, Interesting

    Yes it does (look through 50TB of data), and how would you design it? It has to access all of your friends and find their postings. Robert Johnson gave an excellent talk on facebook's design two weeks ago at OOPSLA (it should be in the ACM digital library soon). He stated that there is no clear segregation of data, the (friend) network is too connected and extracting groups of friends isn't possible. Basically they have a huge mysql farm with memcached on top. Loading an inbox will hit multiple servers (maybe even a different server for each of your friends) across the farm.

  9. Re:bad design by socceroos · · Score: 5, Funny

    Ever heard of bloom filters? Sharding? Indexes?

    Don't forget flux capacitors, FTL drives and crossfading splicers.

  10. Everything old is new again by QuoteMstr · · Score: 5, Interesting

    We didn't start with relationship databases. RDBMSes were responses to the seductive but unmanageable navigational databases that preceded them. There were good reasons for moving to relational databases, and those reasons are still valid today.

    Computer Science doesn't change because we're writing in Javascript now instead of PL/1.

    1. Re:Everything old is new again by QuoteMstr · · Score: 3, Interesting

      Your question reminds me of the people who say, "if flight records are so strong, why don't we just build the whole plane out of the stuff they use to make them?" You might as well ask, "if DNS is so great, why don't we implement filesystems in terms of it?" Your post demonstrates that you you haven't considered context and purpose.

      Relational databases are models. You can certainly describe DNS in terms of a relational schema. In principle, you could construct a wrapper and query it with SQL. But there's no reason to do that, because with someone as simple as DNS, the full power of a relational query engine doesn't buy you much.

      Most datasets aren't that simple.

      Furthermore, DNS is an open standard that needs to be accessible in as simple a way as possible. Complicating it with relational semantics wouldn't have been worthwhile (because of DNS's relative simplicity), and would have significantly hampered DNS's interoperability.

      That is, if relational databases had existed when DNS was implemented, which they didn't.

      Furthermore, DNS is a distributed, decentralized database. You couldn't use a RDBMS (the software that realized the abstract model of a relational database) to manage it even if you wanted to. That doesn't apply to most datasets, which however large, are still managed by a single organization, and which are accessed by software under the control of that organization.

      Your comparison really makes no sense whatsoever. The vast majority of databases aren't put under the same constraints DNS, and so can take advantage of the much greater flexibility an RDBMS affords.

      You're basically arguing that we can't have efficient engines in automobiles because of a few of them might need to tow 18 ton trailers and withstand mortar rounds. It's ridiculous.

  11. Vendor Hype Orange Alert (Re:hmm) by Tablizer · · Score: 3, Interesting

    Notice that this blog is from a company pushing a cloud based solution.

    That is indeed suspicious. But if they want to sell clouds, then make a RDBMS that *does* scale across cloud nodes instead of bashing SQL. (SQL as a language doesn't define implementation; that's one of it's selling points.) It may be that since there's not one out yet, they instead hype the existing non-RDBMS that can span clouds.

    (I agree that SQL could use some improvements, such as named sub-queries instead of massive deep nesting to make one big run-on statement. Some dialects already have this to some extent.)
             

    1. Re:Vendor Hype Orange Alert (Re:hmm) by Just+Some+Guy · · Score: 4, Informative

      A lot of times people who don't know about joins do the basic join of select x.a y.b from x, y where x.c = y.c Not realizing that Most SQL engines will take all the records of x and cross them with y so you will have x.records*y.records Loaded in your system, the it goes and removes the matches. So O(n^2) in performance, Vs. If you do a Select x.a, y.b from x left join y on x.c

      Dude. That is so unbelievably wrong. First, implicit (comma) joins are inner, not left: your results will differ from the original query. Second, please name one popular database released in the last 3 years that implements inner joins with predicates in the way you describe. I can't speak for the others, but PostgreSQL sure as hell doesn't:

      => select count(1) from invoice;
      select c count
      ---------
      1241342

      => select count(1) from ship;
      count
      --------
      664708

      => select invoice.invid from invoice, ship where invoice.shipid = ship.shipid and ship.name_delpt = 'redacted';
      invid
      ---------
      12345
      12346

      Each of those queries against our live production database ran in under a second (and I only edited the input and output of the final query). PostgreSQL may be quick, but I promise you it didn't have time or RAM to create 825,129,958,136 tuples and then winnow out the non-matches. Maybe you're stuck on an ancient version of a DB that was crappy to start with, but the rest of us don't put up with the same insanities you describe.

      --
      Dewey, what part of this looks like authorities should be involved?
  12. Re:bad design by ErikTheRed · · Score: 4, Funny

    So... every time I open my inbox in Facebook, it has to search through 50TB of data? That sounds like a design problem. What has always floored me is why people think everything needs to be stuffed into a database. Terabyte sized binary blobs? You know, there's a certain point where people need to stop and actually think about the implimentation.

    Could be worse. They could try to find something on my desk.

    --

    Help save the critically endangered Blue Iguana
  13. Re:bad design by Ragzouken · · Score: 3, Interesting

    "Also, when was the last time you tried to visit Facebook and it was down? They're doing quite well for people who need to stop and actually think about their "implimentation"."

    When was the last time you tried to use Facebook or Facebook chat and didn't get failed transport requests, unsent chat messages, unavailable photos, or random blank pages?

  14. Re:bad design by Zombywuf · · Score: 4, Insightful

    The problem is when people don't think about the solution and apply the cargo cult mentality. Facebook uses Eeeerlaaaang therefore we should. Facebook wrote it's own database, therefore we should. People end up writing their own database engines that do exactly the same thing as modern relational engines, with all the bugs that were fixed in the relational engines 10 years ago (5 for Microsoft). Even MS SQL will split a large group by aggregate operation (which takes 3 lines to specify) across multiple CPUS by turning it into a map reduce problem, and it will do this all without you having to be aware of it. Oracle (and many others, Oracles is supposed to be the best) will maintain multiple concurrent versions of your data in order to allow multiple users to work with a snapshot that doesn't change under them while others are changing the data, and this happens transparently. You can go ahead and implement all this stuff yourself if you want, in C and sockets, call me when your done, in 10-20 years.

    The real issue I have with the NoSQL people is they're a bunch of whiny babies, who haven't even taken the time to understand the problem before lashing out at the first thing they see. Just the name tells you this, they call themselves "No SQL" and then lash out at relational databases. SQL is is a terrible language, which really needs replacing, but it is only one possible language for querying relational databases. Relational databases represent several decades of research into how to query data in a fault tolerant scalable way as a standing implementation, re-implementing them is a waste of time.

    --
    If you can read this you've gone too far.
  15. Re:And I am missing it greatly on Linux by Errol+backfiring · · Score: 4, Informative

    I did profile my code. It is not my gut feeling, but my experience.

    --
    Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
  16. solution looking for a problem? by timmarhy · · Score: 3, Insightful

    SQL databases if designed properly DO handle enourmous datasets. the problem starts when you have wits designing the database and then managers attempting to use the DB for purposes it wasn't meant for.

    --
    If you mod me down, I will become more powerful than you can imagine....
  17. Re:I know the type well by QuoteMstr · · Score: 3, Interesting

    Right. Don't forget PostgreSQL too. Really, the problem here is MySQL. Hell, look at the "tips and tricks" comments for this story: they all deal with ways to work around deficiencies in MySQL (and old versions of MySQL at that.)

    The guy who recommends using the first two characters of the MD5 hash to select a table is particularly hilarious. Doesn't he realize that's what a database index already does, and that databases (even MySQL) will do that for him?