NoSQL Document Storage Benefits and Drawbacks
Nerval's Lobster writes "NoSQL databases sometimes feature a concept called document storage, a way of storing data that differs in radical ways from the means available to traditional relational SQL databases. But what does 'document storage' actually mean, and what are its implications for developers and other IT pros? This SlashBI article focuses on MongoDB; the techniques utilized here are similar in other document-based databases."
It's so cute how NoSQL developers have reinvented the XML database.
The article is barely a description of MongoDB records. It does not really detail any real drawbacks or benefits beyond "look ma, random structure in my record!"
Oh, look, it's a NoSQL article.
Cue the hundreds of Slashdotters who proclaim "Oh, they're reinvented obsolete databases" and "Just wait until they need ACID, then they'll be fucked", the NoSQL blind-faith followers who harp about pure scalability and clustering, and at least a dozen references to an animated video of a retarded strawman saying "webscale" repeatedly.
Somewhere in the depths of poorly-researched comments will be some guy who thinks that NoSQL is a tool that really just might be useful for particular use cases, and should be used where appropriate, and nowhere else. Sadly, his post will be missed because everyone's too busy talking about how everything can be done just as easily on a $500,000 server farm running Oracle's latest and greatest turd.
You do not have a moral or legal right to do absolutely anything you want.
I'm not sure what the point of this "article" is. It is light on actual information or anything useful, it's basically just a few paragraphs that say "a NoSQL database called Mongo stored data in JSON format. This may or may not work for you".
If we're going to have "BI" articles, they should be informative, containing useful information that we couldn't have gathered ourselves in 10 secs of googling.
How about some comparisons between various NoSQL solutions? How about binary access API v/s RESTful approach ala Couch? How about clustering, replication and scalability? How about stability concerns (with Couch, for example). Real world use cases? Examples of companies using them for specific solutions? Performance comparisons with RDBMS's? Problem domains that NoSQL/schema less DB is more suited to than a RDMBS?
I'm not trying to be pointlessly critical here, I'm trying to provide some constructive feedback on the new slashdot BI format. This article wasn't useful to me at all. I'll probably not spend time reading these articles in the future if the content is as light as this article.
Drinking habits can be dangerous. You can choke on the cloth and the nuns will wonder where their clothes are.
I don't know when unstructured data turned into NoSQL or Big Data, but it is a pretty simple concept with complex Enterprise level requirements. I work in this field and have for various companies. The biggest obstacle is conforming to the laws of various jurisdictions and levels of government.
You have unstructured data, but it NEEDS some level of structure. That structure is there to restrict access to certain groups within the organization and also for retention rules, which differ by type of data being stored. Not to mention that you must store certain documents in the country of origin, so structured field-based distributed storage plays a role. Oh yea, laws/policies around encryption and whether or not an index violates those laws/policies.
This doesn't work well with a relational database. Sure, you can jam it into a RDBMS like IBM Content Manager, but it becomes inflexible. However, there are constraints that must be followed and all documents need some kind of structure wrapped around them in a RDBMS-like fashion.
I haven't dove into these NoSQL systems myself. They seem like a good idea, but I hesitate if they are too loose. In an Enterprise with sensitive information, you need to deny first. Also, how do they index the fields? Like when you have 100,000,000 documents with invoice numbers...
I read this article with the hope of seeing some of the benifits and drawbacks (as the title implied). No talk of scalability, indexing, speed, etc. I actually feel dumber for having read the article.
The "old old boss" would be the CDF/NetCDF/HDF family of self-describing distributed storage solutions. They predate XML by a long way and are - I believe - the first true self-describing method of storing, indexing and searching data.
For the most part, they support network interconnections between instances, so you can have your virtual storage distributed over as many physical systems as you like. The users will never see the difference except in terms of speed. This gives you all the benefit of NoSQL's distributed model (which XML lacks) but with several decades more development in the database design.
But wait! There's more! If you order in the next gazillion years, you get OpENDAP absolutely free! (Which it is anyway.) OpENDAP will translate between any two data formats, so if one site wants to view the data as, say, a conventional database, another wants to look at it as a collection of spreadsheets and a third is expecting XML data, you'd have OpENDAP translate between client form and central repository form.
I have no objections to Mongo or Memcache, they're very powerful and are very useful, but we're still ultimately talking about technology everyone else has had since 1985, thanks be to NASA, and many NoSQL technologies are really just network-aware versions of the DBM/NDBM/BDB/GDBM/QDBM family which have existed since Unix began.
NoSQL definitely has a place - I would not want to try serving cached web data from HDF5 - and it's an important place. But that's just as true for Hierarchical Databases, Star Databases (aka "Data Warehouses"), "genuine" (ie: actually complies with Codd's rules) relational databases (SQL isn't truly relational in the Codd model, merely a subset), and so on.
It's time we got away from one-size-fits-all ideas, which violates the Unix ethos anyway, and get back to using best solutions for specific problems rather than passable solutions that fail at everything. These are all wonderful, highly specialized solutions to highly specific problem types. Treating them as such will always produce a better answer than force-fitting solutions into not-quite-failing with problems they aren't designed for.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
With MongoDB and lack of hard schema requirement doesn't mean your data model can be all willy-nilly. You have to put some thought into it. Several people have mentioned they are looking for the benefits and drawbacks. I'm really enjoying it. Here's my short list draw backs. Case Sensitive: If you load data with mixed case you will have to use a regular expression to find it all. Data Type UnStrickness: If you load zip codes as a string "11223" and try to find it with an numeric 11223 you are out of luck. Also, if you load your data with mixed data types for the same set of keys, good luck finding it again with out having to work for it. Rich Documents Are Cool: Until you try to remove a struct inside and array that's inside a struct. There is little documentation that explains how to manipulate complex data structures. You can join google groups mongodb-user, they are very responsive. Plus learning to use JSON as your query language is a different way of thinking from Sql Queries. Things I like about mongodb: It is fast. Replicating Shards is very handy and relatively easy to do. It has a growing audience and user base. And finally it fun to watch peoples faces when you tell them "I replicated a bunch of shards today, what did you do?' peace
Where is Lotus Notes in all this bru ha ha? They were the original NoSQL system.
JSON also introduces a fantastic new method of inserting arbitrarily executing code into a web application
How so, if you parse the JSON in your own code instead of eval()ing it?
You can create associative arrays in JSON.
"associations": {
"foo":"bear",
"fu":"bar",
"one":1
}
As for executable code, all JSON is JavaScript but not all JavaScript is JSON.
You can put arbitrary code in any string, regardless the encoding.
I was really hoping for a more in-depth description of what NOSQL has to offer over other DB options.
my article about porn stars using kickstarter and other donation websites to provide each other with healthcare (because there is no healthcare in the porn industry) got instantly blacklisted.
instead, you get this.
shrug.
The comments on SlashBI are great too. I also wanted to know how to query data out of your "documents" as the Wikipedia page doesn't describe that. Using the SlashBI example, show me all contact objects with state = "DC" or all records where last name ilike 'o_ama'. Does performing a search like that iterate over all records? Do you need to enable some full-text indexing of your entire document store to be able to execute queries like that?
... to describe non-relational databases. There are many different ways to store data in various database designs. They are mainly of the following: Relational (mysql, postgresql), Column based (HBase, Cassandra), Document based (Mongodb, CouchBase), Hash based (redis, memcached), and Graph based (neo4j). Databases can also be categorized as single system databases or distributed databases. Mongodb and most (all?) relational dbs are single system databases. HBase, Cassandra, CouchBase, Riak are distributed databases. NoSQL is developed to solve problems that traditional SQL databases are unable or difficult to solve, such as real time updates on a massive scale, or being able to scale horizontally with relative ease, or lighting fast query speeds. They also have drawbacks such as eventual consistency, and synchronization issues. A good architect / programmer should have a solid analysis of the usage case against different flavors of DBs and perform benchmarks, be familiar with their fail over / recovery procedures, and have a good understand of the underlying technologies before considering using them in a serious production environment.
Mongodb has no way to guarantee data integrity (no defined fields, foreign keys, constraint or triggers) nor can it provide much in the way of security but It is very fast at querying huge data sets, great for making persistent objects in many languages, and even better for treating those objects more like a database ( finding objects that have parameters in common). Lack of data validation means means a risk of errors and you wouldn't want to run a mission critical system on it, but it's great for data gathering and scrubbing before moving it elsewhere. The best example is scraping and storing web pages, most likely for later parsing or use but I've found many a time where I needed to gather buckets of data prior to actually figuring out how to best manipulate an use it or just wanted to store data that took a lot of I/O to gather and I didn't want to repeat it. Here's a simple perl example for storing and pulling an object
What is the point of document storage in a noSQL database? If you're not going to store docs in a RDBMS, why not just store them in a filesystem? What is the point of Mongo or whatever this stuff is?
I don't respond to AC's.
We used MongoDB as a query/cache accelerator for semi-structured data. The key bottleneck was delegating queries outside of application (pre-filtering results according to ACLs, date transforms, etc.).
We don't have a shockingly huge dataset, and site traffic wouldn't be considered as webscale, but the ad-hoc schema and ability to delegate complex queries to the DBs as JS was really powerful and bought us a lot of performance for very little effort.
And it's only a cache of the authoritative data store, so we can trash mongo and re-load the whole dataset in a few hours.
I'm currently using MongoDB for analytics data in a fairly large company. I must say this article was complete weaksauce. Aside from the fairly well documented scalability, and yes, it delivers, I've found that working with objects rather than related tables has given me a new perspective on what high "normalization" is. Being schema-less is not just about store whatever you want. It's about storing what you need, and also being able to perform some rather interesting observations on that data.
A working example is the fact that I can store, in most cases, only differentiating data. For example, it is very easy for me to take the URL that someone is on, compare it to the URL that someone is going to (via a click), and store only the difference in information. If the domain does not differ, I can store it in a single location. Could the same be achieved with SQL... of course, two tables, each with similar structure, but one allowing null values, and then another table which links them. However, then my queries become increasingly complex for pretty simple data.
In this same area, it is important for me to have things like query parameters normalized. I want to be able to easily query if the user hit a URL with a particular parameter, and a particular value for that parameter. Using an embedded object solves this quite nicely, and there is very little code required... the property names are the parameter name, their values, the values passed. Compare this to what you might have to do in SQL... either store the full query string and try to get away with LIKE '%param=value%' or parse this query string each time... or worse, in my opinion, create a table with columns like url_id, param_name, value and then store one record for each parameter, not to mention the fact that having a schema would basically require me to say all parameters are the same type. And while that is true in the URL sense (they are all strings), my data model clearly understands that my 'page' parameter is an integer for searches and my 'query' parameter is a string... why shouldn't my database be able to do the same thing with ease?
In addition to this, I like a number of features such as aggregate operations. A single URL object can easily have it's click count aggregated in a single autonomous "query." No need to select the value first and then update, not to mention wrap it in a transaction to avoid the possibility of concurrency issues messing with the count.
Additionally, some bringing up XML databases (which I agree I'm not very familiar with) seem to be missing the point that JSON data, but more specifically, BSON, can carry type information without the need for additional parsing and/or standards. Not everything is a 'string' -- and point of fact, you can even have the possibility of more complex objects which are stored as natural types... MongoDB supports date objects for example, as well as geolocation data which is query-able via concepts specific to that data.
These are, in my opinion, the more powerful features of MongoDB and presumably other object store databases that don't get discussed. There are some limitations, of course. For users who get too carried away, they may, for example, run into the maximum object size limitations. Furthermore, the benefits may not be immediately realized when coming from a relational background. It took me some time to really understand the distinction.
I never understood this No-SQL fad. You can turn off transaction isolation and cram serialized record data into a single BLOB field, and you will get the same thing right? Or, use a freaking filesystem? Why do they keep patting themselves on the shoulder over performance of particular implementation that is due to lack of features and safety, and comparing it to relational databases in general as if it was somehow superior? Apples and oranges. Like saying MyISAM backend is superior to InnoDB in MySQL because it is faster. SQL as a performance bottleneck? Having to escape certain characters? mysql_real_escape_string()? They probably never heard of bound parameters and prepared statements. Once they find out they need to start addressing things like durability (when they acknowledge successful completion of a transaction to the remote client, and then there is a crash immediately after that, will the transaction be lost?) and isolation (multiple concurrent transactions modifying the same data jeopardizing the integrity of the data) they will eventually find out that transaction processing is about more than just atomic updates, and find themselves doing the stuff they loathe on the SQL databases. And it will hurt performance to do it compared to the case where they don't care about these things, surprise, surprise. Reminds me of the postgresql guys when they thought they could somehow make their great idea of snapshot based concurrency control into a proper serializable isolation, and everyone else was doing it wrong. They couldn't, it only works for read only transactions. Now they know.
While "NoSQL" the term itself is not new, what most of the posters here really are complaining about is relational databases vs. a subset of everything else. I am also not sure what the assumption that all "NoSQL" databases have no schema, and consequently the attachment to schema. Overall, use the right tool for the right job, and sometimes that even means multiple tools. Just because something worked for you on 10 projects doesn't mean for everyone else it will fit every situation.
Firstly, there is no reason past, present, or future why a relational database cannot forgo using SQL. SQL is simply a tool for relational databases, or rather a semi-standard powerful DSL with an accompanying API per implementation. It's a decent fit with relational databases. Is a relational DB without SQL "NoSQL?" Honestly, "NoSQL" probably one of the dumbest terms and just reminds of people who complain like little children about the failings (of which there are indeed many) of SQL against relational databases.
Moreover, some of the oldest databases in software history are not relational and do not use SQL. For instance, there are quite a number of object databases that are as old or older than many popular relational implementations. These databases do not use SQL. Are they NoSQL? Supposedly, yes. They also most definitely can have schemas. The schemas are defined by classes usually, which creates many advantages in various OO languages vs. the relational model. Some of the largest databases in the world are object databases ranging from applications such as nuclear and genetic sciences to shipping. Gemstone object database is used for shipping and is baked in with Smalltalk. If anything, this ensures that this database can actually have a more complex and rigid schema than any SQL implementation because you can define classes, business rules, and callback-like behavior to handle and enforce your schema. (Aside: this brings up an important point that why not use more functional languages with DBs since after all they are a closer fit than OO, hence less impedance mismatch and so on).
Another category of databases that is NoSQL are graph databases. These again are some of the largest databases on the planet. In most implementations whether we're talking about Neo4j, InfiniteGraph, Sohnes, AllegroGraph, whatever, these databases typically can have a similar level of schema control as object databases or very little. Sometimes that is the point of the graph. You can not just enforce "column" like functionality via class fields, hashes, or something similar, but actually enforce how the structure of the graph can grow. This is usually done via programmer logic. In other words, the responsibility in both object and graph databases can be passed to the programmer to augment the schema, rather than force one more general schema as in the relational world. Therefore in a graph database, I can typically create directed and undirected graphs, b-trees, tries, binary trees, red-black trees, linked lists, whatever.
Additionally, we have the newer set of databases which is probably what most people think of including mongodb, couchdb, cassandra, hadoopfs, riak, and so on. These databases have their purpose like any other. Would I use Neo4j or Oracle as my persistent cache? Probably not. Redis? A much better chance. Really none of these are a magic bullet and they are all growing like any other. I honestly can't stand some of them in my own work, but that doesn't make them bad. Some use JSON, some don't. This isn't a magic language or the new SQL. You can easily write a JSON layer for almost any database, so that's all it is - a way of talking. Don't like it? In most cases, you can use something else or write your own.
The more I hear about NoSQL, the more I want to puke. Yay, some idiots invented a new term akin to renaming air "non-solid." None of this is new, and we've all seen it before, we'll see it again. It's the same annoyance level as cloud computing which we've already also seen. At the end of the day, a com
Most talk of NoSQL seems to focus on the key-value / document-oriented databases (Dynamo, Cassandra, Mongo, etc.). IMHO graph databases (e.g. neo4j) sound a lot more interesting and relevant to most use cases (at least the ones I deal with), focusing on the relationships among the data elements and new ways to query that data.
You can turn off transaction isolation and cram serialized record data into a single BLOB field, and you will get the same thing right?
Not really. Schemaless databases provide indexing and search capabilities that are impossible to achieve using SQL blobs without either loading all your data back into memory whenever you want to search for something or providing your own index mechanism.
Or, use a freaking filesystem?
Which as well as lacking indexing and search as the SQL-based system would, also does not provide any useful mechanism for concurrent updates, or for ensuring consistency (whether eventual or guaranteed at all times). It would also probably be much slower.
Why do they keep patting themselves on the shoulder over performance of particular implementation that is due to lack of features and safety, and comparing it to relational databases in general as if it was somehow superior? Apples and oranges.
They don't. You're misreading the articles because you haven't spotted the context: people are using oranges and complaining that the cider they're ending up with just doesn't taste right. I.e., a lot of people are using SQL databases for tasks for which one of the various NoSQL systems would be better suited simply because they don't realise there's a better tool for them. These people need to see comparisons between SQL & the NoSQL systems that are available in order to realise this.
SQL as a performance bottleneck? Having to escape certain characters? mysql_real_escape_string()? They probably never heard of bound parameters and prepared statements.
My experience is that SQL isn't so much the bottleneck (although it is slow, even with prepared statements) as object-relational mapping. And yes, my application does need some form of ORM (whether handbuilt or using an off-the-shelf library) because I'm working with many types of polymorphic object in single collections. Schemaless databases make this much, much simpler, as they remove the need for large numbers of table joins to make a deep inheritance heirarchy work.
Wine? Which web browser's JavaScript engine changes its array indexing behavior in this way when run under a free reimplementation of the Windows API?
I, for one, would have a much better time taking NoSQL seriously if so many of the arguments for it didn't reduce to -and to truly express this reduction properly I need to put on my best Barbie voice- "The relational model is haaaaard." Some say SQL instead (for example, whoever came up with the NoSQL moniker), but except for a couple of arguments that amount to pure syntax baw it reduces to pretty much the same thing.
NoSQL has its place: there are some things it does really well. The problem is that the things it does best are not the things most of its advocates call for.
You really have no idea what you are doing.
Your variable, stuff is, in fact, an Array with literal strings for its first three indices. But calling stuff.foo = "bar" does not add to them. Instead, what you have created a new property on that instance called foo, which joins other Array object properties like length. Any half-intelligent JSON serialization routine will notice the object has type Array and will go about looking only at the indexed values.
Why you would ever want to do something so confusing as combining indexed and mapped values in an object this way is beyond me.
By the way, you can very easily iterate over object properties.
for (var property in object) {
value = object[property];
}
Seriously, learn the tool before you start criticizing it. And, as it happens, this is one reason JavaScript has developed such bad reputations: clueless hacks like you apply the language in utterly bizarre and foolish ways, and then go on about how “LOL Javascripts are Teh SUX0R d00dz, it's sl0w and lAaMe. U shuld all codez teh Ruby.”
There is no reason this should be moderated anything other than “-1, Misinformed.”