NoSQL Document Storage Benefits and Drawbacks
Nerval's Lobster writes "NoSQL databases sometimes feature a concept called document storage, a way of storing data that differs in radical ways from the means available to traditional relational SQL databases. But what does 'document storage' actually mean, and what are its implications for developers and other IT pros? This SlashBI article focuses on MongoDB; the techniques utilized here are similar in other document-based databases."
It's so cute how NoSQL developers have reinvented the XML database.
The article is barely a description of MongoDB records. It does not really detail any real drawbacks or benefits beyond "look ma, random structure in my record!"
Oh, look, it's a NoSQL article.
Cue the hundreds of Slashdotters who proclaim "Oh, they're reinvented obsolete databases" and "Just wait until they need ACID, then they'll be fucked", the NoSQL blind-faith followers who harp about pure scalability and clustering, and at least a dozen references to an animated video of a retarded strawman saying "webscale" repeatedly.
Somewhere in the depths of poorly-researched comments will be some guy who thinks that NoSQL is a tool that really just might be useful for particular use cases, and should be used where appropriate, and nowhere else. Sadly, his post will be missed because everyone's too busy talking about how everything can be done just as easily on a $500,000 server farm running Oracle's latest and greatest turd.
You do not have a moral or legal right to do absolutely anything you want.
I'm not sure what the point of this "article" is. It is light on actual information or anything useful, it's basically just a few paragraphs that say "a NoSQL database called Mongo stored data in JSON format. This may or may not work for you".
If we're going to have "BI" articles, they should be informative, containing useful information that we couldn't have gathered ourselves in 10 secs of googling.
How about some comparisons between various NoSQL solutions? How about binary access API v/s RESTful approach ala Couch? How about clustering, replication and scalability? How about stability concerns (with Couch, for example). Real world use cases? Examples of companies using them for specific solutions? Performance comparisons with RDBMS's? Problem domains that NoSQL/schema less DB is more suited to than a RDMBS?
I'm not trying to be pointlessly critical here, I'm trying to provide some constructive feedback on the new slashdot BI format. This article wasn't useful to me at all. I'll probably not spend time reading these articles in the future if the content is as light as this article.
Drinking habits can be dangerous. You can choke on the cloth and the nuns will wonder where their clothes are.
I don't know when unstructured data turned into NoSQL or Big Data, but it is a pretty simple concept with complex Enterprise level requirements. I work in this field and have for various companies. The biggest obstacle is conforming to the laws of various jurisdictions and levels of government.
You have unstructured data, but it NEEDS some level of structure. That structure is there to restrict access to certain groups within the organization and also for retention rules, which differ by type of data being stored. Not to mention that you must store certain documents in the country of origin, so structured field-based distributed storage plays a role. Oh yea, laws/policies around encryption and whether or not an index violates those laws/policies.
This doesn't work well with a relational database. Sure, you can jam it into a RDBMS like IBM Content Manager, but it becomes inflexible. However, there are constraints that must be followed and all documents need some kind of structure wrapped around them in a RDBMS-like fashion.
I haven't dove into these NoSQL systems myself. They seem like a good idea, but I hesitate if they are too loose. In an Enterprise with sensitive information, you need to deny first. Also, how do they index the fields? Like when you have 100,000,000 documents with invoice numbers...
I read this article with the hope of seeing some of the benifits and drawbacks (as the title implied). No talk of scalability, indexing, speed, etc. I actually feel dumber for having read the article.
The "old old boss" would be the CDF/NetCDF/HDF family of self-describing distributed storage solutions. They predate XML by a long way and are - I believe - the first true self-describing method of storing, indexing and searching data.
For the most part, they support network interconnections between instances, so you can have your virtual storage distributed over as many physical systems as you like. The users will never see the difference except in terms of speed. This gives you all the benefit of NoSQL's distributed model (which XML lacks) but with several decades more development in the database design.
But wait! There's more! If you order in the next gazillion years, you get OpENDAP absolutely free! (Which it is anyway.) OpENDAP will translate between any two data formats, so if one site wants to view the data as, say, a conventional database, another wants to look at it as a collection of spreadsheets and a third is expecting XML data, you'd have OpENDAP translate between client form and central repository form.
I have no objections to Mongo or Memcache, they're very powerful and are very useful, but we're still ultimately talking about technology everyone else has had since 1985, thanks be to NASA, and many NoSQL technologies are really just network-aware versions of the DBM/NDBM/BDB/GDBM/QDBM family which have existed since Unix began.
NoSQL definitely has a place - I would not want to try serving cached web data from HDF5 - and it's an important place. But that's just as true for Hierarchical Databases, Star Databases (aka "Data Warehouses"), "genuine" (ie: actually complies with Codd's rules) relational databases (SQL isn't truly relational in the Codd model, merely a subset), and so on.
It's time we got away from one-size-fits-all ideas, which violates the Unix ethos anyway, and get back to using best solutions for specific problems rather than passable solutions that fail at everything. These are all wonderful, highly specialized solutions to highly specific problem types. Treating them as such will always produce a better answer than force-fitting solutions into not-quite-failing with problems they aren't designed for.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
With MongoDB and lack of hard schema requirement doesn't mean your data model can be all willy-nilly. You have to put some thought into it. Several people have mentioned they are looking for the benefits and drawbacks. I'm really enjoying it. Here's my short list draw backs. Case Sensitive: If you load data with mixed case you will have to use a regular expression to find it all. Data Type UnStrickness: If you load zip codes as a string "11223" and try to find it with an numeric 11223 you are out of luck. Also, if you load your data with mixed data types for the same set of keys, good luck finding it again with out having to work for it. Rich Documents Are Cool: Until you try to remove a struct inside and array that's inside a struct. There is little documentation that explains how to manipulate complex data structures. You can join google groups mongodb-user, they are very responsive. Plus learning to use JSON as your query language is a different way of thinking from Sql Queries. Things I like about mongodb: It is fast. Replicating Shards is very handy and relatively easy to do. It has a growing audience and user base. And finally it fun to watch peoples faces when you tell them "I replicated a bunch of shards today, what did you do?' peace
Where is Lotus Notes in all this bru ha ha? They were the original NoSQL system.
JSON also introduces a fantastic new method of inserting arbitrarily executing code into a web application
How so, if you parse the JSON in your own code instead of eval()ing it?
Ok, take that associative array and add non associative elements to it.
Or more accurately, take a non associative array and add associative elements to it
["a", "b", "c" "foo":"bar"] is not valid JSON
niether is {"a", "b", "c" "foo":"bar"}
Yet I can do:
var stuff = ["a", "b", "c"];
stuff.foo = "bar";
That javscript object can not be serialized to valid JSON.
I was really hoping for a more in-depth description of what NOSQL has to offer over other DB options.
Sure it won't create an instance of Array, but if you're using an Array to also be an associative array then really I think JSON is the least of your worries.
Or if you want to be avant garde, I suppose you could begin the numbering at zero *blames wine*
The comments on SlashBI are great too. I also wanted to know how to query data out of your "documents" as the Wikipedia page doesn't describe that. Using the SlashBI example, show me all contact objects with state = "DC" or all records where last name ilike 'o_ama'. Does performing a search like that iterate over all records? Do you need to enable some full-text indexing of your entire document store to be able to execute queries like that?
... to describe non-relational databases. There are many different ways to store data in various database designs. They are mainly of the following: Relational (mysql, postgresql), Column based (HBase, Cassandra), Document based (Mongodb, CouchBase), Hash based (redis, memcached), and Graph based (neo4j). Databases can also be categorized as single system databases or distributed databases. Mongodb and most (all?) relational dbs are single system databases. HBase, Cassandra, CouchBase, Riak are distributed databases. NoSQL is developed to solve problems that traditional SQL databases are unable or difficult to solve, such as real time updates on a massive scale, or being able to scale horizontally with relative ease, or lighting fast query speeds. They also have drawbacks such as eventual consistency, and synchronization issues. A good architect / programmer should have a solid analysis of the usage case against different flavors of DBs and perform benchmarks, be familiar with their fail over / recovery procedures, and have a good understand of the underlying technologies before considering using them in a serious production environment.
What is the point of document storage in a noSQL database? If you're not going to store docs in a RDBMS, why not just store them in a filesystem? What is the point of Mongo or whatever this stuff is?
I don't respond to AC's.
We used MongoDB as a query/cache accelerator for semi-structured data. The key bottleneck was delegating queries outside of application (pre-filtering results according to ACLs, date transforms, etc.).
We don't have a shockingly huge dataset, and site traffic wouldn't be considered as webscale, but the ad-hoc schema and ability to delegate complex queries to the DBs as JS was really powerful and bought us a lot of performance for very little effort.
And it's only a cache of the authoritative data store, so we can trash mongo and re-load the whole dataset in a few hours.
I never understood this No-SQL fad. You can turn off transaction isolation and cram serialized record data into a single BLOB field, and you will get the same thing right? Or, use a freaking filesystem? Why do they keep patting themselves on the shoulder over performance of particular implementation that is due to lack of features and safety, and comparing it to relational databases in general as if it was somehow superior? Apples and oranges. Like saying MyISAM backend is superior to InnoDB in MySQL because it is faster. SQL as a performance bottleneck? Having to escape certain characters? mysql_real_escape_string()? They probably never heard of bound parameters and prepared statements. Once they find out they need to start addressing things like durability (when they acknowledge successful completion of a transaction to the remote client, and then there is a crash immediately after that, will the transaction be lost?) and isolation (multiple concurrent transactions modifying the same data jeopardizing the integrity of the data) they will eventually find out that transaction processing is about more than just atomic updates, and find themselves doing the stuff they loathe on the SQL databases. And it will hurt performance to do it compared to the case where they don't care about these things, surprise, surprise. Reminds me of the postgresql guys when they thought they could somehow make their great idea of snapshot based concurrency control into a proper serializable isolation, and everyone else was doing it wrong. They couldn't, it only works for read only transactions. Now they know.
You can turn off transaction isolation and cram serialized record data into a single BLOB field, and you will get the same thing right?
Not really. Schemaless databases provide indexing and search capabilities that are impossible to achieve using SQL blobs without either loading all your data back into memory whenever you want to search for something or providing your own index mechanism.
Or, use a freaking filesystem?
Which as well as lacking indexing and search as the SQL-based system would, also does not provide any useful mechanism for concurrent updates, or for ensuring consistency (whether eventual or guaranteed at all times). It would also probably be much slower.
Why do they keep patting themselves on the shoulder over performance of particular implementation that is due to lack of features and safety, and comparing it to relational databases in general as if it was somehow superior? Apples and oranges.
They don't. You're misreading the articles because you haven't spotted the context: people are using oranges and complaining that the cider they're ending up with just doesn't taste right. I.e., a lot of people are using SQL databases for tasks for which one of the various NoSQL systems would be better suited simply because they don't realise there's a better tool for them. These people need to see comparisons between SQL & the NoSQL systems that are available in order to realise this.
SQL as a performance bottleneck? Having to escape certain characters? mysql_real_escape_string()? They probably never heard of bound parameters and prepared statements.
My experience is that SQL isn't so much the bottleneck (although it is slow, even with prepared statements) as object-relational mapping. And yes, my application does need some form of ORM (whether handbuilt or using an off-the-shelf library) because I'm working with many types of polymorphic object in single collections. Schemaless databases make this much, much simpler, as they remove the need for large numbers of table joins to make a deep inheritance heirarchy work.
Wine? Which web browser's JavaScript engine changes its array indexing behavior in this way when run under a free reimplementation of the Windows API?
I, for one, would have a much better time taking NoSQL seriously if so many of the arguments for it didn't reduce to -and to truly express this reduction properly I need to put on my best Barbie voice- "The relational model is haaaaard." Some say SQL instead (for example, whoever came up with the NoSQL moniker), but except for a couple of arguments that amount to pure syntax baw it reduces to pretty much the same thing.
NoSQL has its place: there are some things it does really well. The problem is that the things it does best are not the things most of its advocates call for.
I know exactly what I am doing. I know I can iterate over all properties and array contents using in. You merely have poor reading comprehension.
I am adding a name property to an object, one that also happens to be an array. Exactly like I said I was. I am fully aware that I am not adding another element to the array when I do stuff.foo = "bar";
Your half intelligent json serialization routine that ignores the properties added to an array is wrong. Just like the other guys suggestion to implement it as an associative array with numeric identifiers. If you hack it to produce valid JSON you do not get the same object out the other end of the pipe. You get something else that is almost but not quite entirely unlike tea.
I know the tool, I know the technology. My criticism is precisely because I use this shit every day.
Here is a question for you. What is the simplest and most efficient way to create an XML mapping to javascript objects. XML elements have attributes and contents that may be other xml element or strings. The names of those elements are arbitrary, one can not simply assume that an xml element wont have an attribute called "contents" or "innerXML" or whatever you decide to call it.
The answer to this question is this: You take a javascript array, the objects that the array contains are the child elements of the element. You then add the xml attributes as properties of the object (that is also an array).
Could the same be achieved with SQL... of course, two tables, each with similar structure, but one allowing null values, and then another table which links them.
Uh, no. In SQL, you store data by putting each record in a row. No one has the slightest idea what you're talking about, or why you'd need to 'store only the differences', or why you'd need three tables for that.
However, then my queries become increasingly complex for pretty simple data.
Oh noes! Complicated queries!
Compare this to what you might have to do in SQL... either store the full query string and try to get away with LIKE '%param=value%' or parse this query string each time
That sure is a complicated query. Why, it's so complicated you managed to type the important part in a slashdot post off-hand.
And while that is true in the URL sense (they are all strings), my data model clearly understands that my 'page' parameter is an integer for searches and my 'query' parameter is a string... why shouldn't my database be able to do the same thing with ease?
Yes, especially as MongoDB somehow does magically understand the types of URL parameters...oh, wait, it doesn't? You're having to parse each one as you put it in and figure out the type? Well then that hardly seems relevant to complain that SQL would also require that, does it? (Not that I have any idea why you're complaining about having to store something as a string that actually is a string. There's not any sort of loss there. If you want to treat it as a number, you can do that just as easily after you pull it from the database as before.)
In addition to this, I like a number of features such as aggregate operations. A single URL object can easily have it's click count aggregated in a single autonomous "query." No need to select the value first and then update, not to mention wrap it in a transaction to avoid the possibility of concurrency issues messing with the count.
Yes, it sure is complicated to do something like: UPDATE urltable SET counter = counter + 1 WHERE url = 'blah'; (With INSERT...ON DUPLICATE KEY if needed.)
If corporations are people, aren't stockholders guilty of slavery?