Moving From CouchDB To MySQL
itwbennett writes "Sauce Labs had outgrown CouchDB and too much unplanned downtime made them switch to MySQL. With 20-20 hindsight they wrote about their CouchDB experience. But Sauce certainly isn't the first organization to switch databases. Back in 2009, Till Klampaeckel wrote a series of blog posts about moving in the opposite direction — from MySQL to CouchDB. Klampaeckel said the decision was about 'using the right tool for the job.' But the real story may be that programmers are never satisfied with the tool they have."
Of course, then they say things like: "We have a TEXT column on all our tables that holds JSON, which our model layer silently treats the same as real columns for most purposes. The idea is the same as Rails' ActiveRecord::Store. It’s not super well integrated with MySQL's feature set — MySQL can’t really operate on those JSON fields at all — but it’s still a great idea that gets us close to the joy of schemaless DBs."
And in another three years they will switch to whatever is the coolest up-and-coming storage solution. Incompetent developers will always be incompetent developers.
That's actually a rather insightful point...
If your application fits well with the methodologies of a traditional RDBMS, use a traditional RDBMS, and hire people who are trained and experienced in using those methodologies to their full potential.
If you're dealing with the latest Big Data paradigms and designs, where you can sacrifice some of the rigidity of a RDBMS to gain some flexibility and cheaper scalability, use a NoSQL database, and hire people who aren't stuck in their old RDBMS ways.
You do not have a moral or legal right to do absolutely anything you want.
PostgreSQL 9.2 beta improves scalability, adds JSON
http://www.h-online.com/open/news/item/PostgreSQL-9-2-beta-improves-scalability-adds-JSON-1573815.html
You can get json support using the PLV8 extension http://code.google.com/p/plv8js/wiki/PLV8
or altenatively you can use the hstore data type.
MongoDB is Webscale. MySQL is not Webscale, because it uses joins. SQL also has impetus mismatch.
But the real story may be that programmers are never satisfied with the tool they have.
Ah typo
But the real story may be that programmers don't know how to store data
They many not know because no one knows the business needs, but more often because they have no idea what they're doing WRT to data storage.
IT training tends to cover data manipulation pretty well "how to add two numbers'
IT training gets shakey on data structures "So, in junior level class we will talk about data structures, which is too bad because you've already developed at least two years of bad habits first"
IT training tends to pretty much skip data storage "In a senior level class, you might talk about scalability, maybe in an optional class. Or maybe you'll take a semester of cobol instead"
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
That's actually a rather insightful point...
If your application fits well with the methodologies of a traditional RDBMS, use a traditional RDBMS, and hire people who are trained and experienced in using those methodologies to their full potential.
If you're dealing with the latest Big Data paradigms and designs, where you can sacrifice some of the rigidity of a RDBMS to gain some flexibility and cheaper scalability, use a NoSQL database, and hire people who aren't stuck in their old RDBMS ways.
The real key is for the person doing the hiring to understand which of those of methodologies fits their application.
And most importantly, make sure you know the difference.
Because I should think someone who thinks you should ditch your RDBMS when it's the thing you need to keep using is going to cause you more problems than they're worth. Of course, the opposite is true ... I remember someone who insisted in writing ER diagrams to describe our system, despite it not being an RDB, and not being accurately described by ER diagrams -- but to him everything was an ER diagram.
It's not uncommon for geeks to push to use the latest stuff simply because it's the latest. (Or, as you point out, use something because that's what they've always used)
I've actually seen someone suggesting we scrap an architecture to go with something he'd read recently -- despite having insisted we switch to the current architecture after reading about that.
After a certain point, you just realize they're a technology magpie and tell them to STFU if they're not providing solid reasoning for why this is better in this context. After a while "because it's newer and better" becomes code for "shiny and pretty". Especially if these whims happen in shorter periods than your development lifecycle.
Lost at C:>. Found at C.
MySQL is not Webscale, because it uses joins.
Then how does a non-webscale database power popular web sites such as Wikipedia and Slashdot? If you don't do joins in the database, you'll probably end up doing the equivalent of joins (using one value as the key in another table) in your application.
RDBMS systems can be flexible also. It just takes a bit of planning, a good understanding of your data and a well designed application...which you should do/have regardless of your storage solution.
Call me set in my old RDBMS ways, but if I'm supporting it then I want to know what the hell is gong on with the data.
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
The real key is for the person doing the hiring to understand which of those of methodologies fits their application.
This is insighful. I've worked extensively with RDBMS solutions and now quite a bit with NoSQL technologies. They each have their place. An entire article could be written on where each fits most naturally, but in general if you don't need to join between tables, need to throw data to your store at a high velocity (e.g. logging), and/or need a loose schema, a NoSQL solution works best. If what you're doing can be naturally modeled (i.e. users HAVE AND BELONG TO stations, stations HAVE MANY playlists, etc. etc.), use an RDBMS.
One can see in the subtext of the GP that they may not get this, with their comment that people using RDBMS solutions are "stuck in old ways". It seems like they are saying that NoSQL is effectively always best. I'm curious why they think that. Nail, hammer, etc...
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
PostgreSQL 9.2 (now in beta) includes native JSON fields:
http://www.h-online.com/open/news/item/PostgreSQL-9-2-beta-improves-scalability-adds-JSON-1573815.html
It's also available as an extension for the current 9.1 release:
http://people.planetpostgresql.org/andrew/index.php?/archives/255-JSON-for-PG-9.2-...-and-now-for-9.1!.html
http://www.glasswings.com/
In NoSQL systems such as MongoDB and CouchDB, what do you call the operation where you retrieve one document, pull an identifier out of that document, and use that identifier as the key to retrieve another document?
Hop into the wayback machine and fire up any flavor of PICK. The database where schema is applied on use, not on storage. No length limits on fields and very fast on old hardware (really fast on new). Storing bits of xml and code are no problem. And for those users who simply must have SQL, many versions will support that too (UniData and UniVerse are two examples). It's not cool, not new, but it does work.
A CSV or XML or JSON file is a db (a DB is just structured data).
Are relational DBs always required? Certainly not.
The big benefit to a relational DB with lots of enforcement at the data layer is that you can have one or more applications reading/writing to it with minimal concern of data corruption.
What isn't obvious is that second application is often aggregate reporting for management. "How many customers are using $foo and where do they live geographically". With a relational DB, I might knock that query out in a few minutes across millions of customers.
With a flat XML file per customer spread across a number of servers, this could take days to assemble, particularly if $foo is nested deep in the structure.
Having spent far too much time writing one-off scripts to gather customer data because the middleware didn't support that type of query, I've actually gone the other way and started shoving some business logic into the DB.
Functions such as isCustomerPaymentOverdue are now in the relational DB with a very thin model in the middleware to allow for much easier and faster reporting.
Rod Taylor
Or use a better DB like Postgres. How the MySQL still is popular I will never know. I think it is a conspiracy to prove FREE DBs suck.
Urban Airship went PostgreSQL to MongoDB to Cassandra to PostgreSQL. http://wiki.postgresql.org/images/7/7f/Adam-lowry-postgresopen2011.pdf
It's a good presentation because they're in love with none of them and are moving for specific reasons each time, handling different issues. It's not coders chasing the new hotness.
Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
If all your application is ever going to do is read and write to fixed sized record structured data with little relational (or any) attributes then COBOL will suit you fine as that's what it was designed for. Unfortunatly those sorts of apps are few and far between these days, but in its ever decreasing niche COBOL is still good.
It depends on the task though, I'd wager 90% of SQL work that is done by developers day to day isn't in such a performance sensitive environment that it needs to favour performance over normalisation, and I agree with the GP, there's far too many developers out there that just don't do it and hence simply don't have the performance excuse. It really is just bad database design as a result of incompetence most the time.
I can definitely see the value in making an informed tradeoff, but like you said, a lot of the time it's not an informed decision--they just do it to make it work and don't really have the expertise to know which is the right way to go. I've definitely seen enough bad database designs to know that most developers just have no clue how to design them. The worst I've seen had bad designs and poor performance, and were built in a completely ad hoc manner without any eye toward maintainability, performance, or data integrity/consistency. The philosophy was just "make it work."
I think developers need to realize that databases are a lot like code: first you prototype, then you throw away the prototype and do it right. (Then again, plenty of developers just keep the prototype and use it for production.)
Check out my world simulator thingy.
Yeah, it really depends on what you are doing. But any time you break normalization there should be a good reason. Performance is certainly a valid reason. "I'm too lazy to make a well-designed database," however, is not.
If you find yourself breaking normalization all the time, then you've probably found a use case where a relational database isn't the best tool for the job.
While there is a "right" way to use a given tool, there is no one tool that is right for every situation. People who get this backwards are zealots and will often make poor decisions.
Check out my world simulator thingy.
The big benefit to a relational DB with lots of enforcement at the data layer is that you can have one or more applications reading/writing to it with minimal concern of data corruption.
Not just that, but good use of relations and normalization makes whole classes of bug impossible.
SJW n. One who posts facts.
not using arcane tools
I know database concepts are difficult for some people, but it's by no means magic.
a majority of our unplanned downtime was due to CouchDB issues
Nowhere on the CouchDB home page is reliability even mentioned. And that's the real issue. Developing a reliable database system is a difficult design and programming task. It requires real software engineering. The hacks who write PHP and use JSON aren't up to a job like that. The "aw, we'll fix it in the next release" attitude doesn't cut it in databases.
And in many databases, there'd be more performance gains from proper normalization than pre-mature optimization. I'm working with a legacy database that has this problem. Proper normalization would probably make it lightning fast, but instead it's slow as fuck because too many concerns are put in one table when they should be put in several tables. Also, it uses functions to retrieve values, which is just...so wrong.
In what other language would this statement compile without error:
PERFORM makemoney UNTIL rich.
(Note the the full stop at the end)
So the thing is, traditional joins (on, say, Postgres or MySQL) aren't blocking operations. You can run more than one at a time. MapReduce (as well as writes, any aggregation, and any use of JavaScript) are blocking operations on Mongo. They block the entire mongo process. The MapReduce case gets around this with a bit of cooperative multitasking (yielding every few hundred or thousand rows), but writes, aggregation, and other use of javascript do not. So there's already a much bigger need to distribute MapReduce on Mongo than there is to distribute a JOIN on an SQL database.
Plus, MapReduce on Mongo is painfully slow, so you'll need to break things down into really small partitions to scale at all. Aggregating 800,000 documents (group by + sum) took me about 20 seconds with Mongo using the existing aggregation framework (which is universally credited with being faster than the MapReduce case). Porting the whole thing over to an SQL database allowed me to a.) not block the entire freaking process and b.) run the query in about 800ms.
So, sure, we could have partitioned the data and spread it across multiple nodes. Maybe that would have been faster (but you can only run MapReduce across multiple nodes, using the existing aggregation framework you can only operate on one node). Dunno. But it would have been a lot more expensive since we would have required more hardware to accomplish the same thing that an SQL database is optimized for.
The reason that "they" mock JOINs is because you simply can't do that efficiently with Mongo.
The revolution will be mocked
Also, I know what I'm doing.
This line really doesn't count for anything. How many people are really going to say "I don't know what I'm doing", or "I'm incompetent"? Everyone thinks they know what they're doing.
You may or may not really know what you're doing, we have little way to know for sure, but you saying it about yourself is meaningless.