Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL
donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
In other news, Cassandra developers are celebrating the fact that their database is now used to store the largest amount of worthless information in history.
Negative moral value of force outweighs the positive value of good intentions.
Reddit also recently switched to Cassandra.
Or away from MySQL? There is a difference.
From the Digg blog - http://about.digg.com/node/564
"And if that doesn't sound like a big enough challenge, we're replacing most of our infrastructure components and moving away from LAMP."
Cassandra Linux Apache PHP?
creation science book
This sad thing is that Monty's MySQL fan boys will blame this on Oracle when in reality the move to Cassandra (or other NoSQL databases) is what a lot of web sites should be doing regardless of who holds the MySQL reins.
100% of hosting companies do not have twitter, facebook, reddit, or digg as their clients. Its a different market. Mysql does have a competitor in this space called PostgreSQL. Its pretty good. Pretty much every hosting company I would consider doing business with also offers it. But again, PostgreSQL wouldn't have saved the day for these companies, they've reached a different sector of the market due to their enormous scale.
Well.. maybe. Or Maybe not. But Definitely not sort of.
Yes! These products are wonderful! They are spectacular! They are a beam of sunshine refreshing my soul! I'm so happy with them! Daisies!
The World Wide Web is dying. Soon, we shall have only the Internet.
If you need a comparison chart... you don't need to switch.
It's probably not necessary to change such a huge part of your architecture if it's not worth investing serious time investigating and benchmarking the alternatives.
Postgres, for people who care about their data.
These slides present a balanced and comprehensive overview of the current state of free databases. Whether you're in the NoSQL camp or not, they're worth reading.
That said, here's my take:
It's currently fashionable to replace MySQL with some "NoSQL" database or other. This trend is driven by two factors:
I haven't seen any consideration from potential "NoSQL" adopters of the benefits of using a good relational database like PostgreSQL. There's a world of difference between it and MySQL, and condemning all relational database systems because of bad experiences with MySQL is like condemning all sandwiches because McDonalds once made you sick. In giving up RDBMSes entirely, these developers lose quite a bit of safety, flexibility, an convenience. It's a huge over-reaction.
This field should not be about following trends, though unfortunately, that's how most people choose which technologies to use: it should be about choosing the best tool for the job. And I believe that in the vast majority of cases, the advantages conferred by a relational system --- enforced integrity, interoperability based on SQL, query flexibility, storage flexibility --- make an RDBMs the best choice for almost any job. If you need sloppier semantics for some cases (for example, "eventual consistency"), you can layer that on top of a robust RDBMs.
you should probably look at what queries you're running and what the planner/optimizer is doing with them to verify the problem is mysql and not your schema and indexes.
Do you even lift?
These aren't the 'roids you're looking for.
Don't be too quick to put Java down.. it's slower but it scales fairly well.
The page you cited, on column-oriented databases, describes an implementation strategy that's applicable to many types of databases. There are database engines that present a perfectly normal SQL interface to a column store, and there's actually a direct link to LucidDB from the article. Likewise, there's nothing stopping a Cassandra-like database from serializing its on-disk bits the other way around.
Column-orientation has nothing to do with the "NoSQL" databases that are in vogue. It's completely orthogonal. You're talking about using vectors or linked lists when everyone else is arguing over whether to serialize data with XML or JSON.
On a related note, Reddit's performance and reliability has dropped off significantly since switching to Amazon's "Cloud", and dropped off even further after this switch to Cassandra.
The constant 503 errors, plus horrendous load times when it does manage to work, have driven me and many others away from Reddit. That's why I'm posting here on Slashdot.
Cloud hosting is a stupid idea for anything beyond a blog getting 10 hits per date. All the talk about scalability is pure bunk. I mean, even with the extensive knowledge and infrastructure of Amazon, the Reddit site is slow (and it wasn't like that before they switched).
If you're trying to run a site on a $15/month hosting account, then no, this is probably not for you. But if you're at the stage where MySQL isn't able to handle all the data you're throwing at it, then chances are you won't care about the extra few MB of memory that the Java runtime requires.
> But if you're at the stage where MySQL isn't able to handle all the data ...it's time to move up to PostgreSQL.
> you're throwing at it...
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
The relational model is consistent and easy to work with. It's easy to specify constraints that describe what the data should look like, and to allow several applications to interact with the data. It's also easier to optimize a database when you can describe discrete queries instead of directly following links from program code as you would in a navigational/object/document/etc. database.
Furthermore, application data models aren't all that object-oriented. Most of the time, the manipulated data types (say, "story", "post", and "user") fall into well-defined categories that correspond well to rows in a table. The few mismatches are easily dealt with in application code.
Sure, using an object database might be "easier" for the first 15 minutes, but you'll kick yourself when you have to manipulate it in any kind of sophisticated fashion.
Go with PostgreSQL. Reliable, standards-compliant, fast.
___
If you think big enough, you'll never have to do it.
Am I the only one who frowns at this moniker?
First, it creates a false premise where people need to pick "SQL" versus "no SQL", while many real-world systems intelligently combine relational and non-relational data storage for their needs. There is no conflict.
Second, there's nothing wrong with SQL as a language in particular, and in fact many of the "noSQL" engines are starting to support and extending basic SQL queries, instead of reinventing their own query language for the same purpose.
I suppose "lessRDBMSabuse" was less catchy...
Bullshit. Languages don't scale: programs do.
Writing a program in Java makes is scalable in the same way that painting a car red makes it fast. The JVM is quite good these days, but don't make up advantages that don't exist.
Come on, it cannot be any sloppier than actual UniVerse: It performs extremely poorly on large files, especially when record sizes vary wildly. I've seen in-memory files in which any insert or update operation took 5+ seconds! In my experience, even Postgres in far weaker hardware just spanks UniVerse even on the simple queries where it should have an advantage. If you ever need to read two or three files, either by hand or through I dictionary entries, UniVerse is orders of magnitude slower. When you add the low quality of the system monitoring and debugging tools that are available for it, it turns into one big stinker.
If Cassandra is any slower, it'd have to lock the system up while idle.
A bad policy when dealing with your data.
Once it's broke, it is way too late.
You can't un-LOSE the past 6 hours of transactions or table referential integrity that MySQL trashed, due to an unclean shutdown.
MySQL's great until it comes up to bite you in the arse.
Note: Facebook, twitter, digg: they aren't moving to postgreSQL. Its not better enough to make any kind of difference for that kind of a scale. They don't need features, they need speed.
Well.. maybe. Or Maybe not. But Definitely not sort of.
First of all, if he's asking Slashdot for advice (which is barely a step above reading tea leaves [which itself is a step above asking 4chan]), he doesn't need Facebook-level scalability.
Second, you're confusing scalability and performance. Scalable solutions tend to actually be slower than non-scalable ones: the difference is that a scalable system increases in capacity linearly with the number of machines you throw at it ("horizontal" scalability), whereas a fast non-scalable system generally needs the same number of faster, individual machines to increase capacity ("vertical" scaling).
Third, PostgreSQL has excellent performance, and PostgreSQL does, in fact, scale horizontally.
In my experiences developing applications in both the business and gaming industries, most applications beyond a simple cookbook app/crappy blog are highly object oriented. How else can you explain the wealth of approaches like ORM mappers, the repository and active record patterns, etc ? They are just patches on the relational model to make them friendly to application code. If your domain objects are consistently flat, you are probably doing something wrong. I for one do not want to use an API with Address1 - Address5 string properties. What you just listed as story, post, etc are all just objects, usually with nesting. Relational databases suck at dealing with complex object hierarchies, hence all the joins just because object A has a collection of object Bs which contain an object C.
Can you please define what a sophisticated fashion means? Unless you are a DBA and love SQL/config work, it is far easier to write constraints using an object database. You simply use the same validation and rules you should already be using in your application. If you rely on your database along to enforce things like required fields, atomicity, etc, then you have failed at creating a good application and likely are ripe for exploits, security holes, bad data, etc anyway. It is true that relational DBs provide certain easy facilities, but any decent Object Database provides most if not more of these same constructs in another form through its API. For instance, most object databases I have used provide some sort of transactional data structure that supports far more types of locking and concurrency/conflict management than any relational DBs I have ever seen. Further, since most object databases are defined and consumed in the languages you develop against with them, the sophistication is limited to the language. I'd say you can do a lot more in Smalltalk than SQL for instance.
If you're referring to querying, apparently you've never queried in Smalltalk, C# with LINQ, LISP, or even just using lambdas in python or ruby. Querying using the actual object is typically far easier than writing a SQL query. These days it is becoming increasingly rare that someone rolls all their own queries in your average app anyway (see ORMs). You'll often end up with something like an ORM translating some things from the UI into a boat load of queries, then you'll have to go and find fixes for the ORM to avoid making the application grind to a halt due to all the chatter. Although a lot of that is often the function of UI elements, ultimately there is a lot of overhead created by patching the relational and object disconnect.
I am wondering how you think going from relational back to objects, even flat ones is somehow easier and more consistent. You're adding an extra language, more layers, and more configuration/management for what gain? Object databases hold records for things like throughput for transactions, data population, etc. The performance thing is a myth of the past. I'd say the stumbling block if anything is simply bad developers. An RDBMS does add some what of an idiot proof layer, but really in the end you just end up with even crappier code in other spots.
Finally, you mention that discrete queries are easier to optimize. I again must disagree. If you want discrete queries, you could describe each query on an object with another object. This is exactly what any good developer should be doing with an ORM anyway. For instance, you could use the specification pattern with the repository pattern to describe and issue your queries, object db or rdbms. Secondly, instead of some crappy tools from the maker of the RDBMS, using an object DB I now have the full facilities of the language to do performance optimization, profiling, logging, etc rather than what a vendor provides. MSSQL provides some great tools for example, but most other DBs while nice implementation wise, provide horrific tool chains.
It is true there are some problems an RDBMS is good for, but your post comes off like someone who has never really use
Thanks for the comprehensive reply.
ORMs are syntactic sugar for the underlying database operations. It's possible to bypass them when you need SQL's full power and access the same data store.
So create a table of addresses and use foreign keys to connect them to whatever other table you'd like. Since when does a relational structure require a garbage schema like your example. But surely you know all that.
But doesn't that then preclude accessing the same data set from programs written in other languages? The beauty of SQL is that it's language-agnostic.
You also make several points relating to toolchains and testing: sure, some databases have better tools than others. But we're talking about differences between models, not differences between particular tools.
i can't tell from the 4 lines of text buried in ads that is this supposed article, but i'm guessing this "nosql" still uses an sql database backend?
and why wouldn't a relational database system not be perfect for facebook?
1) NoSQL databases are just that NO SQL, there is no relational database involved.
2) No relational models are not good for Facebook style data, Facebook uses a lot of trees, networks and graphs, none of which are easy to store in a relational system, Facebook also has a lot of dynamic schema requirements, again SQL does not cope with this well, and at the scale that Facebook operates at they are forced to use techniques like sharding and partitioning of their data sets, at which point a lot of what makes the relational model useful becomes difficult to use, i.e. joins across databases servers are really hard to do etc.
PostgreSQL is a real relational database that support views, nested sql, triggers, foreign keys, and even statistical analysis.
I think Mysql supports foreign keys now and my info might be dated. But if a database does not support foreign keys then its not a real relational database and mysql had that problem for years.
Once switching over you can find out how hard processor intensive tasks that took minutes can be done easily in seconds with the features I described above with PostgreSQL. You can save alot of speed with complex queries with PostgreSQL.
http://saveie6.com/
I imagine with the continual growth of these social networks, high performance DB methodologies will experience tremendous growth, and perhaps even paradigm shifts in the way we logically think and design database architectures.
Your statement that social networks push databases to their theoretical limits is laughable. Larger, more frequently accessed, more complicated databases have existed for years (decades?) before the current crop of Friendster clones existed. Just because Facebook is the largest, most "high performance" database application that you can think of doesn't make it remotely true.
The problem of dealing with very large, frequently changing databases has been addressed and solved, already. The problem is that most PHP-monkeys have -zero- database knowledge, and instead of doing the work to figure out the right way to do things, they feel like they need to re-invent the wheel. A better solution is to pick up a book written by somebody who's been working with RDBMS' for a few decades. It's not a quick fix, but this problem has already been solved many, many times over.
I don't respond to AC's.
Java is a whole platform that is scalable. Its not just about using identifiers and objects but using the vast API's. Some would Java is even an OS as it has its own I/O, threads, etc.
I suppose you could write your own threading and processes code but most Java developers just use whats built into the api.
http://saveie6.com/
I have worked with large PostgreSQL databases (150GB or so) and really, Postgres isn't a solution. You run into issues anyway when some of your tables contain millions or even billions of rows. At that stage things like vacuuming or altering the schema start to become damn near impossible, and even querying starts to become a bottleneck.
Now how do you scale that if your database is still growing? Postgres doesn't have a decent clustering solution that I know of, so your options are either to roll your own, or to scale vertically. Both of those are expensive options.
Based on my experience, I don't think that relational databases are appropriate for really large databases, and at present the only realistic option is horizontal scaling which is a lot easier with things like Cassandra or MongoDB.
Free posters and articles for business analysts and project managers
The 'n' stands for 'Not' and the 'o' stands for 'Only', so it's wrong to read it as NO SQL, it should be seen as Not Only SQL. I.o.w.: not a move away from sql, but exploring other options besides SQL
Never underestimate the relief of true separation of Religion and State.
I just read your comment and checked the PostgreSQL DB I am working with, it's only 1.7GB at this point, but growing, and the most rows in a table is 12,6 million. This DB is heavily used by a number of background processes, which select, insert, update and delete large volumes of data and by 14 people at this point, who run about 400 various reports per day each as well updating some data. The average time that a single user has to wait is 6 seconds per report. Those reports are optimized of-course, but they normally span between 1 day to one month worth of sales data, average being 1 week, while in a day there are on average 5000 sales (the DB grows by that number of sales a day, plus various other product data, client data etc.) (the db is on a single quad-core 5504 Intel, 12GB of RAM, RAID 1 on Intel's 160GB X25 SSD (2 of them) and it's a Gigabit network. This DB is used by the app server, which is a 2 x 4quad core 5405 Intels, 16GB RAM, Java 6 and Tomcat 6 for the front end, with a number of back end systems also talking to the DB from the App server.
My point is that for this given setup, PostgreSQL is showing good performance, however I am sure there are differences in the data model setup that really can kill or make the DB work.
You can't handle the truth.
Now, I'm not an expert on database use and don't want to come across as sarcastic, but it's my impression that a lot of the questions that are being asked of these new types of databases simply don't have past analogues, or if they did, they were solved with this sort of approach in an RBDMS, basically using an RBDMS but without the relational part. Hadoop, Google, and all these social networking sites surely aren't all just... confused? Are they?
Please elaborate on how an RBDMS is applicable to what I guess is now called "scaling horizontally", or perhaps more formally known as sharding, or partitioning with redundancy. It's my impression that most of the RBDMS products available today are simply atrocious at this, but if you can point out which books I need to look at, and which products have good support for this sort of scale, I'd love to learn.
Thanks.
Oh, absolutely, I'm not surprised that your setup works well, Postgres is a great RDBMS. Of course, how you design your schema matters a great deal too.
But here is another issue I thought of: backup. For our database it was 24 hours to do a full restore, which isn't practical. The only reasonable solution I know is to use replication, which is a nuisance with Postgres and adds maintenance overhead (keeping the schemas in sync). I'd prefer to have built-in redundancy. Again, I think you get that with Cassandra and MongoDB.
I guess in a few years we'll probably end up with something that combines good properties of both key-value stores (redundancy and scalability) and RDBMS (powerful query language, transactions).
Free posters and articles for business analysts and project managers
Or you could just sporge some jargonistic keywords together in an attempt to advertise your get-rich-slowly scheme.
Nerd rage is the funniest rage.
A good RDBMS engine and as much as people Poopoo MSSQL server its a good engine. I have used it for databases in the 150TB range. If you do your schema right, your indexes correctly, plan your partitions and file groups well you can great performance out of affordable hardware. Now you do need to maintain this thing or develop the automation around building those partitions and moving data into and out of them based on tombstones or some other criteria or your get underwater real fast.
I don't care what technology you pick if you are going deal with that much data you need to:
1.Understand the problem well
2.Spend the time with whatever tools you select to really understand how they work and build whatever you need to fill in where they are deficient.
When you start doing anything that big its not plug and play anymore no matter how you go about it.
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Putting a proxy between the client and the server to handle the replication does not make Postgre horizontally scalable. Nor does doing a periodic table dump and copying it to the other machines. Postgre might be a ton more efficient than MySQL, but it is in no way scalable.
While insightful and informative in its own right, that isn't a logical response to my post.
He was asking for an alternative to Mysql. I was pointing out that moving from mysql to postgresql was not done by large companies with a lot of smart people working for them, because any performance improvements were not worth it.Postgresql's vertical and horizontal scalability did not represent an improvement over mysql. I didn't even mention vertical vs horizontal scalability. In the end you end up with a raw number saying we can handle X many requests in our total system, regardless of the individual performance numbers of any part of the system.
You're right he probably isn't the lead engineer of flickr and probably doens't need cassandra's power, but I think it really says something that while a lot of these companies are switching away from mysql, they aren't switching towards postgresql. But as always, anyone considering any kind of switch must do their due diligence in assessing the potential performance improvements of any new solution.
Well.. maybe. Or Maybe not. But Definitely not sort of.