Digg Says Yes To NoSQL Cassandra DB, Bye To MySQL
donadony writes "After twitter, now it's Digg who's decided to replace MySQL and most of their infrastructure components and move away from LAMP to another architecture called NoSQL that is based in Cassandra, an open source project that develops a highly scalable second-generation distributed database. Cassandra was open sourced by Facebook in 2008 and is licensed under the Apache License. The reason for this move, as explained by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
Cassandra is basically a sloppy implementation of UniVerse and elated products. Why sloppy? Because the idea of a separate file access for each column sucks - use a union or struct as necessary, people!
In other news, Cassandra developers are celebrating the fact that their database is now used to store the largest amount of worthless information in history.
Negative moral value of force outweighs the positive value of good intentions.
Reddit also recently switched to Cassandra.
I imagine with the continual growth of these social networks, high performance DB methodologies will experience tremendous growth, and perhaps even paradigm shifts in the way we logically think and design database architectures. Instead of this flat 2D table mentality, imagine n-dimensional matrices of data, scaling dimensions instead of table and rowcounts.
I bet if you converted Facebook to this n-dimensional 'table' model, and did a couple inner-joins and unions, you could rip space-time wide-open!
'We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.' RPF
Or away from MySQL? There is a difference.
From the Digg blog - http://about.digg.com/node/564
"And if that doesn't sound like a big enough challenge, we're replacing most of our infrastructure components and moving away from LAMP."
Cassandra Linux Apache PHP?
creation science book
This sad thing is that Monty's MySQL fan boys will blame this on Oracle when in reality the move to Cassandra (or other NoSQL databases) is what a lot of web sites should be doing regardless of who holds the MySQL reins.
and why wouldn't a relational database system not be perfect for facebook?
If you mod me down, I will become more powerful than you can imagine....
I too have a site running on MySQL and I am thinking of switching.
Can anyone tell me if there is any "comparison chart" listing the various features / usability of the various OSS DB packages available so I can make a better educated decision?
Please help !
Thank you !
Muchas Gracias, Señor Edward Snowden !
100% of hosting companies do not have twitter, facebook, reddit, or digg as their clients. Its a different market. Mysql does have a competitor in this space called PostgreSQL. Its pretty good. Pretty much every hosting company I would consider doing business with also offers it. But again, PostgreSQL wouldn't have saved the day for these companies, they've reached a different sector of the market due to their enormous scale.
Well.. maybe. Or Maybe not. But Definitely not sort of.
Will Slashdot switch?
You couldn't even be bothered to read up on what ANT actually was, could you...
"Ant is a Java-based build tool. In theory, it is kind of like Make, without Make's wrinkles and with the full portability of pure Java code."
Well, I don't know too many people who program in C and use Ant. And a glance at the FAQ implies it's Java-based (it talks about the JVM a bit).
I guess Cassandra just isn't really targeted at the market segment where the overhead of a JVM would make much of a difference, even if it would make redundancy easier.
The World Wide Web is dying. Soon, we shall have only the Internet.
So what's the advantage of switching?
I have a policy of if it ain't broke don't fix it
These slides present a balanced and comprehensive overview of the current state of free databases. Whether you're in the NoSQL camp or not, they're worth reading.
That said, here's my take:
It's currently fashionable to replace MySQL with some "NoSQL" database or other. This trend is driven by two factors:
I haven't seen any consideration from potential "NoSQL" adopters of the benefits of using a good relational database like PostgreSQL. There's a world of difference between it and MySQL, and condemning all relational database systems because of bad experiences with MySQL is like condemning all sandwiches because McDonalds once made you sick. In giving up RDBMSes entirely, these developers lose quite a bit of safety, flexibility, an convenience. It's a huge over-reaction.
This field should not be about following trends, though unfortunately, that's how most people choose which technologies to use: it should be about choosing the best tool for the job. And I believe that in the vast majority of cases, the advantages conferred by a relational system --- enforced integrity, interoperability based on SQL, query flexibility, storage flexibility --- make an RDBMs the best choice for almost any job. If you need sloppier semantics for some cases (for example, "eventual consistency"), you can layer that on top of a robust RDBMs.
MySQL has never been a good example of a relational database, the underlying implementation is limited. Its MySQL that is the problem here, not relational databases.
I suspect here that it is not the relational model at fault here, but the lack of creativity and competence in implementing a relational database technology. MySQL perhaps has never been a particularly scalable platform, it has a number of severe limitation and does not seem to be designed with a lot of thought for a distributed environment. Its developers seem to have developed it for small scale webpages, and have been notorius on leaving out many advanced features, and thus have limited its effectiveness to small, low powered pages.
Its all in implementation, its not the relational database model that needs fixing, it is the underlying implementations.
Don't be too quick to put Java down.. it's slower but it scales fairly well.
On a related note, Reddit's performance and reliability has dropped off significantly since switching to Amazon's "Cloud", and dropped off even further after this switch to Cassandra.
The constant 503 errors, plus horrendous load times when it does manage to work, have driven me and many others away from Reddit. That's why I'm posting here on Slashdot.
Cloud hosting is a stupid idea for anything beyond a blog getting 10 hits per date. All the talk about scalability is pure bunk. I mean, even with the extensive knowledge and infrastructure of Amazon, the Reddit site is slow (and it wasn't like that before they switched).
If you're trying to run a site on a $15/month hosting account, then no, this is probably not for you. But if you're at the stage where MySQL isn't able to handle all the data you're throwing at it, then chances are you won't care about the extra few MB of memory that the Java runtime requires.
> But if you're at the stage where MySQL isn't able to handle all the data ...it's time to move up to PostgreSQL.
> you're throwing at it...
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
The relational model is consistent and easy to work with. It's easy to specify constraints that describe what the data should look like, and to allow several applications to interact with the data. It's also easier to optimize a database when you can describe discrete queries instead of directly following links from program code as you would in a navigational/object/document/etc. database.
Furthermore, application data models aren't all that object-oriented. Most of the time, the manipulated data types (say, "story", "post", and "user") fall into well-defined categories that correspond well to rows in a table. The few mismatches are easily dealt with in application code.
Sure, using an object database might be "easier" for the first 15 minutes, but you'll kick yourself when you have to manipulate it in any kind of sophisticated fashion.
This isn't your grandfather's JVM.
These days, Java is quite fast and efficient, and there are even a lot of different alternative VMs you can try. Sure, startup time isn't the best, and Swing is still a lumbering, over-engineered, ill-fitting albatross: but these problems don't matter for server applications.
IMHO, the best part is that you can write programs that run on the JVM in a dialect of Lisp and interact seamlessly with other code on the JVM.
Am I the only one who frowns at this moniker?
First, it creates a false premise where people need to pick "SQL" versus "no SQL", while many real-world systems intelligently combine relational and non-relational data storage for their needs. There is no conflict.
Second, there's nothing wrong with SQL as a language in particular, and in fact many of the "noSQL" engines are starting to support and extending basic SQL queries, instead of reinventing their own query language for the same purpose.
I suppose "lessRDBMSabuse" was less catchy...
Bullshit. Languages don't scale: programs do.
Writing a program in Java makes is scalable in the same way that painting a car red makes it fast. The JVM is quite good these days, but don't make up advantages that don't exist.
so does ASP.net and C#.
I'm sorry, but Java still doesn't compare to C, and those differences *especially* apply to high load server applications.
Java the language isn't scalable on it's own.. there's no magic scaling technology built into the jvm.. but the general Java "culture" tends to (in my opinion) achieve at least medium scalability.
When judging a language, you _have_ to look at the culture around it. These days nothing is 100% custom build.. a sizable project is going to import a wide variety of 3'rd party libraries. The general attitude of the community is going to determine how suitable these libraries are for whatever scale you will be using them at.
Same as how languages like perl on their own don't produce unmaintainable code.. it's the perl "write once, read never" culture that leads to so much unreadable code.
No DBM will save the day if you aren't very careful about how you design your usage of very large databases.
In the case of SQL, that would be things like Schema, choice of Index columns, views, stored procs, joins, SQL statement contents.
In some cases, the performance of a SQL statement can be horrible, but can be rewritten in a different way to answer the same question but have stellar performance.
In my experiences developing applications in both the business and gaming industries, most applications beyond a simple cookbook app/crappy blog are highly object oriented. How else can you explain the wealth of approaches like ORM mappers, the repository and active record patterns, etc ? They are just patches on the relational model to make them friendly to application code. If your domain objects are consistently flat, you are probably doing something wrong. I for one do not want to use an API with Address1 - Address5 string properties. What you just listed as story, post, etc are all just objects, usually with nesting. Relational databases suck at dealing with complex object hierarchies, hence all the joins just because object A has a collection of object Bs which contain an object C.
Can you please define what a sophisticated fashion means? Unless you are a DBA and love SQL/config work, it is far easier to write constraints using an object database. You simply use the same validation and rules you should already be using in your application. If you rely on your database along to enforce things like required fields, atomicity, etc, then you have failed at creating a good application and likely are ripe for exploits, security holes, bad data, etc anyway. It is true that relational DBs provide certain easy facilities, but any decent Object Database provides most if not more of these same constructs in another form through its API. For instance, most object databases I have used provide some sort of transactional data structure that supports far more types of locking and concurrency/conflict management than any relational DBs I have ever seen. Further, since most object databases are defined and consumed in the languages you develop against with them, the sophistication is limited to the language. I'd say you can do a lot more in Smalltalk than SQL for instance.
If you're referring to querying, apparently you've never queried in Smalltalk, C# with LINQ, LISP, or even just using lambdas in python or ruby. Querying using the actual object is typically far easier than writing a SQL query. These days it is becoming increasingly rare that someone rolls all their own queries in your average app anyway (see ORMs). You'll often end up with something like an ORM translating some things from the UI into a boat load of queries, then you'll have to go and find fixes for the ORM to avoid making the application grind to a halt due to all the chatter. Although a lot of that is often the function of UI elements, ultimately there is a lot of overhead created by patching the relational and object disconnect.
I am wondering how you think going from relational back to objects, even flat ones is somehow easier and more consistent. You're adding an extra language, more layers, and more configuration/management for what gain? Object databases hold records for things like throughput for transactions, data population, etc. The performance thing is a myth of the past. I'd say the stumbling block if anything is simply bad developers. An RDBMS does add some what of an idiot proof layer, but really in the end you just end up with even crappier code in other spots.
Finally, you mention that discrete queries are easier to optimize. I again must disagree. If you want discrete queries, you could describe each query on an object with another object. This is exactly what any good developer should be doing with an ORM anyway. For instance, you could use the specification pattern with the repository pattern to describe and issue your queries, object db or rdbms. Secondly, instead of some crappy tools from the maker of the RDBMS, using an object DB I now have the full facilities of the language to do performance optimization, profiling, logging, etc rather than what a vendor provides. MSSQL provides some great tools for example, but most other DBs while nice implementation wise, provide horrific tool chains.
It is true there are some problems an RDBMS is good for, but your post comes off like someone who has never really use
Thanks for the comprehensive reply.
ORMs are syntactic sugar for the underlying database operations. It's possible to bypass them when you need SQL's full power and access the same data store.
So create a table of addresses and use foreign keys to connect them to whatever other table you'd like. Since when does a relational structure require a garbage schema like your example. But surely you know all that.
But doesn't that then preclude accessing the same data set from programs written in other languages? The beauty of SQL is that it's language-agnostic.
You also make several points relating to toolchains and testing: sure, some databases have better tools than others. But we're talking about differences between models, not differences between particular tools.
So much horseshit in just one slide deck. No matter what you do, unless you have at least a hundred machines at your disposal, Hadoop won't be faster than a single box grep from SSDs. LucidDB is excruciatingly slow for all but tiniest datasets. I've tried a good half dozen "solutions" from this slide deck (including Aster), and other than Postgres all of them suck ass, more or less. If you see ANYTHING other than Nutch with Hadoop as a backend, head for the hills right away.
It's kind of like Make, but with a lot more XML
Climate Progress - Hell and High Water
Mysql sucked for many years but is getting better with each release. It was never designed to be a fully RDBMS .
In Japan people use PostgreSQL and I am surprised that its not common among geeks. Many ISPs now offer it as well as MySQL. The problem is the trendy word is Nosql and mostly non database programmers are promoting the movement due to bad experiences of trying to learn mysql to do things that are very complicated.
PostgreSQL is very easy to switch your existing code too if you used SQL compliant code in languages such as Php. WIth triggers, views, stored procedures, and abilities of self repairing in case of a power failure make postgreSQL an easier platform to develop for.
http://saveie6.com/
Languages with namespaces scale better than languages without namespaces.
http://michaelsmith.id.au
Java is a whole platform that is scalable. Its not just about using identifiers and objects but using the vast API's. Some would Java is even an OS as it has its own I/O, threads, etc.
I suppose you could write your own threading and processes code but most Java developers just use whats built into the api.
http://saveie6.com/
Which is more expensive, a few extra machines or developer time? (I'm assuming a solution that scales properly here, you write scalable solutions in any language.)
HAND.
Moving from LAMP to CLAP sounds like a new STD stack for open source develeopment
Many thanks for the explanation ! :)
Muchas Gracias, Señor Edward Snowden !
The 'n' stands for 'Not' and the 'o' stands for 'Only', so it's wrong to read it as NO SQL, it should be seen as Not Only SQL. I.o.w.: not a move away from sql, but exploring other options besides SQL
Never underestimate the relief of true separation of Religion and State.
There seems to be this angry pushback from a core of dedicated SQL programmers, acting as if someone had insulted their tin god and wanted to invalidate their lives' work. Not at all. All that has been developing is the realization that RDBMS's are not the best fit for all applications, and that other storage schemes might have a better impedance match with the needs of a particular design. RDBMS's are still robust and reliable and useful for (maybe most) applications. Only some apps' data does not fit nicely into rows and columns. And you should design your code around the data, not try to morph the data to your software.
Java is a whole platform that is scalable. Its not just about using identifiers and objects but using the vast API's. Some would Java is even an OS as it has its own I/O, threads, etc.
OMFG! The amount of fanboyism is amazing...
Java libraries may be good but they, IN NO WAY make a program 'automatically scale'
You can't just write a non-trivial program in java and have it be automatically scaling horizontally .
Don't take away the merits of Cassandra developes saying it was easy because of java.
how long until
I think the entire platform is the issue, on language level you get threading,vast concurrency apis etc...
on platform level you get cloud features like distribution over multiple servers in realtime, transactional locking over an entire cloud etc...
It really depends on which features of the stack you choose but the scaling features of Java and JEE are phenomenal without too much effort.
Or you could just sporge some jargonistic keywords together in an attempt to advertise your get-rich-slowly scheme.
Nerd rage is the funniest rage.
How nice and readable ! ;-)
I deal with javascript pretty much everyday and like it, but even I think it was ugly the first time I looked at it.
A second look just tells me the indention would really help here.
db.things.find ( {x:4}, {j:true} ).forEach (
function (x) {
print ( tojson(x) );
}
);
I can see how that would work with map/reduce.
New things are always on the horizon
No idea where you got that particular piece of misinformation. :)
If you rely on your database along to enforce things like required fields, atomicity, etc, then you have failed at creating a good application
At some point, I don't want to create a good application. I want to create a good model for the business, and then create applications that interact with that model. If I enforce constraints only in an application, someone else will forget to enforce the constraints when she writes a different application written in a different language sitting on the same database. So I enforce some constraints, such as not null or foreign key, in the table definitions. In tricker cases, I write test case reports to make sure no other application interacting with the same database has made it less than consistent. Where I denormalize to work around limitations in the RDBMS's query optimizer *cough*MySQL subqueries with GROUP BY*cough*, I put the denormalized columns into a separate table whose name ends with "cache", with a table comment pointing to which module of which app is responsible for maintaining it.
Another example of technically knowledgeable people picking a really bad name.
The term shared-nothing architecture dates back to 1986. The concept goes back further; for example, the Teradata RDBMS dates back to 1976-1983.
So ignorance about database technologies and products is one issue, and the one hardest to excuse. There are however other issues that are more understandable:
The existing solutions are proprietary, and web startups tend to prefer open source solutions.
The existing solutions are biased toward OLAP, analytics and data mining, i.e., taking large volumes of data and analyzing large sets of it at a time. Some of the "NoSQL" products are built for this (e.g., Hadoop + HDFS), but there are others that are more aimed toward simple transactional processing.
So it really is the case that there do not exist good relational products that tackle something like Digg, Facebook or Twitter, which want to use commodity hardware in geographically dispersed locations. However, it is also the case that the "NoSQL movement" that has sprung up to fill this gap has a combination of ignorance and animosity towards the relational model, and are just not thinking the problem through.
Are you adequate?
http://www.firebirdsql.org/
Firebird is a relational database offering many ANSI SQL standard features that runs on Linux, Windows, and a variety of Unix platforms. Firebird offers excellent concurrency, high performance, and powerful language support for stored procedures and triggers. It has been used in production systems, under a variety of names, since 1981.
Slower as what, C++ definitely in most szenarios, PHP and other scripting languages, it runs circles around them.