How Twitter Is Moving To the Cassandra Database
MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.
heh heh heh.
Nostalgia's not what it used to be.
I hear Cassandra can even predict when disastrous system failures are going to occur! Unfortunately, for some reason nobody ever believes the warnings.
facebook uses casandra, digg uses cassandra, twitter is mocing to cassandra. Maybe in 5 years slashdot will get with it.
.
First time I have ever heard anyone say that a database was too fast. Maybe there are network problems that also need to be addressed.
I look forward to a brand new twitter that randomly doesn't display expected data and sometimes doesn't take my status updates!
Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?
who cares what twuufter is running off.
The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.
If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs? Databases are not being replaced by NoSQL in projects that need databases. The projects that may not have ever needed databases may benefit by this NoSQL idea, but if you actually need a database... well, you better be really good at working around all kinds of problems that this will create for you.
I think that relational databases are good at what they do and that many projects may not need them, but if you do need them on the back end, you will end up with them on the back end. Of-course there maybe some caching/hashmaps/files on the front end but at the back stuff will be sorted out within a real datastore.
Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.
You can't handle the truth.
I'm reluctant to believe that Twitter is a good technology bellwether. Twitter seems to have so many technology issues, fail whales, outages, security breeches...
SO, I'm left wondering; what does this move say? Does it say that Cassandra is so bad that Twitter is using it? Or does it say that a fail whale population boom is imminent?
They should move to Intersystems Caché. SQL, objects, XML and even MUMPS. It will make equally happy SQL and NoSQL fans. And it's damn fast. Much leaner than Oracle, DB2 or Informix, too. Excellent support. Extremely good. Not cheap, thought.
I hear Cassandra is really a trojan. Can anyone verify? I don't want a trojan on my computer.....
LedgerSMB: Open source Accounting/ERP
I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.
I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.
I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?
Speed is latency. (how long it takes)
Scalability is throughput. (how many concurrent). Or put another way; Speed is the quality, throughput is the width.
who cares what twuufter is running off.
Well, developers, and their managers do. They're nothing if not fashion victims.
RDBMS aren't the be all and end all of scalability (or speed, they perform a shit load of management functions you may or may not need). While attempting to scale conventional rdbms you get into write consistency problem, lookup performance problems unless you specifically design your data structures properly. You end up fighting with the relational data model.
Most developers never even think about it, they just develop against their local mysql install and are overjoyed that their app actually runs. Not all apps even need an rdbms. I've seen apps with a single table, two columns, one of which is a key and it's running on an rdbms, because that's what you do... The words WTF sprang to mind.
Deleted
It's always funny to read things written by people who obviously are inexperienced with high volume transaction processing in the mainframe environment. The systems behind airline, rail, and hotel reservations as well as emergency response messaging often are built on IBM mainframes using TPF/ZTPF as the operating system and
TPFDB(formerly known as ACPDB) as the underlying database. If someone would take the time to study TPFDB, they would notice its nonrelational character, as well as some interesting similarities to what the Cassandra developers unknowingly chose to do. By the way, these systems are happily handling 10K-12K transactions per second without bunny farm racks of servers.
Sometimes progress is not always about what will be done, but understanding the benefits of older things that have been done.
> By the way, these systems are happily handling 10K-12K transactions per second without bunny farm racks of servers.
The airline systems have entire *data centers* instead, to say nothing of the enormous transaction processing infrastructure inbetween.
Flamebait?
Do I have to spell out the joke to people?
Or is it just that nobody reads Homer anymore.
LedgerSMB: Open source Accounting/ERP
It's fascinating how after initially being a posterboy for the post-Java revolution Twitter is gradually moving their architecture to the JVM, piece by piece. I think it's actually a credit to them that they seem to have level heads and are evaluating technology on it's merits (where as if you talk to most of the ruby / python crowd they would rather stick toothpicks in their eyes than endorse a solution that involves java).
Nice point. Thanks for this. Data processing/transaction is not really my area of expertise, but I've always worked with the thought that nothing I'm doing is new on a technical level. This goes to show it. What the F/OSS community should focus on, be it through research groups is the human computer interaction. This is a relatively new field of study - maybe 20 years old, and there's a lot less catch-up. My conspiracy theory hat of yester-year would probably take a stab that this is why oracle cut funding to the accessibility projects of sun/gnome. Just to extend the gap between free and commercial HCI offerings.
But found that its backup policy required horcruxes.
Teradata seems to win typical OLTP and OLAP benchmarks. I would think for airline reservations and such that would be my choice of platform.
LedgerSMB: Open source Accounting/ERP
A lot of the complaints from NoSQL seem to be regarding DBMSses being too slow and SQL being too hard. And yet a lot of them invent query languages/query languages similar to SQL. Supposedly Oracle scales up really well. There is a paper that compares mapreduce to parallel databases and Hadoop takes a huge beating via the RDBMSes in performance. Now the funny thing is that Oracle was not included, yet most content that if you pay enough Oracle scales really well. DB2 also scales, because in 1999 I worked at a place with terabytes of database space and they had a few nodes running DB2 on AIX boxes and seemed to be getting adequate performance.
But most open sources databases seem to not be able to compete with the likes of the commercial parallel databases. But it seems like an open source parallel database would do a lot to silence many nosql critics. There is still the complaint about needing to define a schema, however if you are not exploring the data and are processing the same data over and over again, it seems like a good idea to define a schema anyway, that way you can better detect files that don't conform.
I was going to try to write something funny about twitter only needing three tables to run and how hard is it to change but then I thought about how much money they're going to make off those three tables and I started to cry.
Cassandra has the goods for high available and optimized for non-financial data.
That said, I am amazed at how much time, money, and effort has gone into Twitter.
Now a distributed scalable super duper database will keep track of who is pooping. http://poop.obtoose.com/
The problem is that 10K-12K transactions is 1/100 of what twitter need.
http://slashdot.org/~roman_mir/comments - I imagine twater storm of moderation points was spent well this time, every single post I had on this issue was above 3 point and now within 1 hour, all comments were moderated down. To me that's just funny - someone does not like the truth.
I just wonder is it the twater birds or does it have something to do with the nosql ideologists?
You can't handle the truth.
Now, how about the ACI part?