How Twitter Is Moving To the Cassandra Database
MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.
Scaling. If something turns out to be robust and fast enough for Twitter, it is definitely of interest to anyone working on significantly large and busy websites.
De gustibus et coloribus non est disputandum
I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.
I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.
I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?