How Twitter Is Moving To the Cassandra Database
MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.
Yeah, the actual requirement of the twooter should be really thought over once more.
They may not need any database for their front end at all, that's their problem: they can't scale with the old back end, they think they'll fix it with this new silver bullet? Maybe they'll have it run faster for a while, but what about some real design? Do they actually need to generate any content for every http request? I doubt it. Maybe all they need is a small cluster of large enough servers to generate all of the necessary static pages and push them periodically to their front end web servers. For the inbound requests they probably don't need a database either, just a queue for the generator cluster to work on to generate the static pages.
That maybe all they need, but instead of doing some actual design work and maybe changing some implementation they'll just do what management normally does in the pointy hair boss way: get a hammer, hopefully a silver one and do the same old thing hopefully marginally faster.
Certainly Amazon is in business different from the twater, they can put many more minds together to compensate for all of the deficiencies of a non-transactional system where transactions are needed. For example excessive journalling can be done and then back end systems can sort out the details and process 99% of cases successfully and throw the last 1% at some CSRs in India or wherever they have the call centers.
I am sure that Amazon would have preferred to have completely transactional system and their specific problem may as well be performance deficiencies of RDBMSs of their choice. On the other hand it is also possible that their architecture could be changed to do so, but maybe it was less expensive to go the other way, I haven't worked for them yet, so I don't know. However I am building a retailer solution right now with a cluster of PostgreSQL nodes that process a few million transactions a day with a large growth potential and where possible, I'll stick to the RDBMS but I certainly do caching and use hashmaps in memory to speed up quite a few report generations and other features.
My point is that twufter never really needed an RDBMS in the first place, so it doesn't matter what they use, a fast enough roll of toilet paper maybe sufficient for their purposes, who knows.
You can't handle the truth.