Slashdot Mirror


How Twitter Is Moving To the Cassandra Database

MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.

26 of 157 comments (clear)

  1. Don't believe them! by smellsofbikes · · Score: 4, Funny
    They keep saying that the Cassandra database is better, but somehow I don't believe them. I can't imagine they know what they're talking about. Maybe in the long-term they'll be proven right but I really don't think they are. I don't know why, though...

    heh heh heh.

    --
    Nostalgia's not what it used to be.
  2. Cassandra, eh? by maugle · · Score: 4, Funny

    I hear Cassandra can even predict when disastrous system failures are going to occur! Unfortunately, for some reason nobody ever believes the warnings.

    1. Re:Cassandra, eh? by einhverfr · · Score: 2, Funny

      Especially when trojan horses are the cause of such a disaster....

      --

      LedgerSMB: Open source Accounting/ERP
  3. network issues? by QuietLagoon · · Score: 4, Insightful
    We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast it would saturate the backplane of our network.

    .

    First time I have ever heard anyone say that a database was too fast. Maybe there are network problems that also need to be addressed.

    1. Re:network issues? by b0bby · · Score: 2, Insightful

      I know next to nothing about NoSQL, but what they're talking about there seems to be using BinaryMemtable for the one-time move of data. You can see that you wouldn't want to "saturate the backplane of our network" for several days while that completes, so they're using a slower method & throttling it. It will take a week to do the move, but everything else will keep working.

    2. Re:network issues? by Bill,+Shooter+of+Bul · · Score: 2, Informative

      Yes and no. They are specifically talking about importing their data into cassandra. Which will be a one time event, not worth upgrading the network bandwidth. They need to throttle it to allow for more time sensitive traffic to use the bandwidth. The bandwidth to the database in normal use will be much, much less then the import bandwidth.

      --
      Well.. maybe. Or Maybe not. But Definitely not sort of.
    3. Re:network issues? by ryansking · · Score: 4, Informative

      If we're going to have to slow the system down, we'd rather use the standard interface, because that means the bulk loading doubles as a load test and the tools we build for it can be re-used for normal operations.

  4. And this is front page news, why? by Lunix+Nutcase · · Score: 2, Interesting

    Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?

    1. Re:And this is front page news, why? by BarryJacobsen · · Score: 4, Funny

      Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?

      Because the change prevented them from posting it to twitter.

    2. Re:And this is front page news, why? by Gruuk · · Score: 5, Insightful

      Scaling. If something turns out to be robust and fast enough for Twitter, it is definitely of interest to anyone working on significantly large and busy websites.

      --
      De gustibus et coloribus non est disputandum
    3. Re:And this is front page news, why? by Lunix+Nutcase · · Score: 3, Insightful

      Yes, because twitter is the epitome of robustness and speed. Oh wait... Just in the 2 months of this year alone they've had something like 4 outages.

    4. Re:And this is front page news, why? by Monkeedude1212 · · Score: 2, Insightful

      I suppose then why would we care if any site made any random change to any part of its infrastructure?

      Twitter is a -very- busy site.

      They are changing their infrastructure to accomodate. Here's what they looked at, here is what they chose. If you are looking for something with equal performance, you don't have to shop around.

    5. Re:And this is front page news, why? by kriston · · Score: 4, Insightful

      No way. Their architecture is about as "best guess" engineering as Facebook. I don't think that's actually what engineering is. "Maybe this one will work?"

      In the meantime, I have not been able to update my avatar image on Twitter, and TwitPic-like feature is still a faint glimmer in Twitter's amateur eyes. Speaking of missed opportunities, why drive so much traffic to Twitter parasites Bit.ly, TwitPic, TinyURL, Twitition, TwitLonger?

      What in the world are Twitter's engineers actually DOING should be the real question.

      --

      Kriston

    6. Re:And this is front page news, why? by u38cg · · Score: 2, Interesting

      Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia? I would have thought serving up a fairly straightforward page, a stylesheet, a background image and the tweets or twits or whatever they're called can't be that difficult compared to, say, Facebook.

      --
      [FUCK BETA]
  5. pfffft twatter tweeter by roman_mir · · Score: 2, Insightful

    who cares what twuufter is running off.

    The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.

    If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs? Databases are not being replaced by NoSQL in projects that need databases. The projects that may not have ever needed databases may benefit by this NoSQL idea, but if you actually need a database... well, you better be really good at working around all kinds of problems that this will create for you.

    I think that relational databases are good at what they do and that many projects may not need them, but if you do need them on the back end, you will end up with them on the back end. Of-course there maybe some caching/hashmaps/files on the front end but at the back stuff will be sorted out within a real datastore.

    Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.

    1. Re:pfffft twatter tweeter by AndrewNeo · · Score: 4, Insightful

      I think their point is not everything needs an RDBMS, whereas before it was the 'go to' method of storing data.

    2. Re:pfffft twatter tweeter by azmodean+1 · · Score: 4, Interesting

      I think you're missing the point here, the problem with RDBMSs isn't that they are "slow" per-se, which implies that they just need some good ol' fashioned optimization. The problem is that there is a cost associated with the data integrity guarantees they make (usually appears in scalability bottlenecks rather than in pure computational inefficiencies), regardless of how good the implementation is, and if you don't need some of those guarantees, you can dispense with them and end up with better performance (again, this typically means better scalability). Additionally, this is the kind of bottleneck that you just can't throw more resources at. Sure you can find the bottleneck and beef up that particular component to do more transactions/second, but at a certain point you've isolated the bottleneck on a world-class server that is doing nothing but that, and it's still a bottleneck. At that point (preferably long before you reach that point) you have to look at transitioning to an infrastructure that makes some kind of tradeoff that allows the removal of the bottleneck, which is what NoSQL does.

      I doubt Twitter wants very many RDBMS-type data coherency guarantees at all. 160-character text strings with a similarly-sized amount of metadata, and no real-time delivery guarantees? Sounds like their database can get pretty inconsistent without messing things up badly. It seems to me they would be well served by using a database that offers just what they want/need in that area and better performance.

      Oh and this:

      Is there really a huge issue with rdbms speeds?

      yes, and what are you smoking that you would even ask this question?

    3. Re:pfffft twatter tweeter by Abcd1234 · · Score: 4, Insightful

      Or: use the right tool for the job. The only difference is, now alternative tools actually exist.

    4. Re:pfffft twatter tweeter by roman_mir · · Score: 2, Insightful

      your question is answered in my post: google does not need a database for ACID properties.

      Can you complain much if in one location google gives you results that are very different for the same search query as for the same query in a different location at the same time? Well, if you do complain, you can ask google for your money back.

  6. Re:I'm Reluctant by binarylarry · · Score: 2, Insightful

    Twitter's only moving to this new database written in Java because everyone else is.

    --
    Mod me down, my New Earth Global Warmingist friends!
  7. Don't want to install Cassandra by einhverfr · · Score: 2, Funny

    I hear Cassandra is really a trojan. Can anyone verify? I don't want a trojan on my computer.....

    --

    LedgerSMB: Open source Accounting/ERP
  8. Twitter needs scalability experts by Heretic2 · · Score: 5, Interesting

    I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.

    I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.

    I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?

    1. Re:Twitter needs scalability experts by guru42101 · · Score: 2, Insightful

      I've never dealt with an EMP but a more realistic threat with similar effects would be planning for a hurricane or earthquake. I used to work at a international bank and we had to deal with both (offices in FL and CA). For the most part the best solution was to have an identical setup at another office and having all applications available via VPN and/or web access. We had a separate pipe that was used only for backup data transfers. The DB transaction logs were written both locally and remotely. All user files saved to the server were immediately copied to the backup server. On several occasions the systems were tested due to black/brownouts. The users were sent home where they could work just as effectively as the office.

      Our general emergency plan for hurricanes (I worked at the FL office we used the CA office as our backup). Was to let the users go well in advance of the hurricane and switch CA to being our primary servers with FL as the backup. Once the users were settled then they could continue working from home. The only way we would be screwed is if a hurricane and earthquake happened simultaneously. At that point we'd have to restore VM backups on hardware located at the main corporate offices in NYC or Sydney.

    2. Re:Twitter needs scalability experts by ryansking · · Score: 2, Informative

      You're right, I failed to mention disaster recovery– it was something we looked at, its just been awhile since we went through the evaluation process, so I've forgotten a few things. We actually liked Cassandra for DR scenarios – the snapshot functionality makes backups relatively straight forward, plus multi-DC support will make operational continuity in the case of losing a whole DC a possibility.

  9. Re:Java / JVM Wins Again ... by codepunk · · Score: 3, Funny

    Until recently I thought the same way, I would never endorse a solution that involves java. However
    a recently came to the same realization that sun did when they created it. Java is a fantastic
    way to over sell gobs of expensive hardware. I am a system administrator so the more hardware it takes to
    run a solution the better off I am, more machines, more money and better job security. So I have now
    fully jumped on the java bandwagon, java makes me smile.

    --


    Got Code?
  10. Re:Java / JVM Wins Again ... by zuperduperman · · Score: 2, Informative

    Sure - but I think the whole point is that you'd be smiling even more if they were using one of the modern & trendy dynamic languages because you'd likely have 2 - 3 times the amount of hardware to look after. I'm not sure what alternative you would propose that uses less hardware but there actually aren't many that are better than the JVM these days.