How Twitter Is Moving To the Cassandra Database

← Back to Stories (view on slashdot.org)

How Twitter Is Moving To the Cassandra Database

Posted by kdawson on Tuesday February 23, 2010 @06:55AM from the big-table-doesn't-capture-the-half-of-it dept.

MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.

157 comments

Min score:

Reason:

Sort:

Don't believe them! by smellsofbikes · 2010-02-23 07:07 · Score: 4, Funny

They keep saying that the Cassandra database is better, but somehow I don't believe them. I can't imagine they know what they're talking about. Maybe in the long-term they'll be proven right but I really don't think they are. I don't know why, though...
heh heh heh.

--
Nostalgia's not what it used to be.
1. Re:Don't believe them! by Yvan256 · 2010-02-23 07:08 · Score: 0
  
  Do you have an ex-girlfriend called Cassandra, by any chance?
2. Re:Don't believe them! by Anonymous Coward · 2010-02-23 07:11 · Score: 0
  
  It's a reference to Greek mythology you idiot
3. Re:Don't believe them! by Push+Latency · 2010-02-23 07:21 · Score: 1
  
  I took an axe to my last Cassandra cluster and feel quite better now.
4. Re:Don't believe them! by Anonymous Coward · 2010-02-23 07:34 · Score: 0
  
  Yeah and she told everyone about the GP's micropenis.
5. Re:Don't believe them! by sconeu · 2010-02-23 07:35 · Score: 1
  
  Damn... you beat me to it. I was going to say, "Cassandra? I don't believe it!"
  
  --
  General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
6. Re:Don't believe them! by mariushm · 2010-02-23 07:51 · Score: 1
  
  For some reason my mind went to Cassandra Crossing (http://en.wikipedia.org/wiki/The_Cassandra_Crossing)
7. Re:Don't believe them! by Anonymous Coward · 2010-02-23 07:55 · Score: 0
  
  Intelligence is not knowledge. - Einstein
8. Re:Don't believe them! by Anonymous Coward · 2010-02-23 08:06 · Score: 0
  
  GP still looks dumb.
9. Re:Don't believe them! by Anonymous Coward · 2010-02-23 08:50 · Score: 0
  
  And you still look like a douche.
10. Re:Don't believe them! by einhverfr · 2010-02-23 09:08 · Score: 1
  
  Drink a few beers. Read the Iliad. You'll feel better.
  (AJAX: When second-best is good enough. Or maybe AJAX is almost as good as ACHILLES.)
  
  --
  
  LedgerSMB: Open source Accounting/ERP
Cassandra, eh? by maugle · 2010-02-23 07:15 · Score: 4, Funny

I hear Cassandra can even predict when disastrous system failures are going to occur! Unfortunately, for some reason nobody ever believes the warnings.
1. Re:Cassandra, eh? by einhverfr · 2010-02-23 08:31 · Score: 2, Funny
  
  Especially when trojan horses are the cause of such a disaster....
  
  --
  
  LedgerSMB: Open source Accounting/ERP
2. Re:Cassandra, eh? by idontgno · 2010-02-23 08:41 · Score: 1
  
  And, of course, when the system failure strikes, Cassandra will be blamed, not the underlying issues Cassandra warned of.
  
  --
  Welcome to the Panopticon. Used to be a prison, now it's your home.
3. Re:Cassandra, eh? by einhverfr · 2010-02-23 08:45 · Score: 1
  
  Of course!
  Because Cassandra is a Trojan....
  
  --
  
  LedgerSMB: Open source Accounting/ERP
4. Re:Cassandra, eh? by Anonymous Coward · 2010-02-23 09:17 · Score: 0
  
  You mean the Fail Whale Watching component?
5. Re:Cassandra, eh? by Hurricane78 · 2010-02-23 10:04 · Score: 1
  
  Disastrous failure? Twitter? There’s at least one joke in there somewhere. ^^
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
hmmm by Anonymous Coward · 2010-02-23 07:19 · Score: 0

facebook uses casandra, digg uses cassandra, twitter is mocing to cassandra. Maybe in 5 years slashdot will get with it.
1. Re:hmmm by Anonymous Coward · 2010-02-23 07:29 · Score: 1, Insightful
  
  Maybe in 5 years slashdot will get with it.
  Do you realize how many years it took Slashdot to just remove their HTML table layout from Slashcode? I wouldn't bet on a major backend change for Slashdot, ever.
2. Re:hmmm by Anonymous Coward · 2010-02-23 07:33 · Score: 0
  
  facebook uses casandra
  Bye bye Twitter, it was nice knowing you.
  I'd rather not get tweets from last week showing up as "latest".
3. Re:hmmm by clarkkent09 · 2010-02-23 07:58 · Score: 1
  
  Yeah, but in those cases if something horrible happens and all data gets deleted, nothing of value will be lost, whereas with slashdot........ok never mind
  
  --
  Negative moral value of force outweighs the positive value of good intentions.
network issues? by QuietLagoon · 2010-02-23 07:26 · Score: 4, Insightful

We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast it would saturate the backplane of our network.
.
First time I have ever heard anyone say that a database was too fast. Maybe there are network problems that also need to be addressed.
1. Re:network issues? by Anonymous Coward · 2010-02-23 08:17 · Score: 0
  
  Yeah, suck it up. why discount it for that reason? Sounds like there's real room for growth down that path.
2. Re:network issues? by KermodeBear · 2010-02-23 08:43 · Score: 1
  
  Could someone explain to me why this kind of speed would be a problem? It seems to me that if BinaryMemtable is so incredibly fast that other things become a bottleneck, then you're in a great position. You have something very fast for storing and retrieving data - you just need to get bigger, faster pipes.
  
  --
  Love sees no species.
3. Re:network issues? by b0bby · 2010-02-23 08:50 · Score: 2, Insightful
  
  I know next to nothing about NoSQL, but what they're talking about there seems to be using BinaryMemtable for the one-time move of data. You can see that you wouldn't want to "saturate the backplane of our network" for several days while that completes, so they're using a slower method & throttling it. It will take a week to do the move, but everything else will keep working.
4. Re:network issues? by Anonymous Coward · 2010-02-23 08:54 · Score: 1, Funny
  
  you just need to get bigger, faster pipes
  That's what she said!
5. Re:network issues? by SanityInAnarchy · 2010-02-23 09:32 · Score: 1
  
  I'm surprised they didn't use the faster method and throtlle it.
  
  --
  Don't thank God, thank a doctor!
6. Re:network issues? by Bill,+Shooter+of+Bul · 2010-02-23 09:33 · Score: 2, Informative
  
  Yes and no. They are specifically talking about importing their data into cassandra. Which will be a one time event, not worth upgrading the network bandwidth. They need to throttle it to allow for more time sensitive traffic to use the bandwidth. The bandwidth to the database in normal use will be much, much less then the import bandwidth.
  
  --
  Well.. maybe. Or Maybe not. But Definitely not sort of.
7. Re:network issues? by KermodeBear · 2010-02-23 09:53 · Score: 1
  
  Ah, that makes sense. For some reason I thought they were talking about general usage. Thanks for clearing that up. (o:
  
  --
  Love sees no species.
8. Re:network issues? by ryansking · 2010-02-23 09:55 · Score: 4, Informative
  
  If we're going to have to slow the system down, we'd rather use the standard interface, because that means the bulk loading doubles as a load test and the tools we build for it can be re-used for normal operations.
9. Re:network issues? by SanityInAnarchy · 2010-02-23 10:28 · Score: 1
  
  That actually makes a lot of sense. Thanks!
  
  --
  Don't thank God, thank a doctor!
10. Re:network issues? by geniusj · 2010-02-23 12:26 · Score: 1
  
  I haven't checked, but I'd bet that BinaryMemtable uses UDP, when combined with the fast speed, could easily cause significant network saturation..
11. Re:network issues? by QuietLagoon · 2010-02-23 14:07 · Score: 1
  
  OK, that additional information helps. It seems appropriate. Thanks for the follow-up.
12. Re:network issues? by magus_melchior · 2010-02-24 08:06 · Score: 1
  
  I was thinking that myself; if their backplane is being saturated, surely there's a way to throttle the import process using the datacenter network hardware, QoS, or something similar? For that matter, why don't they have a redundant network so that the production net isn't impacted by datacenter ops (I know, I know... cost)?
  
  --
  "We are Microsoft. You shall be assimilated. Competition is futile."
Huzzzah! by tthomas48 · 2010-02-23 07:28 · Score: 1

I look forward to a brand new twitter that randomly doesn't display expected data and sometimes doesn't take my status updates!
1. Re:Huzzzah! by Target+Practice · 2010-02-23 08:59 · Score: 1, Flamebait
  
  I know you're being sarcastic, but I think some of us around here really do look forward to a non-functioning twitter. Maybe, if it's down long enough, everyone will take a step back and realize what a complete tool they've been, telling the world how their last coffee was, where the Best Place to Buy Things is, or some other third thing equally mundane and self serving.
  Here's to them royally screwing up!
  
  --
  There's a 68.71% chance you're right.
2. Re:Huzzzah! by badboy_tw2002 · 2010-02-23 14:02 · Score: 1
  
  Amazingly I've never gone to twitter or signed up for it, and somehow have never been bothered by it. Go figure!
3. Re:Huzzzah! by Anonymous Coward · 2010-02-23 16:40 · Score: 0
  
  I had a lame ass college professor who mandated using Twitter despite having a web classroom interface for his course. Of course, none of the tweets were interesting or relevant to the course. Great.
And this is front page news, why? by Lunix+Nutcase · 2010-02-23 07:36 · Score: 2, Interesting

Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?
1. Re:And this is front page news, why? by BarryJacobsen · 2010-02-23 07:44 · Score: 4, Funny
  
  Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?
  Because the change prevented them from posting it to twitter.
  
  --
  Track your TV Shows with your iPhone - FREE
2. Re:And this is front page news, why? by Gruuk · 2010-02-23 07:44 · Score: 5, Insightful
  
  Scaling. If something turns out to be robust and fast enough for Twitter, it is definitely of interest to anyone working on significantly large and busy websites.
  
  --
  De gustibus et coloribus non est disputandum
3. Re:And this is front page news, why? by TheTyrannyOfForcedRe · 2010-02-23 07:45 · Score: 1
  
  It's interesting because Twitter is one of the Big Guys and it's cool to know what the Big Guys are up to. Also, a lot of us maintain Twitter based websites and/or apps.
  
  --
  "Liechtenstein is the world's largest producer of sausage casings, potassium storage units, and false teeth."
4. Re:And this is front page news, why? by Lunix+Nutcase · 2010-02-23 07:56 · Score: 3, Insightful
  
  Yes, because twitter is the epitome of robustness and speed. Oh wait... Just in the 2 months of this year alone they've had something like 4 outages.
5. Re:And this is front page news, why? by Gruuk · 2010-02-23 08:27 · Score: 1
  
  Which is exactly why it would be huge and relevant news if there was something that could make Twitter run way better. It's a perfect example, as it is a very well know websites, with very well known problems related to scalability.
  Thank you for helping me prove the point, by the way, that was mighty kind of you.
  
  --
  De gustibus et coloribus non est disputandum
6. Re:And this is front page news, why? by Monkeedude1212 · 2010-02-23 08:33 · Score: 2, Insightful
  
  I suppose then why would we care if any site made any random change to any part of its infrastructure?
  Twitter is a -very- busy site.
  They are changing their infrastructure to accomodate. Here's what they looked at, here is what they chose. If you are looking for something with equal performance, you don't have to shop around.
7. Re:And this is front page news, why? by kriston · 2010-02-23 08:53 · Score: 4, Insightful
  
  No way. Their architecture is about as "best guess" engineering as Facebook. I don't think that's actually what engineering is. "Maybe this one will work?"
  In the meantime, I have not been able to update my avatar image on Twitter, and TwitPic-like feature is still a faint glimmer in Twitter's amateur eyes. Speaking of missed opportunities, why drive so much traffic to Twitter parasites Bit.ly, TwitPic, TinyURL, Twitition, TwitLonger?
  What in the world are Twitter's engineers actually DOING should be the real question.
  
  --
  Kriston
8. Re:And this is front page news, why? by e2d2 · 2010-02-23 08:57 · Score: 1
  
  Which is exactly why developers need to pay attention - So we can avoid these mistakes ourselves.
9. Re:And this is front page news, why? by Lunix+Nutcase · 2010-02-23 08:58 · Score: 1
  
  So instead of just seeing what an actual robust and fast site uses we should instead follow Twitter which switches at whim to whatever is the technology de jour of the moment while still being unstable and slow?
10. Re:And this is front page news, why? by Anonymous Coward · 2010-02-23 09:29 · Score: 0
  
  You're the perfect example of what a morass of stupidity Slashdot has become. Thanks for reminding me why I stopped coming here years ago.
11. Re:And this is front page news, why? by u38cg · 2010-02-23 09:42 · Score: 2, Interesting
  
  Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia? I would have thought serving up a fairly straightforward page, a stylesheet, a background image and the tweets or twits or whatever they're called can't be that difficult compared to, say, Facebook.
  
  --
  [FUCK BETA]
12. Re:And this is front page news, why? by roman_mir · 2010-02-23 09:53 · Score: 1
  
  Twwweeeeter can also probably generate static pages just as well on some large node and then push them to web servers, that just might have worked better for them.
  Do they really need dynamic pages at all or could they live with something that's regenerated every 10 minutes? Just saying.
  
  --
  You can't handle the truth.
13. Re:And this is front page news, why? by theshowmecanuck · 2010-02-23 10:53 · Score: 1
  
  Or is this possibly a case where people are attempting to use technology as a silver bullet against bad design. Your point cannot be made unless we know that they have a good design to begin with and the reason for the outages lies specifically with the database technology. Sometimes the overlooked problem is with bad or dogmatic coding (i.e. too concerned with good form or over-using patterns, and not enough with performance or the 'KISS' principle), and not with the server or hardware technology. Chances are their code is good, but we don't know that. So your premise is not valid.
  
  --
  -- I ignore anonymous replies to my comments and postings.
14. Re:And this is front page news, why? by DragonWriter · 2010-02-23 11:37 · Score: 1
  
  Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?
  Because in some areas Twitter is at an extreme of scale, so what they are doing to deal with that extreme of scale (even if it isn't necessarily always the ideal choice) is usually interesting since, if you are looking for things that have been done in production to deal with the kind of scaling they experience, there aren't a lot of other data points to find.
15. Re:And this is front page news, why? by DragonWriter · 2010-02-23 11:51 · Score: 1
  
  Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia?
  (1) In some measures , probably;
  (2) When Google or Wikipedia makes announcements about technology (whether its a "change" or not) they use in their backend, that's usually often a front-page story on Slashdot, too. The BBC and CNN don't, AFAIK, tend to make big public announcements about back-end technology.
  
  I would have thought serving up a fairly straightforward page, a stylesheet, a background image and the tweets or twits or whatever they're called can't be that difficult compared to, say, Facebook.
  Processing the tweets is the big scale issue at Twitter, AFAIK, and while Facebook does something similar with its status updates, ISTR that the scale at Twitter is bigger. But its not really a big issue either way, as when Facebook talks about their technology backend, that also gets attention from Slashdot.
16. Re:And this is front page news, why? by turbidostato · 2010-02-23 12:21 · Score: 1
  
  Thanks for reminding me why I stopped coming here years ago.
  Are you the Ghost of Slashdot Past per chance?
17. Re:And this is front page news, why? by Anonymous Coward · 2010-02-23 13:27 · Score: 0
  
  And nothing of value was lost. Sorry, it has to be said.
18. Re:And this is front page news, why? by jmcvetta · 2010-02-23 13:52 · Score: 1
  
  Most high traffic are serving content, primarily database reads. Twitter has incoming tweets, on top of outgoing content. Not many sites have as much concurrent read and write activity as Twitter. Amazon & Facebook, sure -- but if they do major infrastructure upgrades, they will also get a front page story.
  Can you suggest some other very-high-traffic sites whose infrastructure might be better (or more interesting) than Twitter?
19. Re:And this is front page news, why? by Xest · 2010-02-23 21:11 · Score: 1
  
  Well that's actually why I like this news.
  I like to think of Twitter's technology experiments, as high not to build and run a high performance web application. Hell, they bought us confirmation that Ruby on Rails wasn't exactly ready for prime time in terms of high performance work for example.
  We have a lot to thank them for, but you're right, one of those things is not how to run a stable, secure, scalable web site, it is the opposite- how not to. I suspect before long we'll be able to see for ourselves how well, or how badly Cassandra does, measured by the increase or decrease in failwhales.
20. Re:And this is front page news, why? by Anonymous Coward · 2010-02-24 02:07 · Score: 0
  
  @Lunix Nutcase: We rarely see what the sites that work well are using posted here or anywhere else for that matter. Until we do, Twitter
21. Re:And this is front page news, why? by Gruuk · 2010-02-24 02:25 · Score: 1
  
  You're cute when you confuse the problem and the solution.
  Or you're a bit slow. Either way, rock on, you sweet, sweet kid.
  
  --
  De gustibus et coloribus non est disputandum
22. Re:And this is front page news, why? by haruchai · 2010-02-24 11:48 · Score: 1
  
  That may not be what actual engineering is but that describes a lot of software "engineering"
  
  --
  Pain is merely failure leaving the body
pfffft twatter tweeter by roman_mir · 2010-02-23 07:45 · Score: 2, Insightful

who cares what twuufter is running off.
The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.
If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs? Databases are not being replaced by NoSQL in projects that need databases. The projects that may not have ever needed databases may benefit by this NoSQL idea, but if you actually need a database... well, you better be really good at working around all kinds of problems that this will create for you.
I think that relational databases are good at what they do and that many projects may not need them, but if you do need them on the back end, you will end up with them on the back end. Of-course there maybe some caching/hashmaps/files on the front end but at the back stuff will be sorted out within a real datastore.
Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.

--
You can't handle the truth.
1. Re:pfffft twatter tweeter by codepunk · 2010-02-23 07:52 · Score: 1
  
  Is there really a huge issue with rdbms speeds? I don't know perhaps you should pose that question to google for instance.
  
  --
  
  Got Code?
2. Re:pfffft twatter tweeter by AndrewNeo · 2010-02-23 07:56 · Score: 4, Insightful
  
  I think their point is not everything needs an RDBMS, whereas before it was the 'go to' method of storing data.
3. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-23 08:04 · Score: 1, Insightful
  
  Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.
  Surely that's the point. It isn't possible to practically scale RDBMSs up to the sort of scale you need for a huge website such as Amazon. The requirement to continue to meet all of the constraints of the relational model makes it very hard to split databases over a large cluster without a lock-bound hell. There are two solutions to this - either you spend a vast amount of effort trying to get the relational model to scale a bit, or you bite the bullet and relax the relational model's constraints.
  Don't get me wrong - there are good reasons why the relational model has constraints in the data model to ensure ACID qualities. However beyond a certain point it is easier to deal with the problems that come from using a different model than it is to stretch a conventional RDBMs and deal with the problems of keeping multiple distributed copies of data consistent.
  Take the collection of user reviews and product pictures on a large site like Amazon. Does this need the analytical power of a RDBMS? No. Does it need something a lot more advanced then "flat files or an in memory hash-map" in order to scale to heavy loads across multiple continents? Yes. That's the sort of thing NoSQL databases are working on.
  In general your attitude reminds me of the people who thought personal computers would always be toys. "Proper work" would be done on mainframes/supercomputers and trivial office tasks may as well be done on paper. Well, mainframes / supercomputers are still faster than personal computers, but few people would claim the PC had no impact on the office.
4. Re:pfffft twatter tweeter by azmodean+1 · 2010-02-23 08:08 · Score: 4, Interesting
  
  I think you're missing the point here, the problem with RDBMSs isn't that they are "slow" per-se, which implies that they just need some good ol' fashioned optimization. The problem is that there is a cost associated with the data integrity guarantees they make (usually appears in scalability bottlenecks rather than in pure computational inefficiencies), regardless of how good the implementation is, and if you don't need some of those guarantees, you can dispense with them and end up with better performance (again, this typically means better scalability). Additionally, this is the kind of bottleneck that you just can't throw more resources at. Sure you can find the bottleneck and beef up that particular component to do more transactions/second, but at a certain point you've isolated the bottleneck on a world-class server that is doing nothing but that, and it's still a bottleneck. At that point (preferably long before you reach that point) you have to look at transitioning to an infrastructure that makes some kind of tradeoff that allows the removal of the bottleneck, which is what NoSQL does.
  I doubt Twitter wants very many RDBMS-type data coherency guarantees at all. 160-character text strings with a similarly-sized amount of metadata, and no real-time delivery guarantees? Sounds like their database can get pretty inconsistent without messing things up badly. It seems to me they would be well served by using a database that offers just what they want/need in that area and better performance.
  Oh and this:
  
  Is there really a huge issue with rdbms speeds?
  yes, and what are you smoking that you would even ask this question?
5. Re:pfffft twatter tweeter by Abcd1234 · 2010-02-23 08:51 · Score: 4, Insightful
  
  Or: use the right tool for the job. The only difference is, now alternative tools actually exist.
6. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 08:52 · Score: 2, Insightful
  
  your question is answered in my post: google does not need a database for ACID properties.
  Can you complain much if in one location google gives you results that are very different for the same search query as for the same query in a different location at the same time? Well, if you do complain, you can ask google for your money back.
  
  --
  You can't handle the truth.
7. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 08:56 · Score: 1
  
  As I said, there are projects and then there are projects. Tweater is not the project that requires any real database in the first place, who cares is a commit is transactional there?
  As for your last comment: problems with database performance are all about design. You think NoSQL will not hit the same roadblocks in projects that don't do design right? What are they going to move to when that one fails? NoNoSQL++?
  
  --
  You can't handle the truth.
8. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 09:04 · Score: 0, Flamebait
  
  In general your attitude reminds me of the people who thought personal computers would always be toys. "Proper work" would be done on mainframes/supercomputers and trivial office tasks may as well be done on paper. Well, mainframes / supercomputers are still faster than personal computers, but few people would claim the PC had no impact on the office.
  - in general your attitude reminds me of everyone who ever thought that the latest fad is the silver bullet that will devoid them of any responsibility for a hacky design and kludgy implementation, all sprinkled with hair bossy management attitude. Good luck with your new silver bullet, hope you kill the vampire of incompetence.
  
  --
  You can't handle the truth.
9. Re:pfffft twatter tweeter by einhverfr · 2010-02-23 09:22 · Score: 1
  
  I am going to add a few other things here. The first is that "not possible to scale" is not really accurate. I believe there are ways to design structures so that write capacity on an RDBMS can scale upward with the nodes on the network. Of course this only works for some types of applications (the approach I have in mind would work with Twitter, for example). And even with Amazon, you would CERTAINLY want RI on purchases even if you don't care about reviews.
  However, the larger point is that an RDBMS is a tool which is useful for certain types of applications and not others. For example, managing the financial data on product purchases at Amazon is going to be integrity critical but it still has to work and perform well all the time. Reviews, OTOH, won't but integrity problems add to the load of work for the tech support guys.
  So there are solutions to this problem which involve middleware, but even in Amazon's case unless you want to cobble together something with bailing twine and duct tape, the RDBMS is going to likely be the go-to solution there. With twitter? Not so much.
  
  --
  
  LedgerSMB: Open source Accounting/ERP
10. Re:pfffft twatter tweeter by tokul · 2010-02-23 09:26 · Score: 1
  
  Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.
  
  WW2 and Korea called.
  Is there really huge issue with those propeller plane speeds. We can always speed them up, right. Fastest prop planes reach 850-870km/h. me-262 reached 900 km/h. Mig-15 went to 1075 km/h.
  If other tools are faster and better than rdbms, then why people should waste their time with slower option.
11. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 09:32 · Score: 1
  
  you really should have left that comment to the 'bad analogy guy', he could make it sound good.
  
  If other tools are faster and better than rdbms, then why people should waste their time with slower option.
  - faster and better, ha? So you don't really mind if your bank switches from its datastore to a 'faster and better' NoSQL system, whatever the latest fad name is? I mean what's a few dollars not rolling back in a transaction that fails when your employment check is deposited?
  Propeller would have been a file in a makeshift file system, we use jet engines now for large commercial aircraft now, you don't see the ramjets on those though or rocket boosters, do you?
  
  --
  You can't handle the truth.
12. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 09:43 · Score: 1, Insightful
  
  You know, the truth is, most data is still stored in individual files, not in databases. So RDBMSs were always a very niche thing used for projects because they are understood and it's easier to develop for them if you really have massive data requirements.
  Files - that's what many projects even today use, not databases. This is basically what they are going back to - files with whatever window dressing on top - a facade of hashes, it's all key/value pairs. It is, my friends, the old old idea of property files.
  I mean, really, I wrote a system in August that uses property files for storage as a database. Property file as a database - because it works. But that's a storage method. So in the NoSQL space they also do clustering by replication across nodes, but it does not really matter much if the data is all the same on all nodes.
  But you can do the same with an RDBMS, really, you can skip the principles of ACID and replicate across nodes and hope that it's good enough. Maybe the implementation for things like 'Cassandra' allows faster replication than what is normally done in an RDBMS, but just you wait and see how the RDBMSs of tomorrow provide a few flags to do the same thing in some 'partial ACID mode' with quick replication.
  This is intended for applications that do not really care about consistency of data - Google does not care. Twewter does not care. Amazon has to jump through more hoops I am sure than Tweeter, because real money is involved.
  
  --
  You can't handle the truth.
13. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-23 09:56 · Score: 0
  
  In principle, a database itself has no cost associated with integrity compared to Cassandra or the others. If you do away with foreign keys, the only "slowdown" would be due to primary/unique key constraints, which *any* map type storage with incur, because checking unicity is O(1) if you're indexing at the same time. Now, there *is* a cost associated with transactional integrity, but that is a latency and not throughput problem. To simplify, if you require transactional integrity, you need to flush and thus wait for the seek and the platter to rotate to wherever the data needs to be written. If matters little if you're committing 1 or 1000 transactions, once you're there the disk bandwidth can take it, it's getting there that is the issue, and that means latency (ignoring SSDs).
  Why does it matter then?
  Because every single DB interface in existence is synchronous. So while the DB can handle 1M TXs a second, that would require 10000 threads on the web side, each working for a negligible amount of time and waiting for 10ms. And that doesn't work.
  Since they can't fix that, the only option is to have the request complete in .1 ms. Then you only need 100 threads. But then you need to do away with transactional integrity. They decided they could do away with that. Fair enough. But the problem there is not RDBMS themselves, it's the sorry APIs and drivers we have to work with.
14. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 09:59 · Score: 0
  
  Let me put it this way, it will make it perfectly clear: if twoofter is regenerating every page on every hit and they are running into issues with speed, then their problem is not their data storage, it's their design. Now that it is clear they don't care about data consistency, I have the solution for them.
  They just need to regenerate the pages once in a few minutes on some large node and then push the static content to their webservers. Done. And that's why they sometimes pay me the big bucks :) to think of the obvious.
  
  --
  You can't handle the truth.
15. Re:pfffft twatter tweeter by Knowbuddy · 2010-02-23 10:12 · Score: 1
  
  I don't think you understand the niche that NoSQL databases are trying to fill.
  
  The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.
  It's not a black and white, panacea-type situation. Relational databases are good at some things, non-relational databases are good at others. Where non-relational databases are better is at solving very specific problems, many of which happen to map directly to the needs of web developers.
  A Viper is a fun car to take you to and from work, but it's probably not the best to shuttle around a little league baseball team--that's what minivans are for. (Whether the Viper is the relational or non-relational database in the analogy is up to you.)
  I teach a course titled Advanced Database Concepts, so I'll give you the same example I give my students: blogs. It's the sort of canonical example--I didn't make it up.
  To show a blog's home page, you need a list of recent posts. Each post is probably associated with a category, maybe some tags, and and author. Just to get that data, you're looking at joining 3 tables: Posts, Categories, and Users. What if you want a comment count? That's another join, and the query just got hairier--do you do a simple aggregation (join then group), or do you see that might be inefficient and so transform it into a harder-to-read-but-more-efficient subquery? That might even involve a fifth join, if you have registered user accounts and avatars for your commenters.
  All of which is fine and good until you're running LiveJournal or WordPress.com and you have millions of bloggers generating hundreds of millions of posts and who knows how many comments. With beefy machines and proper indexes you're probably okay ... but I wouldn't want to be the DBA who had to tell management that a new column needed to be added to any of those tables.
  Enter NoSQL/non-relational databases: why not fetch everything with just one query? (I'd show you some JSON, as that's what many of the NoSQL databases speak, but the /. filter considers it too much junk.) You put your comments in the same document as your posts, and the replies to those comments in child arrays, and the user info right inside the comments. If your users can't change their username, this isn't a bad solution. There are other tricks, but the point is that you reduce everything down to a single denormalized query.
  This design makes it trivially easy to build data-driven web pages, as effectively every web language has a JSON deserializer. No ORM impedence mismatch, and you get horizontal scalability pretty much for free.
  
  If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs?
  Because it's still a database, even if it's non-relational. You're still doing inserts and updates and deletes, you just get a nice hunk of denormalized clay to play with instead of the normalized rigidity of Tinker Toys.
  
  I think that relational databases are good at what they do and that many projects may not need them, but if you do need them on the back end, you will end up with them on the back end.
  But that's the point I think you're missing: until relatively recently, relational databases were the only game in town. Relational databases are ubiquitous because they solved the problems of the 60s-90s. They aren't going anywhere, as those types of problems (financial, transactional, etc) aren't going anywhere. But now we have a relatively new class of problems (graphs, etc) that need to be nailed down just as thoroughly. Many web applications are straining to fit within the relational model, and this explosion of NoSQL software is because people are realizing that all that straining c
16. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 10:24 · Score: 1
  
  Freedom of choice, definitely. I had projects just recently I used property files as a database - inserts, deletes, updates, all in a property file. Easy enough because it is just a hash map. You don't impress me with any of it, it's not in any way new first of all, but it does not replace any RDBMS where RDBMS is needed.
  My entire point is that Twooter never needed an RDBMS in the first place. They should be just fine without any database usage on the front end, and forget about JSON. The problem with them, if they can't scale right now, is that they don't do design, they just jumped on a silver bullet train. Tweuter can just as well serve static content, that's as fast as you can get. The design is obvious - generate static content and periodically replace it on the front end web servers. Done, no need for anything else. Who cares what's on the back end? They never needed an RDBMS, you see, that's my point. Just like google. To your point - we always had the choice, it's all the same stuff in different wrapping, so fine, who fights that?
  
  --
  You can't handle the truth.
17. Re:pfffft twatter tweeter by DragonWriter · 2010-02-23 11:47 · Score: 1
  
  I think their point is not everything needs an RDBMS, whereas before it was the 'go to' method of storing data.
  Except, of course, that it never was the "go to" method of storing data. There was no point in history where RDBMS's were anywhere close to the exclusive method of persisting data. Non-relational document-oriented storage has pretty much always dominated in the era in which relational databases existed, whether it was proprietary binary document formats, fairly direct text-based document formats, or highly structures (XML, etc.) text-based document formats.
18. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-23 12:00 · Score: 0
  
  i think you mean the "werewolf"
19. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 12:19 · Score: 1
  
  you don't think I know what I mean?
  
  --
  You can't handle the truth.
20. Re:pfffft twatter tweeter by Doomdark · 2010-02-23 12:38 · Score: 1
  
  Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.
  What makes you think this has not been done? Sometimes combination of arrogance and ignorance here is amazing. Very bright minds are working on all kinds of approaches; and of course Oracle (et al) are working on their set of tools to improve them as well.
  In reality it is ALL about different compromises. RDBMS pay hefty price for ACID, and that is ok if that is what you absolutely need. But there is no way to horizontally scale them efficiently (or, after some point, at all). This can be solved by rethinking what your actual requirements are -- if you can loosen some of the requirements by adopting "eventual consistency", you can get much better scalability and availability. You can not just add more boxes to your Oracle cluster: you need bigger box(es). Period. But you can easily add new hosts on your no-sql clusters (depends on system, for some its easier than others; but this is big focus for all of them). It is not even so much about speed (of individual requests) but throughtput, and ability to incrementally increase it as needed.
  There are certainly cases where you'd rather want full ACID set for authoritative data. And then there are many cases -- not just read-only/caching -- where it is acceptable to have intermediate inconsistent states. For Amazon Dynamo was used for shopping carts, for example. Oracle database was not cost-effective, and by cost I don't mean license costs, but maintenance (and license, h/w etc). It was designed to solve a specific problem. Other companies are building similar solutions.
  
  --
  I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
21. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 12:55 · Score: 0, Troll
  
  Yeah, the actual requirement of the twooter should be really thought over once more.
  They may not need any database for their front end at all, that's their problem: they can't scale with the old back end, they think they'll fix it with this new silver bullet? Maybe they'll have it run faster for a while, but what about some real design? Do they actually need to generate any content for every http request? I doubt it. Maybe all they need is a small cluster of large enough servers to generate all of the necessary static pages and push them periodically to their front end web servers. For the inbound requests they probably don't need a database either, just a queue for the generator cluster to work on to generate the static pages.
  That maybe all they need, but instead of doing some actual design work and maybe changing some implementation they'll just do what management normally does in the pointy hair boss way: get a hammer, hopefully a silver one and do the same old thing hopefully marginally faster.
  Certainly Amazon is in business different from the twater, they can put many more minds together to compensate for all of the deficiencies of a non-transactional system where transactions are needed. For example excessive journalling can be done and then back end systems can sort out the details and process 99% of cases successfully and throw the last 1% at some CSRs in India or wherever they have the call centers.
  I am sure that Amazon would have preferred to have completely transactional system and their specific problem may as well be performance deficiencies of RDBMSs of their choice. On the other hand it is also possible that their architecture could be changed to do so, but maybe it was less expensive to go the other way, I haven't worked for them yet, so I don't know. However I am building a retailer solution right now with a cluster of PostgreSQL nodes that process a few million transactions a day with a large growth potential and where possible, I'll stick to the RDBMS but I certainly do caching and use hashmaps in memory to speed up quite a few report generations and other features.
  My point is that twufter never really needed an RDBMS in the first place, so it doesn't matter what they use, a fast enough roll of toilet paper maybe sufficient for their purposes, who knows.
  
  --
  You can't handle the truth.
22. Re:pfffft twatter tweeter by Eil · 2010-02-23 13:19 · Score: 1
  
  Just like there is no universal programming language for every type of software, there is no universal database engine for every type of data storage.
23. Re:pfffft twatter tweeter by maraist · 2010-02-23 13:37 · Score: 1
  
  Regional data has nothing to do with BigTable or RDBMS. Have you read the white-papers on BigTable? If google leverages any IP isolated network solutions, then it's at the networking/application level ABOVE BigTable.
  
  BigTable itself leverages map-reduce to cascade the query to potentially thousands of machines, reducing their results back to a SINGLE requesting node.
  
  Geo-location would pick one of several data-centers which house an isolated effective database. The upper layered code would act identically if it was Oracle or BigTable.
  
  The consistent view comes from the fact that all columns have a version (the timestamp). The reduce phase guarantees that only the latest version of a column is returned.. Thus if an update is mid-flight in the replication stage, you'll still get the correct data. Now this is completely separate from multi column updates - though BigTable leverages Chubby which is a clustered locking system which presumably would facilitate multi-table consistent updates. But this falls into the category of problems RDBMS's have solved that you have to fend-for-yourself.
  
  --
  -Michael
24. Re:pfffft twatter tweeter by jmcvetta · 2010-02-23 14:00 · Score: 1
  
  They just need to regenerate the pages once in a few minutes on some large node and then push the static content to their webservers.
  
  Every few minutes is too infrequent. Why not once a minute? Once a second? That will still be a lower load than regenerating on every hit.
25. Re:pfffft twatter tweeter by foxylad · 2010-02-23 14:42 · Score: 1
  
  > Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.
  To my mind, it's scaling rather than speed that is the issue. Having seen an RDBMS web app grow in popularity until we needed two DB machines, I have some inkling of how painful that transition is. So now I use Appengine when I can, which scales completely painlessly. There is a trade-off because you have to un-learn long-held habits (like normalisation), but if my app hits Oprah I'll be listening to champagne corks popping, not processors.
  
  --
  Do as you would be done to.
26. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-23 14:55 · Score: 0
  
  A good commenting system is hierarchical. Guess what? Hierarchical databases existed years before SQL.
27. Re:pfffft twatter tweeter by artsrc · 2010-02-23 15:35 · Score: 0
  
  From what I can tell Google's Big Table is more ACID than your banks Oracle. Big Table commits on two node before it considers its write successful. Your banks Oracle commits on one and schedules replication to Arizona for as soon as possible.
28. Re:pfffft twatter tweeter by artsrc · 2010-02-23 15:55 · Score: 0
  
  RDBMS systems did not invent ACID. There were solid ACID databases on mainframes before SQL or the relational model were thought of. The non-relational ACID databases were faster than relational databases then, and they still are after 40 years of work on relational databases. You seem to be conflating ACID and RDBMS. ACID is not free. However there are many issues with the relational model and with the SQL implementations that additionally negatively impact performance. The relational model provides a logical model for data. This model if sometimes less convenient than other models (Object Oriented etc.). This model is frequently harder to provider an efficient implementation for than other models (key/value etc.). It is likely that the SQL vendors will respond by claims. Claims are free. They won't respond with a low cost, fully scalable, Oracle App Engine with an SQL backend, running on low cost commodity hardware, with low administration costs. They won't because they can't.
29. Re:pfffft twatter tweeter by artsrc · 2010-02-23 16:03 · Score: 0
  
  > The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better. Or the most interesting aspect of the NoSQL movement is that many of the most successful web companies have rejected the SQL orthodoxy and achieved great success. As someone in a conservative, SQL only, environment this is interesting. > Is there really a huge issue with rdbms speeds? There has always been issues with database speed, we have plenty. Some are best solved by adding an index, caching some results or re-writing a query. Some might be best solved by switching to Cassandra or using the file system.
30. Re:pfffft twatter tweeter by maraist · 2010-02-23 16:06 · Score: 1
  
  RDBMS's are optimized for READS, not writes. You can produce a 1,000 machine mysql-INNODB cluster that will be faster than memcached and be fully ACID complaint. But you'll only ever have 1 write node. You CAN do sharded masters with interleaved auto-incremented values, but then your foreign keys are totally out the window - as is your ACIDity. Oracle has clustered lock managers, but very quickly is going to max out it's scalability - especially if it's limited to a single SAN.
  
  Relatively expensive 15,000 RPM disks are going to max out near 15,000 random seeks per second. RAID-10 or even RAID-50 (if you're sadistic) is only going to give you a small constant multiplier to this performance. And if you're maxing out said items, then the SCSI queueing and multi-gig RAID-controller memory cards will buy you mere seconds of peek-performance.. Utterly useless in sustained writes.
  
  SSD's alleviate some of the disk-based limitations, EXCEPT that you are constrained by three factors.. 1) Disk size 2) SSD's like large block-sizes 3) SSD's can't write to one location too often. Thus the modern high performance SSDs do address remapping, which eventually degrades the overall performance. And ironically, while random-IO is faster on SSD's than disks, linear writes seem to be faster on disks than SSDs. Obviously SSD's are still in their infancy. The game may change any year now.
  
  The two core write-scaling problems above are the inter-table dependencies (the foreign keys) and the random-IO necessary for diskhashtable or B+Tree backingstore layouts (factorial-layouts alleviate this somewhat). This also applies to read when you do large M-way joins of multi-giga-record tables. You're essentially requiring over a billion disk seeks to satisfy a single query - completely unmanageable. Yes there were ways to re-architect the product to mitigate this type of query (denormalization, externalized batch journaling, etc) - but your argument was that RDBMS's solve problems - this is a problem that requires hackery to avoid the intrinsic flaw in RDBMS's foreign key/join-key architecture.
  
  Google's BigTable paradigm does away with the need for foreign keys by simply providing a 1-to-many relationship as a 3'rd dimension to the simple flat table. Yes this doesn't solve 4D problems, but MOST RDBMS's could be done away with by simply living in this 3D space.
  
  You could achieve this pattern with existing RDBMS's simply by storing a hash-map in a blob column. But this would not be efficient at all. You'd have to lock the entire row and rewrite the entire blob to change a single value.
  
  BigTable gives you MVCC on each 3D key-value pair with locking to the primary-key of the row. It's a column-oriented database (of which there are many in RDBMSs), but almost all the real meat is in this 3'rd dimension which is stored in a versioned, replicated, append-only manner.
  
  The append-only immuteable data naturally fits a read scaling model.. Once saved to disk, you replicate the recordset to dozens, if not hundreds or thousands of machines (typically on a copy-on-cache-miss model when hitting one of a thousand servers). You then leverage the map-reduce model, to make sure you catch any writing nodes for the given column of interest, then on the reduce, you choose the newest version. Thus you have consistency (unlike some scalable approaches that do an 'eventually consistent' model).
  
  Because of the map-reduced MVCC, you can then shard out writes to random machines.. It literally doesn't matter where the inserts/updates/deletes get written to, because on the next read, only the newest version will be passed to the client. There is some contention in centrally managing which nodes are doing writes, but at least you can spread writes to at least a dozen machines per column.. And with say a dozen columns, that means spreading writes across a hundred nodes. And with a dozen tables, you're over a thousand write nodes (across multiple data-centers or at least isolated networks) (though obviously you
  
  --
  -Michael
31. Re:pfffft twatter tweeter by DragonWriter · 2010-02-23 18:49 · Score: 1
  
  Or: use the right tool for the job. The only difference is, now alternative tools actually exist./blockquote
  In point of fact alternative persistence mechanisms to relational databases predate relational databases.
32. Re:pfffft twatter tweeter by DragonWriter · 2010-02-23 18:57 · Score: 1
  
  If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs?
  There is a difference between needing a structured storage mechanism (database) and needing a database that implements the relational model and provides ACID guarantees. Further, many non-relational databases provide specific, weaker forms of ACID guarantees that are better than (say) naive flat file storage would, while providing better scalability in certain applications than existing RDBMS products.
  There's certainly a lot of work going on on providing better scalability for relational databases providing ACID guarantees, too, and as that progresses (because strong ACID guarantees do have value), RDBMS's may be better in some of the roles that "NoSQL" products are good for now. There are challenges to scalability with ACID guarantees, and maybe even some hard barriers, so at best its going to be easier to build scalable products with weaker guarantees in the near future. And real apps need real solutions now, not solutions that might materialize years down the line.
  
  Is there really a huge issue with rdbms speeds?
  Yes, in certain applications with certain workloads there is. Otherwise people would just use existing products.
33. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 20:47 · Score: 0
  
  In the context of this story I am confusing nothing. They are moving out of ACID space that was provided to them by an RDBMS into some NoSQL stuff that specifically says it does not guarantee transaction properties. They are doing this for speed supposedly or maybe to save money.
  They are not going to have major improvements with this I bet, they will see some moderate speed improvement, but the point is they never needed data to be transactional in the first place, since they obviously don't care about that property of data now, however, I doubt very much that all of their data is truly not relational. Some of it is not, but that's always the case.
  What they should have done instead is proper design, which might have ended up being a small cluster of big machines generating static pages out of the input content and replacing static pages on a cluster of web servers every minute or 5 or whatever is acceptable. Would have been faster than generating all of the dynamic content for each http request. The inbound data only has to be thrown into some queue, memory queue even, then a cluster of generators would grub things off the queue and produce static content while one of the machines could persist the data into some relational database for backup.
  
  --
  You can't handle the truth.
34. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-23 21:45 · Score: 0
  
  Cause SQL doesnt scale proper.
  Check out Ars Technica's feature about NoSQL :
  http://arstechnica.com/business/data-centers/2010/02/-since-the-rise-of.ars
35. Re:pfffft twatter tweeter by roman_mir · 2010-02-23 22:39 · Score: 1
  
  goodness, that is a terrible article you are referring to, just awful.
  
  Additionally, managing relational databases in a production environment can become labor intensive and error-prone. Each database package comes with its own world of configuration options, performance sensitivities, bugs, and tools
  - just like any other software. What, the NoSQL software will not have its own 'labor intensive, error-prone, configuration options, performance sensitive, bugs and tools' features? What, they invented a way to write software in the NoSQL camp that can avoid being complex to maintain in real production environments while being totally bug free while not having complex configuration options and will not be performance sensitive?
  
  While these issues usually start small, they can become a drain on developers' time and resources as the product matures and its needs become more complex. This complexity of management arises from the complexity of the database packages themselves; it is their very breadth of capabilities which makes them difficult to manage.
  - and this NoSQL thing will start small and will eventually have the same exact issues.
  You want a prediction? I predict that eventually NoSQL camp will add some sort of 'pseudo-sql' features on top of their paradigm to become more compatible with the current SQL systems, while adding more and more features, bugs, tools, complexities and at the end a new paradigm will be born, bug free and tool free and sensitivity free, it will be called: NoNoSQL++-+-.
  I mean, look at this stupid article, it says that existing features that are provided by the databases are the reason why people don't want to use them? That's exactly like all those stupid idiots arguing that Java is a terrible language because there are so many libraries available for it!
  Here it is - the epiphany of stupidity:
  
  Finally, SQL encourages (but does not require) developers to perform data processing in the database itself, in addition to data storage. Much of the time, the easiest way to map two tables together is to use a JOIN, and the easiest way to sort the results is with an ORDER BY, and so forth. Doing so adds load to the database's CPU, often a precious resource, while saving load on the application host--a bad trade-off that leads more quickly to the relational database's scaling wall.
  
  - Muhahahahahaha! Morons. Is that the reason why Twfuuufter is moving away from RDBMSs, because they can do joins and orders in the database rather than in their own application? Idiotic at best. Whoever wrote this piece of garbage technobable 'article' is a total moron.
  The problem with Twyyter is that they don't design, they take off the shelf stuff end expect it to do 'thinking' for them. They should have a cluster of dedicated servers to churn out static content off a queue of inputs and have one machine commit the inputs into a datastore, whatever it is. Then the static content is really easy to serve from a cluster of web servers. The database specific design is irrelevant then, but I bet that most of their data is relational.
  
  --
  You can't handle the truth.
36. Re:pfffft twatter tweeter by QuoteMstr · 2010-02-24 01:28 · Score: 1
  
  Thank you for the informative and thought-provoking post. It's certainly refreshing to see discourse on a level above "I hate MySQL, therefore SQL sucks." You make some good points.
  Nevertheless, an RDBMS is still the way to go. You hint at the reason in your last paragraph, actually. The entire NoSQL "movement" is predicated on a confusion of implementation and interface. You describe various problems with the way conventional RDBMSes employ the disk: who said RDBMSes had to use those approaches?
  There's nothing stopping a high-quality system from using BigTable-style backing storage when the schema permits, or when the user specifically authorizes that kind of consistency. The problems yo mention are not "intrinsic" to the RDBMS concept, but are rather features of implementations and schemas. Those can be tweaked without throwing the entire system away and starting from scratch.
  When an RDBMS improves, all the applications using that system also improve automatically. On the other hand, BigTable will remain BigTable forever: it's a bespoke system from people who were convinced they would never need the power of the relational features a system offers.
37. Re:pfffft twatter tweeter by Abcd1234 · 2010-02-24 02:53 · Score: 1
  
  Yeah, no kidding, it's called a filesystem. But when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS (BTW, I'm intentionally discounting OODBMSes, as I think they and RDBMSes are intended to target largely the same application space)? I know I haven't. People simply rolled their own and moved on. But times are changing and that niche is finally being filled (in part because that niche isn't so niche anymore).
38. Re:pfffft twatter tweeter by TheSunborn · 2010-02-24 03:51 · Score: 1
  
  The problem is that with this kind of backing storage, you can't implement most of sql effective. So you might end up with a 'sql' database where you can't user joins in production due to performance. So you end up with the worst of both worlds. A 'relational' database where you can't use most of the relational operations due to performance issues. And you still have a relative interface, so you can't do the kind of magic optimizations you can do with a simple key/value storage.
  As i see it, the problem is that with the relative model(And especially sql) it is very difficult to implement the kind of query optimizations needed. The No SQL movement often give the user an abstraction that is much closer to the implementation. This does require more code for the developer, but it also allow him to make a much more effective client implementation exactly because all details are controlled by the developer, and not the query optimizer.
  My preferred solution would be a dual db interface solution, where the developers could interface with the db using either sql or a very very low level interface that insert data directly into the database b+* tree. Allowing usage of pointers directly to database entries and other low level code.. But I don't think the relational databases will implement this, because it will force them to freeze their backend data storage structure and it will be very difficult to implement in a safe way with concurrent sql running.
  But maybe someday one of the current "No SQL" databases will implement an optinal sql layer above their current storage engine. That would be kind of ironic but not bad at all.
  * Or what ever kind of data structures the database use.
39. Re:pfffft twatter tweeter by DragonWriter · 2010-02-24 05:42 · Score: 1
  
  But when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS (BTW, I'm intentionally discounting OODBMSes, as I think they and RDBMSes are intended to target largely the same application space)? I know I haven't. People simply rolled their own and moved on. But times are changing and that niche is finally being filled (in part because that niche isn't so niche anymore).
  One of the most recent, well-known major successes before the recent "NoSQL" movement, in terms of a product that sacrificed ACID for performance as an alternative to databases providing ACID guarantees, was MySQL. (And, insofar as scalabality in the database size dimension is an aspect of performance, I guess the WWW itself could count.) I can't think of a time in history where there wasn't a tension between offerings with maximum performance in particular dimensions and offerings with the strongest integrity guarantees.
  Aside from that, many of the "new alternatives" are non-relational, high-performance systems that are updated versions of non-relational, high-performance systems that have been around in large-scale production deployments and have continued to be maintained since before relational databases were widespread -- some even before the Codd's paper laying out the relational model was published in 1970. E.g., InterSystems Cache is a development of MUMPS, which has been continuously in use in large production installations since the late 1960s; a number of other of the recently--and amusingly--labelled "post-relational" databases are the products of decades of revisions -- with continuous production deployments -- from the similar MultiValue database included as part of the PICK operating system, also from the late 1960s.
  Since these hard numerous, large-scale, production deployments, I wouldn't exactly call them "niche". They may not have overlapped with the experience of most web developers, so they may seem new from that perspective.
40. Re:pfffft twatter tweeter by Abcd1234 · 2010-02-24 07:00 · Score: 1
  
  One of the most recent, well-known major successes before the recent "NoSQL" movement, in terms of a product that sacrificed ACID for performance as an alternative to databases providing ACID guarantees, was MySQL.
  I said nothing about ACID compliance. I specifically mentioned non-relational datastores, and clearly MySQL isn't that. As such, it still forces the developer to work with a relational data model, and one of the main things these so-called "NoSQL" projects do is lift that requirement.
  Aside from that, many of the "new alternatives" are non-relational, high-performance systems that are updated versions of non-relational, high-performance systems that have been around in large-scale production deployments and have continued to be maintained since before relational databases were widespread -- some even before the Codd's paper laying out the relational model was published in 1970. E.g., InterSystems Cache is a development of MUMPS, which has been continuously in use in large production installations since the late 1960s; a number of other of the recently--and amusingly--labelled "post-relational" databases are the products of decades of revisions -- with continuous production deployments -- from the similar MultiValue database included as part of the PICK operating system, also from the late 1960s.
  Well bully for you having a chance to show off your obscure knowledge of non-relational data stores, I'm sure you must be very proud. But have any of those been targeted at modern enterprise application deployments? Not that I'm aware of. Which is why I asked the question "when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS". Answer: there hasn't been. Rather, the RDBMS has, for decades, been considered the solution that should *replace* the types of systems you describe, because the RDBMS has been largely considered *the* answer for large-scale data management. This whole "NoSQL" (god I hate that name) trend, on the other hand, is a move away from relational models to ones that may be more appropriate for the kinds of applications people are building today.
41. Re:pfffft twatter tweeter by Anonymous Coward · 2010-02-24 07:50 · Score: 0
  
  Epic.
  
  Why do silver bullets kill vampires?
  -----
  Truth About Scientology
  You've heard the controversy. Now Get The Facts. Watch Online Videos!
  Scientology.org (ads by Google)
  -----
42. Re:pfffft twatter tweeter by DragonWriter · 2010-02-24 12:11 · Score: 1
  
  said nothing about ACID compliance. I specifically mentioned non-relational datastores, and clearly MySQL isn't that.
  
  Um, the reason MySQL with MyISAM doesn't provide ACID guarantees (particularly, its deficiencies with regard to consistency) are related to the ways in which MySQL with MyISAM fails to implement the relational model. Merely using a dialect of SQL as a query language doesn't make a database relational.
  
  Well bully for you having a chance to show off your obscure knowledge of non-relational data stores, I'm sure you must be very proud. But have any of those been targeted at modern enterprise application deployments? Not that I'm aware of.
  
  Your ignorance on the point is noted, but since they have been continuously deployed in large enterprise applications (including "modern" ones), and since they are in many cases the exact same products now being touted as "new" "post-relational" alternatives for modern enterprise applications, the answer is "yes".
  There's nothing obscure about them; some of the largest production databases used by the largest institutions (e.g., the U.S. Department of Veterans Affairs) run on these systems.
  
  Which is why I asked the question "when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS". Answer: there hasn't been.
  
  Well, sure, if you ignore the mainstream, high-perforamnce, non-relational datastores which predate RDBMSs and which RDBMSs never managed to displace for many large-scale, production installations. Which happen, in many cases, to be the exact same systems now hyped as "post-relational" alternatives to RDBMS's.
  
  Rather, the RDBMS has, for decades, been considered the solution that should *replace* the types of systems you describe, because the RDBMS has been largely considered *the* answer for large-scale data management.
  
  It has, by some people. It hasn't, by others -- including the users and vendors of these large-scale production systems -- which is why these systems have continued to be deployed, developed, and now around to be the some of the primary subjects of "NoSQL" hype.
  
  This whole "NoSQL" (god I hate that name) trend, on the other hand, is a move away from relational models to ones that may be more appropriate for the kinds of applications people are building today.
  Which, it turns out, are often the current versions of the same systems which have continuously deployed in large production environments since before RDBMSs were even around to compete with.
  IOW, web developers are starting to build systems that reach the scale of the large-scale production database systems which RDBMS's never displaced, and are finding that the same systems which RDBMSs never managed to displace are, in fact, often better choices than RDBMSs at that scale, which makes these tried-and-true systems new and shiny to the web development community.
43. Re:pfffft twatter tweeter by haruchai · 2010-02-24 12:36 · Score: 1
  
  Just because you haven't heard of something doesn't make it obscure. Tens of millions of Americans still can't find Iraq on a map - doesn't mean it doesn't exist or isn't "mainstream".
  And, at least one major "mainstream" US news network can tell Egypt from Iraq.
  http://mediamatters.org/mmtv/200907270040
  Caché / MUMPS is heavily used in Healthcare and Finance.
  Your life and your financial future may well depend on apps that run on them.
  Just because something isn't incessantly hyped by egocentric CEOs doesn't mean it's not dependable, versatile or worthwhile. If you're Joe Couchsitter waiting for the next big thing to break down your door, then, okay, you probably would never had heard of a databases besides Oracle and its closest competitors ( or maybe DBase if you're above a certain age).
  But, getting on this newfangled Internet thingie and using an obscure tool called a search engine would have given you lots of alternatives to ponder, at any time in the last decade.
  
  --
  Pain is merely failure leaving the body
44. Re:pfffft twatter tweeter by maraist · 2010-02-24 14:05 · Score: 1
  
  I totally love RDBMS's, don't get me wrong. You can manipulate schema on-the-fly (more or less). Introduce optimizations independently of the source-code. You don't have to think about fringe cases, or data-integrity. But when a project grows to a certain point, you have two decisions: Go to a hyper-expensive RDBMS solution ($100k .. $500k) (for a project that may only be worth $100k), or identify the key bottlenecks and try to re-architect the tables of interest.
  
  I've often found that DRBD+NFS flat file solutions scale 100x that of a mysql solution and with 1/5th the hardware horsepower. 'Cursors are just a tad bit more efficient at the file-system level'. But obviously when doing this you're introducing massive volumes of potential bugs. But 3 hours of debugging relatively basic CIS practices to save you $96k of the $100k Oracle bill is well worth it in my opinion.
  
  To me BigTable style solutions are the middle ground. (though the file-system cursoring is more akin to a Hadoop map-reduce than the explicit table architecture of HBase/Hypertable/Casandra).
  
  --
  -Michael
I'm Reluctant by Anonymous Coward · 2010-02-23 07:51 · Score: 1, Insightful

I'm reluctant to believe that Twitter is a good technology bellwether. Twitter seems to have so many technology issues, fail whales, outages, security breeches...
SO, I'm left wondering; what does this move say? Does it say that Cassandra is so bad that Twitter is using it? Or does it say that a fail whale population boom is imminent?
1. Re:I'm Reluctant by binarylarry · 2010-02-23 08:11 · Score: 2, Insightful
  
  Twitter's only moving to this new database written in Java because everyone else is.
  
  --
  Mod me down, my New Earth Global Warmingist friends!
2. Re:I'm Reluctant by Doomdark · 2010-02-23 12:40 · Score: 1
  
  I'm not sure this is true -- there are plenty of no-sql alternatives written in other languages (erlang (CouchDB), c++ (mongo), ...). I think choice has to do with somewhat more powerful model Cassandra has (compared to other choices that are simpler), or its good distribution (others only support limited sharding, not true distribution, AFAIK CouchDB has this issue).
  
  --
  I like paying taxes. With them I buy civilization -- Oliver Wendell Holmes
Intersystems Caché by paugq · 2010-02-23 08:17 · Score: 1

They should move to Intersystems Caché. SQL, objects, XML and even MUMPS. It will make equally happy SQL and NoSQL fans. And it's damn fast. Much leaner than Oracle, DB2 or Informix, too. Excellent support. Extremely good. Not cheap, thought.
1. Re:Intersystems Caché by edmicman · 2010-02-23 09:06 · Score: 1
  
  Not cheap, though.
  That might be part of it....
2. Re:Intersystems Caché by Anonymous Coward · 2010-02-23 11:09 · Score: 0
  
  How does the write performance scale across several servers ("scale horizontally")? That is what NoSQL is all about.
3. Re:Intersystems Caché by paugq · 2010-02-23 18:28 · Score: 1
  
  Excellent. That's, in fact, our use case. We have a very specialized application where every client is also server (multi-master). We even wrote our own database replication software, which is better than Intersystems' (the "shadow replication" Caché is able to do does not allow multi-master replication).
Don't want to install Cassandra by einhverfr · 2010-02-23 08:33 · Score: 2, Funny

I hear Cassandra is really a trojan. Can anyone verify? I don't want a trojan on my computer.....

--

LedgerSMB: Open source Accounting/ERP
1. Re:Don't want to install Cassandra by turbidostato · 2010-02-23 12:36 · Score: 1
  
  I hear Cassandra is really a trojan. Can anyone verify? I don't want a trojan on my "computer....."
  But, but... what if I gift it you? I swear I'm not Trojan but Greek.
Twitter needs scalability experts by Heretic2 · 2010-02-23 08:51 · Score: 5, Interesting

I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.
I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.
I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?
1. Re:Twitter needs scalability experts by MindVirus · 2010-02-23 09:12 · Score: 1
  
  If you have dealt with an EMP at your primary datacenter, you are a more hardened sysadmin than us all.
2. Re:Twitter needs scalability experts by msimm · 2010-02-23 09:25 · Score: 1
  
  ...real RDBMS like Oracle...
  Holy fuck, the right tool for the right job, please? Oracle does somethings for some markets really well but for the rest of us who don't need such a high degree of transactional safety that $90k + two-node RAC price tag might just end up taking your great web 3.0 business through development, maybe early beta before you begin liquidating assets. That's per-processor licensing too on a database that scales vertically well (very well really) but not horizontally well (sharding anyone?) so the better your project does the more processor licenses you'll be looking at, and using higher-priced hardware to do it too (because cheap boxes scale best horizontally).
  
  So sure, if you can afford it get the big iron and with any luck your industry works with the kind of margins you'll never even need to know the cost of going that way. Personally, after having done the Sun/Oracle thing I hope to never find myself sitting at a business meeting trying to figure out how we can meet capacity demands after we've run out of money paying for high priced hardware and license fees.
  
  I'm glad products like Oracles exist, even somewhat impressed by them, but not every project will need them and there are very real cost considerations that should also be taken into account. Know your business and for the love of God, do a thorough survey of all the available tools before you commit to one.
  
  --
  Quack, quack.
3. Re:Twitter needs scalability experts by mini+me · 2010-02-23 09:28 · Score: 1
  
  They are seeing about 1/2 million transactions per second with this setup based on the information given, but no word of what their cluster consists of. If it is just a handful of generic PCs, $100,000 for your setup looks pretty expensive.
4. Re:Twitter needs scalability experts by guru42101 · 2010-02-23 09:37 · Score: 2, Insightful
  
  I've never dealt with an EMP but a more realistic threat with similar effects would be planning for a hurricane or earthquake. I used to work at a international bank and we had to deal with both (offices in FL and CA). For the most part the best solution was to have an identical setup at another office and having all applications available via VPN and/or web access. We had a separate pipe that was used only for backup data transfers. The DB transaction logs were written both locally and remotely. All user files saved to the server were immediately copied to the backup server. On several occasions the systems were tested due to black/brownouts. The users were sent home where they could work just as effectively as the office.
  Our general emergency plan for hurricanes (I worked at the FL office we used the CA office as our backup). Was to let the users go well in advance of the hurricane and switch CA to being our primary servers with FL as the backup. Once the users were settled then they could continue working from home. The only way we would be screwed is if a hurricane and earthquake happened simultaneously. At that point we'd have to restore VM backups on hardware located at the main corporate offices in NYC or Sydney.
5. Re:Twitter needs scalability experts by Jazz-Masta · 2010-02-23 09:47 · Score: 1
  
  you are a more hardened sysadmin than us all.
  You have no idea...
  http://xkcd.com/705/
6. Re:Twitter needs scalability experts by Anonymous Coward · 2010-02-23 09:47 · Score: 0
  
  ...real RDBMS like Oracle...
  So sure, if you can afford it get the big iron and with any luck your industry works with the kind of margins you'll never even need to know the cost of going that way. Personally, after having done the Sun/Oracle thing I hope to never find myself sitting at a business meeting trying to figure out how we can meet capacity demands after we've run out of money paying for high priced hardware and license fees.
  Oracle runs on almost anything, including cheap Linux boxes, so hardware costs are rarely the bottleneck unless you're locked into a contract with a hardware vendor. (RDBMS license costs are a different matter.)
  Then again, most apps I've seen are badly designed around the DB - especially if you let the developers design their own schema - so I'm not surprised they have scaling problems. Most well-designed apps running on Oracle can take huge loads without sweating.
7. Re:Twitter needs scalability experts by ryansking · 2010-02-23 10:07 · Score: 2, Informative
  
  You're right, I failed to mention disaster recovery– it was something we looked at, its just been awhile since we went through the evaluation process, so I've forgotten a few things. We actually liked Cassandra for DR scenarios – the snapshot functionality makes backups relatively straight forward, plus multi-DC support will make operational continuity in the case of losing a whole DC a possibility.
8. Re:Twitter needs scalability experts by codepunk · 2010-02-23 10:08 · Score: 1
  
  I love oracle it is a fine database, would I personally buy it? Nope, but as long as
  it is OPM (Other Peoples Money) I am perfectly fine with it. Now say I was designing something
  like a medical records system oracle would be a no brainer. Missing a couple of tweets here
  and there who is really going to care.
  
  --
  
  Got Code?
9. Re:Twitter needs scalability experts by msimm · 2010-02-23 10:57 · Score: 1
  
  You're either intentionally missing my point or have simply missed it.
  
  Sure, you could run Oracle on x86 desktop hardware. You could even go so far as buying two cheap e-machines and run them both as Linux-based RAC nodes. Granted, those shitty nodes with the fibre-channel attached disk array would still have set you back probably over $90k with the additional hardware and licenses.
  
  Eventually, if you're lucky enough to see rapid or continued growth you'll find yourself moving onto better enterprise grade gear and the best way to push performance (aside from application/network/database optimization) it to begin your vertical scaling by increasing your available memory, disk speed (SSD cache disks, etc) and total number of cores.
  
  Oracle is amazing with multi-core architecture. I've personally performed days, probably several weeks, of stress testing and watching Oracles core/page/memory use which, as compared to MySQL for example, is a thing of beauty.
  
  But that all comes with a per-core license cost attached which can make capacity increases cost significantly more in database licenses then in physical hardware. That's fine if you have a business that really needs some of the features that come with Oracle. But I'd argue that the majority of businesses like Twitter or Facebook need flexible, cheaply scalable, high-volume read-writes more then they need the reliability or datamining/statistical features that come with the Oracle price tag.
  
  But to each their own. The premise of my post was proper evaluation, the right tool for the right jobs and there are certainly times Oracle is/will be that tool.
  
  --
  Quack, quack.
10. Re:Twitter needs scalability experts by Anonymous Coward · 2010-02-23 11:29 · Score: 0
  
  Cool! Let's design twitter in a slashdot forum! That sounds fun!
  EFD's are 10x cheaper now than when Twitter started. But, we'll ignore that. You need two nodes for fail-over, probably. So $200k.You also need some hardware and licenses for test and development environments. So, probably $250k minimum?
  Twitter is doing ~50 million tweets per day. That's about 600 tweets per second, probably with peek load around 6k/s. I'm just guessing about the peek load. So, your point load capability of 1 million simple row writes per second with no indexing is extreme overkill. But, you aren't measuring the right things. The critical problem is read throughput and latency when under write load. Twitter currently supports an average page read load of 1600/s. So, there are probably bursty point loads approaching 10k/s.
  You need to write those raw tweets into a FollowTweet table like |ID|UserID|Date|AuthorID|AuthorName|TweetID|Tweet|, or you'll end up doing essentially random searches across the core tweet table for every page view. The dominant query pattern for this table is
  SELECT TOP 50 * FROM FollowTweet WHERE UserID = @pUserID ORDER BY Date DESC
  You probably need a clustered covered index to make that query fast or you'll do a lot of random seeks on read. I've already denormalized that a bit for you to get rid of the join against the core Tweet table. You can argue about that if you like. Test both if you disagree.
  So, what's your write rate to a table that looks like that while maintaining a clustered covered index, after writing 50 million tweets per day for 6 months? It's far lower than 1 million transactions/sec. But, I'm curious about the exact number.
  Twitter also provides free text search. So, add in a free text search index on tweets, ordered by time of post descending.
  Now that you're maintaining a free text index on the core Tweet table, how many writes per second can you sustain, after writing 50 million tweets per day for 6 months?
  Much more importantly, how many user queries per second can you sustain against the FollowTweet table, and what is the 99.9% latency of a read? And, how many free text searches per second can you sustain, and what is the 99.9% latency of a free text search? And, how long does it take to completely write a tweet and make it available to all followers and in the free text index?
  I'm sure you know that latencies are much more important than throughput to the user's subjective experience. I bet you can't sustain the necessary write load on a single node while maintaining good read latency, So, you're going to need some read only mirror nodes to scale reads. So, add that hardware and license cost to your system.
  I could, however, take 12 commodity 1u boxes with cheap disks and 8GB ram, install mongo on them and turn them into mirrored pairs, and handle all of the load I just described with consistently low latency. That'd be about 133/s direct key lookups per mongo box for a page view. And, 100/s writes per mongodb box. Each mongodb box should scale to 10x that load without worry, so we should be fairly safe for point load. I'd have to see the real twitter numbers to be sure. They do seem to have high burst peeks. We can snapshot the mongodbs for disaster recovery, and I'm mirrored for fast failover if a single node goes down.
  You're going to need another 6-12 front end boxes to render pages, with either storage system.
  The total cost for the release environment machines would be ~$24k from a reputable server hardware builder. So, we're looking at $24k for the NoSql FOSS design, vs. $200k for the Oracle, big iron design. We're both ignoring network equipment, bandwidth, and hosting costs.
  Also, with an oracle design, you get to spend your $500k a year on operational specialists with pagers. With FOSS, you can probably spend $240k a year on that staff.
11. Re:Twitter needs scalability experts by lawpoop · 2010-02-23 11:38 · Score: 1
  
  I only use FOSS when it happens to be best-in-class
  Just curious, what FOSS have/do you use?
  
  --
  Computers are useless. They can only give you answers.
  -- Pablo Picasso
12. Re:Twitter needs scalability experts by Anonymous Coward · 2010-02-23 12:02 · Score: 0
  
  From detailed background on cassandra link:
  
  Multi-datacenter awareness: you can adjust your node layout to ensure that if one datacenter burns in a fire, an alternative datacenter will have at least one full copy of every record.
  I didn't know that IP tunnels were able to transport EMP's. Awesome!
13. Re:Twitter needs scalability experts by lisany · 2010-02-23 13:06 · Score: 1
  
  I lost interest in Oracle when I had to develop for it and learned that it can't support index names of more than 32 characters in length. Oh, and needing a degree to figure out licensing costs doesn't help either; best hope for a non-crooked VAR to set you straight.
14. Re:Twitter needs scalability experts by Bazouel · 2010-02-23 13:20 · Score: 1
  
  I am curious what someone with your experience thinks of PostgreSQL ? Would you say that it can scale properly as Oracle does ?
  This is a genuine question as I am pondering between both for my startup. Even thought I already done my investigations, one more opinion cannot hurt :) Assuming my current DB design holds, it will have about 50 tables, most having less than 10,000 records and some having few millions records (they will be partitioned). The volume of reads will be much higher than writes. Write queries will involve mostly 1-2 tables and short transactions. Typical read queries will require many joins (thought most can be cached or materialized as the data is quite stale).
  
  --
  Intelligence shared is intelligence squared.
15. Re:Twitter needs scalability experts by Anonymous Coward · 2010-02-23 13:34 · Score: 0
  
  That's because you don't need index names more than 32 characters in length. Seriously.
16. Re:Twitter needs scalability experts by Eil · 2010-02-23 13:37 · Score: 1
  
  I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?
  1) They never said they didn't plan for disaster recovery. It's silly to deride them for not discussing the entirety of their backups and disaster recovery efforts when the whole topic of the article was their move to Cassandra as a primary data store.
  2) Disaster recovery looks at realistic threat scenarios. Fire, sabotage, natural disaster, and so on. "EMP bomb at your primary datacenter" is wholly unlikely. Nobody can plan for failure at all points simultaneously because "all points" includes everything in their entire operation including backups and redundant systems. What do you want them to do, make hourly offline backups and bury the tapes under a mountain in China? The point of DR is to make your systems diverse, redundant, and operable against a broad category of general failures. Not fully invulnerable to every random specific movie plot threat someone happens to come up with.
17. Re:Twitter needs scalability experts by magus_melchior · 2010-02-24 08:03 · Score: 1
  
  EMP bomb at your primary datacenter anyone?
  I'm pretty sure that's what Faraday cages are for. I know that EMP bombs (AKA nuke detonation in the upper atmosphere) is a favorite doomsday scenario, but with the right electrical hardening (re: Switzerland), they're pretty easy to defend against.
  Now, fires (and by "fire" I mean something like "thermite"), floods, dirty bombs, and earthquakes, on the other hands...
  
  --
  "We are Microsoft. You shall be assimilated. Competition is futile."
18. Re:Twitter needs scalability experts by Tablizer · 2010-02-24 08:22 · Score: 1
  
  Maybe G.W.Bush selects their index names
  idx_mississippian_transactionicational_ userites_registerification_trackeratizing
  
  --
  Table-ized A.I.
19. Re:Twitter needs scalability experts by Anonymous Coward · 2010-02-24 12:50 · Score: 0
  
  For $40000/core, if the customer they needed 33+ characters frankly I would it too them
20. Re:Twitter needs scalability experts by Anonymous Coward · 2010-03-01 03:37 · Score: 0
  
  XKCD pasted, virgin detected
Speed *isn't* scalability by Colin+Smith · 2010-02-23 09:22 · Score: 1

Speed is latency. (how long it takes)
Scalability is throughput. (how many concurrent). Or put another way; Speed is the quality, throughput is the width.

who cares what twuufter is running off.
Well, developers, and their managers do. They're nothing if not fashion victims.
RDBMS aren't the be all and end all of scalability (or speed, they perform a shit load of management functions you may or may not need). While attempting to scale conventional rdbms you get into write consistency problem, lookup performance problems unless you specifically design your data structures properly. You end up fighting with the relational data model.
Most developers never even think about it, they just develop against their local mysql install and are overjoyed that their app actually runs. Not all apps even need an rdbms. I've seen apps with a single table, two columns, one of which is a key and it's running on an rdbms, because that's what you do... The words WTF sprang to mind.

--
Deleted
Too bad they dont about TPF/ZTPF and TPFDB/ACPDB by emes · 2010-02-23 09:24 · Score: 1

It's always funny to read things written by people who obviously are inexperienced with high volume transaction processing in the mainframe environment. The systems behind airline, rail, and hotel reservations as well as emergency response messaging often are built on IBM mainframes using TPF/ZTPF as the operating system and
TPFDB(formerly known as ACPDB) as the underlying database. If someone would take the time to study TPFDB, they would notice its nonrelational character, as well as some interesting similarities to what the Cassandra developers unknowingly chose to do. By the way, these systems are happily handling 10K-12K transactions per second without bunny farm racks of servers.
Sometimes progress is not always about what will be done, but understanding the benefits of older things that have been done.
Re:Too bad they dont about TPF/ZTPF and TPFDB/ACPD by Anonymous Coward · 2010-02-23 09:55 · Score: 0

> By the way, these systems are happily handling 10K-12K transactions per second without bunny farm racks of servers.
The airline systems have entire *data centers* instead, to say nothing of the enormous transaction processing infrastructure inbetween.
To the clueless mod (Homer Simpson, is that you?) by einhverfr · 2010-02-23 10:06 · Score: 1

Flamebait?
Do I have to spell out the joke to people?
Or is it just that nobody reads Homer anymore.

--

LedgerSMB: Open source Accounting/ERP
Java / JVM Wins Again ... by zuperduperman · 2010-02-23 10:09 · Score: 1

It's fascinating how after initially being a posterboy for the post-Java revolution Twitter is gradually moving their architecture to the JVM, piece by piece. I think it's actually a credit to them that they seem to have level heads and are evaluating technology on it's merits (where as if you talk to most of the ruby / python crowd they would rather stick toothpicks in their eyes than endorse a solution that involves java).
1. Re:Java / JVM Wins Again ... by codepunk · 2010-02-23 10:18 · Score: 3, Funny
  
  Until recently I thought the same way, I would never endorse a solution that involves java. However
  a recently came to the same realization that sun did when they created it. Java is a fantastic
  way to over sell gobs of expensive hardware. I am a system administrator so the more hardware it takes to
  run a solution the better off I am, more machines, more money and better job security. So I have now
  fully jumped on the java bandwagon, java makes me smile.
  
  --
  
  Got Code?
2. Re:Java / JVM Wins Again ... by zuperduperman · 2010-02-23 11:31 · Score: 2, Informative
  
  Sure - but I think the whole point is that you'd be smiling even more if they were using one of the modern & trendy dynamic languages because you'd likely have 2 - 3 times the amount of hardware to look after. I'm not sure what alternative you would propose that uses less hardware but there actually aren't many that are better than the JVM these days.
3. Re:Java / JVM Wins Again ... by DragonWriter · 2010-02-23 11:31 · Score: 1
  
  It's fascinating how after initially being a posterboy for the post-Java revolution Twitter is gradually moving their architecture to the JVM piece by piece.
  I think its fascinating, too -- but probably in a very different way than you do. You seem to think that it is a repudiation of some mythical "post-Java revolution", when in many ways I think it is a validation of exactly the approach that was common to pushing Ruby, Python, and similar languages as more agile alternatives to Java. The appeal of tools noted for their suitability for rapid development of software that works and is maintainable, even if it isn't going to set any kind of performance records, is that it supports getting new functionality (and, thus, often new businesses) of the ground, and supports the kind of rapid change that is often necessary when a product is first exposed to a mass market, gets used in new and unexpected (by the developers) ways, etc., and that the right time to optimize performance is often once the concept is validated, and trying to do too much of that too early means you lose agility in introduction and early development of the product.
  Shifting, component by component, to more "enterprisey" solutions as a service/product matures is entirely consistent with that understanding.
  
  (where as if you talk to most of the ruby / python crowd they would rather stick toothpicks in their eyes than endorse a solution that involves java).
  I don't think that's particularly true. Sure, some of the people in the any language community are going to be partisans for that language exclusively, but the Ruby community (which I'm more familiar with than the Python community) seems particularly friendly to Java as a platform, and to Ruby being used in the role of a "glue" language instead of an exclusive language.
  In the case of the Ruby community, I think that the appearance of anti-Java sentiment there stems largely from the the early days of Rails, where lots of people were pushing Rails by extolling (often in a rather hyperbolic manner) its virtues as compared to enterprise-oriented, XML-configuration-heavy, Java frameworks.
Re:Too bad they dont about TPF/ZTPF and TPFDB/ACPD by Anonymous Coward · 2010-02-23 10:33 · Score: 0

Nice point. Thanks for this. Data processing/transaction is not really my area of expertise, but I've always worked with the thought that nothing I'm doing is new on a technical level. This goes to show it. What the F/OSS community should focus on, be it through research groups is the human computer interaction. This is a relatively new field of study - maybe 20 years old, and there's a lot less catch-up. My conspiracy theory hat of yester-year would probably take a stab that this is why oracle cut funding to the accessibility projects of sun/gnome. Just to extend the gap between free and commercial HCI offerings.
They considered Voldemort by Anonymous Coward · 2010-02-23 10:35 · Score: 0

But found that its backup policy required horcruxes.
Re:Too bad they dont about TPF/ZTPF and TPFDB/ACPD by einhverfr · 2010-02-23 11:29 · Score: 1

Teradata seems to win typical OLTP and OLAP benchmarks. I would think for airline reservations and such that would be my choice of platform.

--

LedgerSMB: Open source Accounting/ERP
Open Source Parallel Databases by cervo · 2010-02-23 11:39 · Score: 1

A lot of the complaints from NoSQL seem to be regarding DBMSses being too slow and SQL being too hard. And yet a lot of them invent query languages/query languages similar to SQL. Supposedly Oracle scales up really well. There is a paper that compares mapreduce to parallel databases and Hadoop takes a huge beating via the RDBMSes in performance. Now the funny thing is that Oracle was not included, yet most content that if you pay enough Oracle scales really well. DB2 also scales, because in 1999 I worked at a place with terabytes of database space and they had a few nodes running DB2 on AIX boxes and seemed to be getting adequate performance.

But most open sources databases seem to not be able to compete with the likes of the commercial parallel databases. But it seems like an open source parallel database would do a lot to silence many nosql critics. There is still the complaint about needing to define a schema, however if you are not exploring the data and are processing the same data over and over again, it seems like a good idea to define a schema anyway, that way you can better detect files that don't conform.
1. Re:Open Source Parallel Databases by einhverfr · 2010-02-23 12:50 · Score: 1
  
  But most open sources databases seem to not be able to compete with the likes of the commercial parallel databases. But it seems like an open source parallel database would do a lot to silence many nosql critics. There is still the complaint about needing to define a schema, however if you are not exploring the data and are processing the same data over and over again, it seems like a good idea to define a schema anyway, that way you can better detect files that don't conform.
  I have actually thought it would be really cool to come up with a REALLY NICE parallel-processing-capable shared-everything-clustered db. I suspect this could be done by modifying PostgreSQL in a number of ways changing shared memory operations to file operations, and changing semaphores to a DLM system. Unfortunately a lot of this ends up causing performance loss on the low-end. While Green Plum offers a nice lower-cost Pg spinoff that handles these things in an OLAP environment, I there is no OLTP equivalent.
  The real issues with FOSS databases in this area have to do with parallel query execution across servers. One reason that DB2 and Oracle can scale so well is that the query can be shared across nodes on a server, and this allows you to run BI stuff on a server with better performance. When combined with clustered filesystems, DLM-based locking, etc. you can get something that scales up very well for high traffic databases. Add high-end battery-backed caches for RAID 1/0 arrays and your throughput keeps going up both on read and write.
  You can get remarkably good performance with PostgreSQL, but the performance is limited by the lack of parallel execution of queries. Consequently a query takes how long it takes and beyond a certain point, throwing hardware at the problem won't fix it. On the other hand, for very many applications it is quite good, and for OLTP environments I haven't yet run into a problem I couldn't solve with it.
  
  --
  
  LedgerSMB: Open source Accounting/ERP
2. Re:Open Source Parallel Databases by maraist · 2010-02-23 16:55 · Score: 1
  
  [complaining that] "SQL being too hard"? Well, one can assume you can ignore this class of amateurs - there's no lack of free learning tools for SQL - and it's dirt simple.
  
  "And yet a lot of them invent query languages/query languages similar to SQL. " - See, I think you're magically associating two classes of programmers. There are people, like myself that love the expressiveness of SQL over virtually any other language for data-set manipulation. Thus we would like as an optional to utilize SQL on even a simple key-value store. And there are tons of noSQL solutions that provide SQL front ends (hell, there are SQL front-ends to CSV stores). You typically lose efficiency at this point, but for rarely run reports, the lack of bugs makes it worth it.
  
  " Hadoop takes a huge beating via the RDBMSes in performance" - By Hadoop, I assume you mean HBase which runs on top of several layers of technologies - the lowest of which is Hadoop. Naturally this layering produces inefficiencies. Consequently, things like HyperTable came about as functional equivalents of HBase without all the layers (and written in raw C I might add). When people say scales well, they typically mean runs slowly on a given node.. And thus something like HBase requires several dozen machines before it can overtake an optimized single-node (mysql/Oracle/what-have-you). Then when you jack up the performance of the single node-cluster (Oracle RAC), you need a lot more machines before you can overtake. This may not make sense for 90% of companies out there - having 1,000 machines just isn't practical and the maintenance costs will be killer in year-2. For Google it made absolute sense.. They simply can't make a single DB configuration go fast enough. And therein is the driving model that noSQL is trying to replicate.
  
  "Now the funny thing is that Oracle was not included" - yeah funny how Oracle has a clause in their license that says you may NEVER publish performance results.. Guess why.. Makes it easier to suck suckers in to paying $100k, only to find that a mysql setup is faster on the same hardware. Yes, you can spend $200k on Oracle and have it faster than mysql will ever be, but you didn't budge for that when you were suckered in.
  
  "1999 I worked at a place with terabytes of database space" - A peta-byte of archive data is not the same as a 100 gigabyte of actively manipulated data when you can only get 32Gig of RAM on the box (such that any random indexed lookup is almost guaranteed to hit the disk). It's somewhat easy to add disks to a virtual server cloud.. Use iSCSI via 100 mounted partitions into a tablespace that spans all 100 partitions (in linearlyly appended mode) - you can do this with mysql today. Not sure what the limits of LVM are, but you could do it that way too. Not too expensive either if you use cheap 2TB disks in sufficiently raided configurations. This use to be mainframe class (tons of IO with fully redundant hardware was their mantra). But my point is there are problem-spaces that make this not scale with RDBMS - unless you treat the offending table as a simple key-value store, such that you can shard it - thereby not properly utilizing the RDBMS.
  
  "But it seems like an open source parallel database would do a lot to silence many nosql critics" - you're not going to silence people that think of data as simple key-value pairs, or highly specialized full-text-searching (which is related to but independent of RDBMS activity). Or even as log-file-processing (such as apache page-view reporting). These are things that RDBMS isn't the best solution for. It CAN do these things, which is why it's become the multi-use hammer. But, when batch processing a 10 million records a day, I have the choice of having a 30 minute load time in the RDBMS and a pretty heafty sustained load over several hours (due to the random-seeks that can't fit in memory). Or I can just store to a flat text file CSV, and maintain a cursor (I mean file-handle) to the last read item, and both load and process the entire
  
  --
  -Michael
3. Re:Open Source Parallel Databases by einhverfr · 2010-02-23 17:30 · Score: 1
  
  "But it seems like an open source parallel database would do a lot to silence many nosql critics" - you're not going to silence people that think of data as simple key-value pairs, or highly specialized full-text-searching (which is related to but independent of RDBMS activity).
  Simple key/value pairs work for some things. Most data cannot be managed reasonable as key/value pairs. And full text searches are entirely orthogonal. That involves searching through text rather than questions of semantic information management. While some things are better stored in flat files or simple non-SQL db's, or even text files, richer information is better handled in an RDBMS IMNSHO.
  As for parallelism, I think you misunderstand the problem. On DB2, Oracle, Teradata, etc. the queries themselves are broken down into subsets, and each node runs a piece of the query. the data is handed back and processed. On PostgreSQL, each query runs in a single-threaded process. You just can't scale the processing high enough this way because eventually. You can scale up the disk access to a point, but between that and the processing power limitations, this only goes so far.
  The real solution is to have the capacity for real parallel execution on shared-everything clusters. This currently cannot be done on any of the major FOSS RDBMS's. It requires DLM-based coordination, avoiding inter-thread and inter-process communication and using distributed methods instead, etc. This will cost performance on the lower end.
  
  --
  
  LedgerSMB: Open source Accounting/ERP
How hard can it be? by FloydTheDroid · 2010-02-23 11:52 · Score: 1

I was going to try to write something funny about twitter only needing three tables to run and how hard is it to change but then I thought about how much money they're going to make off those three tables and I started to cry.
Amazing by caller9 · 2010-02-23 15:32 · Score: 1

Cassandra has the goods for high available and optimized for non-financial data.
That said, I am amazed at how much time, money, and effort has gone into Twitter.
Now a distributed scalable super duper database will keep track of who is pooping. http://poop.obtoose.com/
Re:Too bad they dont about TPF/ZTPF and TPFDB/ACPD by TheSunborn · 2010-02-23 22:44 · Score: 1

The problem is that 10K-12K transactions is 1/100 of what twitter need.
he he he, hit the soft spot, did I, twater? by roman_mir · 2010-02-24 05:57 · Score: 1

http://slashdot.org/~roman_mir/comments - I imagine twater storm of moderation points was spent well this time, every single post I had on this issue was above 3 point and now within 1 hour, all comments were moderated down. To me that's just funny - someone does not like the truth.
I just wonder is it the twater birds or does it have something to do with the nosql ideologists?

--
You can't handle the truth.
That's the D part. by Anonymous Coward · 2010-02-24 18:48 · Score: 0

Now, how about the ACI part?