Horizontal Scaling of SQL Databases?
still_sick writes "I'm currently responsible for operations at a software-as-a-service startup, and we're increasingly hitting limitations in what we can do with relational databases. We've been looking at various NoSQL stores and I've been following Adrian Cockcroft's blog at Netflix which compares the various options. I was intrigued by the most recent entry, about Translattice, which purports to provide many of the same scaling advantages for SQL databases. Is this even possible given the CAP theorem? Is anyone using a system like this in production?"
Just store everything in a big XML file.
It would be a lot easier to talk about solutions if you said which limitations you run into.
Is your dataset to large (large tables), are you having to much joins, too many transactions per second? In short, what is the problem we're trying to solve here?
After that go diagonal, you get a preemptive database which can guess your sql needs.
Learn partitioning principles, get a database product that does partitioning properly, learn normalization, never worry again about not being able to scale with relational databases. It just requires some real skills but relational databases really do scale all the way up.
Another idea is to scale using other layers, if there are problems at the SQL server level.
At the lower areas, one can go with a mainframe (parallel sysplex) and have geographically separate pieces of hardware acting coherently.
At the higher layers, have the app use multiple SQL servers and handle the redundancy in this layer.
Call me skeptical but there are companies out there with massive amounts of data in relational databases, if you as a setup are "constantly hitting limitations" you're either a very odd startup or using it very wrong. As long as the volume is small you can make almost anything happen on SQL. Hell, most small business I've known run mostly on Excel. Somehow I don't see a startup needing NoSQL unless they specialized in processing huge amounts of data, in which case trying to make slashdot work on your core business seems stupid. But maybe I missed something...
Live today, because you never know what tomorrow brings
Translattice is not consistent... it is eventually consistent ...
Given my past 12 years between working at consultancies and start ups, I've seen this a few times. It's usually not a technical hurdle, it's a "We can't solve this problem within our budget" problem. Either by going out and hiring someone who is an expert at performance tuning with their DB of choice or moving from certain db's to real databases that could handle the work like MSSQL, DB2, Oracle, or in some cases Teradata if dealing with Data warehousing.
Because I've worked around some very large database installs in my day. Every time the scaling question/problem came up, it was solvable with RDBMS's, but the solution wasn't cheap.
"The problem with socialism is eventually you run out of other people's money" - Thatcher.
"I'm currently responsible for operations at a software-as-a-service startup, and we're increasingly hitting limitations in what we can do with relational databases. "
Relational databases scale to pretty amazing heights. The notion that you are hitting some limit of relational databases at a startup stretches the imagination. I mean, really, you've already hit exabyte data sizes? That's typically where relational starts to struggle.
You really need to define your problem with much greater specificity to get a valuable answer.
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Let Google worry about it. Pricing is stupid cheap.
I used to work at a managed service provider and we often had clients complain that the SQL Server was slow or did not scale. 99 times out of 100 the issue was that their code was horribly inefficient. Either it was eating up connections or executing inefficient queries thousands of times more than necessary.
It's often hard to convince the developers that their code is bad, but if you do some profiling, capture the most frequent queries, and show them the results, that may help.
If in fact the code is behaving and you are still having trouble scaling, here are a few hints:
1. See if there is some caching that you can do on the application tier
2. Reorganize and index your data structure to optimize for the queries that you find are inefficient
3. Separate the database logically onto separate servers.
If it was really good, it would create itself, if it hasn't already.
rewriting history since 2109
Please post the name of your company so we can learn more about what kind of data you're storing and what kind of issues you are seeing. And so we can avoid using your services until you hire somebody competent. Thanks.
I didn't expect we'd be on Slashdot just yet. I'm Michael Lyle, CTO and cofounder of Translattice.
With regards to the original submitter's question, we'd love to talk to him. How much we can help, of course, depends on the specific scenario he's hitting.
What we've built is an application platform constituted from identical nodes, each containing a geographically decentralized relational database, a distributed (J2EE compatible) application container, and distributed load balancing and management capabilities. Massive relational data is transparently sharded behind the scenes and assigned redundantly to the computing resources in the cluster, and a distributed consensus protocol keeps all of the transactions in flight coherent and provides ACID guarantees. In essence, we allow existing enterprise applications to scale out horizontally while keeping the benefits of the existing programming model for transactional applications, by letting computing resources from throughout an organization combine to run enterprise workloads.
Current stacks are really complicated, multi-vendor, and require extensive integration/custom engineering for each application install. We're striving to create a world where massively performing infrastructure can be built from identical pieces.
Perl seems to work well for me. You may want to try it.
Some drink at the fountain of knowledge. Others just gargle.
The post is so vaguely worded, I imagine the author is merely trying to find some justification to purchase some new toys. "See, Slashdot people think this is a good idea!"
I agree with most of the posts so far -- if you're truly hitting a limit, you are most likely doing something wrong. Hire an outside DBA to make recommendations if you don't have the resources in-house. I strongly suspect this is the real issue.
story. It's pretty obvous from "I was intrigued by the most recent entry, about Translattice, which purports to provide many of the same scaling advantages for SQL databases."
Yours In Akademgorodok,
Kilgore Trout
Most of the time, when someone says they're having trouble scaling their database, it's a case of a developer that has an incorrectly configured database. Installing MySQL is easy, but configuring it is VERY difficult. That's why you need a person with very specialized knowledge to properly configure a database for efficiency or throughput or whatever you're going for in your specific case.
It would be like saying that anyone can go to a hardware store, buy some lumber, and nail them together to create a rudimentary shelter, but if you want a *house*, something that will weather the elements and keep you warm and comfortable and secure, you need to hire a professional carpenter.
Set up and scaling has been really easy in comparison to similar MySQL clusters I have set up previously.
I recently read that someone moved their large operation from Cassandra to Hbase, a hadoop file system. http://hbase.apache.org/
HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase includes:
Convenient base classes for backing Hadoop MapReduce jobs with HBase tables
Query predicate push down via server side scan and get filters
Optimizations for real time queries
A high performance Thrift gateway
A REST-ful Web service gateway that supports XML, Protobuf, and binary data encoding options
Cascading, hive, and pig source and sink modules
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
HBase 0.20 has greatly improved on its predecessors:
No HBase single point of failure
Rolling restart for configuration changes and minor upgrades
Random access performance on par with open source relational databases such as MySQL
TossableDigits.com: Temporary Phone Numb
Is to write better queries, I mean how hard can it be:
select * from (select * from A,B,C,D,E,F,G WHERE A.ID=B.AID(+) AND B.ID=C.BID(+) AND C.ID=D.CID(+) AND D.ID=E.DID(+) AND E.ID=F.EID(+) AND F.ID=G.FID(+) order by F.name ASC) where F.name='zzzzz'
Everything will work out, I swear.
Bye!
Are you sure you are hitting a limitation of the RDBMS or a limitation in the way your services are built? I'm just a little skeptical that a SaaS startup is already hitting limits with what you "can do with relational databases". How many hundred terabytes are you talking about here?
Usually when I hear this I see a PHP application which hits the database synchronously for every request. Or worse, a Java/Python/Ruby/.NET/whatever application built like it was a PHP app.
http://thedailywtf.com/Articles/Lyle-Can-Do-Anything-Better-Than-You.aspx
*SCNR*, don't take it personally :)
What limits are you hitting. And why are you mentioning but one of the many solutions to your problem one which is probably mighty expensive compared to the other solutions.
If you're genuinely hitting a limit, you're doing it wrong. You're probably not Google so most likely you're having issues scaling your proprietary and expensive SQL database (Oracle, MSSQL) but don't want to buy more $10-20k licenses. Most likely you can fix it by simply throwing better and more hardware at it (SSD, more hard drives and RAM) and while you're at it changing to a cheaper database solution (MySQL or PostgreSQL) which you can scale further for less money.
Custom electronics and digital signage for your business: www.evcircuits.com
I work with some very high traffic sites, storing large data sets (100GB+).
Depending on the application (if it allows for different write-only/read-only database configurations) we'll have a master-master replication setup, then a number of slaves hanging off each MySQL master. In front of all of this is haproxy* which performs TCP load balancing between all slaves, and all masters. Slaves that fall behind the master are automatically removed from the pool to ensure that clients receive current data.
This provides:
* Redundancy
* Scaling
* Automatic failover
The whole NoSQL movement is as bad as the XML movement. I'm sure it's a great idea in some cases, but otherwise it's a solution looking for a problem.
(*) http://haproxy.1wt.eu/
Just because you disagree doesn't make it offtopic or flamebait.
depending on what specifically you're trying to do, it may be the way to go.
Look at ParAccel.
"I'm currently responsible for operations at a software-as-a-service startup, and we're increasingly hitting limitations in what [OUR OUTSOURCED INDIAN DEVELOPERS] can do with relational databases.
FIXED IT FOR YOU MY PRETTY LITTLE OPERATIONS MANAGER. (Just using all caps to make you feel more at home)
Save the headaches and just use SQL Server
Have you looked at voltdb ? http://www.voltdb.com .
My 2 cents.
Depending on your intended application this may help: http://www.infinidb.org/
NoSQL isn't some passing fad invented by high school kids.
Luckily, most of you will probably never discover that fact for yourselves, because you'll never have experience with a successful internet-scale architecture. Relational DBs are just fine for internal "enterprisey" apps, or for your hobby website that drives an astounding 1200 page views/month, or for your failed attempt at launching a web service that only ever garners 300,000 users, so you can continue to delude yourselves that there just isn't a problem here, and SQL is the only skillset you'll ever need.
For the elite few who actually achieve success, you'll totally know where the OP is coming from. Intimately. And you'll either be very glad that there is a path (hadoop, cassandra, mongodb, etc) to migrate to that solves your problems, or you'll be very glad that you started with one of those solutions in the first place.
The fastest DB I've ever used is based on PICK OS/DB. Reality is the retail name for it now(essentially an emulator with an API for *nix/Windows). The military used it for inventory tracking and various companies still use it today for a great deal of things. ADP uses it for extremely large databases with tons of history for accounting, financials, inventory, etc. Even very old systems with 20+ years of data are very responsive/quick(these systems are running Digital Unix 4 with Alpha processors) Pick/Reality is a hashfile oriented multivalue database. Wikipedia has a pretty good explanation and I believe Northgate Systems markets Reality today
http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html
https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL
Take a look at SQL alternatives. There's an interesting PostgreSQL use case which uses only open source tools to achieve a good horizontal scaling solution. This post tells a little about how they did it: http://highscalability.com/skype-plans-postgresql-scale-1-billion-users The post also says that there is a very similar approach using MySQL.
Have you looked at TokuDB for MySQL (http://tokutek.com)
You're going to stay stuck there (and getting progressively worse) until the designers of your database start to implement 5th normal form.
That means taking into account the relationships between data elements and implementing them as something other than aggregated tuples.
The aggregation problem is getting worse as you try to implement new relationships.
MSBPodcast.com The opinions expressed here are my own. If you don't like 'em... Think up your own stuff.
The real question is *what* you are doing with relational databases and what are you trying to accomplish. There is rarely a magic new buzzword or hip new thing that magically solves all your problems.
Are you doing important financial calculations, or anything that requires ACID? If the answer is yes, then you may want to investigate sharding and figure out ways to safely split your data amongst multiple databases while still insuring that you can do a transaction on one system for important situations.
If you are serving up read heavy web content that doesn't need any fancy transactions or specific versions you can easily set up an intermediary cache with something like memcache, or with many of the NoSQL storage systems available (which typically act like memcache with disk backing and persistence). This might mean that you have your primary source in a SQL database or perhaps just the portions that require ACID and then you periodically sync changes from your SQL to memcache or your NoSQL system.
If you are doing lots of writes you may want to consider a system of local storage per node, whether it be SQL, NoSQL or something else and then aggregate and synchronize that at a later point for reporting or some sort of centralized usage.
SQL and NoSQL are just tools, not a religious philosophy. If you need to screw in a screw, don't use a hammer. The same goes for using or not using a relational database, I find that most business need a relational database somewhere in their technology stack, this is because businesses ultimately deal with things that relate to making money... and dealing with money properly often requires ACID features and transactions.
Lastly, remember that disks are slow and memory is fast... sounds silly but people seem to forget where their data is coming from and bottleneck the disk on just about any web application that is slow.
Luckily, we won't become so condescending, I hope ;-)
Really, NoSQL, MongoDB, Cassandra sounds neat. However, I've yet to encounter a design where they are truly superior. However, they would be neat to use in the right project..
MSSQL...Oracle....
I've been working on a highly scalable system where we have managed to alleviate the load on the database to near zero and hence save some big bucks on Oracle licensing.
The application is java based using an ORM (hibernate) with L2 caching. The cache is distributed using an open source technology called terracotta. The performance we have achieved is through the roof.
www.terracotta.org
look at the specific requirements of your systems and build a custom solution.
that I have trouble imagining you have any kind of skill for tackling the issue.
You do realize you give us ZERO info on what the problem is, but do push a very specific (if not fringe) approach to your non-question ?
If your problem-solving skills are on a par with your problem-describing skills, you're in for hard times.
The Cloud - because you don't care if your apps and data are up in the air.
Relationahl databases do not scale when the person(s) working schema and systems design have not put sufficient effort into resolving the underlying flow/concurrency hotspots inherent in their design.
At the end of the day design trumps selection of underlying data management technology when viewed at scale.
See subject. Like other sites, there are a-holes around here with nothing better to do than harass or bother others, just because their lives are f'd up, so they want others' to be as well. (This doesn't go for all /.'ers, there are some truly nice folks here and greatly informative posters too, but again there's also what you noticed: a-holes. Most of them are probably hormonally imbalanced teens I'd guess, but others are just losers that want others to feel as they do, miserable. Hilariously, that kind always wonders why their lives are so messed up. If they could only look at their negative attitude from someone else's view of them, they'd understand then I think). Anyways, if the poster of the original article reads this, I am certain he will understand, that this site, like any other? Has its share of permanently f'd up a-holes that post here regularly to try to harass and annoy others.
SeriouslyI mean I’m used to the EDITORSnot reading the articles, but at least the SUBMITTERS used to read them, then this one said "Doesn't this violate the CAP theorem?" Hmm, in the first page of the blog, it sayd "... It's a distributed relational SQL database that supports eventual consistency..." Eventual Consistency means NOT CAP.. sheesh.
Heh, just wait until you try SugarCRM.. Reading your post made me realize there are other projects out there with the _exact_ same flaws / annoyances as Sugar, love it or hate it.
I'm sure Sugar is nice if you have the 6-12 months pouring over all the code and making design for new modules and layout / DB. But for _efficiency_ it's a mirage in the desert of hopeless "open source" projects, which in reality are paywalled spagghetti monsters.
Same with Typo3.
Don't get me wrong. It CAN work. I've been on successful projects with these. Just don't underestimate the complexity of already-written code, especially when they require you to commit to SO MUCH.. I mean who else uses Typoscript or that internal MVC templating engine? Yeah right, nobody. Guess why...
It's a nightmare.
http://www.debunkingskeptics.com/
This has struck me numerous times of late. One of the problems of SQL culture and conflict is the idea that schemas are defined by a separate caste of user and/or that they are fixed.
There are application domains where end users need to be involved in schema creation/maintenance, and many SQL methodologies have failed here. Too many applications and/or "mapping layers" do not grant schema management rights to end users, and people go off to badly reinvent RDBMS when they really ought to just take the RDBMS out of its glass room.
If you use RDBMS properly, at any point during the "runtime" lifecycle of an application where there is a need for a new type of relation, you just create a new table! There is no reason this should be more difficult than inserting rows into existing tables, except traditionally narrow-minded programming tools.
I recently came across Rick Cattell's site which addresses just the questions you're asking.
Rick Cattell has written an excellent comparison guide of horizontally scalable datastores of different types (RDBMS as well as a variety of NoSQL systems).
Cattell has also written an academic paper with database expert Mike Stonebraker, which weighs the system design factors required to make a datastore scalable.
Executive summary of Cattell's work: although NoSQL may be a huge fad, the things that make a datastore scalable can be implemented in SQL RDBMS systems as well. Also, implementing do-it-yourself ACID in NoSQL systems is extremely difficult and error-prone, and is a significant advantage of most RDBMS systems. Stonebraker is the author of VoltDB, which is an open-source RDBMS designed for horizontal scalability, but they give a very fair and thorough look at competing datastores as well.
My bicyles
What about something like Greenplum (http://www.greenplum.com)? It is supposed to be able to optimize a query (optimize...I hate that word) and then execute it in parallel among multiple machines.
If people are getting inconsistencies, they're not following good CS practices in the first place. Any proper course on the subject will teach algorithms and database-design, among other subjects that alleviate problems before they arise. This is nothing new, but has been maturing for the last 30-40 years.
I don't think people will automatically "get it", by just doing an exam on the subject. However, respect and a bit of dedication for such knowledge is required, to further refine one's skills.
I believe it's because people are _not_ trained properly, or their brains are unable to comprehend sound principles, they're getting inconsistencies in the first place. That, or the tight timelines which are expected to become tighter and tighter, until people are out of jobs..
"NoSQL"-practices has a place too, but not for most projects. If you're using NoSQL to "save time" not thinking about schemas, you're most probably inviting a world of hurt later, when inconsistencies, vertical scaling and immature support/reporting-tools catch up to you. The more value of the information in the system, the more hurt.
Of course, information on Facebook and its equivalent has almost 0 value, and don't really need to be that consistent or up-to-date. Although it has interesting scaling challenges, facebook is hardly the standard to meet for most serious projects out there.
This is coming from someone who has reviewed numerous "nosql" solutions, but not yet found something more compelling than RDBMs for general projects. I would very much like to use them for "fun" however, but seem unable to give up on so many sound practices, just to squeeze a bit more juice out of the system. Often, I find parallellizing the task gets the job done just as quick, with a "relational" solution, and with less headaches and support-nightmares later..
Just being able to use some mature support-tools is enough to make my decision. Hopefully, "nosql" will mature and become a viable alternative. Right tool for the right job and all that..
http://www.debunkingskeptics.com/
KDB+ - http://kx.com/ will handle it.
There's no universal general purpose answer to your question, especially not with the level of detail you have posted. To suggest otherwise would be irresponsible.
For some applications, db scaling is easy. For others it may require some enormously complicated considerations about things like indexing and transactions.
-fb Everything not expressly forbidden is now mandatory.
Can you shard the same SQL data store in Chicago, London, and Tokyo? Not with standard SQL databases, unless you write your own complicated replication techniques or pay through the nose. (See CAP Theorem).
Yes, the company I work for has expressed the world-wide SQL database need, so this is not just a thought experiment.
Have you heard of GemFire/GemStone, VoltDB, or Xeround?
If you can get rid of the SQL requirement, try
XML (or other format) on Amazon's S3
or try one of the NoSQL databases, such as MongoDB, Riak, or CouchDB.
All of the above scale horizontally, most even scale in a geographically diverse environment.
Yeah, SQL will show limits at some point. You'll hear otherwise from people who think it'll work for you because there's about 1000x as many people who are experts in SQL vs. NoSQL. I am an expert in NoSQL, and suggest you look at MarkLogic, MongoDB or something, depending on if you want a document store, key/value store or what. Be aware that your architecture and lessons learned will be different, so some consulting will be a good idea.
In particular, development and CPU costs of all that relational mapping just doesn't support all business processes anymore.
Let's put it this way -- Facebook and Google ain't runnin' on Oracle.
http://voltdb.com/
This database project tries to eliminate a lot of the bottlenecks that cause poor scalability. Here is a talk Stonebreaker gave (requires registration): http://voltdb.com/content/mike-stonebraker-sql-urban-myths-webinar-recording
From what I know of it, it puts everything into memory and uses a cluster to distribute the load. It's more interesting my description, but people should really look up the product.
Except for ending slavery, the Nazis, communism, & securing American independence, war has never solved anything.
>I'm just a little skeptical that a SaaS startup is already hitting limits with what you "can do with relational
>databases".
Or if they are, then it will be relatively easy to find people willing to help out with the cash situation.
-fb Everything not expressly forbidden is now mandatory.
If you really use the right hardware, the right database server parameters, the right indexes, the right queries and the right isolation level, you shouldn't care about performance. You should try to read database performance books and the apply the knowledge.
Anyway, have you tried to use any cache library for your software?
XML is THIS CLOSE to becoming sentient...
Sleep your way to a whiter smile...date a dentist!
I guess if you are building a business bigger than eBay, then relational databases may not do the trick anymore. If you lack imagination anyway.
See here for more information. And eBay is not the only one. I wouldn't put mission critical data on a garden variety noSQL database unless I really really hated my customers and planned to go out of business fast.
Garden variety NoSQL is great for data you can mostly throw away and no one is going to sue you as a result. Facebook, Twitter, Google. Perhaps not so much for financial transactions. If the knowledge leaked out that any bank was using a typical not-particularly consistent, nor particularly durable NoSQL system for transaction data, a run on the bank would soon ensue, at least from anyone advised by anyone who had a clue.
Buy some SSDs. OCZ's RevoDrive X2 can do 120k IOPS for just $3-4/G. RAID a few of them together for 240k IOPS, which will fix most any bottleneck in your your database. There are very few applications that need that much performance.
it's still just a parsing fancy
rewriting history since 2109
If you want horizontal scalability with near linear progression, take a look at DB2 pureScale. This scales (in the first release) to 128 member nodes, and the demonstrations at this level of scaling are very impressive. Oh, and you don't need to keep making changes to your application to get this level of scaling - unlike Oracle RAC (which can't really climb much past 8 nodes anyway without dying a horrible death from the law of diminishing returns).
I dunno what Translattice does - no source available.
The infrastucture description on its website looks kind of simillar to Askemos.org
(which is GPLed).
Recently I wrote this note, which shall become kind of a tutorial.
I hope it will give you an idea, whether ot not this could be for you:
using SQL
(So far I'm working in two environments with it: a) public (cutomer+ours) websites run from a mixture of Linux and FreeBSD on mixture of 32bit, 64bit Intel and ARM; and b) a node 5 Segate Docstar network in a suitcase. And without breaking CAP: 3/2n+1 nodes required.)
I think you should use Mauve, it has the most RAM.
If you're just a start up I would suggest that, without me knowing the size of your client base using these databases, you should probably use a vertical scaling database situation. If you're using application servers or web servers that connect to that SQL database. you would want those application servers horizontally scaled in a load balanced environment. But in every scenario I can think of, horizontally scaling a Database would have a detrimental effect on performance.
Have you tried more RAM? How about 15 Disk arrays. I take care of 10 TB (yes Tera) with good old SQLSERVER 2005 on 6 machines. 4 Million books on 25 market places means 50,000,000 prices/listings that must be worked on constantly. Full/Diff Backups every day, tran logs every 15 minutes. Oh, and hardware, even dell has stuff >twice as strong as what I'm using. sure >$100K in licenses, but don't for 1 second try to tell me about LIMITS. go double digit on your T, reprocess most of it every couple of days, and then talk to me about how big you is.
I think he's talking about a different problem altogether.
We have about 40,000 people, any one of which may want a web site. They've all got storage in our campus storage. Provisioning our web servers for all those people is just a matter of ....
<Directory "/home/*/public_html">
AllowOverride [some options]
Options [some other options]
</Directory>
All they have to do is put some html pages in place.
Now, if some subset of those users wants to put a MySQL backed blog or some other low traffic app in their html space, they've got to stand up their own MySQL, or talk to a dba and do a lot of hand holding. This doesn't scale horizontally -- lots of users doing basically the same thing. You can't say
<Directory "/home/*/public_dbspace"> ...
[appropriate database defaults]
AllowOverride [some other defaults]
</Directory>
and if the users put the right stuff in their ~/public_dbspace, they get data base service. We're not talking about high performance or large data. We're talking about large number of mostly very small users being provisioned with very little intervention on our part.
If you think about it, web servers and data bases _in_this_application_ do the same thing: they provide highly specialized interfaces to data in specifically provided files. There's no inherent reason providing the M to a LAMP stack should be any harder than the A, but configuration for the masses is clearly easier for the latter.
I recently attended an excellent mint.com talk about scaling beyond 2,000,000 users. Their setup: apache/spring/hibernate with mysql backend.
They engaged in some interesting tweats to get the multiple MySQL boxen to cope with the load but they basically come down to 'customised sharding'. Each new users data (all of it) was set to store on a random (I think) mysql box in the cluster. When the user logged in, they were connected to the correct mysql db (on one of many mysql servers) and their data was retrieved and loaded fast using some tricks with varchar keys (apparently, in mysql, they load all together in concurrent blocks).
This provides sufficient data access speed for 2,000,000 users. I've not got the number of mysql servers handy but from memory it was less than 20.
The talk was very impressive and heartening as we also deploy similar architecture (except we love Postgresql instead of mysql) and would love to have this particular user problem.