Horizontal Scaling of SQL Databases?
still_sick writes "I'm currently responsible for operations at a software-as-a-service startup, and we're increasingly hitting limitations in what we can do with relational databases. We've been looking at various NoSQL stores and I've been following Adrian Cockcroft's blog at Netflix which compares the various options. I was intrigued by the most recent entry, about Translattice, which purports to provide many of the same scaling advantages for SQL databases. Is this even possible given the CAP theorem? Is anyone using a system like this in production?"
Just store everything in a big XML file.
It would be a lot easier to talk about solutions if you said which limitations you run into.
Is your dataset to large (large tables), are you having to much joins, too many transactions per second? In short, what is the problem we're trying to solve here?
Learn partitioning principles, get a database product that does partitioning properly, learn normalization, never worry again about not being able to scale with relational databases. It just requires some real skills but relational databases really do scale all the way up.
Another idea is to scale using other layers, if there are problems at the SQL server level.
At the lower areas, one can go with a mainframe (parallel sysplex) and have geographically separate pieces of hardware acting coherently.
At the higher layers, have the app use multiple SQL servers and handle the redundancy in this layer.
Call me skeptical but there are companies out there with massive amounts of data in relational databases, if you as a setup are "constantly hitting limitations" you're either a very odd startup or using it very wrong. As long as the volume is small you can make almost anything happen on SQL. Hell, most small business I've known run mostly on Excel. Somehow I don't see a startup needing NoSQL unless they specialized in processing huge amounts of data, in which case trying to make slashdot work on your core business seems stupid. But maybe I missed something...
Live today, because you never know what tomorrow brings
Given my past 12 years between working at consultancies and start ups, I've seen this a few times. It's usually not a technical hurdle, it's a "We can't solve this problem within our budget" problem. Either by going out and hiring someone who is an expert at performance tuning with their DB of choice or moving from certain db's to real databases that could handle the work like MSSQL, DB2, Oracle, or in some cases Teradata if dealing with Data warehousing.
Because I've worked around some very large database installs in my day. Every time the scaling question/problem came up, it was solvable with RDBMS's, but the solution wasn't cheap.
"The problem with socialism is eventually you run out of other people's money" - Thatcher.
"I'm currently responsible for operations at a software-as-a-service startup, and we're increasingly hitting limitations in what we can do with relational databases. "
Relational databases scale to pretty amazing heights. The notion that you are hitting some limit of relational databases at a startup stretches the imagination. I mean, really, you've already hit exabyte data sizes? That's typically where relational starts to struggle.
You really need to define your problem with much greater specificity to get a valuable answer.
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
I didn't expect we'd be on Slashdot just yet. I'm Michael Lyle, CTO and cofounder of Translattice.
With regards to the original submitter's question, we'd love to talk to him. How much we can help, of course, depends on the specific scenario he's hitting.
What we've built is an application platform constituted from identical nodes, each containing a geographically decentralized relational database, a distributed (J2EE compatible) application container, and distributed load balancing and management capabilities. Massive relational data is transparently sharded behind the scenes and assigned redundantly to the computing resources in the cluster, and a distributed consensus protocol keeps all of the transactions in flight coherent and provides ACID guarantees. In essence, we allow existing enterprise applications to scale out horizontally while keeping the benefits of the existing programming model for transactional applications, by letting computing resources from throughout an organization combine to run enterprise workloads.
Current stacks are really complicated, multi-vendor, and require extensive integration/custom engineering for each application install. We're striving to create a world where massively performing infrastructure can be built from identical pieces.
The post is so vaguely worded, I imagine the author is merely trying to find some justification to purchase some new toys. "See, Slashdot people think this is a good idea!"
I agree with most of the posts so far -- if you're truly hitting a limit, you are most likely doing something wrong. Hire an outside DBA to make recommendations if you don't have the resources in-house. I strongly suspect this is the real issue.
I recently read that someone moved their large operation from Cassandra to Hbase, a hadoop file system. http://hbase.apache.org/
HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase includes:
Convenient base classes for backing Hadoop MapReduce jobs with HBase tables
Query predicate push down via server side scan and get filters
Optimizations for real time queries
A high performance Thrift gateway
A REST-ful Web service gateway that supports XML, Protobuf, and binary data encoding options
Cascading, hive, and pig source and sink modules
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
HBase 0.20 has greatly improved on its predecessors:
No HBase single point of failure
Rolling restart for configuration changes and minor upgrades
Random access performance on par with open source relational databases such as MySQL
TossableDigits.com: Temporary Phone Numb
Is to write better queries, I mean how hard can it be:
select * from (select * from A,B,C,D,E,F,G WHERE A.ID=B.AID(+) AND B.ID=C.BID(+) AND C.ID=D.CID(+) AND D.ID=E.DID(+) AND E.ID=F.EID(+) AND F.ID=G.FID(+) order by F.name ASC) where F.name='zzzzz'
Everything will work out, I swear.
Bye!
I work with some very high traffic sites, storing large data sets (100GB+).
Depending on the application (if it allows for different write-only/read-only database configurations) we'll have a master-master replication setup, then a number of slaves hanging off each MySQL master. In front of all of this is haproxy* which performs TCP load balancing between all slaves, and all masters. Slaves that fall behind the master are automatically removed from the pool to ensure that clients receive current data.
This provides:
* Redundancy
* Scaling
* Automatic failover
The whole NoSQL movement is as bad as the XML movement. I'm sure it's a great idea in some cases, but otherwise it's a solution looking for a problem.
(*) http://haproxy.1wt.eu/
Just because you disagree doesn't make it offtopic or flamebait.
Geez, you guys. There's a real person behind the question. Do you HAVE to be an asshole?
I recently came across Rick Cattell's site which addresses just the questions you're asking.
Rick Cattell has written an excellent comparison guide of horizontally scalable datastores of different types (RDBMS as well as a variety of NoSQL systems).
Cattell has also written an academic paper with database expert Mike Stonebraker, which weighs the system design factors required to make a datastore scalable.
Executive summary of Cattell's work: although NoSQL may be a huge fad, the things that make a datastore scalable can be implemented in SQL RDBMS systems as well. Also, implementing do-it-yourself ACID in NoSQL systems is extremely difficult and error-prone, and is a significant advantage of most RDBMS systems. Stonebraker is the author of VoltDB, which is an open-source RDBMS designed for horizontal scalability, but they give a very fair and thorough look at competing datastores as well.
My bicyles
Can you shard the same SQL data store in Chicago, London, and Tokyo? Not with standard SQL databases, unless you write your own complicated replication techniques or pay through the nose. (See CAP Theorem).
Yes, the company I work for has expressed the world-wide SQL database need, so this is not just a thought experiment.
Have you heard of GemFire/GemStone, VoltDB, or Xeround?
If you can get rid of the SQL requirement, try
XML (or other format) on Amazon's S3
or try one of the NoSQL databases, such as MongoDB, Riak, or CouchDB.
All of the above scale horizontally, most even scale in a geographically diverse environment.