Database Clusters for the Masses
grugruto writes "Cluster of databases is no more the privilege of few high-end commercial databases, open-source solutions are striking back! ObjectWeb, an Apache-like group, has announced the availability of Clustered JDBC (or C-JDBC). C-JDBC is an open-source software that implements a new concept called RAIDb (Redundant Array of Inexpensive Databases). It is simple: take a bunch of MySQL or PostgreSQL boxes, choose your RAIDb level (partitioning, replication, ...) and you obtain a scalable and fault tolerant database cluster."
So a few things come up just reading the docs on this:
;)
1. A Controller. It looks as tho a single controller is used by the clients to communicate to the various RAID'd dbs. I'm sure there can be multiple controllers since there would be little point to make some db's redundant, yet the access to them not. Still looking into this.
2. And also, it looks as tho the default port is 1099 - RMI. If you have, for a web app, your EJBs and web app local to that containter, that might not be a problem. However, I happen to have my EJB server on its own box and this might very well cause probs. I think it said you could specify our own ports, but I haven't seen any examples in the docs yet of this being the case. Also, still looking.
A few other things exist as well which are in the docs as known limitations:
* XAConnections
* Blobs
* batch updates
* callable statements
These could be serious issues for some. My last project used CLOBs/BLOBs, batch updates and callable statements, so this would rule that out. Of course, all the db stuff was strictly tied to Oracle, so I think that would rule this all regardless.
All in all tho, this looks like a good start. As my current project progresses, clustered dbs will become more and more of an issue. I've looked into some other projects out there for Postgres, but nothing yet really satisfactory. I think this is a good step in the right direction - for Java developers. It'll be interesting to watch.
Database clustering is typically used for high-avaliability, not performance.
There are better ways to improve the performance of a database, horizontal partitioning, federated servers, etc.
This would be very cool if there was a generic implementation; we build many Microsoft SQL clusters and just the hardware requirements for an MSCS cluster easily exceed $50k, let alone the licensing...as an MCDBA I'd consider an open source solution if I could use it as a back-end ot an ASP/VB.NET application, just to save the licensing $$ for consulting! ; )
second society
If you want to cluster Oracle, use Oracle RAC (Real Application Clusters). It's based on Parallel Server so is mature enough to put forward for consideration... and even then it might be eschewed from above. Cheap databases are not going to ring the bells of the people with the say-so simply because Oracle (and DB2 etc) are proven over the years, and the cost of losing your data because you went for the cheap option is going to lose your company a lot of money, and you your job!
Technically better, cheaper and all those good things does not mean better for a business. Databases are predominantly used for *business*, and as such a *business* reason it used when choosing one over another, not technical reasons.
MySQL AB and a few Postgre companies.
they do consultant work for there products.
I am the Alpha and the Omega-3
What you missed is that this thing only forwards SQL requests. Therefore you can also build clusters of Oracle if you want. You will not miss any Oracle feature this way.
When you look at Oracle pricing policy, you can have Oracle RAC for the price of just Oracle (+ a free RAIDb), which is already a 50% discount!
Oracle has database links.
Create a database link (for example to an AS400) and you can query the remote tables just like local tables.
select * from somelib.sometable@as400
Oracle will pass as much SQL as it can to the remote DB engine in order to keep things speedy.
LOAD "SIG",8,1
LOADING...
READY.
RUN
C-JDBC can handle more than just full partitioning or replication, it also provides partial replication (a little bit like you would use RAID-5 with disks).
The idea is that with full replication you have to broadcast the write to all databases (to be consistent) and you can only balance the reads. By controlling the replication of each database table, you can have scalable performance. Look also at the nested RAIDb levels, it's pretty cool to build large configurations.
Some tests have been done with TPC-W and performance scales linearly up to 6 nodes (we did not have a larger cluster to test bigger configurations).
Sure it will not replace very large Oracle configurations at the end of the year, but it looks very promising.
Your problem wouldn't be solved with the product mentioned in the story. However, because you are using MS SQL Server, this is really easy. You just need to get the postgres ODBC Driver, and setup a Linked Server on the MS box.
Check out This page for the postgresql ODBC Driver.
You should also look at the linked servers documentation in SQL Server Books Online (under sp_addlinkedserver) as well as the interface in enterprise manager (security -> linked servers)
As I was searching a bit, I see that people have had trouble using the server.database.owner.object syntax, however using the SQL Server OpenQuery(servername, query) function seems to work, and will allow you to control the exact SQL Statement sent to postgres.
Not to argue in any way that Sun botched Java, but what I meant is, this implementation is for Java programs. It provides no functionality for programs not written in Java. Even if Sun had done Java correctly, my statement would still be true - this isn't a generic implementation, as it requires the code be written in Java. Even if Java itself were generic, this implementation wouldn't be generic, it'd be Java-specific.
;)
When I said "generic implementation" I meant "an implementation which doesn't require your programs be written in a particular language." Which is probably a bit of a pipe dream, you'd still need some sort of glue code (ODBC, JDBC, DBD, etc). But, as was alluded to above, I was trying to beat the Beowulf comment when I asked my question.
Josh, know what you're talking about before you post. MySQL (the company which does the vast majority of development of MySQL) offers a variety of levels of support and consulting, regardless of the number of systems that you admin. For $48,000/year, you get:
Does Oracle match that for the price?
This simply isn't true. Oracle's clustered database solution (9i Real Application Clusters) are designed to increase the ability to gracefully recover from individual node failures. Additionally, they can scale the performance of your database application by increasing the number of CPUs with access to shared storage. For CPU bound database applications, this technology provides near linear scalability!
Actually, if you look at RAIDb-0, it is very close to this, maybe even identical. They show having different tables on different database servers. They also indicate that C-JDBC can be used without modifcations to the application. This would imply that if you get a JDBC driver for MSSQL, a JDBC for PostreSQL, and write your code using JDBC, you should be able to do the type of selects you are talking about.
--LeBleu
If you're reading this you're part of the mass hallucination that is Kevin the Blue.
A paper on this has been submitted to a conference with a blind review process. Therefore, we are not allowed to publicly disclose the results before the notification of acceptance/rejection.
Maybe you can get a copy by asking directly one the C-JDBC team member?
Good point. The sooner people realize this, the better. Running SQL server in partition mode doesn't provide any failover. for that matter running any database in partition mode without replication provides no failover.
The C-JDBC controller embedds a recovery log that allows backends to recover from failures (check the recovery log part in the doc).
If one backend fails in the cluster, it is automatically disabled and the controller always ensures that data that are sent back to the application are consistent.
By the way, you can tune how you want distributed queries to complete (return as soon as the first node has commited, wait for a majority or safer wait for all nodes to commit). There are many options that helps tuning the performance/safety tradeoff.
After actually reading the documentation, here's my informed take on this:
1) In it's current incarnation, it's only useful for very very simple database access. No transactions, no blobs, etc. Basically if you're just storing some simple weblication tables and doing single-statements against them for selects/updates (no big cross-table transactions), you can use it.
2) It's JDBC only. Perhaps someone could port the concept to ODBC though.
3) There's a new middle tier between the JDBC driver and the database itself, which is the bulk of their code. This tier actually re-implements some database constructs like recovery logging, query caching, etc. Of course this is neccesary, as trying to do replication from the client-code side alone would be impossible (what do you do when one of 3 DB mirrors goes offline for an hour? have every jdbc client cache the requests and replay them later, hoping those clients are even stilla round later?)
For some applications and some companies, in it's current state this is a godsend - but it's not a general solution yet. Making it ODBC (or even better, having the front of it emulate a native postgresql or mysql listener) would broaden it's applicability.
Supporting transactions would be a big win too, although I'm not sure how feasible this is - I think at that point they may as well just write their own new database engine which is parallel from the start, seeing as they'll be re-implementing in their cluster tier almost everything the database server does except for actual physical storage.
Still, it's nice to see that someone did this and made it work - and for a lot of simple databases behind java apps it's all you really need.
PostgreSQL has all the transaction support in place already, so of all the free DBs out there it would seem they have the best shot at doing their own native parallelism, if they would just get it done someday.
11*43+456^2
There are many problems with this design, some have already been mentioned. There are serious issues with performing atomic updates. Modern databases use locking to allow high levels of concurrency. Foreign key constraint checking is one thing that would be very hard to implement in this design, as it is generally implemented in the indexes themselves. Likewise, to get all databases in a "RAIDb 0" group to reflect the same state, operations such as concurrent delete and insert must be completely serialized to assure consistency...serialized across all clients, not just from one source.
Furthermore, to scale up systems generally take advantage of stripping. At the IO level that means striping across multiple disks (modern convention is to stripe across all!). In a parallel database one usually stripes a single table across multiple nodes for parallel query processing. While it is possible with C_JDBC to put table X on node A, table Y on node B I don't see any provision for striping the data. It will be very difficult to use your hardware efficiently in this scenario.
If you are going to go through the trouble of implementing a complete query processor (that can handle jobs larger than ram), a full update/query scheduler (lock manager), and a journalling mechanism that can (somehow) even maintain atomic transactions (even in the face of multiple failures) then why not just build your own database. This system might be useful in certain rare cases but I wouldn't use it except possibly for replication.
JJ