One main issue I see with the Emic solution is that it does not support transactions. I saw their demo at the last LinuxWorld in SF and they are just using a multicast layer to broadcast the queries to all nodes (they don't parse SQL so they can't handle transactions properly).
Moreover, if you have queries like UPDATE... WHERE date=NOW() , you will just get a different result on every node! At least, solutions like C-JDBC replaces macros such as NOW or RAND on the fly so that all databases are consistent.
Hopefully all transactions are not distributed and that makes life much simpler. In most applications, transactions deal with a single datasource at a time and don't need distributed transactions.
... I assumed "XA Connections" meant standard transactions.
No, XA stands for X/Open Distributed Transaction Processing (DTP). If you access multiple data sources, you need 2-phase commit to ensure that either all sources commit or rollback. It is not acceptable that some sources commit and other rollback in a distributed transaction. There is a lot of litterature available on the web about this.
C-JDBC fully supports transactions (else it would be completely useless). The only thing that is not supported is distributed transactions (a transaction that would span over different clusters). XA connections for C-JDBC are under development but standard transactions are fully supported.
When Linux came out. Did someone take it seriously? Did you see customers suddenly jumping over it? Didn't all commercial Unixes already have all the features that Linux just start to have?
As someone already mentioned, C-JDBC and RAIDb are certainly not ready for prime time, but at least it is worth debating about it!
Maraist is right, C-JDBC is database independent and can even support clusters of heterogeneous databases (as long as they can understand the same SQL subset).
In fact, scheduling the requests upfront the database usually performs better than just letting the database doing the locking (check this article). The bet is that with this solution we can have a generic way to provide clustering solutions, it is much easier than implementing it inside a RDBMS engine (see Postgres-R work) and can perform at least as well as the DB specific implementation.
"StupiedEngineer(102134)", yes, I am part of the project.
1. Yes, you can have multiple controllers started at any time that use group communication to synchronize the requests to be sent to the backends. Clients that were communicating with the failed controller automatically redirect their queries to another controller.
2. The assumption made in C-JDBC is that communications between controllers and backends are fast (like in a cluster environment). What generates inter-controller traffic is write queries. If you have many of them, the performance will go down. If your workload is read-mostly it should scale well.
If you want to distribute controller on multiple sites, the performance will be dependent of the link speed and reliability between the sites (not talking about the security issues like crossing firewall, the possible need to encrypt data,...).
As a side note, you never need to replicate reads, just writes.
The Sun Fire 15K can maintain 43.2 GB/sec bandwidth connecting all the CPUs
That is really nice but in a database engine, a query is usually handled by a single thread and you don't need to communicate with another thread on another cpu. Therefore this bandwidth is useless.
What would be more interesting is the IO bandwidth of the machine and how many Ethernet adapters you can plug in. Can you plug 36 Ethernet adapters to handle the same aggregated bandwidth as a cluster does?
You have your web site backed by an open source database?
Just put a replica on a second node and you will have fault tolerance (even just for maintenance) and you will be able to handle peak loads. 2 nodes is already a cluster, don't need to have hundreds of nodes.
Another usage could be to keep a single Oracle instance and put a bunch of open-source databases to offload your main Oracle database. You could have all the write queries (orders,...) handled by your [safe] main Oracle database and have all other open-source databases handle the read requests for browsing your web site (which is the main part of the load). What do you think of this idea of scaling Oracle with open-source databases?
The C-JDBC controller embedds a recovery log that allows backends to recover from failures (check the recovery log part in the doc).
If one backend fails in the cluster, it is automatically disabled and the controller always ensures that data that are sent back to the application are consistent.
By the way, you can tune how you want distributed queries to complete (return as soon as the first node has commited, wait for a majority or safer wait for all nodes to commit). There are many options that helps tuning the performance/safety tradeoff.
The client (driver side) cannot be generic. It will always be application dependent. Therefore, you will always have to port (at least) the driver. But the controller itself (where the cluster logic is really implemented) just deals with SQL strings sent over sockets whatever the client is on the other side.
Could be interesting to have an ODBC driver sending the requests to a Java C-JDBC controller.
1. Yes, you can have multiple controllers that synchronizes using group communication. In the driver, you give a list of coma separated host names running controllers. The driver has built-in failover and load balancing among multiple controllers (check the doc here).
2. Yes, all ports are customizable when you start the controller (check the doc here).
This is just an alpha version, so as you mentioned, there are still many features missing but it is a good starting point and contributions are welcome (remember it is open source software;-))!
What you missed is that this thing only forwards SQL requests. Therefore you can also build clusters of Oracle if you want. You will not miss any Oracle feature this way.
When you look at Oracle pricing policy, you can have Oracle RAC for the price of just Oracle (+ a free RAIDb), which is already a 50% discount!
One main issue I see with the Emic solution is that it does not support transactions. I saw their demo at the last LinuxWorld in SF and they are just using a multicast layer to broadcast the queries to all nodes (they don't parse SQL so they can't handle transactions properly). ... WHERE date=NOW() , you will just get a different result on every node! At least, solutions like C-JDBC replaces macros such as NOW or RAND on the fly so that all databases are consistent.
Moreover, if you have queries like UPDATE
Hopefully all transactions are not distributed and that makes life much simpler. In most applications, transactions deal with a single datasource at a time and don't need distributed transactions.
No, XA stands for X/Open Distributed Transaction Processing (DTP). If you access multiple data sources, you need 2-phase commit to ensure that either all sources commit or rollback. It is not acceptable that some sources commit and other rollback in a distributed transaction. There is a lot of litterature available on the web about this.
C-JDBC fully supports transactions (else it would be completely useless). The only thing that is not supported is distributed transactions (a transaction that would span over different clusters).
XA connections for C-JDBC are under development but standard transactions are fully supported.
As someone already mentioned, C-JDBC and RAIDb are certainly not ready for prime time, but at least it is worth debating about it!
Maraist is right, C-JDBC is database independent and can even support clusters of heterogeneous databases (as long as they can understand the same SQL subset).
In fact, scheduling the requests upfront the database usually performs better than just letting the database doing the locking (check this article). The bet is that with this solution we can have a generic way to provide clustering solutions, it is much easier than implementing it inside a RDBMS engine (see Postgres-R work) and can perform at least as well as the DB specific implementation.
1. Yes, you can have multiple controllers started at any time that use group communication to synchronize the requests to be sent to the backends. Clients that were communicating with the failed controller automatically redirect their queries to another controller.
2. The assumption made in C-JDBC is that communications between controllers and backends are fast (like in a cluster environment). What generates inter-controller traffic is write queries. If you have many of them, the performance will go down. If your workload is read-mostly it should scale well. ...).
If you want to distribute controller on multiple sites, the performance will be dependent of the link speed and reliability between the sites (not talking about the security issues like crossing firewall, the possible need to encrypt data,
As a side note, you never need to replicate reads, just writes.
That is really nice but in a database engine, a query is usually handled by a single thread and you don't need to communicate with another thread on another cpu. Therefore this bandwidth is useless.
What would be more interesting is the IO bandwidth of the machine and how many Ethernet adapters you can plug in. Can you plug 36 Ethernet adapters to handle the same aggregated bandwidth as a cluster does?
Just put a replica on a second node and you will have fault tolerance (even just for maintenance) and you will be able to handle peak loads. 2 nodes is already a cluster, don't need to have hundreds of nodes.
Another usage could be to keep a single Oracle instance and put a bunch of open-source databases to offload your main Oracle database. You could have all the write queries (orders, ...) handled by your [safe] main Oracle database and have all other open-source databases handle the read requests for browsing your web site (which is the main part of the load). What do you think of this idea of scaling Oracle with open-source databases?
The C-JDBC controller embedds a recovery log that allows backends to recover from failures (check the recovery log part in the doc).
If one backend fails in the cluster, it is automatically disabled and the controller always ensures that data that are sent back to the application are consistent.
By the way, you can tune how you want distributed queries to complete (return as soon as the first node has commited, wait for a majority or safer wait for all nodes to commit). There are many options that helps tuning the performance/safety tradeoff.
The client (driver side) cannot be generic. It will always be application dependent.
Therefore, you will always have to port (at least) the driver. But the controller itself (where the cluster logic is really implemented) just deals with SQL strings sent over sockets whatever the client is on the other side.
Could be interesting to have an ODBC driver sending the requests to a Java C-JDBC controller.
1. Yes, you can have multiple controllers that synchronizes using group communication. In the driver, you give a list of coma separated host names running controllers. The driver has built-in failover and load balancing among multiple controllers (check the doc here).
2. Yes, all ports are customizable when you start the controller (check the doc here).
This is just an alpha version, so as you mentioned, there are still many features missing but it is a good starting point and contributions are welcome (remember it is open source software ;-))!
What you missed is that this thing only forwards SQL requests. Therefore you can also build clusters of Oracle if you want. You will not miss any Oracle feature this way.
When you look at Oracle pricing policy, you can have Oracle RAC for the price of just Oracle (+ a free RAIDb), which is already a 50% discount!
You should check it out, it's open source.