Slashdot Mirror


Database Clusters for the Masses

grugruto writes "Cluster of databases is no more the privilege of few high-end commercial databases, open-source solutions are striking back! ObjectWeb, an Apache-like group, has announced the availability of Clustered JDBC (or C-JDBC). C-JDBC is an open-source software that implements a new concept called RAIDb (Redundant Array of Inexpensive Databases). It is simple: take a bunch of MySQL or PostgreSQL boxes, choose your RAIDb level (partitioning, replication, ...) and you obtain a scalable and fault tolerant database cluster."

29 of 278 comments (clear)

  1. Non-Java Implementations? by the_quark · · Score: 4, Interesting

    Just started looking at the site. I've wanted this for years. I was ecstatic with what load-balancing cheap Apache boxes did for the cost of web hosting. Unfortunately, reliability has still required hundreds of thousands of dollars of high-end equipement and software for databases. I've been hoping the open-source community would make headway on this front.

    So, the question is - is anyone working on anything like this for Perl, C, or generic implmentations?

    1. Re:Non-Java Implementations? by Etcetera · · Score: 3, Insightful


      Exactly -- given that the RAIDb itself sits elsewhere, I can't imagine it would be that hard to take the source itself and make a Perl DBD::Module out of it.

      If only I had the spare time... :/

    2. Re:Non-Java Implementations? by akadruid · · Score: 4, Insightful

      Unfortuntly there is no free open source hardware available :)
      Seriously though, this may reduce the costs for some users but I don't think it will get a wide take up. Most people will not want to leave the deniability you can have with large corps like Oracle. Oracle is a 'safe' solution for the purchaser with their ass on the line, which is most corperate users these days.
      And the more entrepenrial users will not usually have the hardware to use this properly anyway.
      Anyone who is financing this lot will want proven standards.
      Just my flawed £0.02

      --
      "Those who cast the votes decide nothing; those who count the votes decide everything." (attrib. Joseph Stalin)
    3. Re:Non-Java Implementations? by palad1 · · Score: 3, Insightful
      So, the question is - is anyone working on anything like this for Perl, C, or generic implmentations?

      Am I the only one a bit saddened by the fact that Sun botched it with java that much, that we now exclude java from 'generic implementations'

      Build once, run anywhere, riiiiight.

    4. Re:Non-Java Implementations? by the_quark · · Score: 4, Informative

      Not to argue in any way that Sun botched Java, but what I meant is, this implementation is for Java programs. It provides no functionality for programs not written in Java. Even if Sun had done Java correctly, my statement would still be true - this isn't a generic implementation, as it requires the code be written in Java. Even if Java itself were generic, this implementation wouldn't be generic, it'd be Java-specific.

      When I said "generic implementation" I meant "an implementation which doesn't require your programs be written in a particular language." Which is probably a bit of a pipe dream, you'd still need some sort of glue code (ODBC, JDBC, DBD, etc). But, as was alluded to above, I was trying to beat the Beowulf comment when I asked my question. ;)

    5. Re:Non-Java Implementations? by palad1 · · Score: 4, Interesting

      Please don't take my previous post as a flame, I completely agree with your point. What I was whining about was the fact that java doesn't play nice with system libs, as it is 'easy' to import other libs, but exporting java classes to other languages is ...
      Let's say that few people feel like embedding a JVM to their C app :)

  2. hmmm by the_2nd_coming · · Score: 4, Interesting

    now if only MySQL or PosgreSQL can get the reputation that Oracle has mabye we will start to see Oracle DBs go away in favor of the cheaper solutions using RAIDb

    --



    I am the Alpha and the Omega-3
  3. If only replicaton was so trivial by marcink1234 · · Score: 4, Insightful

    Running many databases is easy. Organizing and serializing replication is hard. Even if one have distributed transactions handy - not present in this case. But let's read their code...

  4. Performance? by deranged+unix+nut · · Score: 5, Interesting

    Hmm, interesting idea. I didn't see performance listed as a feature.

    I wonder how much slower my query will be when the data is spread across several machines. I'd imagine that a few complex queries that aren't correctly optimized would bring this system to it's knees rather quickly.

    1. Re:Performance? by jsin · · Score: 5, Informative

      Database clustering is typically used for high-avaliability, not performance.

      There are better ways to improve the performance of a database, horizontal partitioning, federated servers, etc.

      This would be very cool if there was a generic implementation; we build many Microsoft SQL clusters and just the hardware requirements for an MSCS cluster easily exceed $50k, let alone the licensing...as an MCDBA I'd consider an open source solution if I could use it as a back-end ot an ASP/VB.NET application, just to save the licensing $$ for consulting! ; )

  5. This is a threat to the big vendors by Jack+William+Bell · · Score: 4, Insightful

    This is a major threat to the big vendors. In fact I would say it is even more of a threat to Oracle than it is to MS! After all MS can continue to go after the midrange market that are are already locked into them for the OS.

    But Oracle shops are dealing with expensive boxes they would love to replace, not to mention expensive Oracle licenses. Often the only reason they use Oracle (other than Oracle salesmen licking their buttholes) is because only Oracle has the horsepower to meet their requirements. Give them a cheaper alternative with the same capabilities and they will bail out faster than you can say 'Geronimo'.

    Expect Larry Ellison to start talking about the dangers of using Open Source software now...

    --
    - -
    Are you an SF Fan? Are you a Tru-Fan?
    1. Re:This is a threat to the big vendors by DavidpFitz · · Score: 3, Informative
      Give them a cheaper alternative with the same capabilities and they will bail out faster than you can say 'Geronimo'.
      But there isn't anything close to Oracle when it comes to availability/reliability etc. And, even if there was IT managers would not go for it for some years because it's not proven in the enterprise. Oracle is so embedded into management brains, and it's reputation is well deserved.

      If you want to cluster Oracle, use Oracle RAC (Real Application Clusters). It's based on Parallel Server so is mature enough to put forward for consideration... and even then it might be eschewed from above. Cheap databases are not going to ring the bells of the people with the say-so simply because Oracle (and DB2 etc) are proven over the years, and the cost of losing your data because you went for the cheap option is going to lose your company a lot of money, and you your job!

      Technically better, cheaper and all those good things does not mean better for a business. Databases are predominantly used for *business*, and as such a *business* reason it used when choosing one over another, not technical reasons.

    2. Re:This is a threat to the big vendors by FortKnox · · Score: 3, Insightful

      I have to say this is a major point. This is why you don't see people using open source. If my DB goes down, I call up Oracle, and make them bring someone down here to fix the problem. If my open source DB goes down, I crap my pants and hope to keep my job.

      What does proprietary software have that Open Source doesn't? Insurance.

      The best way to knock over oracle is to start up a company that supports open source for a fee (which is cheaper than running oracle for a year).

      --
      Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
    3. Re:This is a threat to the big vendors by mangu · · Score: 3, Insightful
      please prove to me that this new "threat" has all of the features of Oracle


      That's exactly the point. Who needs all the features of Oracle? Maybe the IRS or Mastercard, but the vast majority of Oracle users are getting just one feature: the Oracle reputation that their marketing has built.


      And with all those features comes the big problem of managing them: no matter how small the application is, once you choose Oracle you need a team of experienced DBAs to correctly and reliably configure the system.

    4. Re:This is a threat to the big vendors by valisk · · Score: 5, Insightful
      People will always by oracle because "No one ever got fired for choosing Oracle". If something goes wrong, you always have someone to blame. With open source, your job is more on the line because you have to take responsability.

      Prior to Oracle taking off in a big way people used to say:

      People will always by IBM because "No one ever got fired for choosing IBM". If something goes wrong, you always have someone to blame. With the Seven Dwarfs (the common name for IBMs competitors back then), your job is more on the line because you have to take responsability.

      Then Larry E. shamelessly put together a cool SQL database which copied every major innovation IBM had made and added in a few more for good measure. He also cut the price by a third, IBMs database customers deserted in droves, after all if this Oracle thing turned out to be shit, they could always get IBM to come clean up the mess. It turned out though, that Oracle wasn't and isn't shit.

      That does not mean that Oracle is immortal and will always be top of the pile, Postgres now replicates almost all of the major features and is proven in the reliability stakes, tools like this are only going to make it more likely that corporate data departments will dip their toes into the Free software waters, after all if it turns out to be shit, they could always get Oracle to come clean up the mess.

      --

      Economic Left/Right: -0.62
      Social Libertarian/Authoritarian: -3.69
    5. Re:This is a threat to the big vendors by leviramsey · · Score: 4, Informative

      Josh, know what you're talking about before you post. MySQL (the company which does the vast majority of development of MySQL) offers a variety of levels of support and consulting, regardless of the number of systems that you admin. For $48,000/year, you get:

      • Access to the entire development team 24x7x365, with a guaranteed response within 30 minutes
      • Ability to request developers by name
      • Just about every issue is supported (from APIs to configuration to OS, kernel, library, and filesystem dependencies to custom compiles, to recovery, to tuning and so on)

      Does Oracle match that for the price?

  6. Quick thru the docs... by Lysol · · Score: 4, Informative

    So a few things come up just reading the docs on this:

    1. A Controller. It looks as tho a single controller is used by the clients to communicate to the various RAID'd dbs. I'm sure there can be multiple controllers since there would be little point to make some db's redundant, yet the access to them not. Still looking into this.

    2. And also, it looks as tho the default port is 1099 - RMI. If you have, for a web app, your EJBs and web app local to that containter, that might not be a problem. However, I happen to have my EJB server on its own box and this might very well cause probs. I think it said you could specify our own ports, but I haven't seen any examples in the docs yet of this being the case. Also, still looking.

    A few other things exist as well which are in the docs as known limitations:
    * XAConnections
    * Blobs
    * batch updates
    * callable statements

    These could be serious issues for some. My last project used CLOBs/BLOBs, batch updates and callable statements, so this would rule that out. Of course, all the db stuff was strictly tied to Oracle, so I think that would rule this all regardless. ;)

    All in all tho, this looks like a good start. As my current project progresses, clustered dbs will become more and more of an issue. I've looked into some other projects out there for Postgres, but nothing yet really satisfactory. I think this is a good step in the right direction - for Java developers. It'll be interesting to watch.

  7. Sigh - Looks like I have my work cut out for me... by Yoda2 · · Score: 4, Funny

    Off I go to starting coding a FORTRAN port...

  8. Where are the benchmarks that they speak of ? by a7244270 · · Score: 5, Insightful

    I looked at the diagram, and it looks very nice, but they seem to be very light on the details.

    Supposedly, This new version has been successfully tested with Tomcat, JOnAS, MySQL and PostgreSQL. Excellent results have been obtained with the TPC-W and RUBiS benchmarks.

    Don't get me wrong, I like the idea, and I have been wanting something like this for years, but I sure would like to _see_ the test results, even if they are preliminary.

  9. How about a meta-database adapter? by Frater+219 · · Score: 4, Interesting
    Since the article suggests the idea of applying disk-volume concepts (RAID) to databases, I thought I'd bring this up: For a while now I've been wishing there was an equivalent of NFS for databases, a way to mount a running database's tablespace into another database. This would allow one to draw together disparate databases, creating views and running joins across tables which natively reside in different databases, on different hosts.

    Here's an example of an application: I have a database-driven Web application that allows my onsite clients to register network services for openings in the firewall. Another software component probes the registered hosts for daemon version information and records it in the database, so that we can send out alerts when security holes are discovered in particular versions. I use PostgreSQL on Debian and Solaris. Independently of my work, our networking office has a Microsoft SQL Server database of IP addresses, MAC addresses, and physical switch ports and jack numbers.

    What I'd like to do is mount both my database and the networking office's database into some sort of "meta-database" -- analogous to mounting filesystems from two different hosts via NFS -- and run SQL queries that span both data sets. I wouldn't expect to be able to write to this conjoined database -- locking would be a nightmare -- but being able to SELECT across the two sets would be incredibly valuable.

  10. More info on transactions by binaryDigit · · Score: 3, Interesting

    Maybe I missed it but there info is pretty sparse on how they handle updates (i.e. adds/deletes/updates). Does it do two phase commit so if I'm stripping data and one of the updates fail then everything fails? If they are replicating, will they automatically update replication servers if they are down at the time of the update? If one of the databases in the RAIDb doesn't support online backups and it's backing up, what will their system do? After all, this would be the true grunt work, without these features then what they have isn't a big deal at all. Does anyone have more info?

    1. Re:More info on transactions by grugruto · · Score: 3, Informative

      The C-JDBC controller embedds a recovery log that allows backends to recover from failures (check the recovery log part in the doc).
      If one backend fails in the cluster, it is automatically disabled and the controller always ensures that data that are sent back to the application are consistent.
      By the way, you can tune how you want distributed queries to complete (return as soon as the first node has commited, wait for a majority or safer wait for all nodes to commit). There are many options that helps tuning the performance/safety tradeoff.

  11. supposed to be at RDMS level by Arethan · · Score: 4, Insightful

    Isn't clustering supposed to be a function of the database system, not the software you use to access it?

    I mean, this is neat and all, but I really don't want to have to use this interface just so that I can cluster my database. You're much better off placing clustering functions within the database itself. Then you can access the data by any method (ODBC, native libraries, hell even with the provided command line interface).

    Take a look at how MS SQL Server performs clustering sometime. Everything (and I mean EVERYTHING) is performed via triggers and tsql. All the clustering setup does is set up a bunch of known working trigger scripts to propagate the data. You can even edit them to your liking afterwards if you wish. Now I'm not saying that MS's solution for clustering is the cat's ass. Personally, I think it is kind of hackish, but then again I believe that clustering should be something you simply turn on, and shouldn't be able to fuss with. Realistically, I can't think of any good reason to change the cookie cutter tsql scripts that perform the clustering, so I only see the ability to modify them as a potential way to fsck it up (that being an obviously bad thing).

    Clustering really isn't that hard to implement. I'm pretty surprised that MySQL and Postgres don't have better support for it. Especially Postgres, since transaction support is really the one big key that makes clustering possible. Maybe no one has really had an itch to make it heppen yet. Hopefully it will happen soon, since I'd love clustering to be another argument for why OSS databases can play with the big kids just as easily.

    1. Re:supposed to be at RDMS level by Vihai · · Score: 3, Insightful

      You are true, clustering not only it better implemented ad DMBS itslef, it actually NEEDS support from the DBMS.

      You are wrong saying that implementing clustering isn't hard.

      If we are talking about REAL DBMSes (no, MySQL is not a real DBMS) enabling every form of clustering which maintains the ACID properties we expect from a DBMS is a major step, it means becoming a distributed application, and it is one of the most complex thing to implement.

      Just for example, suppose you have two machines in a master-to-master configuration, suddenly the network become partitioned, each server thinks that the other is offline, but the clients can reach both of them.

      Suppose now that the clients update the same record on the two servers in an incompatible way... you could imagine what will happen when the servers become visible to each other again...

  12. Slightly Offtopic.... by frodo+from+middle+ea · · Score: 3, Insightful

    But , Seriously do you see Oracle/DB2 etc customers suddenly jumping over this ?
    My view is that it may be difficult to migrate OSes or even hardware, but its almost darm impossible to migrate existing Databases.
    A Database is the most fundamental and most cared about aspect of a major business. There is a lot of time and effort and MONEY spent to incorporate it in to the company.
    Lots and lots of critical business applications are written using the propritory extenstions of these vendors. Is it very easy to migrate this code ?
    May be interesting for a future pilot project, but if expect business to change their database vendors.. that's not going to happen very soon.

    --
    for the last time people, I am "frodo from middle eaRTH", not "middle eaST".
  13. Also new! by Dark+Lord+Seth · · Score: 4, Funny

    RAID -- Redundant Array of Inexpensive Developers

    RAID 0
    Multiple developers work on the same project but none of them has any idea what the other is doing at the same time. One developer failing (caffeine dehydration, severe electrostatic shock, sex, etc) will cause the entire project to screw up and become a mess.

    RAID 1
    Extreme Programming.

    RAID 2
    Inefficient way to keep track of what developers are doing. For every 10 developers, 4 are needed to keep track of them and recover any error by the aforementioned 10 while they don't work together at all. Level of efficienty comparable to a modern goverment.

    RAID 3
    Equal to RAID 2, except all responsibility for checking the code is now granted to one person. The rest has been budget-cutted away. A bite more effective but considering people still don't cooperate, not too good.

    RAID 4
    Equal to RAID 3, escept people are finally working together now. Kinda efficient and fast, except it all still relies on that one person who checks the data.

    RAID 5
    Everyone knows what everyone else is doing, they all work perfectly together and they can easily miss one person because of that.

  14. Clusters aren't performance? Just not true! by CharlesDarwin · · Score: 3, Informative

    This simply isn't true. Oracle's clustered database solution (9i Real Application Clusters) are designed to increase the ability to gracefully recover from individual node failures. Additionally, they can scale the performance of your database application by increasing the number of CPUs with access to shared storage. For CPU bound database applications, this technology provides near linear scalability!

  15. Thorough rundown by photon317 · · Score: 4, Informative


    After actually reading the documentation, here's my informed take on this:

    1) In it's current incarnation, it's only useful for very very simple database access. No transactions, no blobs, etc. Basically if you're just storing some simple weblication tables and doing single-statements against them for selects/updates (no big cross-table transactions), you can use it.

    2) It's JDBC only. Perhaps someone could port the concept to ODBC though.

    3) There's a new middle tier between the JDBC driver and the database itself, which is the bulk of their code. This tier actually re-implements some database constructs like recovery logging, query caching, etc. Of course this is neccesary, as trying to do replication from the client-code side alone would be impossible (what do you do when one of 3 DB mirrors goes offline for an hour? have every jdbc client cache the requests and replay them later, hoping those clients are even stilla round later?)

    For some applications and some companies, in it's current state this is a godsend - but it's not a general solution yet. Making it ODBC (or even better, having the front of it emulate a native postgresql or mysql listener) would broaden it's applicability.

    Supporting transactions would be a big win too, although I'm not sure how feasible this is - I think at that point they may as well just write their own new database engine which is parallel from the start, seeing as they'll be re-implementing in their cluster tier almost everything the database server does except for actual physical storage.

    Still, it's nice to see that someone did this and made it work - and for a lot of simple databases behind java apps it's all you really need.

    PostgreSQL has all the transaction support in place already, so of all the free DBs out there it would seem they have the best shot at doing their own native parallelism, if they would just get it done someday.

    --
    11*43+456^2
  16. I wouldn't get too excited by godofredo · · Score: 3, Informative

    There are many problems with this design, some have already been mentioned. There are serious issues with performing atomic updates. Modern databases use locking to allow high levels of concurrency. Foreign key constraint checking is one thing that would be very hard to implement in this design, as it is generally implemented in the indexes themselves. Likewise, to get all databases in a "RAIDb 0" group to reflect the same state, operations such as concurrent delete and insert must be completely serialized to assure consistency...serialized across all clients, not just from one source.

    Furthermore, to scale up systems generally take advantage of stripping. At the IO level that means striping across multiple disks (modern convention is to stripe across all!). In a parallel database one usually stripes a single table across multiple nodes for parallel query processing. While it is possible with C_JDBC to put table X on node A, table Y on node B I don't see any provision for striping the data. It will be very difficult to use your hardware efficiently in this scenario.

    If you are going to go through the trouble of implementing a complete query processor (that can handle jobs larger than ram), a full update/query scheduler (lock manager), and a journalling mechanism that can (somehow) even maintain atomic transactions (even in the face of multiple failures) then why not just build your own database. This system might be useful in certain rare cases but I wouldn't use it except possibly for replication.

    JJ