Open Source Database Clusters?

← Back to Stories (view on slashdot.org)

Open Source Database Clusters?

Posted by Cliff on Thursday September 11, 2003 @11:45AM from the prime-time-league-contenders-yet? dept.

grugruto asks: "A lot of open source solutions are available to scale web sites with clusters but what about databases? I can't afford an Oracle RAC license but can I have something more reliable and fault tolerant than my single Postgres box? I have seen this recent article that looks promising for open source solutions. Do anyone have experiences with clusters of MySQL , Postgres-R, C-JDBC or other solutions? How does it compare to commercial products?"

11 of 350 comments (clear)

Min score:

Reason:

Sort:

Not personally, but by revividus · 2003-09-11 11:53 · Score: 5, Interesting

I've been looking into MySQL for a bit, and I saw this article recently, which is directly concerning clustered database servers running MySQL.
Maybe it will be of interest...

--
philcrissman.com.
The big problem is replication by MarkusQ · 2003-09-11 11:56 · Score: 5, Interesting

IMHO, the biggest problem is replication; keeping them all consistent in the face of asyncronous updates. It can also reduce/eliminate the advantages of clustering if you have a significant number of updates compared to the number of quieries.
I guess the best answer depends on how dynamic your data is. If it's static, there are all sorts of easy answers. If all the updates come from a central source, or on a predictable schedule, you're almost as well off. If updates come from the great unwashed but the data can be partitioned in some way (say, geographically) you can still do it. If updates come from all over but queries can be centralized, or if your database is tiny, or if latency isn't a problem, or if you have a machine that prints money, it can still be done.
If you want to do everything for everyone everywhere, right now if not sooner, for under twenty bucks, you're screwed.
So, what are your needs?
-- MarkusQ
PostgreSQL and pg_dump by zulux · 2003-09-11 11:57 · Score: 4, Interesting

Check out the new replication at postgresql.org: it's master -> multiple slave replication.

Then have your slave database query the master database - and if it no longer responds, it could promote itself to master.

The replication is the easy bit - the slave promotion is the hard and gritty bit.

--
Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.
eRserver by linuxwrangler · 2003-09-11 12:01 · Score: 5, Interesting

I have found PostgreSQL to be nearly bullet-proof. I routinely have connections up for months at a time (that's individual persistent connections - the server is up much longer and the connections usually get dropped when I upgrade the client software). Still, sh*t happens and replication has been a sore point for many databases both open and commercial.
You should investigate eRserver. It was originally a commercial replication product for Postgres but has been open-sourced. I haven't tried it yet but it's on my to-do list.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
Emic, InnoDB Hot Backup by vinsci · 2003-09-11 12:06 · Score: 5, Interesting
Two MySQL products I found interesting (neither of which is open source at this time):
- CLUSTERING IN TUNE WITH APACHE AND MYSQL (Free registration might be required. Also see Emic Application Cluster for MySQL
- InnoDB Hot Backup (with point in time backup)
The rest of this comment is quoted verbatim from InnoDB News
MySQL/InnoDB-4.0.1 and Oracle 9i win the database server benchmark of PC Magazine and eWEEK. February 27, 2002 - In the benchmark eWEEK measured the performance of an e-commerce application on leading commercial databases IBM DB2, Oracle, MS SQL Server, Sybase ASE, and MySQL/InnoDB. The application server in the test was BEA WebLogic. The operating system was Windows 2000 Advanced Server running on a 4-way Hewlett-Packard Xeon server with 2 GB RAM and 24 Ultra3 SCSI hard drives.

eWEEK writes: "Of the five databases we tested, only Oracle9i and MySQL were able to run our Nile application as originally written for 8 hours without problems."

The whole story. The throughput chart.
--

Trusted Computing FAQ | Free Dawit Isaak!
What is slashdot doing? by rtnz · 2003-09-11 12:06 · Score: 5, Interesting

What does Slashdot do for this? I recall way back in the day there was some information about what the Slashdot tech looks like, anyone have info regarding their database setup? L
Re:eRserver, more info. by ron_ivi · 2003-09-11 12:49 · Score: 3, Interesting

.org and .info are both using it.
The press release of ER Server becoming open source is quite informative (karma?) as well.
Marc of PostgreSQL Inc's an incredible resource on the postgresql mailinglists too; and PostgreSQL Inc has a really cool policy that allowed them to do donate their code to the community that way:
From their release: " "DATELINE FRIDAY, DECEMBER 15, 2000 Open Source vs. Proprietary: We advocate Open Source, BSD style :) We will consider and develop short term (up to 24 month) proprietary applications and solutions where there is a strong business and intellectual property case to be made. *All" proprietary developments that we are involved in *will* become open source within two years of implementation, without exception." ".
Also cool, they provide hosting http://www.pgsql.com/hosting/ which donates "25% of all profit from these services ... directly back into the PostgreSQL Project. "
Ron
I'm not affiliated with them in any way, just appreciative of Marc's contributions on the mailingslists and to postgresql as well.
Agreed. by oneiros27 · 2003-09-11 12:53 · Score: 4, Interesting

Availability is one of the basic issues when sizing your system. [ie, can you have it down at night for a cold backup, or does it have to be available 24x7? Can you even get a maintenance window once a month?]

As with sizing your UPS and/or generators, you need to determine what the cost to your business is for downtime.

Now, yes, you might have some issues in SLAs that spell out how much it'll cost you, if you have to refund customers's money [for service based orgs]-- or how much profit you'd lose if your customers couldn't purchase items [for sales based orgs]. But unfortunately, you have to also consider the recovery costs, the costs of damage to your reputation, etc.

If it's not worth your purchasing an Oracle or other, more expensive database, there's good odds that it's not worth the headaches of maintaining a high availability cluster with automatic failover. Instead, you can mirror the data, and keep transaction logs that you can replay.

You can have a spare system on standby, that you can keep updated on a regular basis (again, your cost of downtime, and the necessary time to recover the system will affect your choices), and when your main system should fail, you can push the most recent diffs to your standby, reconfigure the application servers to recognize the new server as the old one, and you're back in business.

It requires a bit of planning, and making sure that the necessary manual steps are well documented [so that anyone can do it, should the server outage be caused by something serious enough to take out your administrator, too], but it's easier and cheaper to build and maintain than a true cluster.

--
Build it, and they will come^Hplain.
Re:-1:Troll by Tmack · 2003-09-11 12:54 · Score: 4, Interesting

I would have to second this. I use MySql at work as the main database for the NOC and service activations and circuit delivery groups. The database (running off an old Sun Netra box) handles the load of all the scripts (mostly perl) used by all those groups. This includes scripts that monitor circuit status (ala Netcool), test new circuits, keep track of customer installations, change requests, troubles, router configs, etc... The MySql server has never caused dataloss, and the only instances where it "crashed" were errent querys in alpha CGI script releases that caused basically an infinite loop around a search on the 20K+circuit entries on a non-Indexed field, that a simple restart of the mysqld fixed. Even when the Beta version was released running on a linux P4 box we never had issues, as opposed to the Oracle backended system used for the main corp. database that regularly causes much frustration among co-workers (not to mention the internal conflict between 2 development teams (corp vs us) trying to control the access and data of the corp database vs the ease of development of new utilities to make Customer installation and support easier.
TM
P.S.Cant wait for our Sun V280r shows up!

--
Support TBI Research: http://www.raisinhope.org
Need to define the problem better by koreth · 2003-09-11 13:02 · Score: 4, Interesting

Why do you want clustering? Do you need to scale up transactions per second? If so, are these primarily reads or writes? The answer to that question can make a huge difference in your clustering and replication strategy.
Clustering read-mostly data for performance reasons is relatively easy; for many applications, where a second or two of staleness on the replicated databases is acceptable, you can make do with a bunch of independent copies of the database, with all updates going to an authoritative database and getting replicated out from there asynchronously.
If your data can be partitioned cleanly -- that is, if you have groups of tables that are never joined with tables in other groups -- then you can perhaps get some benefit from putting different data on different servers, with no replication required. Obviously that's only worthwhile if the query load is comparable between groups.
If, on the other hand, you require ACID-compliant updates of all the replicants as a unit, you're entering difficult territory and you might have no choice but to go with a commercial solution depending on the specifics of your needs.
At just about all of the places where I've done database programming where this has come up, we ended up buying a much beefier database server (lots of processors and memory, good I/O bandwidth, redundant networking and power supplies) with disk mirroring, rather than get into the headaches of replication. A big Sun or HP server is certainly more expensive than some mid-range Dell or no-name PC, but it may end up being cheaper than the engineering time you'd spend getting anything nearly as robust and high-performance on less expensive hardware.
I've also found that very often when there's a database bottleneck that looks like it requires bigger hardware, the problem is the data model or the queries (unnecessary joins, no indexes where they're needed, poorly-thought-out normalization, etc.) or the physical layout of the data (indexes competing with data for access to the same disk, fragmentation in indexes/data, frequently-used tables spaced far apart on disk.)
If I'm dealing with Oracle, sometimes the solution is as simple as adding an optimizer hint to make the query do its joins in a sensible way. Sometimes denormalization is helpful, though you want to be careful with that. Sometimes a small amount of data caching in the application can mean a tremendous decrease in database load. And so on.
If you can tell us more about the specifics of your situation, there are lots of people here who can offer more specific advice.
Well you sort of can! by codepunk · 2003-09-11 18:02 · Score: 3, Interesting

I run two types of clusters, one of them is a RAC 9i on Linux. Nothing and I mean nothing has the functionality of RAC 9i. You can put a bullet through one of the nodes right in the middle of a query being returned and still get your records just like nothing ever happened. The other database I run is a postgresql on redhat advanced server and the database files are sym linked into the san (this is high availability only) . If I had to do it again I would not use postgresql because it scales for shit and I cannot under any circumstances keep it up in a 24/7 configuration. The database needs to have vaccuum run on it once a day and I have to do that manually because half the time it fails. Running a vaccum on the database while clients are connected basically locks everyone tight until it is finished.

If you cannot spend any money and wish a fast, scalable and higly available system my advice is first sapdb and or mysql and advanced server on some sort of shared scsi.

Now all of you big postgresql advocates flame away but it does not change the facts. I love the database but if you need heavy lifting it just does not cut the mustard.

--

Got Code?