Open Source Databases Revisited
pusakat writes "If you've been following performance comparisons of the different Open Source databases, Tim Perdue revisits PostgreSQL v. MySQL with production data from SourceForge and comes up with interesting results. This may be fodder for yet another 'my database is better than your database' exchange from both camps but the results are interesting anyway."
Linux is great for manufacturing.. I work for a mid-sized auto remanufacutuer, and we use postgresql all over the place, to do inventory, and core checkin. postgres has made huge strides in the last few years.. we started out with an old pentium running 6.3 postgres.. it was slow as dirt.. we are currently running 6.5.3, on a 600mhz alpha system.. and will probably move to 7 one of these days.
you have a budget? what's that? *smirk* I'm lucky if i get $50 for a stack of CD-R's.
I think the lack of tools, row-level locking, transaction support, etc. in MySQL are a shame. Yet, people talk like its the best thing since sliced bread. This has lead me to discover two kinds of MySQL users:
Not necessarily. Database design is one of a very few true black arts (read: serious voodoo.) The number of queries alone isn't a problem. The layout of the data, indexing, number of queries, complexity of queries, and the volume of returned data all factor in.
Let me provide a real world example. Some years ago, it was my task to roll out RADIUS for an ISP's dialup network. The stock setup for the USR RADIUS server -- it was the only one I could get to talk to a SecurID server correctly -- was, shall we say, non-optimal. USR's stats said one (1) authenication per second when using postgres. Let's ignore the way they had postgres setup (fsync enabled) for the moment. The reason everything was so bloody slow was the database schema. Even though there was only 1 query against the USERS table, it had to return 5k. Each row of the USERS table contained everything you could ever imagine (well, that USR could imagine.) Each row consumed almost 5k of space.
After three months of testing, data collection, and analysis, I redesigned the mess. USR literally laughed at my design because it made between three and six queries depending on the user. They stopped laughing when I showed one of their engineers the setup actively handling over 100 authentications per second. The USERS table, holding 10,000 entries, was about 2M.
actually, in the mysql fulltext search docs the mysql guys do a decent job of detailing the _two_ options for querying the fulltext index. putting the "match (a,b) against ('words')" in the "select" list returns the relevance rank against all rows in the table, including those with zero relevance. if you put the match clause as part of the "where" clause then mysql returns only those rows with non-zero relevance, ordered by the relevance score (but the relevance score is not one of the return columns).
- mark
Which of them is most stable (with HUGE databases) ?
What does HUGE mean? Gigabytes? Terabytes? Exebytes? Also, size is not the only metric involved in stability. Stability with a given number of records or fields, for instance. A 100 gigabyte database could have 10 records or a million records. Depends on how big each record is. Another metric is the number of simulatenous requests it can handle. One might work fine if it gets a 100 requests at a time, but chokes if it gets a 1000 requests at a time. A million record database might be only accessed by 2 clients at a time...size doesn't equal number of users.
Which of them is most likely to be able to recover withouth glich from an "impure" shutdown ?
Good point, but they're probably about the same in that regard (just taking an educated guess here).
Which will benefit more from a HUGE ram quantity or multiple cpus ?
Most well-written software will take advantage of as much RAM as it can get. The question would be which program is better at managing its own memory?
As far as mutiple CPUs go, programs have to be written specifically to support multiprocessing. On Linux or Windows NT, for the most part, this means writing the code to use multiple threads. I'm not sure if either of them are or are not, but that would be a very good point.
My journal has hot
I don't know about MySQL, but from what I can tell psql for postgreSQL is fairly full featured in terms of wrapping SQL in functional language-- and comparing the minimal work I've done in GUI DB tools to the experiences I've had working in command line or Perl with pgsql, I'd take the non-gui process any day, it's much more direct, flexible, and doesn't suffer from the same tool-specific learning curve (i.e. solid Perl + solid SQL = extremely portable skill, whereas any of the listed commercial tools may or may not be portable-- this is important for both ramp-up time on new experiences/jobs/whatever, and for personal development or career building). As for the tools you mention, do any of these function via ODBC, and therefore have the capability to connect to any database with drivers? Not to say that you don't have a valid point, I'm just wondering if there isn't a good reason why the Free Software types haven't embraced GUI and ungeek-friendly tools...
I do not have a signature
Interbase is free, it is not restricted by the GPL.
Interbase has been released under a variant of the MPL that "protects" the Interbase name (IOW, only Interbase may release certified Interbase builds). Anyone may grab the code and use it whichever way they like.
I'd love to use db.linux on a couple of projects but this limitation is a killer. The complete license is at License Terms.
Please note that this benchmark (and any benchmark, for that matter) applies only to the system and application that it's testing. Database applications vary so widely that it's very difficult to get any meaningful numbers outside of very specific areas.
As an example, I recently ran some tests that came up with the exact opposite results. My application is a large message tracking database. Query speed is secondary; insertion speed is critical. MySQL handles my test data set in 20 minutes; PostgreSQL takes over 3 hours (!). For this application, PostgreSQL is Right Straight Out.
Other notes:
1. PostgreSQL's tables took up roughly twice the space of MySQL's.
2. MySQL's lack of transactions is a real pain. I can, however, work around it (in this particular application, at least).
3. MySQL's "text indexing" is useless. The evaluation function returns every record that contains any of the search terms; there is no way I've found to require all search terms. No documentation, of course.
4. The latest beta of MySQL can use Berkeley DB tables to get real transaction handling. Unfortunately, this is even slower than PostgreSQL.
Obvious conclusion: Run your own tests and draw your own conclusions.
--
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
I'm pretty sure k5 sets a limit of <100 queries on a page before they start worrying about performance. At least that's what Rusty said if I remember correctly. I'm guessing they probably run about 10-20 queries on most pages, however.
My site runs about 10-20, but thats only after caching lots of data in memory.
--
Why? The answer is simple: load controlling. Basically, as concurrency increases, the cost of locking and that of buffer pool misses become prohibitive and transactions begin to starve. So most "smart" DBMSes will queue a few transactions for later processing after a certain point.
Because MySQL uses such coarsely-grained locks (at the table level), the opportunity for concurrency is very low to start with; hence the decreased throughput and starvation.
If you're interested in this stuff, look at Jim Gray's seminal paper "Granularity of Locks and Degrees of Consistency in a Shared Database" and Chou and DeWitt's "An Evaluation of Buffer Management Strategies for Relational Database Systems." You should be able to find either from the ACM Digital Library or in an anthology (like Stonebraker's _Readings in Databse Systems_).
~wog
Essentially what that test showed was that mysql had to lock all those tables in order to do the join.. And that with Postgres's versioning system, it was a trivial matter.
I remember reading on the topic of excessive joins in a Sybase Admin book once (excelent book on "practical" database, since it shows you exactly how one company solved all the DB problems). It essentially said that determining join order was actually a very slow process.. Exponential in fact.. Especially if your DB tries to determine which indexes to used based on statistics... The selection of indexes increases the number of possible variations incredibly. Sybas's solution was to only compare a select few of the possible join methods - which is a trade-off.. Alternatively, if you were to create a stored procedure, then which-ever method was best at creation time will be used from there-on-out...
Stored procedures - in Sybase at least - reduce a great deal of the overhead, especially in a 16-way join. Sadly, I don't know if Postgres does ANY pre-compiling of stored procedures. They have two main methods: straight named 'sql' statements, and 'psql' (a true stored procedure 'language'). My guess is that they do little more than pre-parse the statment (if even that). If this is the case, then using a stored procedure might even be slower than raw SQL since a call requires the parsing of the function name, then loading of the function into memory from a table, then the parsing of that statement (two more steps than raw processing). It is, however, necessary for triggers and rules.
Postgres (and I believe mysql) has raw-c extensions, but I really dislike these for general DBA operations (not to mention I haven't spent the time learning them).
As for the 16-table join, a general web page probably shouldn't do this (especially a heavily used page), but a database SHOULD at least let the DBA perform the ad-hoc query when trying to extract information. The only real solution is summary information, which often tends to be more trouble then their worth... Another solution is de-normalization (which makes me feel dirty)... Basically like what Blacknova.net uses... If you have 1 to N data, then just pick some upper bound N and include that many extra columns directly in your main table. Fast and efficient, provided that you choose the right N.
-Michael
-Michael
Anyway, the number of tables you have depends greatly on the problem you are trying to tackle. If you've used SourceForge, you'll see that there is a whole lot of data displayed on each page. I would be more worried if he would have been able to get all of this from a single database query (could you imagine a table that would do that - yuch)
Engineering and the Ultimate
I'm sorry but I think this grossly misses several points and undermines the appropriate difference between threads and processes.
One way to think of threads is as a solution to bloated processes. In operating systems like Solaris, which are designed to scale to 32 or 64 (or more) processors, the number of locks in the kernel is enormous. As a result, context switches and process creation time tend to be higher. In gross terms, you pay for the concurrency at high numbers of processors with poor performance at lower numbers of processors. This is not necessarily bad (especially if you plan to use 16 or more processors at some point) but it is just the way things are.
Linux, on the other hand, penalizes scalability for large numbers of processors in order to get much better performance with smaller numbers of processors. Linux does this because Linus doesn't believe that 16+ processor machines are common or sensical and that the kernel should be optimized for common (and sensical) cases.
Why do I mention all of this? Because Linux process creation times are slightly faster than Solaris thread creation times. Anyone who fetishizes multi-threading versus multi-processing doesn't really understand the difference between the two and when it really matters.
Processes are contexts of execution. Threads are contexts of execution that may share a memory space with other contexts of execution. These are fundamentally different things in some operating systems. In Linux, a thread is just a process that shares a memory space with another process. This is because processes are *fast* to create and switch to in Linux (optimized for the common case, remember?).
Anyway, I don't mean to bash maraist at all, just to point out a common set of misconceptions about these things. When considering these architectures, each database system must try to optimize across all of the operating systems they plan to be deployed on (just like the apache project, which i think was an excellent analogy).
I work for a large manufacturing company, and recently we submitted our budget for next year (2001)... well, on there we had SQLServer, and a bunch of other commercial software. We got our budget back in its final form, and it did NOT include our SQLServers... we needed a database system, so I threw together a proposal for some smaller MySQL servers.
My manager liked the idea, but wanted info on other open source databases... I pointed him towards Tim's article (this was last week) and we've been on our merry way since. We now have both MySQL and PostgreSQL boxes in production for testing, and will make our determination later on!
Tim's article played a HUGE roll in our initial impression forming, and I would recommend it to all and any who are looking for info on open source databases!
-C
"This above all, to thine own self be true"
We are a medium sized company developing web apps. We are not poor, but we would be if we ruled out open source software. We use db2 for most work, but postgres is being investigated and we have found it quite workable for our many needs.
Your comments remind me of similar ones a few years back when we started using linux and ppl said "serious companies don't use linux" Nobody (except a small-minded minority of change-haters) is saying that now.
The only explanation here is the caching behavior of the data base. Howerver, this also indicates that benchmarking databases is not a really trivial task, because exactly such effects must be considered for database benchmarks.
Further, besides a few graphs. The test says actually nothing different then: PostgreSQL improved strongly, while MySQL is a dog. Actually it says nothing at all.
Not that I mind the results. I worked with both databases briefly and believe that PostgreSQL is far closer to an industrial strength database (Lack of transaction control disqualifies MySQL for that in the first place). Nevertheless, I think the results really lack any significance.
ich bin der musikant
mit taschenrechner in der hand
kraftwerk
- Which of them is most stable (with HUGE databases) ?
- Which of them is most likely to be able to recover withouth glich from an "impure" shutdown ?
- How much disk space can they waste in tablespaces with frequent spaces ?
- Which will benefit more from a HUGE ram quantity or multiple cpus ?
--
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
As a followup idea.. 16 joins reminds me of a project that I did once where we used ID's for every table.. Thus every table stored meta data and a single piece of actual data.. Thus if you had 17 pieces of data to display, then you'd have to do 16 joins. There were three solutions to this:
First, cache all the smaller tables, and do the translation (from say user_id to user_name) in an external hash (I know perl can easily do this; don't know about php). For static fields such as user-names, email addresses, etc, this is easy. Unfortunately this doesn't help when trying to find all users with email addresses going to aol and souce code that's over-due, etc. But it does solve the summary report problem.
Second is to break up the query into multiple operations.. You need an incredible amount of memory to do certain types of joins, and sometimes the sql-processor guesses the wrong method. Even though postgres's "explain" command told me it was doing hash-lookups for a user-name summary, I found that it was faster to first get all the data with user-id's, then perform a second search with
"select user_name from users where user_id in (x,y,z,....)"
So long as you don't have more than a couple dozen entries, this works nicely.. ESPECIALLY if you have thousands of rows of real-data with only one or two unique user-names. This was a nice-short term fix, but "clean" sql should all be done on a single statement.. Soo onto the third
Denormalize back to form 4 or 3. Namely, whenever you have a user-name, don't use a user-id.. Sure you consume 8 more bytes of memory per row, but it greatly speeds up searches. The method is to ONLY use ID's when you can not uniquely represent a primary key with a single column.. (such as first-name, last-name,...).. But in our computer world, "user-name" tends to be unique. "project-name" should be unique, "working-branch" should be unique.. In which case, at the design level, it's worth the extra consumed space.. Now I know that Mysql advocates (including me a year ago), cringe at the idea of keying off variable length strings.. For me, it was the idea that you can compare integers much faster than strings.. BUT, if the string is an index, then your sorting is going to be MUCH faster, strings distribute much better than auto-incremented integers (which tend to want to be linear). Additionally, the fact that you won't have to perform a join at all removes almost all of the performance problems.
When you don't use meta-data as primary keys, then you find that you only perform joins when you actually have relational data... That joins based on summary data almost always goes away.
Additionally, if your DB ever get's corrupt (say you lose a table), then you don't have to fret over "who the hell was user-id 155?". Given that Mysql doesn't have roll-backs, the potential for data corruption with meta-data is enormous (when you do both reads/writes).
-Michael
-Michael
Actually, you are quite wrong. PostgreSQL has more documentation, I know this cuz I have combed every possible link on both site. Go back to PostgreSQL page, there is a full book free on PostgreSQL on their documentation page. Also, PostgreSQL has bindings for a zillion languages as well, what do you want? Perl? C? Python? C++? more? Before you post, please update yourself before you post your misinformations. I like both database, and I use both. :)
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
First, I started off with the words "Sadly, I don't know".. Therefore removing myself as an expert in that area. It was more a question than anything else.
As for the other languges, perl, tcl, etc, those can only be less optimial forms, and are more for procedural operations, which has nothing to do with over-comming the 16 join problem. (With the exception maybe of hashed-lookups, but that's dangerous)
It's great to hear about cached query plans (I'd forgotten the term).
As for the problem with speculation.. If nobody were to speculate, and those in the know don't impart their knowledge, then has anyone gained anything? 'Qualified' speculation does exactly as here; Inciting those knowing specific facts to quality errors. My speculation _was_ based on emperical analysis, however, so it wasn't as if I was blindly stateing something.
But now, as a question, is the compiled sql stored in any more optimal fashion? Or does postgres use an interpreter (which does little more than tokenize). To me, a compiled sql statement means that its making raw calls to table operations..
Simple benchmarks that I've done haven't shown any performance increases with using stored procedures, but that might just be because the compilation / interpretation stages were insignificant with respect to the overall operation. The other interpretation is that little is actually gained by the compilation of the sql statement (at least until the cached query plans come around).
-Michael
-Michael
Open source databases -- great. But how about tools to facilitate database development? In commercial terms you've got modeling systems (e.g.: Erwin), Extract-Transform-Load (ETL) tools (e.g.: Informatica, Sagent), Reporting tools (e.g.: Brio, Business Objects, etc.). GUI managers, GUI drag-n-drop development tools, etc.
In the open-source world you've got... Well you've got to custom code in some non-database-centric language, e.g.: C, C++, Java, Perl, etc.
Anyone want to start up an open-source alternative for an ETL tool? Target open-source databases as source and target systems. A usable open-source system would really raise the visibility of open source in the (hitherto) commercial-only data warehousing market.
"But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
Anyone who has designed a site and its database in such a way that generating a page requires 16 queries should not taken seriously when conducting a database performance test. Still, it does seem to suggest that MySQL does not perform too great (which doesn't come as a surprise to me really.. All I'm saying is that a site with a page that requires 16 queries is *VERY BADLY DESIGNED* and will not scale very well at all!
Centura Software makes an open source mobile database software called db.star and db.linux
db.star is the first open source, embeddable data management solution for Information Appliances (IAs). "It is a high performance, small footprint, developer friendly database engine which facilitates the development of rich and powerful applications for popular IA platforms, such as the Palm® Computing Platform and Windows®-powered Pocket PC. "
Open source but not free.
db.linux is OSS Data Management for Information Appliances
Open source and royalty free.
PostgreSQL is faster the MySQL for this guy.
But, as always in the case of database server choices, nothing replace real world test. You should test the various servers on *your* data.
Frankly, I found the numbers quoted in the article ridiculously low. But as we don't know te volumetry of the data, the hardware used, the database cache size, etc, etc, this boils down to 'someone found PostgreSQL better than MySQL for its usage'.
It is not uncommon to have orders of magnitude performance differences between databases. I would love to know how what a tuned ORACLE would get.
Cheers,
--fred
1 reply beneath your current threshold.
I should say at the outset: I've used both Postgres and MySQL on production projects and like both for very different reasons. But here's the thing (and this is a common problem with technology product evaluations): these products have very different design goals and aren't really that comparable.
I'm not saying the standard "MySQL's not a database because it doesn't support transactions and databases have to. ACID!" (although I'm sympathetic to that point of view, I don't think individual words like "database" are worth fighting over--If MySQL wants to call itself a database, fine). What I am saying is that Postgres was designed to be a full-fledged SQL92-compliant database with transactions and triggers and foreign keys and the whole lot. MySQL was designed to be a SQL-based front-end to a very fast database file format. These products are not the same and comparing them without agknowledging that seems foolish.
I've been very pleased by the speed improvements in Postgres recently (partisan testing aside, Postgres 7 really is much faster). I've also been impressed by feature additions in MySQL (although it still isn't close to what you would expect to get if you're coming from the Oracle or DB2 world). But both remain inherently died to their design goals. This isn't a bad thing, at all, because different projects need different products with different design goals.
I compare this to the (often senseless) comparisons of NetBSD, OpenBSD and FreeBSD. One is designed for portability, another for security (features be damned!) and the last for performance, features and multi-processing under (primarily) intel. Although they come from a common code base, they have obviously diverged in design goals. Instead of bashing one product or another (all of the *BSD's and the two databases discussed have *major* problems that are bashable) use the one that seems most appropriate to your needs.
Lest we forget, technical superiority is only a very small part of the battle, you must also win Mind share
Well, that depends on what you mean by "win." Open source is not a popularity contest. In the past, it may have been true that to ensure the continued existence of a cool project, it needed to have a chance of dominating the market at large. But nowadays, as long as there are enough developers interested in a project--oftentimes only because of its "technical superiority"--it can stay alive, grow and improve. If a project can attain real technical superiority in this way, I call this winning.
That being said, I think it should be remembered that for many simple web sites, MySQL was the right choice because its superior performance with a read-oriented setup. Of course, Postgresql people claim a read-oriented setup isn't a real database application at all, but having designed several such small sites, I really wouldn't use anything else. (Flat files?? What an unnecessary pain!)
Just my two cents . . .
Respondeo dicendum quod . . .