Open Source Databases Revisited
pusakat writes "If you've been following performance comparisons of the different Open Source databases, Tim Perdue revisits PostgreSQL v. MySQL with production data from SourceForge and comes up with interesting results. This may be fodder for yet another 'my database is better than your database' exchange from both camps but the results are interesting anyway."
Linux is great for manufacturing.. I work for a mid-sized auto remanufacutuer, and we use postgresql all over the place, to do inventory, and core checkin. postgres has made huge strides in the last few years.. we started out with an old pentium running 6.3 postgres.. it was slow as dirt.. we are currently running 6.5.3, on a 600mhz alpha system.. and will probably move to 7 one of these days.
you have a budget? what's that? *smirk* I'm lucky if i get $50 for a stack of CD-R's.
Please note that this benchmark (and any benchmark, for that matter) applies only to the system and application that it's testing. Database applications vary so widely that it's very difficult to get any meaningful numbers outside of very specific areas.
As an example, I recently ran some tests that came up with the exact opposite results. My application is a large message tracking database. Query speed is secondary; insertion speed is critical. MySQL handles my test data set in 20 minutes; PostgreSQL takes over 3 hours (!). For this application, PostgreSQL is Right Straight Out.
Other notes:
1. PostgreSQL's tables took up roughly twice the space of MySQL's.
2. MySQL's lack of transactions is a real pain. I can, however, work around it (in this particular application, at least).
3. MySQL's "text indexing" is useless. The evaluation function returns every record that contains any of the search terms; there is no way I've found to require all search terms. No documentation, of course.
4. The latest beta of MySQL can use Berkeley DB tables to get real transaction handling. Unfortunately, this is even slower than PostgreSQL.
Obvious conclusion: Run your own tests and draw your own conclusions.
--
Welcome to the Turing Tarpit, where everything is possible but nothing interesting is easy.
Why? The answer is simple: load controlling. Basically, as concurrency increases, the cost of locking and that of buffer pool misses become prohibitive and transactions begin to starve. So most "smart" DBMSes will queue a few transactions for later processing after a certain point.
Because MySQL uses such coarsely-grained locks (at the table level), the opportunity for concurrency is very low to start with; hence the decreased throughput and starvation.
If you're interested in this stuff, look at Jim Gray's seminal paper "Granularity of Locks and Degrees of Consistency in a Shared Database" and Chou and DeWitt's "An Evaluation of Buffer Management Strategies for Relational Database Systems." You should be able to find either from the ACM Digital Library or in an anthology (like Stonebraker's _Readings in Databse Systems_).
~wog
Anyway, the number of tables you have depends greatly on the problem you are trying to tackle. If you've used SourceForge, you'll see that there is a whole lot of data displayed on each page. I would be more worried if he would have been able to get all of this from a single database query (could you imagine a table that would do that - yuch)
Engineering and the Ultimate
I'm sorry but I think this grossly misses several points and undermines the appropriate difference between threads and processes.
One way to think of threads is as a solution to bloated processes. In operating systems like Solaris, which are designed to scale to 32 or 64 (or more) processors, the number of locks in the kernel is enormous. As a result, context switches and process creation time tend to be higher. In gross terms, you pay for the concurrency at high numbers of processors with poor performance at lower numbers of processors. This is not necessarily bad (especially if you plan to use 16 or more processors at some point) but it is just the way things are.
Linux, on the other hand, penalizes scalability for large numbers of processors in order to get much better performance with smaller numbers of processors. Linux does this because Linus doesn't believe that 16+ processor machines are common or sensical and that the kernel should be optimized for common (and sensical) cases.
Why do I mention all of this? Because Linux process creation times are slightly faster than Solaris thread creation times. Anyone who fetishizes multi-threading versus multi-processing doesn't really understand the difference between the two and when it really matters.
Processes are contexts of execution. Threads are contexts of execution that may share a memory space with other contexts of execution. These are fundamentally different things in some operating systems. In Linux, a thread is just a process that shares a memory space with another process. This is because processes are *fast* to create and switch to in Linux (optimized for the common case, remember?).
Anyway, I don't mean to bash maraist at all, just to point out a common set of misconceptions about these things. When considering these architectures, each database system must try to optimize across all of the operating systems they plan to be deployed on (just like the apache project, which i think was an excellent analogy).
I work for a large manufacturing company, and recently we submitted our budget for next year (2001)... well, on there we had SQLServer, and a bunch of other commercial software. We got our budget back in its final form, and it did NOT include our SQLServers... we needed a database system, so I threw together a proposal for some smaller MySQL servers.
My manager liked the idea, but wanted info on other open source databases... I pointed him towards Tim's article (this was last week) and we've been on our merry way since. We now have both MySQL and PostgreSQL boxes in production for testing, and will make our determination later on!
Tim's article played a HUGE roll in our initial impression forming, and I would recommend it to all and any who are looking for info on open source databases!
-C
"This above all, to thine own self be true"
The only explanation here is the caching behavior of the data base. Howerver, this also indicates that benchmarking databases is not a really trivial task, because exactly such effects must be considered for database benchmarks.
Further, besides a few graphs. The test says actually nothing different then: PostgreSQL improved strongly, while MySQL is a dog. Actually it says nothing at all.
Not that I mind the results. I worked with both databases briefly and believe that PostgreSQL is far closer to an industrial strength database (Lack of transaction control disqualifies MySQL for that in the first place). Nevertheless, I think the results really lack any significance.
ich bin der musikant
mit taschenrechner in der hand
kraftwerk
- Which of them is most stable (with HUGE databases) ?
- Which of them is most likely to be able to recover withouth glich from an "impure" shutdown ?
- How much disk space can they waste in tablespaces with frequent spaces ?
- Which will benefit more from a HUGE ram quantity or multiple cpus ?
--
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
Open source databases -- great. But how about tools to facilitate database development? In commercial terms you've got modeling systems (e.g.: Erwin), Extract-Transform-Load (ETL) tools (e.g.: Informatica, Sagent), Reporting tools (e.g.: Brio, Business Objects, etc.). GUI managers, GUI drag-n-drop development tools, etc.
In the open-source world you've got... Well you've got to custom code in some non-database-centric language, e.g.: C, C++, Java, Perl, etc.
Anyone want to start up an open-source alternative for an ETL tool? Target open-source databases as source and target systems. A usable open-source system would really raise the visibility of open source in the (hitherto) commercial-only data warehousing market.
"But actually trying to use m4 as a general-purpose langage would be deeply perverse" --ESR
I should say at the outset: I've used both Postgres and MySQL on production projects and like both for very different reasons. But here's the thing (and this is a common problem with technology product evaluations): these products have very different design goals and aren't really that comparable.
I'm not saying the standard "MySQL's not a database because it doesn't support transactions and databases have to. ACID!" (although I'm sympathetic to that point of view, I don't think individual words like "database" are worth fighting over--If MySQL wants to call itself a database, fine). What I am saying is that Postgres was designed to be a full-fledged SQL92-compliant database with transactions and triggers and foreign keys and the whole lot. MySQL was designed to be a SQL-based front-end to a very fast database file format. These products are not the same and comparing them without agknowledging that seems foolish.
I've been very pleased by the speed improvements in Postgres recently (partisan testing aside, Postgres 7 really is much faster). I've also been impressed by feature additions in MySQL (although it still isn't close to what you would expect to get if you're coming from the Oracle or DB2 world). But both remain inherently died to their design goals. This isn't a bad thing, at all, because different projects need different products with different design goals.
I compare this to the (often senseless) comparisons of NetBSD, OpenBSD and FreeBSD. One is designed for portability, another for security (features be damned!) and the last for performance, features and multi-processing under (primarily) intel. Although they come from a common code base, they have obviously diverged in design goals. Instead of bashing one product or another (all of the *BSD's and the two databases discussed have *major* problems that are bashable) use the one that seems most appropriate to your needs.
Lest we forget, technical superiority is only a very small part of the battle, you must also win Mind share
Well, that depends on what you mean by "win." Open source is not a popularity contest. In the past, it may have been true that to ensure the continued existence of a cool project, it needed to have a chance of dominating the market at large. But nowadays, as long as there are enough developers interested in a project--oftentimes only because of its "technical superiority"--it can stay alive, grow and improve. If a project can attain real technical superiority in this way, I call this winning.
That being said, I think it should be remembered that for many simple web sites, MySQL was the right choice because its superior performance with a read-oriented setup. Of course, Postgresql people claim a read-oriented setup isn't a real database application at all, but having designed several such small sites, I really wouldn't use anything else. (Flat files?? What an unnecessary pain!)
Just my two cents . . .
Respondeo dicendum quod . . .