I am pretty sure that PostgreSQL uses thread from version 8.0.
That is not true. There is an auxiliary thread used for signal handling on Windows (which is somewhat hacky), but the basic model of one process per client connection is unchanged, and applies to Postgres on both Windows and Unix.
the tuning documentation for PostgreSQL *states* that good IO performance has more of an impact than good CPU performance
This is often, but not always the case -- it depends on the nature of the query workload. In this case, the benchmark was done using read-only queries, specifically for the purpose of minimizing the I/O requirements and therefore focusing on CPU performance.
Sure -- Greenplum DB is one such database. There's a lot of research on distributed databases from the 1980s onward that addresses this topic in detail.
SQL standard explicitly specifies that writing to the database under READ UNCOMMITTED isolation is not allowed.
Which part of the SQL standard says that? The SQL specification for isolation levels talking specifically in terms of what visibility behavior is disallowed at each isolation level; it makes no guarantees about what behavior will be provided (and it typically doesn't get into the nitty gritty details of whether database writes will be allowed). For example, it is perfectly legal per spec for a database to only implement SERIALIZABLE, and simply accept the other isolation levels as aliases with the same behavior (since those other isolation levels are defined in terms of which isolation defects are tolerable, not guaranteed).
Further, PostgreSQL doesn't even support READ UNCOMMITTED.
Per SQL spec it does, in the sense that it doesn't reject it, it is just implemented via a more strict isolation level (READ COMMITTED).
Asynchronous transactions seem like a bad idea
In real applications, there are often transactions whose durability requirements can be relaxed for better performance (session state in a web app is a typical example). The async commit feature is a nice way of speeding up those transactions, without harming the ACID guarantees provided for other transactions (it can be enabled on a per-transaction basis).
While it is true that dblink and dbi-link can be used to do cross-database joins, they are not even remotely close to being able to do them in an efficient manner.
Well, the point is just that the total memory usage is not a simple product, as you had suggested. As for the relative size of the Perl DLL code sections against the total memory consumption of the process, I wouldn't say the DLLs are "tiny". For example:
% size/usr/lib/libperl.so.5.8
text data bss dec hex filename 1246792 23876 8328 1278996 138414/usr/lib/libperl.so.5.8
(I don't have a Windows box handy, but DLLs should only differ to a relatively small degree.) For many small Perl programs, I'd expect the total size of the code sections of all the shared objects used could be a pretty significant portion of the process's total memory consumption.
Imagine a Perl program, which requires about 35 MB of memory (25 MB for the parsed code, 10 MB for actual work). The user wants to run 6 instances of the program. Without fork (that is, on Windows) the user will need 210 MB of memory.
Not true: the code section of any DLLs required by each instance of the program will only be loaded once.
Some work has gone into improving Postgres' default configuration in recent releases. With 8.2 on my machine, for example, the default configuration allocates:
1MB of work_mem. This is a reasonable figure: because work_mem can be allocated once for every sort operation, each backend can allocate several times work_mem concurrently, so setting it much higher than 1MB could actually consume more memory than would be desirable out of the box, IMHO.
24MB of shared_buffers (maybe less if initdb can't allocate sufficient SysV shared memory). This is on the low side for a modern server, but still not too bad, considering that many people using PG leave most page caching up to the kernel.
128MB for effective_cache_size. Again, maybe not perfect (this box has 2GB of RAM), but probably needs DBA configuration to be at all accurate anyway.
There's probably room for improvement, but the above behavior is definitely an improvement, IMHO.
most platforms today alignment requirements are going to pad the extra byte in memory and disk
MySQL uses a different representation for in-memory and on-disk data, so it's not too difficult to avoid the padding byte in the on-disk representation.
Computers aren't good at retaining knowledge and recognizing patterns? That's news to me... this statement is obviously made by someone who doesn't know what he's talking about...
Heh, you're kidding, right? The text you quoted is from the article, and the person who said it is Jonathan Shaeffer: a CS prof at the U of Alberta, the person who initiated much of the current research into poker, the author of the world champion checkers AI -- in other words, someone who's forgotten more about AI than any of us are likely to learn. His area of specialty is game-playing AI, so I'd trust his opinion on this subject...
The reason this game is difficult is not based on a computer's inability to solve problems, rather that there are so many possibilities that we cannot effectively design algorithms that the can be put to use.
Nonsense. That is one of the reasons that writing a good Go AI is difficult (among others), but the set of possible possibilities in Poker is relatively small. In Hold'em, for example, your opponent has one of 50*49=2450 possible hands -- easily enough to do static precomputation on, for example. Similarly, given the flop, there are 48*47=2256 possible sequences of turn and river cards. And that's assuming you care about the suit of the cards, which is not always the case (e.g. given a rainbow flop there's only a small chance of an opponent drawing to a flush). Random sampling over these relatively small sets of possibilities is perfectly feasible, especially when you can infer information about the way your opponent acts to limit the set of cards they are likely to hold (e.g. if they raise preflop, bias your samples toward them holding a strong hand). Read some of Schaeffer's papers (e.g. this one) on their poker work, I think it's fairly clear where the challenge in poker AI lies.
In short, the difficulty in poker AI is very much a question of reasoning with imperfect information (a classical AI problem, but one that is still fairly open), and changing your behavior in response to how your opponent acts.
I love the "alleged". Denial won't help PgSQL gain market share.
My apologies; "alleged" was the wrong word. My intent was just to suggest that, given the appropriate tuning, VACUUM can have a minimal effect on the performance of concurrent queries. That is, the basic technology is already in place, but it can take a fair bit of Postgres knowledge and manual tweaking to get something that works in production. This could definitely be improved, I agree.
This is market data. The records are time stamped and 99% of selects are constrained by the time stamp "ts>'...'". My guess is that inserts skew timestap index statistics and it becomes useless.
FYI, this sounds like the sort of situation in which table partitioning ("constraint exclusion", new in 8.1) might be useful.
since pgsql relies on OS caching, which one does a proper job of caching for pgsql?
AFAIK there is no unanimous recommendation on which OS to use with PG (or which filesystem, for that matter). Solaris is generally seen as slow (although that is being worked on), and I wouldn't personally consider Windows, but FreeBSD, Linux et al. should all be suitable. That is, I wouldn't expect to see the problems you encountered magically resolved by switching OSs.
I am sure something can be done. Like keep a single "big" counter per table, read in into transaction on start, calculate difference in the transaction. When the transaction ends update the "big" counter.
That would impose a significant amount of overhead for all transactions, whereas only a tiny fraction of queries actually need count(*) on a large table with no WHERE clause.
It took only 7 years of constant nagging, but pg developers did find a way to fix min & max. There is simply no excuse to have such often used feature broken for a DBMS which positions itself for enterprise.
Well, I certainly have no objection in principle to optimizing count(*) with no WHERE clause, and the topic has been discussed extensively on pgsql-hackers. However, I've yet to see anyone come up with a feasible way to implement it: optimizing count(*) with no WHERE clause is hard in a system that uses MVCC.
The only potential issue that comes to mind is that very large hashed aggregations can sometimes run the server out of RAM, if the planner makes a drastically incorrect guess about the size of the hashed aggregate's result set (this should just result in crashing the backend executing the query and not effecting concurrent backends, but on Linux the OOM killer might take down something unrelated). That's definitely a defect (hashed aggs should spill to disk once they exceed a certain point), but the code in question was in 7.4 as well, so perhaps that is not the problem.
If you encounter any query that crashes Postgres in any situation, please report it (either to the pgsql-bugs mailing list, or the #postgresql channel on irc.freenode.net). We'd be happy to try to help you resolve the problem.
Also, PG is perceived as less stable than Oracle, and even less than MySQL.
Not in my experience: Postgres is very stable. Very old releases of Postgres (6.5 and older) definitely had some bugs, but anything released within the last few years is very stable. It's somewhat difficult to quantify "stability", but if you have any evidence for the alleged instability of Postgres, I'd like to hear it.
FYI, Postgres 8.1 supports partitioned tables (although the implementation isn't as nice as Oracle's yet).
work_mem => Memory Postgres will allocate PER database backend for sorting and other purposes. This is not shared between backends.
Nitpick: work_mem is actually the per-operation limit on temporary memory usage. If a single backend is doing several sorts simultaneously (which is quite possible with reasonably complex queries), it can consume several times work_mem.
Vacuum kills performance. Some uses maybe OK with loosing 50% or more while VACUUM runs.
What version of PG were you using? Recent versions should be fairly good about minimizing the effect of VACUUM on concurrent operations. There are many people who run VACUUM regularly during production operation without degrading their overall performance too badly -- look into vacuum delay configuration, among other things.
In our case (a lot of inserts with majority of selects going for the newly inserted records) performance degrades within 6-8 hours after running VACUUM & friends.
Well, INSERT and SELECT will not generate expired tuples (except if the INSERTing transaction aborts), so if your workload is mostly INSERT / SELECT I wouldn't expect you to need to VACUUM very often. Did you determine what the cause of this alleged performance decrease was?
It caches the last few results, but otherwise goes to disk for data even if there is anough RAM to cache the whole DB.
Nonsense, PG does not cache "the last few results." If anything, the shared buffer cache is organized around caching frequently used pages, not results.
The PG developers keep saying that it's the job for the OS. Now, which OS should we use then? FreeBSD, Linux, Windows? Which one?
Why is letting the OS handle most of the I/O caching unacceptable for you?
As for which OS you should use, that's really up to you, Postgres works well on a lot of platforms. I'd personally wouldn't run an important DBMS on Windows, though.
Forever broken COUNT(). Although MIN & MAX were fixed in the latest release, COUNT() is still broken and there is no fixing in sight. Yes, I beieve 10 seconds execution time for count() on a table with just a few million records qualifies it as a broken feature.
Well, if it's "broken" it seems hard to envision how to fix it. What would you suggest? Due to MVCC, you can't just maintain a single count of the tuples in the table (or rather Postgres does maintain such a count, but it is merely an approximation for planning purposes) -- different concurrent clients will see different snapshots of the table. You could potentially optimize COUNT(*) by keeping multiple counts for different clients etc, but that would impose significant overhead for all clients, most of whom aren't interested in COUNT(*).
And do people really use SELECT COUNT(*) FROM large_table; regularly in realistic applications? Not in my experience...
Postgres (last I checked) preferred to let the OS do the data-caching. Thanks, but no thanks.
Well, Postgres does do its own caching in userspace (see the shared_buffers configuration parameter and related documentation). It just does that caching in addition to (or rather, on top of) the caching and I/O scheduling done by the kernel. Why do you consider this to be a problem?
(Yes, letting the kernel do most of the caching does result in a minor performance hit, but I think that the amount of work required to implement raw I/O doesn't justify the returns, at least at this point.)
And no 64-bit version [of Postgres] (though I've read a few people have managed to compile one, I wouldn't trust it unless Postgres gave it the thumbs up).
This is not true: Postgres has supported 64-bit architectures for many years. The official list of supported platforms includes many 64-bit architectures (AMD64, IA64, Alpha, Sparc64, ppc64, MIPS, PA-RISC).
Here the big databases allow you to assign different amounts of memory to different buffer pools, which are then assigned to different tablespaces.
Yeah, Postgres doesn't currently support this. IMHO it isn't that useful -- the performance improvement I'd expect would be pretty small (for one thing, all Postgres buffering is done in addition to the kernel's buffering, so the net impact will be smaller). It also adds a significant administrative burden -- you need to configure which objects go in which pools, as well as how large each pool is.
5. process management: in db2 your application writes to a buffer pool, an asychronous agent picks up that change and writes it to a log file, another asynchronous agent picks it up and writes it to the table. This heavily-asychronous behavior (and yes, with memory & processor tuning available for each agent type) allows you to maximize write-throughput. Postgresql and mysql are still in the slower sychronous world.
DB2 may well be better than Postgres here, but your explanation above doesn't make a lot of sense. In Postgres, a committing transaction only needs to wait for the WAL record describing the transaction to be flushed to disk (multiple transactions that commit concurrently can be flushed via a single fsync(2)). That is the only I/O that needs to be done synchronously -- the rest can be done async (notably, this includes the table I/O itself -- the modified buffers are just marked dirty in memory and are subsequently flushed to disk). Note that a backend may also need to wait for dirty pages to be flushed from the buffer pool if it is trying to replace a dirty page with a clean one, but (a) those flushes are done via write(2), so there is not necessarily a disk flush involved (b) the background writer in 8.0+ is intended to resolve this by ensuring that most of the work of flushing dirty pages is not done by a normal backend.
6. parallelism: in mysql and postgresql all queries are single-threaded. In db2 and oracle a large query will actually split itself up into multiple sub-queries to maximize throughput for multiple cpus and storage arrays. This provides db2 & oracle with linear performance improvements up to 4-8 cpus. In othe words, large queries that perform table scans can take advantage of SMP hardware for the commercial products - and cut down your query time by 75% on a 4-way compared to mysql and postgresql.
... assuming your table scan is CPU-bound, which is almost certainly not the case. In practice, intra-query parallelism is useful for two scenarios that I can think of: creating large indexes, and OLAP workloads in which you are running a small number of concurrent queries, each of which is extremely expensive. In more normal OLTP circumstances, the number of concurrent clients far exceeds the number of CPUs in the box, so you don't need to parallelize within each query. Still, I agree this would be useful to have in some circumstances, although it's a bit difficult to see how to implement it reasonably.
7. partitioning: btree indexes only work for very selective queries - like when you want 1% or less of the data of a table. But for many queries you need to crunch 5,10,or 15% of the data. That's where range partitioning comes in: you just scan the data you absolutely need to.
PostgreSQL 8.1 (currently in beta) includes "constraint exclusion", which is essentially a primitive form of table partitioning (using inheritence and check constraints, you divide the data into tables with distinct check constraints; the optimizer has been improved to recognize when a child table can be omitted from the query plan by looking at the check constraints involved).
Affilias (which runs the.org TLD) employs Jan Wiek (sp?) who is a member of the core development team. They paid for him to develop a Master-Slave async replication system which is called Slony-I (released under a BSD-style license). Also under development is a synchronous replication solution called Slony-II.
Right; SL2 will also be multi-master. (Whether it is truly synchronous or not is open to debate if this algorithm is used -- but it is certainly "almost" synchronous, and in any case that algorithm isn't necessarily going to be used.)
Now as for multimaster replication, it is possible to partition tables using PostgreSQL's inheritance features and then have each system be the master of one partition of the table.
And companies such as Affilias have developed open source replication systems (which could be used in multimaster configurations with a sufficiently small number of masters though these configurations are largely undocumented).
Having to run vacuum all the time to help the query optimizer figure things out. Why this doesn't happen automagically in the background without me having to worry about it is beyond me.
VACUUM does not have much to do with the optimizer's statistics. Statistics are primarily updated by ANALYZE (and VACUUM ANALYZE), and should be updated periodically (say, once a day or week), or when the distribution of the data radically changes.
As for scheduling VACUUM and/or ANALYZE automatically, the pg_autovacuum daemon (integrated into 8.1, available as a contrib/ module prior to that) can be used. There are various people thinking about how to improve automatic vacuuming in the future, but pg_autovacuum solves the basic problem. BTW, automatically scheduling maintenance operations in an optimal fashion (and without imposing overhead on routine operation) is not an easy problem...
I find the documentation on mysql.com superior to on postgresql.org
Is there anything specific that you noticed could be improved with the PG docs?
3. PostgreSQL is heavier, and I hate statistics collector subprocess via udp (which seems to be eating 1-2% cpu all the time)
Note that via configuration settings you can disable the stats collector, or reduce the amount of stats that are collected. I'm a bit surprised that the collector is imposing a noticeable amount of overhead with the default configuration.
I didn't intern for Microsoft this summer, but I've interned for them the past two summers, and I'd highly recommend it (especially for college students who have done some F/OSS hacking on the side, as those are exactly the kind of people I think will get a lot out of software design (SDE) intern position at MS). Perhaps the nature of the experience differs depending on the group you intern with, but I've interned with two very different groups and had enjoyable experiences both times. Also, I did a software design (SDE) internship; interning in project management or test would probably be a bit different, I'm not sure. Highlights:
Interesting work. Interns typically work on a single project for the summer, and you usually get a degree of choice about what to work on. Both times I was given an interesting, challenging project. The nice thing is that your project isn't usually "port some code we don't care about from Platform X to Platform Y" or similar grunt work — it's actually interesting and relevant.
Interesting, competent coworkers. I was pretty amazed at the quality of the full-timers at MS -- pretty much everybody I met who had a technical role was a pretty shit-hot programmer. But more important, everyone was a huge geek -- most people really loved what they were working on. That sense of enthusiasm was infectious; people actually enjoyed the work they were doing! Again, maybe it differs between product groups, but pretty much everyone I met was exactly the kind of person I'd imagine hacking F/OSS if they weren't working for MS (which is what I would have been doing, too).
Nice campus, good pay. All the rest was pretty good pretty good -- the MS Redmond campus is pretty nice, the pay is pretty decent, Seattle weather in the summer was nice, etc. MS hire a lot of interns every summer (in the ballpark of 700-900 in Redmond, say), so there are lots of students in the same position as you to hang out with.
Obviously if you have a moral objection to working for Microsoft, then don't bother applying. But if that doesn't apply to you and you like hacking code, MS has been a great experience for me in the past.
That is not true. There is an auxiliary thread used for signal handling on Windows (which is somewhat hacky), but the basic model of one process per client connection is unchanged, and applies to Postgres on both Windows and Unix.
the tuning documentation for PostgreSQL *states* that good IO performance has more of an impact than good CPU performance
This is often, but not always the case -- it depends on the nature of the query workload. In this case, the benchmark was done using read-only queries, specifically for the purpose of minimizing the I/O requirements and therefore focusing on CPU performance.
Sure -- Greenplum DB is one such database. There's a lot of research on distributed databases from the 1980s onward that addresses this topic in detail.
SQL standard explicitly specifies that writing to the database under READ UNCOMMITTED isolation is not allowed.
Which part of the SQL standard says that? The SQL specification for isolation levels talking specifically in terms of what visibility behavior is disallowed at each isolation level; it makes no guarantees about what behavior will be provided (and it typically doesn't get into the nitty gritty details of whether database writes will be allowed). For example, it is perfectly legal per spec for a database to only implement SERIALIZABLE, and simply accept the other isolation levels as aliases with the same behavior (since those other isolation levels are defined in terms of which isolation defects are tolerable, not guaranteed).
Further, PostgreSQL doesn't even support READ UNCOMMITTED.
Per SQL spec it does, in the sense that it doesn't reject it, it is just implemented via a more strict isolation level (READ COMMITTED).
Asynchronous transactions seem like a bad idea
In real applications, there are often transactions whose durability requirements can be relaxed for better performance (session state in a web app is a typical example). The async commit feature is a nice way of speeding up those transactions, without harming the ACID guarantees provided for other transactions (it can be enabled on a per-transaction basis).
While it is true that dblink and dbi-link can be used to do cross-database joins, they are not even remotely close to being able to do them in an efficient manner.
Although note that Flex Data Services has not been open sourced, and AFAIK Adobe has not announced any plans to make it so.
Well, the point is just that the total memory usage is not a simple product, as you had suggested. As for the relative size of the Perl DLL code sections against the total memory consumption of the process, I wouldn't say the DLLs are "tiny". For example:
/usr/lib/libperl.so.5.8 /usr/lib/libperl.so.5.8
% size
text data bss dec hex filename
1246792 23876 8328 1278996 138414
(I don't have a Windows box handy, but DLLs should only differ to a relatively small degree.) For many small Perl programs, I'd expect the total size of the code sections of all the shared objects used could be a pretty significant portion of the process's total memory consumption.
Not true: the code section of any DLLs required by each instance of the program will only be loaded once.
There's probably room for improvement, but the above behavior is definitely an improvement, IMHO.
Bitmap indexes will almost definitely be in 8.3. Gavin Sherry submitted a revised patch for them a few days ago.
MySQL uses a different representation for in-memory and on-disk data, so it's not too difficult to avoid the padding byte in the on-disk representation.
Heh, you're kidding, right? The text you quoted is from the article, and the person who said it is Jonathan Shaeffer: a CS prof at the U of Alberta, the person who initiated much of the current research into poker, the author of the world champion checkers AI -- in other words, someone who's forgotten more about AI than any of us are likely to learn. His area of specialty is game-playing AI, so I'd trust his opinion on this subject...
Nonsense. That is one of the reasons that writing a good Go AI is difficult (among others), but the set of possible possibilities in Poker is relatively small. In Hold'em, for example, your opponent has one of 50*49=2450 possible hands -- easily enough to do static precomputation on, for example. Similarly, given the flop, there are 48*47=2256 possible sequences of turn and river cards. And that's assuming you care about the suit of the cards, which is not always the case (e.g. given a rainbow flop there's only a small chance of an opponent drawing to a flush). Random sampling over these relatively small sets of possibilities is perfectly feasible, especially when you can infer information about the way your opponent acts to limit the set of cards they are likely to hold (e.g. if they raise preflop, bias your samples toward them holding a strong hand). Read some of Schaeffer's papers (e.g. this one) on their poker work, I think it's fairly clear where the challenge in poker AI lies.
In short, the difficulty in poker AI is very much a question of reasoning with imperfect information (a classical AI problem, but one that is still fairly open), and changing your behavior in response to how your opponent acts.
My apologies; "alleged" was the wrong word. My intent was just to suggest that, given the appropriate tuning, VACUUM can have a minimal effect on the performance of concurrent queries. That is, the basic technology is already in place, but it can take a fair bit of Postgres knowledge and manual tweaking to get something that works in production. This could definitely be improved, I agree.
FYI, this sounds like the sort of situation in which table partitioning ("constraint exclusion", new in 8.1) might be useful.
AFAIK there is no unanimous recommendation on which OS to use with PG (or which filesystem, for that matter). Solaris is generally seen as slow (although that is being worked on), and I wouldn't personally consider Windows, but FreeBSD, Linux et al. should all be suitable. That is, I wouldn't expect to see the problems you encountered magically resolved by switching OSs.
That would impose a significant amount of overhead for all transactions, whereas only a tiny fraction of queries actually need count(*) on a large table with no WHERE clause.
Well, I certainly have no objection in principle to optimizing count(*) with no WHERE clause, and the topic has been discussed extensively on pgsql-hackers. However, I've yet to see anyone come up with a feasible way to implement it: optimizing count(*) with no WHERE clause is hard in a system that uses MVCC.
Interesting -- what platform are you using?
The only potential issue that comes to mind is that very large hashed aggregations can sometimes run the server out of RAM, if the planner makes a drastically incorrect guess about the size of the hashed aggregate's result set (this should just result in crashing the backend executing the query and not effecting concurrent backends, but on Linux the OOM killer might take down something unrelated). That's definitely a defect (hashed aggs should spill to disk once they exceed a certain point), but the code in question was in 7.4 as well, so perhaps that is not the problem.
If you encounter any query that crashes Postgres in any situation, please report it (either to the pgsql-bugs mailing list, or the #postgresql channel on irc.freenode.net). We'd be happy to try to help you resolve the problem.
Not in my experience: Postgres is very stable. Very old releases of Postgres (6.5 and older) definitely had some bugs, but anything released within the last few years is very stable. It's somewhat difficult to quantify "stability", but if you have any evidence for the alleged instability of Postgres, I'd like to hear it.
FYI, Postgres 8.1 supports partitioned tables (although the implementation isn't as nice as Oracle's yet).
Nitpick: work_mem is actually the per-operation limit on temporary memory usage. If a single backend is doing several sorts simultaneously (which is quite possible with reasonably complex queries), it can consume several times work_mem.
What version of PG were you using? Recent versions should be fairly good about minimizing the effect of VACUUM on concurrent operations. There are many people who run VACUUM regularly during production operation without degrading their overall performance too badly -- look into vacuum delay configuration, among other things.
Well, INSERT and SELECT will not generate expired tuples (except if the INSERTing transaction aborts), so if your workload is mostly INSERT / SELECT I wouldn't expect you to need to VACUUM very often. Did you determine what the cause of this alleged performance decrease was?
Nonsense, PG does not cache "the last few results." If anything, the shared buffer cache is organized around caching frequently used pages, not results.
Why is letting the OS handle most of the I/O caching unacceptable for you?
As for which OS you should use, that's really up to you, Postgres works well on a lot of platforms. I'd personally wouldn't run an important DBMS on Windows, though.
Well, if it's "broken" it seems hard to envision how to fix it. What would you suggest? Due to MVCC, you can't just maintain a single count of the tuples in the table (or rather Postgres does maintain such a count, but it is merely an approximation for planning purposes) -- different concurrent clients will see different snapshots of the table. You could potentially optimize COUNT(*) by keeping multiple counts for different clients etc, but that would impose significant overhead for all clients, most of whom aren't interested in COUNT(*).
And do people really use SELECT COUNT(*) FROM large_table; regularly in realistic applications? Not in my experience...
Actually, 2PC will be in PostgreSQL 8.1 (which is currently in beta).
Well, Postgres does do its own caching in userspace (see the shared_buffers configuration parameter and related documentation). It just does that caching in addition to (or rather, on top of) the caching and I/O scheduling done by the kernel. Why do you consider this to be a problem?
(Yes, letting the kernel do most of the caching does result in a minor performance hit, but I think that the amount of work required to implement raw I/O doesn't justify the returns, at least at this point.)
This is not true: Postgres has supported 64-bit architectures for many years. The official list of supported platforms includes many 64-bit architectures (AMD64, IA64, Alpha, Sparc64, ppc64, MIPS, PA-RISC).
Yeah, Postgres doesn't currently support this. IMHO it isn't that useful -- the performance improvement I'd expect would be pretty small (for one thing, all Postgres buffering is done in addition to the kernel's buffering, so the net impact will be smaller). It also adds a significant administrative burden -- you need to configure which objects go in which pools, as well as how large each pool is.
DB2 may well be better than Postgres here, but your explanation above doesn't make a lot of sense. In Postgres, a committing transaction only needs to wait for the WAL record describing the transaction to be flushed to disk (multiple transactions that commit concurrently can be flushed via a single fsync(2)). That is the only I/O that needs to be done synchronously -- the rest can be done async (notably, this includes the table I/O itself -- the modified buffers are just marked dirty in memory and are subsequently flushed to disk). Note that a backend may also need to wait for dirty pages to be flushed from the buffer pool if it is trying to replace a dirty page with a clean one, but (a) those flushes are done via write(2), so there is not necessarily a disk flush involved (b) the background writer in 8.0+ is intended to resolve this by ensuring that most of the work of flushing dirty pages is not done by a normal backend.
PostgreSQL 8.1 (currently in beta) includes "constraint exclusion", which is essentially a primitive form of table partitioning (using inheritence and check constraints, you divide the data into tables with distinct check constraints; the optimizer has been improved to recognize when a child table can be omitted from the query plan by looking at the check constraints involved).
Right; SL2 will also be multi-master. (Whether it is truly synchronous or not is open to debate if this algorithm is used -- but it is certainly "almost" synchronous, and in any case that algorithm isn't necessarily going to be used.)
Ah, I see. That sounds fairly dodgy, IMHO...
Can you elaborate on this?
VACUUM does not have much to do with the optimizer's statistics. Statistics are primarily updated by ANALYZE (and VACUUM ANALYZE), and should be updated periodically (say, once a day or week), or when the distribution of the data radically changes.
As for scheduling VACUUM and/or ANALYZE automatically, the pg_autovacuum daemon (integrated into 8.1, available as a contrib/ module prior to that) can be used. There are various people thinking about how to improve automatic vacuuming in the future, but pg_autovacuum solves the basic problem. BTW, automatically scheduling maintenance operations in an optimal fashion (and without imposing overhead on routine operation) is not an easy problem...
Is there anything specific that you noticed could be improved with the PG docs?
Note that via configuration settings you can disable the stats collector, or reduce the amount of stats that are collected. I'm a bit surprised that the collector is imposing a noticeable amount of overhead with the default configuration.
Obviously if you have a moral objection to working for Microsoft, then don't bother applying. But if that doesn't apply to you and you like hacking code, MS has been a great experience for me in the past.