PostgreSQL 9.2 Out with Greatly Improved Scalability
The PostgreSQL project announced the release of PostgreSQL 9.2 today. The headliner: "With the addition of linear scalability to 64 cores, index-only scans and reductions in CPU power consumption, PostgreSQL 9.2 has significantly improved scalability and developer flexibility for the most demanding workloads. ... Up to 350,000 read queries per second (more than 4X faster) ... Index-only scans for data warehousing queries (2–20X faster) ... Up to 14,000 data writes per second (5X faster)" Additionally, there's now a JSON type (including the ability to retrieve row results in JSON directly from the database) ala the XML type (although lacking a broad set of utility functions). Minor, but probably a welcome relief to those who need them, 9.2 adds range restricted types. For the gory details, see the what's new page, or the full release notes.
E) stop using oracle and start using postgres
insensitive clod overlords obligatory xkcd car analogy russian reversals whoosh pedant fanbois ftfy in 3...2...1..PROFIT
9.3. Seriously.
http://rhaas.blogspot.com/2012/06/absurd-shared-memory-limits.html
http://postgresapp.com/
I just posted this to the blog, but I will repeat it here --
There is a very good reason we OS vendors do not ship with SysV default limits high enough to run a serious PostgreSQL database. There is very little software that uses SysV in any serious way other than PostgreSQL and there is a fixed overhead to increasing those limits. You end up wasting RAM for all the users who do not need the limits to be that high. That said, you are late to the party here, vendors have finally decided that the fixed overheads are low enough relative to modern RAM sizes that the defaults can be raised quite high, DragonFly BSD has shipped with greatly increased limits for a year or so and I believe FreeBSD also.
There is a serious problem with this patch on BSD kernels. All of the BSD sysv implementations have a shm_use_phys optimization which forces the kernel to wire up memory pages used to back SysV segments. This increases performance by not requiring the allocation of pv entries for these pages and also reduces memory pressure. Most serious users of PostgreSQL on BSD platforms use this well-documented optimization. After switching to 9.3, large and well optimized Pg installations that previously ran well in memory will be forced into swap because of the pv entry overhead.
you atheists love to take all the fun out of things, don't you?
Eliminate the human sacrifice now and next you'll be saying we have to get rid of our Steve Jobs altars.
I think everyone has glossed over the single most important feature in the Postgre SQL that they have refined in this release, IMHO. Ranged data types. Let's say you have a meeting schedule DB application. Well currently if you want to restrict a room between two times (start and stop) so that no one else can have the room during that time, you are going to have to write that logic in your application.
Postgre's range data type allows you to create unique checks on ranges of time. This can in two lines of code, do every single logic check that is needed to ensure no two people schedule the same room at the same time.
How this is not showing up on anyone's radar is beyond me, or maybe we all just use Outlook or Google Calendar now. However, the range types are not just limited to the application of time, but of anything that requires uniqueness along a linear fashion, as opposed to just checking to see if any other record matches the one that you are trying to insert.
TL;DR: Is there an advanced PostgreSQL for MySQL Users guide out there somewhere? Something more than basic command-line equivalents? And preferably from the last two major releases of the software?
Long version
I've been using MySQL personally and professionally for a number of years now. I have setup read-only slaves, reporting servers, multi-master replication, converted between database types, setup hot backups (Regardless of database engine), recovered crashed databases, and I generally know most of the tricks. However I'm not happy with the rumors I'm hearing about Oracle's handling of the software since their acquisition of MySQL's grandparent company, and I'm open to something else if it's more flexible, powerful, and/or efficient.
I've always heard glowing, wonderful things online about PostgreSQL, but I know no one who knows anything about it, let alone advanced tricks like replication, performance tuning, or showing all the live database connections and operations at the current time. So for any Postgres fans on Slashdot, is there such a thing as a guide to PostgreSQL for MySQL admins, especially with advanced topics like replication, tuning, monitoring, and profiling?
... And so it comes to this.
Oracle is not that big a of concern.
There is MariaDB which is data-compatible with MySQL, and has some nice additions (like microsecond performance data), and there is also Percona Server.
If Oracle messes up, like they did with OpenOffice, there will be another version that they cannot touch, like LibreOffice.
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
I've been searching for a comparison chart of various SQLs but all I can find are very very old articles
There's a database project that I'm working on and I'm choosing which SQL to be employed
MySQL is obviously not up to par
I don't know how good PostgreSQL is - so, is there a comparison chart or something that can facilitate us, the one who are going to make purchasing decision, to make one choice over the other?
Thank you !
Muchas Gracias, Señor Edward Snowden !
I don't see your comment on the blog (maybe it has to be approved?), but the same issue was raised here during review of the patch. The concern was mostly blown off (most PG developers use Linux instead of BSD, that might well be part of it), but if you had some numbers to back up your post, the -hackers list would definitely be interested. Ideally, you could give numbers and a repeatable benchmark showing a deterioration of 9.3-post-patch vs. 9.3-pre-patch on a BSD. If that's too much work, just the numbers from a dumb C program reading/writing shared memory with mmap() vs. SysV would be a good discussion basis.
http://cltracker.net -- powerful craigslist multi-city search
Each client connected to the DB has its own child process - the shared memory is a buffer that is shared across postgresql child PIDs with the same parent. That's why the proposed patch would work using an anonymous shared memory segment - because the memory is only passed to children of the same process.
While Postgresql does use the Apache model, there is middleware available (google 'pgpool' for an example) that amongst other things will queue requests so they can be serviced by a limited number of children. Of course this only matters if there are an awful lot of simultaneous queries (without the corresponding amount of server RAM).
However; your claim about threads per CPU is oversimplified, and especially wrong with a DB server where processes will most likely be IO bound. With 1 core, for example, there is nothing wrong with having 5 processes parsing and planning a query for a few microseconds, while the 6th is monopolising IO actually retrieving query results. Or the reverse - having 1 CPU-bound process occasionally being interrupted to service 5 IO bound processes, which would negligibly impact the CPU-bound query, while hugely improving latency on the IO bound queries.
I don't think this is true any more. Threads are light weight... that's the whole point. They all share the same pmap (same hardware page table). Switching overhead is very low compared to switching between processes.
The primary benefit of the thread is to allow synchronous operations to be synchronous and not force the programmer to use async operations. Secondarily, people often don't realize that async operations can actually be MORE COSTLY, because it generally means that some other thread, typically a kernel thread, is involved. Async operations do not reduce thread switches, they actually can increase thread switches, particularly when the data in question is already present in system caches and wouldn't block the I/O operation anyway.
There is no real need to match the number of threads to the number of cpus when the threads are used to support a synchronous programming abstraction. There's no benefit from doing so. For scalability purposes you don't want to create millions of threads (of course), but several hundred or even a thousand just isn't that big a deal.
In DragonFly (and in most modern unix's) the overhead of a thread is sizeof(struct lwp) = 576 bytes of kernel space, +16K kernel stack, +16K user stack. Everything else is shared. So a thousand threads has maybe ~40MB or so of overhead on a machine that is likely to have 16GB of ram or more. There is absolutely no reason to try to reduce the thread count to the number of cpu cores.
--
There are two reasons for using lock memory for a database cache. The biggest and most important is that the database will be accessing the memory while holding locks and the last thing you want to have happen is for a thread to stall on a VM fault paging something in from swap. This is also why a database wants to manage its own cache and NOT mmap() files shared... because it is difficult, even with mincore(), to work out whether the memory accesses will stall or not. You just don't want to be holding locks during these sorts of stalls, it messes up performance across the board on a SMP system.
Anonymous memory mmap()'s can be mlock()'d, but as I already said, on BSD systems you have the pv_entry overhead which matters a hell of a lot when 60+ forked database server processes are all trying to map a huge amount of shared memory.
Having a huge cache IS important. It's the primary mechanism by which a database, including postgres, is able to perform well. Not just to fit the hot dataset but also to manage what might stall and what might not stall.
In terms of being I/O bound, which was another comment someone made here... that is only true in some cases. You will not necessarily be I/O bound even if your hot data exceeds available main memory if you happen to have a SSD (or several) between memory and the hard drive array. Command overhead to a SSD clocks in at around 18uS (verses 4-8mS for a random disk access). SSD caching layers change the equation completely. So now instead of being I/O bound at your ram limit, you have to go all the way past your SSD storage limit before you truly become I/O bound. A small server example of this would be a machine w/16G of ram and a 256G SSD. Whereas without the SSD you can become I/O bound once your hot set exceeds 16G, with the SSD you have to exceed 256G before you truly become I/O bound. SSDs can essentially be thought of as another layer of cache.
-Matt