Slashdot Mirror


PostgreSQL 9.1 Released

With his first posted submission, sega_sai writes "The new version of open source database PostgreSQL was released. This version provides many important and interesting features such as synchronous replication, serializable snapshot isolation, support of per-column collations, K-nearest neighbor indexing, foreign data wrappers, support of SELinux permission controls and many others. The complete list of changes is available here"

29 of 148 comments (clear)

  1. vs Oracle? by pak9rabid · · Score: 4, Interesting

    So...how does PostgreSQL compete with Oracle nowadays as far as features go (specifically, spatial, and data-guard-like replication)? Anybody here tried making the switch?

    1. Re:vs Oracle? by Talderas · · Score: 2

      That dumb reason is backwards compatibility for their customers due to Oracle being older than the SQL standard.

      --
      "Lack of speed can be overcome. In the worst case by patience." --Znork
    2. Re:vs Oracle? by amorsen · · Score: 2

      Backwards compatibility does not prevent Oracle from supporting FETCH FIRST. See Limiting result sets. Row value constructors wouldn't be a problem to handle either, especially since Oracle already has the functionality just with a silly syntax.

      --
      Finally! A year of moderation! Ready for 2019?
    3. Re:vs Oracle? by discord5 · · Score: 5, Informative

      So...how does PostgreSQL compete with Oracle nowadays as far as features go (specifically, spatial, and data-guard-like replication)?

      I can't speak for Oracle, but if you're interested in spatial stuff you should have a look at PostGIS. We've recently been using it to store tons (magnitude of several million) of points and polygons, and we're very happy with it. We've got about hundred simultaneous users connecting to the WFS in peak hours, and it bears the load pretty well if you properly index your tables. I can't speak much for updates, since our database updates in bursts (we import new data every X weeks). I can't go too much into detail about the type of data other than that it's polygons, points, and mostly distance calculations and intersections.

      We briefly looked at Oracle Spatial for a while, looked at the pricetag and the project budget and made the decision to try the PostgreSQL+PostGIS combination and see how far it'd get us. We were pleasantly surprised. I had some experience with PostgreSQL before in the 7.X releases in a previous lifetime but in the end wasn't all that pleased with it, especially on busy servers. Nowadays, I'm running 9.0 and I'm pretty much content about it. Replication wise we've got a PITR setup up and running which is more than enough for our purposes. It's pretty well documented, but be sure to test everything, etc etc etc... It doesn't quite hold your hand when you're setting it up, so double check everything.

      I'm sure that there will be people on here that have more extensive experience with PostgreSQL (and Oracle) to fill you in on the juicy details, but in general I'm pretty pleased with it so far. It scratches my particular itch, and does so without all too much headaches.

    4. Re:vs Oracle? by Talderas · · Score: 2

      Under the SQL standard a NULL string is different from a zero length string (""). It's subtle but it's a difference.

      Oracle does not differentiate between the two. A zero-length string behaves identically to a NULL. If a customer wrote an application where ("") equated to a NULL then fixing Oracle to differentiate between the two would break those applications.

      If they make it an optional feature Oracle needs to make it an opt-out feature in order to maintain backward compatibility while still allowing their customers the ability to update.

      In otherwise, it's easier to ignore the standard on this one thing than screw with things for their customers.

      --
      "Lack of speed can be overcome. In the worst case by patience." --Znork
    5. Re:vs Oracle? by h4rr4r · · Score: 5, Insightful

      It is easier, it is still wrong. It should be opt-in to prevent people from using it for new installations.

      It is not a subtle fucking difference. It is a huge big honking difference. Either you don't know the information which is null or you know it to be "".

    6. Re:vs Oracle? by dkleinsc · · Score: 2

      With regards to LIMIT, Oracle does have a couple of equivalents:
      1. WHERE rownum < end

      2. select * from (select S1.*, ROWNUM rn FROM ( real query ) S1 WHERE ROWNUM <= end) WHERE rn >= start

      Obviously, the second syntax is a bit painful, but it works, and it has the important behavior that it doesn't calculate any of the rows that aren't returned (as Postgres says it does for LIMIT...OFFSET in 8.1, see the docs). This is important when you're selecting items 4990-5000 of a 10,000 row result set.

      I've used both PostGres and Oracle - they're both pretty good at their jobs, both have their quirks, upsides, and downsides.

      --
      I am officially gone from /. Long live http://www.soylentnews.com/
    7. Re:vs Oracle? by greg1104 · · Score: 5, Interesting

      Lots of companies are converting from Oracle Spatial to PostgreSQL plus PostGIS because it's faster and has better compliance to GIS standards. The text of the talk isn't available, but the FAA Airports GIS and PostgreSQL presentation was a typical report I was in the audience for. The FAA's first conversion happened very quickly: just export their data in a standard format, import into PostgreSQL, and tweak some queries. The result worked so much better that they've standardized on PostgreSQL for spatial applications at the FAA now. Internal projects needing a spatial database have to justify why they want the budget for Oracle Spatial, and it's default deny unless you have a really good reason.

      The addition of synchronous replication to 9.1 has made it a pretty even match for Oracle's Data Guard now. The main bonus is that you can control the integrity level you want at the transaction level. So you can have a database with a mix of important data (only considered safe when on two nodes) and fast, best attempt eventual consistency data, all in one place. Nothing else can replace Oracle at the top end while still having a lean enough mode to be competitive with NoSQL database when integrity isn't the top priority.

      We convert Oracle installs to PostgreSQL all the time at my day job. The main obstacles I keep seeing that don't have simple solutions are 1) using a lot of PL/SQL, 2) differences in query handling, such as OUTER JOIN behavior or reliance on optimizer hints, and 3) can't limit the resources used by individual users easily in PostgreSQL yet. I actually have a design outline for how to solve (3)...would only cost a fraction of a typical Oracle license to sponsor that feature. EnterpriseDB's version of Oracle comes with PL/SQL compatibility, but only in a commercial product that lags behind the open-source releases--and buying from them just switches which vendor you're locked into.

    8. Re:vs Oracle? by shish · · Score: 2

      As a demonstration of postgres' spatial abilities, check out OpenStreetMap and the various sub-projects

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    9. Re:vs Oracle? by Bengie · · Score: 2

      It's about as different as a null pointer vs a pointer pointing to an empty string object.

    10. Re:vs Oracle? by fuzzytv · · Score: 2

      I doubt that's an Oracle issue, my guess is they're using a custom-developed tool to export the data and it's buggy. I'm dealing with a lots of data exported from Oracle (CSV, columnar, ...) and I've never had this problem. External tables actually made exporting even easier.

      So while I'm a PostgreSQL fan, let's not blame Oracle for the mistakes of others.

    11. Re:vs Oracle? by fuzzytv · · Score: 2

      The streaming replication is generally equal to Oracle DataGuard (physical standby). The hot_standby actually gives you about the same as Active DataGuard, i.e. the ability to run read-only queries on the standby for free (you have to pay for that with Oracle). With Oracle you'll get a management console to handle all this, with PostgreSQL you have to set it up manually (5-minute task), but there are several tools that may help (e.g. repmgr).

      Spatial ... although it's not a built-in feature, there's a PostGIS (www.postgis.org). A great package to manage geospatial data.

      There are companies that are migrating from Oracle, but they don't want to go public for good reasons. I know there were some case studies about how Sony replaced Oracle with EnterpriseDB - although it's mostly a marketing mumbo jumbo.

    12. Re:vs Oracle? by fusiongyro · · Score: 2

      Another neat thing about Oracle that you obviously don't understand is that Oracle makes you pay for the features you use, but the software lets you turn on whatever features you want. If you're in a large organization with poor database oversight, you may wake up one day to discover that you're using features you aren't paying for, and you're one audit away from a few more tens of thousands of dollars being flushed down the Oracle toilet.

      Lots of organizations bought Oracle in the past because it was trendy or had some little feature that necessitated it, and now that they're growing, they're discovering that the costs are going up too, but they don't need the features that necessitated the original purchase, or the open source databases now have those features. Or maybe they just realize that trendy and expensive doesn't beat free.

    13. Re:vs Oracle? by greg1104 · · Score: 3, Informative

      This isn't an optimization problem. As noted elsewhere in the comments here, Oracle handles string nulls oddly. Because of that, joins happen differently against a null value vs. an empty string. That will produce different query results in Oracle vs. PostgreSQL. It's one of those really annoying conversion quirks, either going to Oracle or escaping from it.

    14. Re:vs Oracle? by hplus · · Score: 2

      Imagine if your form did not always include a blank for middle name, though one was added later. People who filled out the form prior to the addition would have a NULL middle name, while those who filled it out later would have it set, if only to "" in some cases.

    15. Re:vs Oracle? by cduffy · · Score: 5, Insightful

      No foreign keys

      You mean just like MyISAM? It has it's [sic] uses.

      If we grant that a substantial set of use cases where a full-scale, non-embedded, out-of-process database is called for but relational integrity is unimportant exist (something I grant only for the sake of the ongoing argument), that still leaves people who try to use it in situations outside that set. I once took maintainership of an accounting system for car dealerships built on top of a "database" without relational integrity. Cases where bugs in the software resulted in orphaned records and numbers that didn't add up were legion.

      Saying you don't need relational integrity is right up there with saying you don't need bind variables because your hand-built escaping code is good enough. There might be cases where hand-built escaping is good enough, but encouraging J. Random Slashdot Reader to take that approach means it's going to be abused... and I assure you, the person who built that accounting system considered himself an expert.

      no guarantees of that data has been flushed to disk when a COMMIT comes back

      It uses a two-phase commit process.

      Interesting documentation. Reading the links -- suffice to say that MySQL's multi-master clusters are vastly less safe than Oracle RAC. Depending on after-the-fact conflict resolution rather than having proper locking... really? (And with respect to NDB not supporting fsync() before reporting a commit complete -- yes, I can provide a link for that. Being in a cluster isn't good enough -- sometimes you lose a rack, or a full DC).

      Yes, it's cheaper. Yes, it might be good enough for someone. But if someone needs RAC features enough to pay $50K in licensing fees for a tiny cluster, NDB just ain't gonna' cut it. And if I'm handling medical data (which I was last time I deployed RAC), I'm not going to risk massive liability by deciding that I can afford to get a COMMIT back before data is not just ACKed by multiple nodes but sitting on a platter.

      PostgreSQL 9.1's synchronous streaming replication is a sane middle ground -- it's not multi-master, but it actually preserves the same semantics and performance characteristics (durability semantics being the most critical of those -- but at the same time, needing to rewrite everywhere your code uses savepoints or autoincrement or relies on referential integrity exceptions because you're switching to a database backend that doesn't support them isn't any fun either) you'd have in a non-clustered environment. Having it available six years ago would have saved me one helluva lot of money.

      Provided whomever sets the cluster up knows what they are doing and doesn't half-ass it, it works just fine.

      Really? Compared to RAC, wherein moving to a cluster involves no loss of semantics, using NDB looks a whole lot like half-assing it already.

  2. not excited by roman_mir · · Score: 4, Interesting

    I am not excited about any of these changes unfortunately, they are somewhat specialized, though having synchronous replication and serializable transaction isolation sounds more useful than other stuff.

    But there are real things that are missing. Most obvious is distributing of one SQL request into parallel processes or threads to speed up query execution on multi-core systems (which are all multi-core today). The other is the entire issue of attempting to calculate execution time and failing in various cases in the planner, like the really sad cases of completely mis-handling of the mergejoin estimates, which then forces people to set enable_mergejoin to false unfortunately, it's a sledgehammer approach, but otherwise things that can execute in a few milliseconds can take tens of seconds and even minutes instead.

    There are so many ways to improve performance and really kick it up, and instead there are more features added. I think database performance is now more important for PostgreSQL than features (unless this means introducing parallelization of single SQL requests.)

    Otherwise it's a good database, it already provides tons of features. The one weird thing that I find though, is that for replication or hot stand by or just for creating a dynamic backup, the segments that are written to the disk are always of fixed size.

    You can modify the size, which is 16MB by default, but you can only modify the size when you configure the source code before compiling it: configure --with-wal-segsize=1 - this configures the segments to 1MB, which allows the second drive to last that much longer if all you are doing is using a second drive to keep dynamic backup (and that asynchronous backup method, by the way, the problem that they are solving with "synchronous replication", it's that you either have these segments fill up, and then the segment is written to disk, or you wait until time expires for segment to be written to disk if you set checkpoint_timeout). I imagine treating fixed sized segments is easier than generating segments that are of exact size equal to amount of data that was produced in a time period, but it's a waste of disk though.

    The other big thing that I would love to have in a database is ability to scale the database to multiple machines, so have a logical database span multiple disks on multiple machines, have multiple postgres processes running against those multiple disks, but have it all as one scalable database in a way that's transparent to the application. That would be some sort of a breakthrough (SAN or not).

    1. Re:not excited by simcop2387 · · Score: 2

      The other big thing that I would love to have in a database is ability to scale the database to multiple machines, so have a logical database span multiple disks on multiple machines, have multiple postgres processes running against those multiple disks, but have it all as one scalable database in a way that's transparent to the application. That would be some sort of a breakthrough (SAN or not).

      The big reason you don't find that and it would be a tremendous breakthrough, is that it is currently believed to be actually impossible to get that. Have a look at the CAP Theorem. http://en.wikipedia.org/wiki/CAP_theorem

    2. Re:not excited by Anonymous Coward · · Score: 3, Informative

      I am not excited about any of these changes unfortunately, they are somewhat specialized, though having synchronous replication and serializable transaction isolation sounds more useful than other stuff.

      Synchronous replication for many is a must have. In many cases this single feature was preventing adoption of PostgreSQL as many applications require synchronous replication support.

      Per column collation is another feature which is commonly required in complex applications; especially if they are multilingual capable.

      Extensions support is very important and will benefit anyone who uses any contrib or third party extension for PostgreSQL. Unless your database work starts and stop at the most basic of features, this is a something you'll likely use, if not today, then tomorrow.

      Serializable isolation level support, for many, is a critical feature. I agree most people likely won't benefit, but its an important feature to round out support for all database applications and loads. It means PostgreSQL is now an option for whom it wasn't previously.

      Unlogged tables is extremely important, super cool, and very powerful. Many applications require time consuming kludges to work around this previously lacking feature. For many applications this is a make it or break it feature from a performance perspective.

      For many geo-location applications, but even beyond that, nearest-k is an extremely important feature which not only simplifies code, but also reduces development and test while also providing for a nice performance boost.

      And of course, the SE Linux stuff is very important for customers (you'd be surprised how many there are, including the NSA) who consider this the most important feature of the release.

      But there are real things that are missing. Most obvious is distributing of one SQL request into parallel processes or threads to speed up query execution on multi-core systems (which are all multi-core today).

      You want EnterpriseDB. It already has this. As for the rest of your rant, pragmatically, it just doesn't happen very often. Assuming you're not just trolling, what did the PostgreSQL Perf guys have to say about your issue. Generally these types of issues are considered planner bugs (unlike all other SQL vendors) and if possible, will gladly create a fix, if able/appropriate. But 99% of this time, this is just one of those things people love to lie and troll about.

      There are so many ways to improve performance and really kick it up, and instead there are more features added. I think database performance is now more important for PostgreSQL than features (unless this means introducing parallelization of single SQL requests.)

      Except that PostgreSQL is already one of the fastest databases available for 90% of the likely workloads. Performance, in of itself, typically isn't the primary focus right now simply because, for the vast majority of users, it already spanks or is at least on par with most every other option.

      You can modify the size, which is 16MB by default, but you can only modify the size when you configure the source code before compiling it: configure --with-wal-segsize=1 - this configures the segments to 1MB, which allows the second drive to last that much longer if all you are doing is using a second drive to keep dynamic backup (and that asynchronous backup method, by the way, the problem that they are solving with "synchronous replication", it's that you either have these segments fill up, and then the segment is written to disk, or you wait until time expires for segment to be written to disk if you set checkpoint_timeout). I imagine treating fixed sized segments is easier than generating segments that are of exact size equal to amount of data that was produced in a time period, but it's a waste of disk though.

      Pragmatically, just not an issue. Which is why its a compile time option and not a runtime option.

    3. Re:not excited by roman_mir · · Score: 2

      people love to lie and troll about.

      - more people saying that I am trolling, yet again. Yet again, incorrectly, I don't troll ever.

    4. Re:not excited by roman_mir · · Score: 2

      I address all of these shortcomings that I am writing about here within the application.

      Of-course it's much simpler for me to do from application perspective, because I know what the business logic is, so I break SQLs into pieces that can run in parallel, then I execute them in multiple connections against the database (thread or process per connection), and then I merge data as it comes back. This speeds up execution dramatically, not even close to what a single serial SQL can do.

      As to adding more machines to the cluster - again, in the application level I have to split data logically into separate instances, application knows where to go for different segments of data, so this is not transparent to the application of-course, it must know where different data is.

    5. Re:not excited by fuzzytv · · Score: 3, Interesting

      Cassandra is just one of many NoSQL databases, but yes - NoSQL can be an answer to workaround the CAP theorem in some cases.

      But in many cases it's not a solution. If the data are relational, if you need full ACID, etc. then ditching "consistency" is not a choice. There are projects to build PostgreSQL clustering solutions, that may resemble RAC a bit, although none of them uses shared disk (so each instance needs a separate disk). Let's mention PGCluster, PGCluster II or Postgres-XC (aiming to build write-scalable cluster, something Cassandra does in the NoSQL world). Sure, all this has to follow the CAP theorem.

    6. Re:not excited by fuzzytv · · Score: 5, Interesting

      The reliability probably improved since ENIAC, but the the question still is "when it is going to fail" and not if it is going to fail. Because it is going to fail - it may be a drive, CPU, PSU, a network switch, an AC unit, the whole AWS data center ... something is going to fail.

      The beauty of CAP theorem as I see it that it says "You can't get all three at the same time, face it." If you don't need the strong consistency (and with most apps you don't), then ditch it and it'll be much easier and cheaper to built and scale the system. I'd say once you realize this inner beauty, it clears your mind - something like a Zen of distributed computing.

  3. Not Sqlite by wandazulu · · Score: 2

    Just one small nitpick...sqlite is really meant as an embedded database into an application, it's not a full-fledged database like any of the others mentioned (it doesn't have networking, for example). I suppose you could be scaling up from an embedded sqlite db, but that suggests your application has gotten so big that an external database is necessary.

    It's also one of the backing store options for Apple's Core Data framework.

  4. Re:Replication Drawback by greg1104 · · Score: 2

    It's not trivial to figure out, but we've been deploying PostgreSQL 9.0 without the problem you describe (must do a fresh dump from the master) for a while now. The repmgr software we've released takes care of all the promotion trivia. Worst-case, unusual situations can require you use a tool like rsync to make an out of date standby node into a copy of the new master. That's not the expected case though.

  5. Re:Hate to say it by X0563511 · · Score: 3, Insightful

    What's your point?

    Are you going to tell me word processors are boring, spreadsheets put you to sleep, and calculators suck the life out of the party?

    --
    For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
  6. should be excited - what you want already exists by ron_ivi · · Score: 2

    The other big thing that I would love to have in a database is ability to scale the database to multiple machines, so have a logical database span multiple disks on multiple machines, have multiple postgres processes running against those multiple disk

    This exists for Postgres in the form of Yale's HadoopDB project: http://db.cs.yale.edu/hadoopdb/hadoopdb.html http://radar.oreilly.com/2009/07/hadoopdb-an-open-source-parallel-database.html

    HadoopDB is comprised of Postgres on each node (database layer), Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running Postgres, and Hive as the translation layer. The result is a shared-nothing parallel database, that business analysts can interact with using a SQL-like language. [Technical details can be found in the following paper.]

    as well as for commercial forks of Postgres such as EMC's GreenPlum.

  7. Re:Hate to say it by Steauengeglase · · Score: 2

    It's not their fault they are completely normalized.

  8. Re:Hate to say it by GaryOlson · · Score: 2

    You failed to trigger the proper result.

    --
    Every mans' island needs an ocean; choose your ocean carefully.