Slashdot Mirror


User: nconway

nconway's activity in the archive.

Stories
0
Comments
211
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 211

  1. Re:That's really dumb. on PostgreSQL 8.0 Released · · Score: 1
    If you have an index on a NOT NULL column, count(*) will count the index, and it will be fast(er).


    Unfortunately, this can't be done in PostgreSQL. Tuple visibility information (for MVCC) is only stored in heap tuples, not index entries, so index-only plans like you're suggesting are not possible (you'd still need to lookup the index entry's heap tuple to see if it was visible to your transaction). It might be possible to add tuple visibility information to indexes, but that would be very complicated and have a lot of drawbacks. Also, index-only plans would only result in a marginal speedup in count(*), although of course they are useful for other queries.
  2. Re:Apples and oranges on High Performance MySQL · · Score: 1
    As far as queries go, PostgreSQL seems fast, if you've done "VACUUM ANALYZE" on the database.
    To nitpick, only ANALYZE is necessary to update the optimizer's statistics (which is the source of the performance problem you mentioned) -- VACUUM garbage-collects expired row versions, and VACUUM ANALYZE does both.
    For example, if I have a simple table, indexed on an integer field, and do a select that was "WHERE that_field = 123", it is fast, and uses the index, but if I do "WHERE that_field IN (123, 456)", it doesn't use the index, unless you analyzed the table--it scans the whole freaking table. MySQL is smart enough to actually use an index without you having to run a special command first. Geesh!
    Simply put, you need to run ANALYZE on any database before running any serious queries on it with PostgreSQL (you really ought to run ANALYZE periodically, say once a day or once a week -- it is a fairly fast operation). Doing any meaningful query optimization without statistics about the data is impossible, so the fact that PostgreSQL happened to do an index scan for one query but not the other is pure dumb luck.

    The problem with "just using an index" even if the table hasn't been ANALYZE'd is that there are plenty of situations in which using an index results in worse performance than doing a sequential scan (consider a query that fetches nearly all the rows in a table, for example: in PostgreSQL, using an index scan to process that query would mean reading in the whole index, then reading in the whole table -- and worse yet, doing both with random I/O, rather than the sequential I/O that is obviously used by a sequential scan). The situation becomes much more complex when you consider more realistic queries (involving joins, subqueries, set operations, etc.) -- here, to have even a hope of getting a good plan you need to gather some statistics about the distribution of data in the tables (which is what ANALYZE does). Gathering that information on-the-fly is possible, but it would add overhead to routine database operation.

    So I'm not disagreeing with you -- the need to run ANALYZE does impose a small burden on new users of PostgreSQL, and of course it would be great if PostgreSQL could pick a better query plan without ANALYZE being run. But actually figuring out how to do that is rather more difficult...

  3. Re:Apples and oranges on High Performance MySQL · · Score: 1
    Actually, I did have the complete processing for each report in a begin/commit pair.


    That will be cheaper than doing each INSERT in its own transaction, but it is still significantly slower than it could be. You should really be using COPY for loading lots of data into PostgreSQL, as it is significantly faster than INSERT. Other tips for improving bulk-loading performance:

    • Bulk load the data into the table first, then create indexes and foreign key constraints. Creating an index once the data has already been loaded is significantly faster than incrementally updating the index as each new row is added. (This advice applies to most RDBMSs, AFAIK.)
    • Consider temporarily disabling fsync.
    • Increase maintainence_work_mem and checkpoint_segments -- the former increases the amount of memory PostgreSQL will use for in-memory sorting (e.g. when creating B+-tree indexes), and the latter increases the number of WAL segments.
  4. Re:chaotic? i don't think so... on Chaotic Computing In Practice · · Score: 1
    I don't believe so (can you name a Lisp implementation that supports self-modifying code?).



    If you're referring to macros, they are not an example of self-modifying code. They just allow the definition of new syntactic structures; they can essentially be expanded at compile-time.



    In contrast, self-modifying code was an idea that I believe was originally proposed by van Neumann back in the 40s and 50s. The idea is that a computer consists of a processor that as input a stream of data and instructions (both stored in memory and loaded into the processor). All computers are also capable of modifying data (i.e. writing to memory); van Neumman suggested that programs should also be written by modifying their instructions -- i.e. by rewriting parts of their executable code at runtime. That naturally makes a program difficult to debug, since the code it is executing might change from instruction to instruction.

  5. Re:chaotic? i don't think so... on Chaotic Computing In Practice · · Score: 1
    And, yes, there are reasons we're not all programming in LISP.


    Perhaps, but what relevance does that have to the parent or to the article? If you're suggesting that Lisp is an example of self-modifying code, you are mistaken.
  6. Re:Oh, the stupidity... on Firebird Relational Database 1.5 Final Out · · Score: 3, Informative
    First, you can back up any database that uses OS-level files using tar and gzip - that's certainly nothing special for Interbase/Firebird.


    Uh, no. A typical modern RDBMS will use a scheme similar to WAL (write ahead logging) to record data changes and allow recovery from crashes. Unless you can manage to make tar+gzip atomic (which is true if you're using a filesystem-level snapshot feature, but not true if you're actually using tar), you won't get a consistent on-disk snapshot of a DBMS like that. The reason is that the WAL and the data files will be out of sync.

    In order to effectively back up such a database using tar, you need to shut down the DBMS (which is hardly an appropriate backup technique),
  7. Re:When will PostgreSQL have a Windows version? on PostgreSQL Beta Testers Needed · · Score: 1

    There is a fairly good chance that there will be a native Win32 port of PostgreSQL in the 7.5 release. If you'd like to help make that happen, join the pgsql-hackers-win32 mailing list.

  8. Re:oidrand() on PostgreSQL Beta Testers Needed · · Score: 3, Insightful
    Since I was responsible for removing oidrand(), I should probably speak up in my own defense :-)

    The motivation for removing oidrand() was that:

    1. it was undocumented
    2. it was poorly named
    3. it was crufty: the unused OID argument is intended to workaround an optimizer bug that has been fixed for years
    4. it didn't play nicely with other portions of the backend that also use a PNRG.


    Note that it is easy to workaround this change: you can either replace references to oidrand(123, y) with 'random() < 1.0/y', or just define the following function:


    CREATE FUNCTION oidrand(oid, int) RETURNS bool AS 'SELECT random() < 1.0/$2' LANGUAGE sql;


    The former being arguably cleaner, the latter meaning that you won't need to change a single line of application code.

    For more information, the thread on pgsql-patches discussing the change is here.

    I'm pretty convinced that removing oidand() is justified, but if you feel otherwise, please speak up. Thanks for raising the issue.
  9. Re:The big problem is replication on Open Source Database Clusters? · · Score: 1

    FYI, "PostgreSQL" is an open-source database, whereas "PostgreSQL Inc." is a Canadian-based company that offers PostgreSQL support, hosting, and development. The replication product, which was previously being sold, has been released by PostgreSQL Inc..

  10. Re:-1:Troll on Open Source Database Clusters? · · Score: 1
    My main problem with PostgreSQL is the query optimiser. Oracle's query optimiser is definitely superior as Postgres occasionally comes up with some peculiar query plans.


    Can you elaborate on the problems you've had with the PostgreSQL query optimiser? It's worth noting that the query optimizer in PostgreSQL 7.4 (which is in beta right now) should be significantly improved, but if you've had specific problems, I'd be interested in hearing about them.



    (FWIW, in the future reporting bugs about poor optimization of a given query would be nice, too :-) )

  11. Re:Just how much faster than Postgresql is it real on MySQL 4 Declared Production-Ready · · Score: 1
    is there a way to make postgresql keep up to mysql so I can justify using it and right away get access to those cool things like views, triggers, functions


    There's really not enough information here to help you out very much: database optimization is a complex task. However, the PostgreSQL mailing lists (such as pgsql-performance) frequently handle questions about performance tuning -- if you'd like to improve the performance of PostgreSQL for your application, my suggestion is to post more details about your particular configuration and workload to one of the mailing lists.



    However, two easy tasks you can do to improve performance are too tweak the configuration (for example, increase the default shared_buffers setting), and upgrade to the llatest release of PostgreSQL (7.3.2 is current).

  12. Re:gigabytes? on Object Prevalence: Get Rid of Your Database? · · Score: 1
    And how does this differ from storing non-object-oriented data structures in RAM?


    Because changes are written to disk (a la a transaction log), so you don't lose data in a crash.

    You'd still need to implement searches, and how do you search an collection of objects without placing them on the relational line.


    If you don't know how to do that, you don't know how to program.

    (Hint: use a hash table, binary tree, heap, or other data structure of your choice.)
  13. Re:Did not prove causality on Scientists Don't Read the Papers They Cite · · Score: 1
    So they copy and paste, that doesn't imply that they didn't read it.


    Given that the paper specifically studies the misprints of a 1972 article, it seems unlikely that most people would have an electronic copy of the paper to "copy and paste" from.
  14. Re:Not necessarily... on Scientists Don't Read the Papers They Cite · · Score: 1
    Many citiations now are copied from ONLINE SOURCES. We read the papers, but we hate typing in our bibliography from scratch.


    Actually, the study refers specifically to a paper published in 1972; furthermore, the paper specifically notes that its analysis of misprints applies only to "the pre-internet era". The paper itself (which is linked from the article) addresses this objection.
  15. Re:Quick question on PostgreSQL 7.3 Released · · Score: 1
    Good Win32 support.


    This is planned for 7.4 -- the plan is to have native Win32 support in CVS by the end of December.

    executeBatch and such that I think would be helpful for inserting a lot of rows quickly. There's COPY, but I think it's completely non-standard.


    Can you elaborate on what executeBatch is?

    Other than that, I agree that all the points you raised are worth implementing.
  16. Re:Unhelpful answer. on Darwin 6.0.2 for x86 Released · · Score: 1, Insightful
    Myself, I'd rather run *BSD (in Free/Net/Open forms) over RH8 for a number of reasons- a few technical, a few based on rational-self-interest


    How is a "technical" reason (by which I assume you mean "using the best tool for the job") not based on rational self-interest?
  17. Re:Ummm.... on Novell Releases PostgreSQL for NetWare · · Score: 1
    Actually, there's a trivial work-around: min() is really just syntactic sugar for

    SELECT col FROM table ORDER BY col LIMIT 1;

    and similarly, max() is equivalent to

    SELECT col FROM table ORDER BY col DESC LIMIT 1;

    And both of these queries will use indexes in PostgreSQL. These queries are, of course, SQL-standard.

    I don't know what drawbacks he was talking about, but one drawback which stopped our group from working with PGSQL was lack of support for indexes in aggregates


    Letting the developers know that this is a major impediment for you would be appreciated, I think -- AFAIK most people don't regard it as a major issue.
  18. Re:This surprises me on Novell Releases PostgreSQL for NetWare · · Score: 1

    No, it's not. IIRC it's implemented by adding a new column to the table, copying the data over, and marking the original column as "attisdropped", which marks it as invisible.

  19. Re:This surprises me on Novell Releases PostgreSQL for NetWare · · Score: 1
    However, I think that the drop column issue is a problem and so I do all my prototyping on MySQL.


    ALTER TABLE DROP COLUMN is fully implemented in 7.3beta1, which was released in early September.
  20. Re:Did you *READ* the article? on Helping Computers Help Themselves · · Score: 1
    AFAIK, the free and open-source PostgreSQL also has similar technology built in.


    Not AFAIK -- the PostgreSQL query optimizer uses statistics collected periodically (namely, when the ANALYZE or VACUUM ANALYZE commands are run); optimizer statistics are not updated with any data collected by during query execution. I'm not saying that self-tuning optimizer statistics are a bad idea, but they haven't yet been implemented in PostgreSQL. You might be referring to GEQO (a version of the query optimizer built-in to PostgreSQL that uses a genetic algorithm to avoid an exhaustive search of the solution space for large join queries), but that is obviously completely different.



    Here's a good paper on a related Microsoft technology: STHoles: A Multidimensional Workload-Aware Histogram. IMHO that's the most interesting part of the AutoAdmin stuff mentioned in the article -- I don't care for the performance tuning wizard so much. Also, the design of the IBM LEO query optimizer, mentioned in the article, is described in this paper: LEO - DB2's LEarning Optimizer.

  21. Re:Not for a while. on MySQL A Threat To The Big Database Vendors? · · Score: 1
    IIRC for raw disks it was far easier to implement asynchronous I/O which is a significant performance boost over synchronous I/O. With filesystems you had to rely on the OS which may not have it or implement it differently, and it tends to blow away the disk cache anyway.


    Yeah, that's true. That said, the I/O buffer algorithms used by most modern file systems and operating systems are the result of a lot of hard work. So while there is a theoretical benefit to implementing I/O buffer management inside the RDBMS itself, you need to do a lot of work just to equal the implementation you get "for free" with a modern Unix system, before you even see any performance improvement. I'd also suspect that raw disk I/O will be less portable than using the filesystem, and would likely mean that you'd need to maintain two I/O subsystems (one that used the filesystem, one that did raw disk I/O). Of course, I haven't looked into very thoroughly, so that may not be the case...

    As you point out, it's also possible to do asynchronous I/O without needing to resort to raw disk access (one, two for Linux, I believe there are also implementations of the POSIX AIO API for other Unixen). Although this would require rewriting a lot of the PostgreSQL core, it might well be worth it.

    p.s. isn't dsync a slightly better method to use than fsync if you're using pre-defined/sized filesystem devices?


    If you mean fdatasync(), then yeah -- I'd expect that would be faster on most systems. I was just using fsync() for clarity -- PostgreSQL can actually use 4 different but similar methods for forcing the OS to flush it's buffers (see the wal_sync_method GUC var).
  22. Re:Not for a while. on MySQL A Threat To The Big Database Vendors? · · Score: 2
    Most real DB's bypass the OS/File system and write directly to the disk.


    That's true; however, I think such a feature is mostly a waste of time. It takes a lot of effort to implement something like that. While there are some benefits (such as the capability for a more intelligent buffering scheme), the benefits are fairly small: if you're using a good filesystem to begin with, the performance difference is likely to be quite small. In fact, I've heard that part of the reason that raw disk I/O was implemented in the first place is that the filesystem implementations at the time were pretty poor -- so avoiding the FS buffer was a good idea. Since that's no longer the case for the most part, IMHO the enormous amount of work required to implement raw disk I/O is not justified by the performance improvement.

    This is done mostly for reliability, sicne the OS will cache things and tell the application that it is written before its actualy commited to the drive.


    On the contrary, you don't need raw disk I/O to implement provide reliability. You simply write() the WAL log entry, and then call fsync(), which forces the change to be written to disk. The actual (non-WAL) data change is not fsynced, so it stays in the OS's buffers and is written to disk at some later time -- but that's fine, because the integrity of the data can be checked by comparing the WAL records with the on-disk data.

    If you're using an IDE disk that does it's own caching, you're still going to have problems, since the disk lies to the OS following an fsync() -- it reports back that the changes have been written to disk, while in fact they have not, so if you lose power at that instant, the changes may be lost. However, I fail to see how using raw disk I/O improves the situation in this case -- an IDE disk will still lie to the OS or the RDBMS, either way.
  23. Re:Not for a while. on MySQL A Threat To The Big Database Vendors? · · Score: 2

    Are you crazy? In any substantially complex database having to manage all those files would be a nightmare.


    I don't agree, but I'll accept that it's a valid concern.


    The way Oracle does it, with a file(s) per tablespace, is a much easier option. You can put whatever you want in that tablespace and it makes it very easy to manage.


    Ah, ok -- yes, there are plans to implement tablespaces. I haven't really looked at the concept extensively, but as I understand it, tablespaces don't simply "put multiple tables in one file", which is dumb -- they allow you to organize the physical storage of the database's data in a flexible manner. I can see the need for that -- all I objected to was the alleged deficiency of not storing all database tables in a single file (which would be pretty braindamaged).
  24. Re:Not for a while. on MySQL A Threat To The Big Database Vendors? · · Score: 5, Informative
    I wish TPC www.tpc.org would do some MySQL tests and show just how it really perfoms when compared, to DB2, Oracle, and MS SQL Server.


    The specifications for the TPC benchmarks are freely available -- it's fairly easy to write a client application that follows one of the benchmark specs to test a specific database. contrib/pgbench in the PostgreSQL tree, for example, implements a "TPC-B-like" benchmark.

    Log based transactions.


    That's a fairly vague (or rather, inaccurate) term, but if you mean write-ahead logging, then PostgreSQL has done this since version 7.1. Some of the additional enhancements to this feature (such as point in time recovery) are planned for the near future, likely 7.4

    One data file for multiple tables.


    This sounds like a complete waste of time, IMHO. Since the database client shouldn't have any idea what the physical representation of the data is, it's not clear to me why this would be an important feature to have. Can you elaborate?

    I think Postgres is going to do this soon according to some of the discussions.


    Oh? I haven't heard anything about this...

    Complete SQL-92 support.


    "Complete" support for SQL92/99 is pretty damn difficult (SQL92 is 650+ pages, SQL99 is 1200+). Nevertheless, PostgreSQL aims to support as much of the standard as possible.

    MySQL and postgres are not as good of Multiuser system as the big boys.


    Erm, PostgreSQL uses MVCC, the same concurrency control scheme used by Oracle. It allows for "better than row-level locking" (readers and writers never conflict; one writer only blocks another if they update/delete the same row). In what way is PostgreSQL deficient in this regard?

    Right now postges sends a message to the application saying, this is locked retry transaction, this is just not up to real enterprise levels.


    When exactly does this happen, and what's the exact error message?
  25. Re:Who cares? on Probing Hash Tables? · · Score: 1
    It's a table. Which means it'll be faster than a list.
    Sequentially searching through a hash table is slower than sequentially searching through a list. Similarly, finding a given element in a 10-element list is much faster than finding an element in a 10-element hash table. The point being: hash tables are (much) faster for certain types of access patterns, but not so for others.
    Write well-designed, clean, maintainable code.
    How is choosing an effective hash algorithm not "writing well-designed, clean, maintainable" code? I would include choosing good algorithms and data structures as "good design" -- and an application which uses a single, efficient hash function throughout the application is much more maintainable than one which uses a myriad of hash functions chosen at random throughout the code.