Slashdot Mirror


Linux Databases with Huge Tables?

Eugene writes "I am working on a project to create an expansive network monitoring and all encompassing database solution. With archival and ageing of all data we are looking at a database that grows by 40Gb a year, with some tables being well over the 2Gb file limit in Linux/x86. We have narrowed ourselves down to Oracle 8i (for it's stalwart referencial integrity checking, and PL/SQL for tracing across aged route information) and PostgreSQL (for it's object polymorphism and CIDR data types for IP addresses). We are concerned with the robustness of Postgres compared to "grown up" Oracle, as well as file size limitations. Can anyone give us any advice on these issues? "

13 of 220 comments (clear)

  1. Re:Oracle 8i on Linux by gavinhall · · Score: 3

    Posted by patg:

    OCI programming? Uhhhhgghhh. It's cool, but very different/difficult to the uninitiated. The Oracle distribution comes with some good OCI sample programs, but I wish the interface were like mysql's C interface.

    With OCI, you have to bind every one of your columns before doing an insert,update, or query.

    I suppose that one could write a C wrapper library around OCI which would do that binding for you.

    Essentially, DBD::Oracle is such a beast, except it being for the use of perl.

    I was at one company where the Oracle consultants suggested using a perl cartridge with Oracle's web server because DBD::Oracle wasn't "supported". I went on the explain that DBD::Oracle is really just an OCI application and that it indeed is something Oracle should look at as being "supported".

    By the way, DBD::Oracle (along with DBI) is an excellent piece of code that powers many a site.

  2. Postgresql is good by Grim · · Score: 4
    We have been using postgresql for all our database needs for the last 18 months where I work. It has gone through a number of changes and is now a LOT more stable than it was only a year ago.

    If your data is critical and 100% availability is essential, I would probably still go for oracle, as it has a far more mature background and, whilst closed source, is very stable. Otherwise, postgresql is a very good choice. It handles tables of effectively infinite size (it just breaks them into 1GB chunks to avoid the filesystem limit). And it has the bonus that is is open source. If you need a feature, you can add it. I am sure that if you are part of an organisatin generating 40GB of data a year, you probably have a few programmers floating around to add enhancements.

    As an addition, I have found the postgres development team to be VERY responsive and helpful. Personally I would say they are probably the most active and helpful open source dev team I have seen.

    If you want something out of the box, with 100% availability, go oracle. If you want something that you can control, that will grow with you, and will eventually get 100% reliability (it is very close right now), go postgres.

  3. PostgreSQL and Large Tables on Linux by EricTheRed · · Score: 5

    PostgreSQL on Linux handles large tables pretty well. It breaks each table into 1Gb segments to get round the 2Gb file size limit. We chose 1Gb as it was a nice round figure, which makes it easy to organise the tables when they get really big.

    When they do become huge, you can then move the segments onto different filesystems to ease the load (although currently this is not a simple task). This also applies to indices.

    I know of a few large databases out there using PostgreSQL, but one I have access to is the TASS project (http://www.tass-survey.org), which currently has about 28Gb of astronomical data.

    The largest database I have is currently 5Gb, but that's primarily limited by disk space and not by postgresql.

    --
    Java gaming nut - http://www.retep.org/ or for the rail http://uktra.in/
  4. Oracle 8i on Linux by IainBowen · · Score: 3

    I've been using Oracle a long time now and Linux for a few less years and I've so far found Oracle 8i on Linux a stable and enjoyable experience - however, I have yet to base a real system on it. I've also toyed with Postgres under Linux, but after using Oracle for so long, it just doesn't compare.

    I would say that tablesize going over 2Gb is not a problem. the tables go in tablespaces which can be made up of many datafiles.

    PL/SQL is useful for fast and dirty code, but lacks a little to C for performance. If you can master it, Oracle Call Interface with 'C' will produce very fast code.

    I'd also recommend reading a good book both on Oracle Design and on Tuning. There's a couple by O'Reilly, oddly enough.

  5. Database choices... by tomierna · · Score: 3
    There are so many RDBMSs out there that it is very difficult to make a good choice quickly.

    Even if you limit your choices to one OS, you still have at least five or six great DB platforms.

    The choices you've given are by no means the only ones you have, and the reasons you give for narrowing down your choice of engines seem to be pretty sparse.

    For instance, Sybase ASE and ASA both support referential integrity constraints. ASE's are more limited than ASA's, but ASA has RI checks that are comparable to what I've seen of Oracle 8i.

    All of Sybase's products have T-SQL, which are their programmatic extensions to SQL92. Sybase has a robust C API for writing server extensions.

    Sybase ASE is fully capable of handling the data load you've specified. So is Oracle's enterprise-scale product. So are some of Informix's products.

    Interesting tidbit about Postgres - parts of it found its way directly into a product called Illustra. Illustra was bought lock, stock and barrel by Informix. Informix has in one of their products the object-polymorphism you crave.

    From the PostgreSQL Guide:

    The authors of PostgreSQL 1.01 were Andrew Yu and Jolly Chen. Many others have contributed to the porting, testing, debugging and enhancement of the code. The original Postgres code, from which PostgreSQL is derived, was the effort of many graduate students, undergraduate students, and staff programmers working under the direction of Professor Michael Stonebraker at the University of California, Berkeley.

    Michael Stonebraker spun Postgres into Illustra. For a while after the Informix buyout he was grafted into their executive management. I dunno if he's still there, since Informix doesn't seem to list that information on their site.

    Informix still has grants at UCB as well as many, many other colleges. (source) Some of those grants sound interesting...

    Now, I realize that the subject was "Linux Databases", but there are *solid* RDBMSs that run on the various BSD's and commercial *nix, too. I mention this because of the large scale data needs. I'm not sure that Linux running on any PC hardware is capable of supporting the amount of data you are looking at. (Notice I said PC hardware...)

    Because of that, I'd suggest a commercial *nix, running on a server-class machine and an engine from one of the "big three" database vendors: Sybase, Oracle or Informix.

    All of these vendors offer their enterprise-level engines at very reasonable prices, since you are buying at the per-seat level in most cases.

    Of course, I'll likely get flamed to hell and back for the above statements because I'm suggesting commercial, non-GPL engines running on commercial, non-GPL OS's on top of hardware that wasn't built by stuffing armfuls of parts from Fry's into a whitebox ATX case.

    If I were stuck cobbling things together because of budgetary constraints, I'd go with Linux running on a whitebox and use the free Sybase ASE installation that bundled with RedHat.

  6. Be aware of the complexity of Oracle by ZamZ · · Score: 5
    If you are looking for a 'black box' to put your data in be aware that Oracle, even at only 40G of data, requires some heavy administration at times. You may want to consider MySql for an easier ride on admin.



    As well as knowledge on tuning the database you'll have very specific Oracle SQL tuning to do. I've been doing this type of thing for a number of years now and I still don't know 100% of what is needed - there are always knew features to get aquainted with.


    In saying that there are some good resources to get you started. Have a look at http://www.orapub.com for white papers and before you even start to lay the database down see the Oracle white paper (from their website) 'How to stop defragmenting and start living'.


    You should also take into account that 8i is a new-ish release of Oracle. Keep an eye on the bug lists for it and make sure you won't be impacted by anything.


    Apart from the warnings, I've run around 60G Oracle db's on Linux and as long as your server is beefy enough you should have no problem.


    One big thing, if you do go down the Oracle route have t-shirts made up with 'I've looked at v$system_event today, have you?' written on them and wear them. If there is one way of finding what your performance problems are/will be its this table. The only real exception is bad SQL which you need to trap in other ways

    1. Re:Be aware of the complexity of Oracle by hey! · · Score: 3

      I agree with this wholeheartedly, especially about the documentation. You need to be a good skimmer. If you can't sit down and skim through a couple thousand pages of documentation in a couple of days, then you could be in trouble. In the past, I've also got bitten by Oracle documentation that was just plain wrong.

      But... if you aren't constantly under the gun from cranky PHBs, tuning an Oracle database can be _fun_. If you've never worked on a system this configurable before, it's like you were a model rocketer and suddenly were given a chance to fire off a Saturn V. Chances are it will blow up on the launchpad, but if you get everythign right the results are spectacular.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    2. Re:Be aware of the complexity of Oracle by Get+Behind+the+Mule · · Score: 3

      If you are looking for a 'black box' to put your data in be aware that Oracle, even at only 40G of data, requires some heavy administration at times. You may want to consider MySql for an easier ride on admin.

      That's quite true (but see my other post about shortcomings of MySQL). I can't help the original poster much because I have no experience with PostGres, but I can tell you that while Oracle is very powerful, the learning curve is very steep, and you'll need a skilled, experienced DBA to really get all of its benefits to work for you.

      I have found few things in the world more frustrating than Oracle documentation, not because there's not enough of it, but because you can't see the forest for the trees. The first time I had to administer an Oracle DB I was overwhelmed, and there was almost nothing there to help you learn how to just get started. By now, I feel relatively comfortable with Oracle, but it's taken months of practice.

  7. Interystems' Cach� by dublin · · Score: 4

    I'm not a database expert, but I've done a fair amount of healthcare IT consulting over the years, and run across Intersystems' Caché database quite often.

    Most people aren't familiar with Caché from Intersystems, but if you're building a very large, sparsely populated transactional database (like an electronic medical record, for instance), it's at least an order of magnitude faster than Oracle, even after doing unnatural things with Oracle. (This info came directly from a major EMR vendor that benchmarked Cache and all the usual suspects and found nothing in the same league for this application. According to his tests, Caché was 30x faster than Oracle out of the box - Intersystems' website claims Caché is 20x faster than RDBMS competitors.)

    The technology is quite interesting in some regards: It is very mature, having its roots in the old MUMPS or "M" system (about as old as Unix), but has been updated quite nicely as time has passed to provide quite capable SQL and object-oriented interfaces, even though it is neither an RDBMS or an OODB internally.

    Another nice feature is its distributed caching protocol, which allows you to build a logically huge database server piecewise from a number of distributed servers. A number of large installations, including Boston Children's Hospital (IIRC), use this for their EMR - last I heard, BCH's database was spread over 100-150 servers. This would seem to fit nicely with the Linux way of doing things.

    It runs on Linux and a free download is even available on thier website, or they'll send you a CD for the asking if you're one of the unfortunates stuck on the end of a phone line.

    You can find Intersystems at http://www.intersys.com, not exactly the world's most intuitive URL, so I thought I should include it here...

    --
    "The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
  8. Re:SQL 7 by hey! · · Score: 3

    Yeah, but the problem with SQLAnywhere is that Sybase doesn't want it to get TOO good. One of the things they've done is provide a wacky implementation of SQL, and taken out things like statistical aggregate functions like standard deviation. It's small, fast and useful, but in terms of compatibility its the pits.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  9. Re:4D by hey! · · Score: 3

    Except that from what I can see, four out of five 4D projects end in a bloody train wreck. I know, I've been involved in a number of rescue missions for 4D users.

    4D has a number of major problems.

    First, it is not relational in a technical sense; that is it does not allow for non-procedural programming. It uses a kind of foreign key structure to "relate" (ugh) one table to another, but the programming style is based on traversing links, and it implements that badly (imagine programming in a system which only allowed you one cursor). The programming style this enforces is much more like a hierarchical or network database. The lack of cursors means that 4D programs show poor modularity; whether any piece of code works depends on the global state of the system. This results in code that behaves inconsistently. Furthermore the weak object model tends to encourage use of global variables to control system behavior.

    4D was a database designed by people with only a vague understanding of database technology, and this shows by their confusing tedency to use technical terms like "project" (not as in project management but as in "project an image") but in ways completely unrelated to their accepted meaning.

    The 4D file format is at once highly fragile and stunningly slow. Past work I've done with recovering crashed 4D database indicate that it depends critically on internal file pointers, which tend to get corrupted. The internals are not documented, and the tools ACI gives you are utterly inadequate. You absolutely _must_ buy third party tools from people who have reverse engineered the structure, and you absolutely must run these any time the database has terminated abnormally. These tools are very slow. So after any abnormal termination, you can count on much down time. Whats worse is that the combination of slowness, and the lack of anything like SQL, means that designs almost always have to carry acucmulator fields in tables that must be updated (to keep monthly totals etc). So in addition to running the third party tools, you have to gin up your own tools to fix these fields.

    The 4D developer environment is one of the most poorly thought out I've ever seen. You MUST lay out your database in a graphical layout window, and "relate" your tables using a connection tool in this window. This works fine for three or four tables, but in a system with twenty, thirty or more tables it often takes minutes of painstaking mousing to figure out the "relationship" between two or three tables.

    This points out 4D's real weakness. Things like the "current record" model of programming and the graphical table layout make it fairly easy to get started with a small system, but unless that system stays tiny (in complexity AND number of records) you are bound for trouble.

    I could go on and on about the shortcomings of 4D; but I know some developers like it. Different strokes, I guess, but one thing is absolutely certain: 4D is a very BAD choice for large, mission critical applications.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  10. Perl and Monitoring Oracle by moscow · · Score: 3
    Are you aware of the perl DBI? It works for both Oracle and PostgreSQL (and mysql and ...). This can be used to provide access to databases regardless of type in a uniform manner, and are truly useful for remote access and regular running jobs, particularly those which use standard SQL. In Oracle terms, you can run Perl DBI anywhere that you can run sqlplus.

    There are a number of GPL products which use Perl + DBI. One of the best, IMHO, is Orac which also uses Perl/Tk and so provides GUI access from multiple platforms (Solaris, Linux, NT). Orac offers loads of SQL scripts to help with tuning, or just seeing the layout, of databases. It also provides realtime database monitoring, which is the current thrust for improvement of the tool. You can find it on CPAN, e.g. here .

    Another monitoring tool, which is capable of emailing you when it's unhappy as well as putting up current status on a web page is Karma. This is still developing rapidly and is intended for Oracle on Linux. This can be found here on freshmeat.

    Hope this is of some use.

    --
    Who would believe in penguins,unless he had seen them? Conor O Brien - Across Three Oceans
  11. Re:Some good information... by frossi · · Score: 3

    MySQL does not support transactions. The good point is that there is NO way to make a transactionnal database as efficient as a non transactionnal one. So MySQL is faster than PostgreSQL for many benches.

    The bad point is that when you need transaction, you have to perform table level locks which are far LESS efficient than row level locks of PostgreSQL.