What Capacities Do Databases Have?
razor69 asks: "Having programmed PHP in conjunction with MySQL for some time now, I'm quite happy with the whole setup. However, I'm now working up to some more serious clients, which require larger and larger databases. Looking for database comparisons, I'm stuck with the stats MySQL give me and none of the database manufacturers tell me what the maximum amount of records their system can handle is. Now I know that Oracle is unlimited, but so is their price. Which database program is suitable for what price, how many records can they hold (practically as well as theoretically) and how do they all REALLY compare?"
Gee DB2 doesn't count as one of the "Big" Databases? C'mon it even runs on Linux.
Perhaps somebody on this thread can answer my question, since it seems relevant. Most of the time people speak of Oracle they use expressions like "unlimited price" in the same sentence. But the few people I've talked to that are users (and therefore evangelists) of Oracle have all said $25k. That's not really *that* bad, especially for a company that knows its database needs when they make the budget. My question is, is that really the price? What exactly does one get for that price? If not, what's the more accurate price for Oracle in the wild?
www.HearMySoulSpeak.com
We are read-only ~10 million row database in MySQL on a dual PIII 800 w/1GB RAM. Indexed lookups are fast (sub second). MySQL was created as a data warehousing database, and when you use it for its original design purpose, it is fast.
We found that the limiting factor is the number of indexes on each table. All indexes for each table are stored in one file. As you add indexes, reindex performance degrades; we're guessing that the cause is b-tree rebalancing. Anyway, we addressed that issue by creating and indexing sub tables. It's a query-only application, so we can afford to do things like that.
Maximum size of database: 32TB using multiple files; largest recorded InterBase database in production is over 200GB
Maximum size of one file: 4GB on most platforms; 2GB on some platforms
Maximum number of tables: 64K Tables
Maximum size of one table: 32TB
Maximum number of rows per table: 4GB Rows
Maximum row size: 64K
Maximum number of columns per table: Depends on the datatypes you use. (Example: 16,384 INTEGER (4 byte) values per row.)
Maximum number of indexes per table: 64KB indexes
Maximum number of indexes per database: 4G indexes
And don't forget, Interbase/Firebird is absolutely free and open source!
1: It's not the record count, it's the data size. 2: Quite a bit depends on the underlying OS, and what it can and cannot do. 3: Quite a bit relies on what you're doing; a database optimized for transactions will have different high water marks than a database optimized for querying. 4: That having been said, the Big Four (Oracle, Sybase, Informix, MS SQL Server) can pretty much handle what you throw at them, so long as you can shell for the hardware.
Vintage computer games and RPG books available. Email me if you're interested.
since no one has mentioned it I thought I'd dig up a few links I remember reading.
From the PostgreSQL FAQ, linked above:
If you couldn't tell I like postgres but as a business you should get what you think is best. From what I've heard Oracle on Solaris is where it's at if you can afford it. (I can't)
Leknor
I have practical experience with Oracle on Solaris (while I was at Borland and SST), with Access on Windows NT (while I was at Actuate), with CSV flat-file databases on Solaris, and with MySQL on Linux (at SST). In addition, the employee I found thanks to the /. thread on hiring, has experience with MySQL on Linux as well.
From what I can tell, MySQL is fast, fast, fast for certain limited uses, and can handle databases into the hundreds of thousands of records, although I have not pushed it into millions yet. At somewhere around 250,000 database calls every day, the thing starts to have problems, like the server needs a reboot. I don't know if the same problem happens if those calls are spread out over a long period of time. Throwing more iron at it, or reworking the indexes may help. We haven't upgraded to 3.23 stable yet, though, so that may solve the problem. In addition, we've begun using the "p" functions for databases -- you know, the "persistent" database connections. Use "mysql_pconnect" instead of "mysql_connect" for (sometimes) better response.
One of the nice backups for MySQL (and few people seem to use this, but I like it a lot) is that it can write out every SQL update/insert/delete statement to a file. In the case of database corruption, you can take that file, go though the last few statments, and try to correct it. Then regen the ENTIRE database from that SQL file. In addition, 3.23 has something new -- replication or mirroring, something where it duplicates the database onto backup disks. I haven't used that yet. Perhaps someone reading this could share more.
As for Oracle, yes, it can handle just about anything. I don't like the speed on older versions of the database. 8i is good, but actually not as fast as MySQL for some selects. I haven't tried 9. But it is a workhorse, it is storing hundreds of tables and millions of records for us. My current employer is trying to move away from it, because of the cost issue you cited. One of the things I like about Oracle is triggers, which MySQL doesn't have yet.
I know that the worst database I used was back at Borland in 1996 -- I don't remember what the database was called, it was part of a Lotus Domino system. In any case, with 100,000 member records in the database, it took a good 20 minutes for it to add a new record. It was miserable. We had to put big bold text on the "become a member" page that warned of the wait, and I had to write a JavaScript (onclick) for the submit button, that returned false on subsequent clicks, so that people didn't abort the procedure on accident. I hated that. I felt like we were this technology company that didn't understand Web technology.
As an interesting aside, back then Borland had a product called IntraBuilder. It was one of the first Cold Fusion-like systems, but it's database system was just poor. That's partly why they named it as they did -- they needed to brand it as an intranet tool, because they knew it wasn't fast enough for a full Internet deployment. But once the product was out there, the pressure was on to use it on Borland's site. We tried, Chris Malatesta and I. But I recall showing the IntraBuilder team some pages it served up, and how there was a full-second delay per user. Their response? "That seems acceptable to us." Chris and I tried to talk to them about scalability, about how the seconds accumulate and build up. They had no idea how hammered the product would be on a "real" Internet Web site. The product died, thankfully, when we ran a test with dozens of users, and database corruption was rampant -- one person's password appearing on another person's browser, and so on. Ugh. I don't miss those early days.
My Greasemonkey scripts for Digg &
I think any large DB project should be treated as an infrastructure project...
When you're dealing with clients who have (relatively) massive data requirements (ie., data requirements that outstrip the smaller DB services such as MySQL that can be hosted on cheap, commodity hardware), you've got to step back and take a second look at the requirements of their business.
Before you start looking at particular DB packages, and their relative merits, you should really start by identifying the business requirements for the database. For example, what are the uptime and recoverability requirements? If your client requires 24x365 uptime, and also wants a hot standby site, you've got to start spec'ing out the system at the hardware and network level.
Identify the infrastructure that will sustain the requirements of your client, and then begin the process of selecting a DB package that works well within the constraints of that environment. When dealing with larger clients, always keep in mind that you are fundamentally delivering a DB infrastructure, not simply a DB.
A couple things you might want to consider:
Availability - does the system have to be redundant? Does it have to support automatic failover between redundant server?
Scalability - does the DB need to grow beyond the constraints of a single server? Can you forsee this data existing on a large array (ala EMC), and do all nodes of the DB complex need simultaneous access at some level?
Middleware - If it's a large DB, possibly being accessed from multiple physical locations, does it look like a queued middleware (ala IBM MQ) is needed, or perhaps appropriate?
Disaster recovery - If the client is a 24x7 shop, or even if they aren't, are you going to have a sufficient window to back this beast up? Are they going to ask you to architect an off-site recovery strategy? If so, does it have to be a "hot" standby site? If so, does your hardware/network infrastructure support dynamic mirroring over a WAN?
Once you've hammered out these kind of questions (and there are lots more..), then you are in a position to start evaluating DB packages that will work well withing the infrastructure you've designed.
As with any project, getting the requirements defined out front is going to save you a lot of time and agony, and is going to save your client a lot of money and aggravation.