> So why would he want to waste nearly double the money for a 1 processor system unless he really has to only have 1 processor?
how about: 1. hot-swappable power-supplies 2. pair of scsi internal disk in a hardware mirror configuration 3. onsite 24x7 support 4. significant vendor testing that ensures that all components really are compatible
> You are right but only for OLTP types of applications. Full table scans can't be buffered adequately.
yeah, cache hits are more critical in oltp - where a single percentage point in the hit rate can be a noticable difference.
but even on non-oltp applications (data warehouses, data marts, olap, whatever), it makes a difference - and you (typically) get can get good hit rates. I'm looking at stats on a 500 gbyte data warehouse right now, and see: 1. dimension tables: 99% cache hit rate 2. summary tables: 98% cache hit rate 3. fact indexes: 99% cache hit rate 4. fact tables: 92% cache hit rate
In the above list the vast majority of data is in the fact tables - where over 90% of the queries are coming out of memory. Since many queries actually hit the summary tables instead of detail, those queries are getting a 98% hit rate. And when the cache is missed - it doesn't necessarily mean a tablescan: this particular database is partitioned via db2, so a cache miss generally just results in a partition scan of 2-4 seconds.
So, while you're right - caching is tougher in a reporting environment, it still helps enormously in most implementations. At least in my experience.
> Actually it's hard for me to think of an app which doesn't require more than one CPU and needs faster > access than a properly configured RAID can give it.
was wondering this too, how about this unlikely scenario:
1. very low concurrency application - in which you never or almost never have more than one connection at a time to a database.
2. you need the fastest possible response to the query.
3. you've got 15 gbytes of data along with 500 mbytes of indexes
4. your queries don't fit into any pattern - so normal database buffering seldom gives you above a 50% hit rate.
5. all access to data is via indexes, and is never for more than a few rows: so parallel queries won't help.
6. availability doesn't matter - you'll just get two of these, and they are read-only.
In this scenario I imagine that a ramdisk would provide a tiny performance margin over a properly designed disk-based database.
But the other 99 out of 100 scenarios would do better on disk.:-)
Actually, a well-tuned high-end database typically doesn't need to be completely in-memory to get 98% of the speed of one on a ramdisk.
For example, let's say that you've got 15 gbytes of raw data on db2/oracle/informix, as well as 4-8 gbytes of memory.
First step is storage: don't use raid-5 (requires extra writes), instead you'll want something like raid-10. Then ideally get 28-56 or so drives, and you can spread data, indexes, logs, and tempspace across all drives. Obviously 15k rpm ultra320 drives with a caching controller would be best (if you can't afford fibre).
Next make sure that your memory is tuned right - multiple gbytes set aside for buffering, sorting, asynchronous agents, etc. With these databases it's easy to ensure that all of your indexes are always in memory, that all writes to disk are asynchronous, and that most sorts are in-memory. The buffering should ensure that 98% or so of your disk access actually hits the bufer instead of the disk.
Given the above scenario a database running on a ramdisk could be faster - what? 0-5%? And in return for that possible speed increase it'll cost more and lack the reliability of the disk-based database. Probably not a good deal.
the issue *isn't* that open source advocates are attacking her analysis.
the isue *is* that open source advocates have discovered that: 1. she teamed up with a microsoft gold partner to perform the analysis 2. they sent the survey to subscribers to a microsoft publication (a completely biased sample) 3. the analysis & survey don't match up well - a considerable amount of apparently unfounded interpretation occured.
So, are open source advocates sometimes excessive? Sure. But more to the point: Didio's analysis was beyond flawed - it was deceptive. And that discredits her as well as Forester.
Hey thanks for the informative reply. Still suspicious of the value of something that thinly skins simple sql & doesn't help with complex sql - but I'll have to take a closer look.
Seriously can't imagine developing powerful business apps today without writing sql by hand.
Sure, there's plenty of simple CRUD (create, replace, update,delete) stuff that could be hidden behind something like this if you want. It's tedious stuff - but hardly difficult.
But beyond that - applications are expected to provide a greater amount of analysis than they were ten years ago. Just doing CRUD isn't enough - people also want to see simple analytical data (trends, totals, medians, averages, rolling averages) all throughout their transactional applications.
And without learning how to write some sql, you'll end up unable to provide this functionality - the app will be too slow to code and too slow to run it.
Ok, here are a few reasons why I *wouldn't* want to use that hibernate syntax:
1. how do I separately validate the syntax & query results? Can I run it in on a command-line? In a query tool, etc?
2. how do I diagnose subtle problems caused by hibernate join failures? Exactly what sql is it running?
3. how much flexibility does it have? Can it handle outer joins? Subselects? Group bys? Aggregate functions? No? Then why use such a poor subset? Yes? Then why not just use sql?
4. How do you leverage dbms-specific features? Yes, I know that it's best to stay generic, but when you need to add more analytical queries to your app (to deliver something beyond simple CRUD), you want to use olap functions, etc. Now, yeah, you could do this in your app. But it'll be slower. *Far* slower: you might have to process a hundred thousand rows - and the database will do it in less than a second. Your app? Maybe 5-10 seconds?
My recommendation is to learn sql, use standard sql whenever possible for portability. But learn the advanced sql required to deliver robust functionality. And when required be prepared to dip into vendor extensions.
Or stay away from huge, challenging, and powerful business applications.
> Is MySQL 5.x the answer to my prayers? Or just a cruel reminder of why MS software costs what it does?
Well, what's your time worth? what's the business worth that would be supported by this database? How much risk can you take?
Assuming that you've got money at stake here, your best bet would be to stay completely away from mysql at this point in time: you aren't seriously thinking of porting 1000 triggers and 800 stored procedures to mysql's initial beta release of this functionality are you?
Given your requirements, I can't imagine considering mysql or even postgresql right now. On the other hand, there might be scenarios in which you could begin to break up this database into components, and perhaps some of these components could live in other technologies.
For instance, can some of your reporting & read-only queries run against a separate database? This is a typical oltp / olap split in which transactions hit a small transactional database with minimal data retention, then the data is moved over to a separate database for reporting & analysis. The reporting database ('data warehouse', 'datamart',etc) typically uses a denormalized model to avoid so many joins, typically also uses parallelism & partitioning for far better performance. You'll miss db2, oracle, sybase, or informix here. Even sql server most likely.
The above wouldn't get you completely out of sql server, but is probably achievable. And would help to avoid trying to port thousands of triggers & procs.
> "ok, how about using their non-isam option. Ok, now we've got the kind of data integrity we've taken for granted since about 1981."
That's true, there were holdouts back in the early 80s. Used to be lots of these guys running around saying that relational technology would never work, PCs couldn't do real work, etc, etc. I think most of them lived in bomb shelters from the 50s. We next heard from them in y2k.
> I've worked with raw disk I/O; flat-files; single-key databases; and primative table-managers > like MyISAM; more robust table managers like innoDB; pure time-series databases like fame; > "enterprise" databases like Sybase and Oracle, etc. and they all have a broad appeal for > different reasons.
Sure. And just like you don't use oracle for storing data in real-time automotive transmissions, you don't use isam in 2005 for storing business data. There are exceptions. But very few a professional would consider: read-only access where the source of record is somewhere else, fully redundant systems, etc.
> Yes, if you look up in the thread you will see discussion of the fact that you can trade > performance for guarantees.
In this particular case half of the issue is due to space-bloat by innodb. They've accepted that this is a problem, and are working on it. In other words, it isn't an engineering compromise it's a design flaw.
Additionally, remember that myism is dead last in speed in write-intensive applications. Dead last. So, the elimination of common-sense data protections dosn't buy myisam any speed in this scenario.
> No one solution is right for everyone.
No, but "gotos are considered harmful". And so is ISAM or mysql for vital business data.
> They just don't exist in the same space together.
1. the op claim *fastest* wasn't limited to any particular "space"
2. mysql is pretty cheap as far as commercial databases go, at $600 / database server. But both oracle and db2 have versions within this price range. For example, DB2 Express is somewhere between $500 & $1500, and is limited to 2 cpus. On the other hand, it includes partitioning (MDC), parallelism, a mature optimizer, standard features, etc. It can handle 5000 transactions / second - in a standard database environment in which you've got referential, check, and unique constraints on your data.
So, yeah, mysql far slower than db2 & oracle at the top end, where it doesn't really compete. But it's also generally slower at the low end as well - where its only edge is simpler installation. Ah, and a GPL version that almost no corporate clients qualify for. And perhaps almost nobody else either. Who can tell? http://www.mysql.com/company/legal/licensing/comme rcial-license.html
> It's faster than everything else because it doesn't have all the data integrity features that a RDBMS does.
Hmmm, really? Lets see...
if you're using myisam (the fast io layer), then all locking is done at the table level. Which means that if you get multiple connections attempting to write to the same resource they'll all have to wait while their operations run in series. Fast? Nope, hard to imagine a slower scenario than that.
again, if you're using myisam (again the fast solution), and need to select 10% of the rows of a table with 10 million rows, what will it do? It'll do a scan of all 10 million rows. What would oracle, db2, or informix do? Rely on partitioning to just scan those 1 million rows. Of course, these commercial databases will also break the query up into pieces and run them all in parallel across multiple cpus on your server. You could easily find mysql taking 40x as long to respond to this query as one of the others.
How about if you've got a complex query? Well, mysql admits that their optimizer is very primitive, and will take a while to fine-tune. Great, so now you just write a few different versions of your query until you get the speed you want, probably depending on 'gaming' the optimizer. Then when you upgrade the software next year you'll find that your gaming is killing performance. Time to rewrite the query. No big deal, we all went through this (about twenty years ago, and glad to have it behind us).
ok, how about using their non-isam option. Ok, now we've got the kind of data integrity we've taken for granted since about 1981. But, reads are very slow, often 1/10th the speed of myisam.
On the Innodb site they brag about getting 800 inserts / second. I think the most recent db2 benchmark on a 4-way intel box hit almost 3,000 transactions per second. Note: Not just simple inserts & updates, it was a tpc benchmark. And that's on a low-end box. At the high-end db2 is running 80,000 of these transactions per second.
Fastest? please, only in read-only transactions running on tiny hardware...
> If they can fix 10 years of criticiscm in one release, why couldn't do that before? Or maybe in > several fixes rolled out within the 10 years?
If you recall, their management made the (unconvincing) argument that 99% of the time people didn't need fluff like:
* referential integrity (pk & fk constraints)
* views
* triggers
* stored procedures
So, they never implemented this because they had been arguing that nobody needs it anway. Nevermind that it's been standard fare for over twenty years.
Personally, although I'm glad to see them support views, I would rather have them clean up their exception-handling than add triggers & stored procedures right away. The problems with silent errors, silent truncation of strings & integers, date validation errors, etc are far more serious than what they are adding in this *beta* release.
Maybe in five years that'll be the release that fixes 15 years of criticism?
Just like Yahoo Finance - I've seen quite a few financial departments in fortune 100s that relied on spreadsheets. Nothing unique to small departments. Yahoo Finance is probably better off on mysql than spreadsheets, so it's a win for them. But it proves nothing.
> but for an awful lot of what people want to do, MySQL is Good Enough (tm), and it requires virtually > zero administration, that's the real pull (and why you find MySQL available on every damn hosting > account out there).
Hmm, I think the reason it is everywhere is more to do with the lack of any other options 2-4 years ago than anything else. Except maybe native support for windows. That helped as well.
Today I continually discover people using mysql who are blissfully unaware of the data quality problems and licensing fees. Since they aren't fixing their silently corrupted data (silent errors, truncated strings, truncated numbers, invalid dates, invalid relationships, etc, etc), they probably feel that mysql is a very low-management solution. Meanwhile, I've known several who've migrated *from* mysql exactly because they couldn't deal with the efforts involved in fixing data quality problems. Likewise, they'll tell you how cheap the product is - completely unaware that they are using it in violation of the licensing agreemments and need to cough up some cash.
And this data quality issue can be ok - lets face it, a bowling score application built for fun doesn't have critical data in it. Who cares if it occasionally bungles the data?
There is no question that mysql isn't yet to where db2 & oracle were in 1983. That's easy to show (no views, silent errors, etc, etc). And there's no real question about the value of practices, techniques, and technology that protect data in a database. That's easy to show, and no amount of fans of mysql are going to budge 40 years of data management progress. The only real question is how long it'll take mysql to seriously address these issues.
And when it does, it'll be an extremely respectable solution. Until then, it's best for applications where the data doesn't have much value.
> As for mission critical, Yahoo Finance, Associated Press, Lycos, Los Alamos Laboratory, NASA and Suzuki (to name but a few) would disagree with you there.
Ya, and I've seen a dozen companies that managed mission-critical data within the complex and often incorrect spreadsheets of secretaries insecure pcs.
So? The fact that company x doesn't care about data quality problems says more about the company's competance than about the importance of those problems.
"You silly sausage . . . check your facts first next time before posting such knee-jerk drivel" wrote:
> I guess SAP is well, clueless, being that MySQL MaxDB has been their open source database of choice > for years.
MaxDB? Ah, "Adabase". That heap of 70s technology has been around for at least 20 years, and predates most relational databases. The product was probably open-sourced since it is reliable with data - and so a great addition to the mysql stable. However, it's been a 'has-been' in the database world since the 80s. I have no idea why anyone would want to use it today.
Kind of like someone getting excited about using IBM's IMS DB for their small site. Why?
> I'm afraid. 95% of software development problems would be solved by having good, descriptive, well thought out specs.
yeah, that's the problem: requirements gathering is the most failure-prone activity in software development.
Waterfall methods approach this problem by generating pallet-loads of documentation, and having everyone and their brother (theoretically) review the paperwork. This of course almost never happens. When it does happen it's almost impossible for the users to visualize the system anyway - so tons of gaps & errors occur.
Agile methods approach this problem by delivering the software one step at a time. Something the users can actually interact with, which they do, and then much more easily detect requirements gaps and errors.
Agile isn't a silver bullet - it relies very heavily on developer skills, and isn't helpful at getting a specific set of requirements completed at a specific date. On the other hand, brooks documented the benefits of focusing on developer skills 40 years ago, and nothing else is accurate at determining target dates anyway.
> Now, yes, with clustering and keeping stuff in memory, you can get good response times. That's not > some deep, fundamental property of the database, it's a simple add-on that sits on top of the > database. I believe there are already solutions like that for PostgreSQL and MySQL, although they > are not quite as flexible as those for DB2 yet.
Um, no. We're not talking about just setting up veritas on a couple of sun 6500s with 24 gbytes of memory, assuming that every product in the dbms category from MS Access to Terradata has the same scalability, crossing your fingers and hoping for the best.
We're talking about multiple forms of data partitioning, parallel query, and query rewrite (to redirect against summary tables). All built into the database - so that the optimizer can take full effect, and so it's (relatively) easy to build & manage, and will run portable sql.
Add to this dozens of fine-tuning knobs to dial in the specifics of your environment, very clever optimization, very functional sql, etc - and comparing an open source database to oracle/db2/teradata at this scale is rediculous.
It's like some kid infatuated with his brand-new honda bragging about how it can pull stumps as well as a farm tractor.
> PostgreSQL cannot compete with Terradata in this space.
right, but this is a huge space: and given the rate of data explosion, 3TB databases have become extremely common. And that's a good thing for all database products & customers: it means that db2, oracle, etc can keep adding more exotic features to support multi-TB databases while postgresql, etc can quietly pick up more of the smaller ones.
> Also one question-- suppose you have your storage go bad on a blade-- what happens to your > relational integrity if the records are spread across various servers?
Depends on which dbms you're talking about: on oracle it's no big deal, since they use shared-disk, and one node can quickly take over for another on 10g (was OPS). Of course, that's a complex, high-priced product, and doesn't scale as well as the shared-nothing options like db2 & terradata. I can't answer for terradata, but if you're on db2 then you've got a few options: 1. implement a HA solution between pairs of nodes, so that one node can take over for another. This can be via HACMP or Veritas (both complex) to take over the storage from the crashed node, or even HADR (client reroute - not complex) to support a second duplicate node. 2. implement a failover database. This keeps the architecture very simple, and is a more complete solution (read: disaster recovery) - since it can protect you against datacenter floods, etc: by locating the failover in an entirely different data center.
Since we're generally talking read-only data here, you've got more options than on a large transactional database. I'd typically prefer #2, followed by #1/hadr. HACMP and Veritas are a pain in the butt, IMHO.
> something like Craig's list, if you had to pick between MySQL and Postgres only, not based on claims > by the distributors, but based on the capabilities you elaborate in your post and what you know > personally. Starting from scratch.
ok, just shooting from the hip here, ymmv, etc, etc, here's my $0.02 on that:
Assuming that you partitioned the application like Craigslist - by city, that drastically simplifies the scaling issue: since each city can be its own server, with possibly another server to support cross-city summaries. Your application would have to know which city translates to which server - but that can be handled easily many ways.
I'd go with postgresql given the choice, since: 1. partitioning by city drops the top read performance requirement. mysql has better read performance using myisam, but you could probably meet the requirement at the city level just using postgresql. 2. postgresql's write performance is *way* better than mysql's using myisam, and also better than mysql's using innodb according to anecdotes. 3. postgresql is just simply a far more mature database right now. Mysql has a few attractive features to it, but its gotchas (http://sql-info.de/mysql/gotchas.html) are bizarre in a data management product. And not everything you do will be simply serving up content - you'll have transactions supporting calendars, the creation of classified ads, discussion forums, etc, etc. 4. postgresql is free, mysql is not. If you're trying to save money, and might want to have a bunch of low-end servers, you don't want to be shelling out $500 each. Then extra for online backup software.
A few other notes: 1. mysql has two things going for it for something like this: myisam query speed, and its query-cache. This is one of the few places the mysql query cache is actually useful. And it would probably have a huge impact here. In fact, if your read:write ratio is sufficiently in favor of the reads, and your data volumes are sufficiently high - then I'd have to give mysql a second look. But as long as postgresql can meet your performance needs (and I suspect it can), then its other strengths (quality & exception handling, online backup, standard sql features, etc) put it way ahead of mysql. 2. I'd definitely treat each subject area within the site as a separate set of tables - and try to completely encapsulate them in the application. These subjects are discussion forum, housing, community, personals, etc, etc. By doing this you'll be able to scale them independently, move them around on different servers, even use a different product for just one component if you wanted. 3. I'd also define some performance targets - and early in the development cycle test out your performance to see if you can hit those targets. 4. Assuming a typical growth scenario, I'd start with multiple databases on one server, then gradually move them to separate servers as the hits & revenue increased (hopefully at the same time). 5. As far as the boxes are concerned, to keep costs low I suppose I'd go with lots of memory (4gbytes), battery-backed hardware raid with 128 mbytes or so of cache using 15,000 rpm drives if possible. raid-5 probably won't hurt you on postgresql or mysql with innodb as much as it would mysql with myisam. 6. Commercial databases are probably also worth considering - since they would allow you some of the scalability benefits via partitioning - and could help you avoid having to run 20 servers, most of which are idle most of the time. Labor's money too, and commercial database products start around $500-$1500.
well, i guess it's time to get back to work. oh, and good luck.
> Take a look at the hardware and the cost per transaction per second. I am not saying this could > not be done on PostgreSQL (it could) but it would expensive as well.
Good point - a database performing 50,000 transactions a second isn't running on your grandmother's pc. But it wouldn't be correct to assume that all databases scale equally. This is where the greater tuning opportunities of db2 or oracle come into play:
- instead of just a single 8k page size, db2 has 4k,8k,16k, and 32k page sizes.
- db2 has much more flexible statistics gathering than either mysql or postgresql
- db2 has a far more robust & intelligent optimizer than any open source database
- db2/oracle/etc have query parallelism
- db2 in particular is very good at performing most database operations asynchronously: data writes are sent to a buffer pool by one agent, to a log buffer by another, from the log buffer to the log file by another, from the log buffer to actual storage by another, etc, etc.
- both mysql & postgresql have very primitive memory tuning capabilities: mysql can't share io buffers between innodb & mysql. Postgresql just doesn't have many options to manage sort memory, separation of memory buffers by tablespace, etc, etc.
But keep in mind - many of these weaknesses are also strengths: since the missing configurability is translated to simpler management for smaller & less-demanding databases.
> However, unless you have all the information in memory or unless the the data is spread across a > large number of disk arrays, I don;t see how you can even get the information to process it in less > than a second.
Ok, here's an exeample architecture using db2. Would work identically on informix, and probably Terradata. Oracle has similar capabilities, though a little different since they're shared-disk rather than shared-nothing for db2/informix/terradata.
Step 1: spread data across 10 2-way 64-bit AMD blades. Each blade has 8 gbytes of memory and a fibre connection to storage. You can expand this up to *hundreds* of blades (nodes) - and get almost linear performance improvement the entire time. You've now got a 20 CPU system for about 10% the price you'd pay for a Sun E10000 or HP Superdome. Since the data is spread across all nodes by hashkey - every query will get 20 CPUs processing it together - each on 1/10 (100 gbytes) of the original data.
Step 2: partition the data on each node by range or value via MDC. Say, by 'day' & 'department' - so that you've got (for example) 1000 days & 1000 departments. Now, each query will only tablescan the data within the blocks it needs: whether it needs to scan 5% or 75% it gets a linear performance improvement corresponding to the reduction of data.
Step 3: implement parallelism on each node. Now, this sample config only has 2 CPUs - so the opportunities aren't as great as they are on an 8-way. But a two-way will still usually cut your query time in half.
So, now a few examples: 1. query for 10 days of data for 1 department:
a. clustering cuts 100 gbytes on each node down to about 1 mbyte to scan on each node.
b. query parallelism cuts down scan on each node to about 500 kbytes
2. query for 100 days of data for 100 departments:
a. clustering cuts 100 gbytes on each node down to about 1 gbyte to scan on each node.
b. query parallelism cuts down scan on each node to about 500 mbytes
3. query for 365 days of data for all departments:
a. clustering cuts 100 gbytes on each node down to about 40 gbytes to scan on each node.
b. query parallelism cuts down scan on each node to about 20 gbytes
All three queries - whether very selective or not, all use the same facility for a linear performance improvement (unlike indexes). The last query is unlikely to complete in 1 second - but I'll bet it'll be fast nevertheless - especially since it'll probably end up with over 2
> So why would he want to waste nearly double the money for a 1 processor system unless he really has to only have 1 processor?
how about:
1. hot-swappable power-supplies
2. pair of scsi internal disk in a hardware mirror configuration
3. onsite 24x7 support
4. significant vendor testing that ensures that all components really are compatible
> You are right but only for OLTP types of applications. Full table scans can't be buffered adequately.
yeah, cache hits are more critical in oltp - where a single percentage point in the hit rate can be a noticable difference.
but even on non-oltp applications (data warehouses, data marts, olap, whatever), it makes a difference - and you (typically) get can get good hit rates. I'm looking at stats on a 500 gbyte data warehouse right now, and see:
1. dimension tables: 99% cache hit rate
2. summary tables: 98% cache hit rate
3. fact indexes: 99% cache hit rate
4. fact tables: 92% cache hit rate
In the above list the vast majority of data is in the fact tables - where over 90% of the queries are coming out of memory. Since many queries actually hit the summary tables instead of detail, those queries are getting a 98% hit rate. And when the cache is missed - it doesn't necessarily mean a tablescan: this particular database is partitioned via db2, so a cache miss generally just results in a partition scan of 2-4 seconds.
So, while you're right - caching is tougher in a reporting environment, it still helps enormously in most implementations. At least in my experience.
ken
> Actually it's hard for me to think of an app which doesn't require more than one CPU and needs faster
:-)
> access than a properly configured RAID can give it.
was wondering this too, how about this unlikely scenario:
1. very low concurrency application - in which you never or almost never have more than one connection at a time to a database.
2. you need the fastest possible response to the query.
3. you've got 15 gbytes of data along with 500 mbytes of indexes
4. your queries don't fit into any pattern - so normal database buffering seldom gives you above a 50% hit rate.
5. all access to data is via indexes, and is never for more than a few rows: so parallel queries won't help.
6. availability doesn't matter - you'll just get two of these, and they are read-only.
In this scenario I imagine that a ramdisk would provide a tiny performance margin over a properly designed disk-based database.
But the other 99 out of 100 scenarios would do better on disk.
Actually, a well-tuned high-end database typically doesn't need to be completely in-memory to get 98% of the speed of one on a ramdisk.
For example, let's say that you've got 15 gbytes of raw data on db2/oracle/informix, as well as 4-8 gbytes of memory.
First step is storage: don't use raid-5 (requires extra writes), instead you'll want something like raid-10. Then ideally get 28-56 or so drives, and you can spread data, indexes, logs, and tempspace across all drives. Obviously 15k rpm ultra320 drives with a caching controller would be best (if you can't afford fibre).
Next make sure that your memory is tuned right - multiple gbytes set aside for buffering, sorting, asynchronous agents, etc. With these databases it's easy to ensure that all of your indexes are always in memory, that all writes to disk are asynchronous, and that most sorts are in-memory. The buffering should ensure that 98% or so of your disk access actually hits the bufer instead of the disk.
Given the above scenario a database running on a ramdisk could be faster - what? 0-5%? And in return for that possible speed increase it'll cost more and lack the reliability of the disk-based database. Probably not a good deal.
Didio is just trying to discredit her critics:
the issue *isn't* that open source advocates are attacking her analysis.
the isue *is* that open source advocates have discovered that:
1. she teamed up with a microsoft gold partner to perform the analysis
2. they sent the survey to subscribers to a microsoft publication (a completely biased sample)
3. the analysis & survey don't match up well - a considerable amount of apparently unfounded interpretation occured.
So, are open source advocates sometimes excessive? Sure. But more to the point: Didio's analysis was beyond flawed - it was deceptive. And that discredits her as well as Forester.
Hey thanks for the informative reply. Still suspicious of the value of something that thinly skins simple sql & doesn't help with complex sql - but I'll have to take a closer look.
ken
Just don't get this - what kind of a brainstorming session did these guys have?
mansnivel?
mansnot?
manmucus?
manbile?
manslurry?
mandribble?
mandriven?
mandriver?
mandriveme?
man-drivin' female?
Just like my old scout, I'll probably just go on calling it what I've always called it- Mandingo.
Seriously can't imagine developing powerful business apps today without writing sql by hand.
Sure, there's plenty of simple CRUD (create, replace, update,delete) stuff that could be hidden behind something like this if you want. It's tedious stuff - but hardly difficult.
But beyond that - applications are expected to provide a greater amount of analysis than they were ten years ago. Just doing CRUD isn't enough - people also want to see simple analytical data (trends, totals, medians, averages, rolling averages) all throughout their transactional applications.
And without learning how to write some sql, you'll end up unable to provide this functionality - the app will be too slow to code and too slow to run it.
Ok, here are a few reasons why I *wouldn't* want to use that hibernate syntax:
1. how do I separately validate the syntax & query results? Can I run it in on a command-line? In a query tool, etc?
2. how do I diagnose subtle problems caused by hibernate join failures? Exactly what sql is it running?
3. how much flexibility does it have? Can it handle outer joins? Subselects? Group bys? Aggregate functions? No? Then why use such a poor subset? Yes? Then why not just use sql?
4. How do you leverage dbms-specific features? Yes, I know that it's best to stay generic, but when you need to add more analytical queries to your app (to deliver something beyond simple CRUD), you want to use olap functions, etc. Now, yeah, you could do this in your app. But it'll be slower. *Far* slower: you might have to process a hundred thousand rows - and the database will do it in less than a second. Your app? Maybe 5-10 seconds?
My recommendation is to learn sql, use standard sql whenever possible for portability. But learn the advanced sql required to deliver robust functionality. And when required be prepared to dip into vendor extensions.
Or stay away from huge, challenging, and powerful business applications.
You're right - I shouldn't have taken liberties with Saddam's record in this discussion of linux & microsoft. ;-)
http://www.sunbelt-software.com/surveys/040213_Lin ux.htm
0 85956154
oh, and btw - it was sent out to readers of the w2k news magazine:
http://www.w2knews.com/index.cfm?id=463
So, the sample of survey respondents was about as controlled as a george bush or saddam hussein political ralley.
More at: http://www.groklaw.net/article.php?story=20040324
> Is MySQL 5.x the answer to my prayers? Or just a cruel reminder of why MS software costs what it does?
Well, what's your time worth? what's the business worth that would be supported by this database? How much risk can you take?
Assuming that you've got money at stake here, your best bet would be to stay completely away from mysql at this point in time: you aren't seriously thinking of porting 1000 triggers and 800 stored procedures to mysql's initial beta release of this functionality are you?
Given your requirements, I can't imagine considering mysql or even postgresql right now. On the other hand, there might be scenarios in which you could begin to break up this database into components, and perhaps some of these components could live in other technologies.
For instance, can some of your reporting & read-only queries run against a separate database? This is a typical oltp / olap split in which transactions hit a small transactional database with minimal data retention, then the data is moved over to a separate database for reporting & analysis. The reporting database ('data warehouse', 'datamart',etc) typically uses a denormalized model to avoid so many joins, typically also uses parallelism & partitioning for far better performance. You'll miss db2, oracle, sybase, or informix here. Even sql server most likely.
The above wouldn't get you completely out of sql server, but is probably achievable. And would help to avoid trying to port thousands of triggers & procs.
> "ok, how about using their non-isam option. Ok, now we've got the kind of data integrity we've taken for granted since about 1981."
That's true, there were holdouts back in the early 80s. Used to be lots of these guys running around saying that relational technology would never work, PCs couldn't do real work, etc, etc. I think most of them lived in bomb shelters from the 50s. We next heard from them in y2k.
> I've worked with raw disk I/O; flat-files; single-key databases; and primative table-managers
> like MyISAM; more robust table managers like innoDB; pure time-series databases like fame;
> "enterprise" databases like Sybase and Oracle, etc. and they all have a broad appeal for
> different reasons.
Sure. And just like you don't use oracle for storing data in real-time automotive transmissions, you don't use isam in 2005 for storing business data. There are exceptions. But very few a professional would consider: read-only access where the source of record is somewhere else, fully redundant systems, etc.
> Yes, if you look up in the thread you will see discussion of the fact that you can trade
> performance for guarantees.
In this particular case half of the issue is due to space-bloat by innodb. They've accepted that this is a problem, and are working on it. In other words, it isn't an engineering compromise it's a design flaw.
Additionally, remember that myism is dead last in speed in write-intensive applications. Dead last. So, the elimination of common-sense data protections dosn't buy myisam any speed in this scenario.
> No one solution is right for everyone.
No, but "gotos are considered harmful". And so is ISAM or mysql for vital business data.
> They just don't exist in the same space together.
e rcial-license.html
1. the op claim *fastest* wasn't limited to any particular "space"
2. mysql is pretty cheap as far as commercial databases go, at $600 / database server. But both oracle and db2 have versions within this price range. For example, DB2 Express is somewhere between $500 & $1500, and is limited to 2 cpus. On the other hand, it includes partitioning (MDC), parallelism, a mature optimizer, standard features, etc. It can handle 5000 transactions / second - in a standard database environment in which you've got referential, check, and unique constraints on your data.
So, yeah, mysql far slower than db2 & oracle at the top end, where it doesn't really compete. But it's also generally slower at the low end as well - where its only edge is simpler installation. Ah, and a GPL version that almost no corporate clients qualify for. And perhaps almost nobody else either. Who can tell?
http://www.mysql.com/company/legal/licensing/comm
> It's faster than everything else because it doesn't have all the data integrity features that a RDBMS does.
Hmmm, really? Lets see...
if you're using myisam (the fast io layer), then all locking is done at the table level. Which means that if you get multiple connections attempting to write to the same resource they'll all have to wait while their operations run in series. Fast? Nope, hard to imagine a slower scenario than that.
again, if you're using myisam (again the fast solution), and need to select 10% of the rows of a table with 10 million rows, what will it do? It'll do a scan of all 10 million rows. What would oracle, db2, or informix do? Rely on partitioning to just scan those 1 million rows. Of course, these commercial databases will also break the query up into pieces and run them all in parallel across multiple cpus on your server. You could easily find mysql taking 40x as long to respond to this query as one of the others.
How about if you've got a complex query? Well, mysql admits that their optimizer is very primitive, and will take a while to fine-tune. Great, so now you just write a few different versions of your query until you get the speed you want, probably depending on 'gaming' the optimizer. Then when you upgrade the software next year you'll find that your gaming is killing performance. Time to rewrite the query. No big deal, we all went through this (about twenty years ago, and glad to have it behind us).
ok, how about using their non-isam option. Ok, now we've got the kind of data integrity we've taken for granted since about 1981. But, reads are very slow, often 1/10th the speed of myisam.
On the Innodb site they brag about getting 800 inserts / second. I think the most recent db2 benchmark on a 4-way intel box hit almost 3,000 transactions per second. Note: Not just simple inserts & updates, it was a tpc benchmark. And that's on a low-end box. At the high-end db2 is running 80,000 of these transactions per second.
Fastest? please, only in read-only transactions running on tiny hardware...
> If they can fix 10 years of criticiscm in one release, why couldn't do that before? Or maybe in
> several fixes rolled out within the 10 years?
If you recall, their management made the (unconvincing) argument that 99% of the time people didn't need fluff like:
* referential integrity (pk & fk constraints)
* views
* triggers
* stored procedures
So, they never implemented this because they had been arguing that nobody needs it anway. Nevermind that it's been standard fare for over twenty years.
Personally, although I'm glad to see them support views, I would rather have them clean up their exception-handling than add triggers & stored procedures right away. The problems with silent errors, silent truncation of strings & integers, date validation errors, etc are far more serious than what they are adding in this *beta* release.
Maybe in five years that'll be the release that fixes 15 years of criticism?
Just like Yahoo Finance - I've seen quite a few financial departments in fortune 100s that relied on spreadsheets. Nothing unique to small departments. Yahoo Finance is probably better off on mysql than spreadsheets, so it's a win for them. But it proves nothing.
> but for an awful lot of what people want to do, MySQL is Good Enough (tm), and it requires virtually
> zero administration, that's the real pull (and why you find MySQL available on every damn hosting
> account out there).
Hmm, I think the reason it is everywhere is more to do with the lack of any other options 2-4 years ago than anything else. Except maybe native support for windows. That helped as well.
Today I continually discover people using mysql who are blissfully unaware of the data quality problems and licensing fees. Since they aren't fixing their silently corrupted data (silent errors, truncated strings, truncated numbers, invalid dates, invalid relationships, etc, etc), they probably feel that mysql is a very low-management solution. Meanwhile, I've known several who've migrated *from* mysql exactly because they couldn't deal with the efforts involved in fixing data quality problems. Likewise, they'll tell you how cheap the product is - completely unaware that they are using it in violation of the licensing agreemments and need to cough up some cash.
And this data quality issue can be ok - lets face it, a bowling score application built for fun doesn't have critical data in it. Who cares if it occasionally bungles the data?
There is no question that mysql isn't yet to where db2 & oracle were in 1983. That's easy to show (no views, silent errors, etc, etc). And there's no real question about the value of practices, techniques, and technology that protect data in a database. That's easy to show, and no amount of fans of mysql are going to budge 40 years of data management progress. The only real question is how long it'll take mysql to seriously address these issues.
And when it does, it'll be an extremely respectable solution. Until then, it's best for applications where the data doesn't have much value.
> As for mission critical, Yahoo Finance, Associated Press, Lycos, Los Alamos Laboratory, NASA and Suzuki (to name but a few) would disagree with you there.
Ya, and I've seen a dozen companies that managed mission-critical data within the complex and often incorrect spreadsheets of secretaries insecure pcs.
So? The fact that company x doesn't care about data quality problems says more about the company's competance than about the importance of those problems.
"You silly sausage . . . check your facts first next time before posting such knee-jerk drivel" wrote:
> I guess SAP is well, clueless, being that MySQL MaxDB has been their open source database of choice
> for years.
MaxDB? Ah, "Adabase". That heap of 70s technology has been around for at least 20 years, and predates most relational databases. The product was probably open-sourced since it is reliable with data - and so a great addition to the mysql stable. However, it's been a 'has-been' in the database world since the 80s. I have no idea why anyone would want to use it today.
Kind of like someone getting excited about using IBM's IMS DB for their small site. Why?
ken
> however hard you try, you can't polish a turd
.net they'll have the flexibility to revisit the internals and start improving them?
you can if it's frozen
maybe once microsoft gets enough users migrated over to
of course, this is just idle speculation, and not that they ever seem interested in anything so far removed from shiney new product features.
> I'm afraid. 95% of software development problems would be solved by having good, descriptive, well thought out specs.
yeah, that's the problem: requirements gathering is the most failure-prone activity in software development.
Waterfall methods approach this problem by generating pallet-loads of documentation, and having everyone and their brother (theoretically) review the paperwork. This of course almost never happens. When it does happen it's almost impossible for the users to visualize the system anyway - so tons of gaps & errors occur.
Agile methods approach this problem by delivering the software one step at a time. Something the users can actually interact with, which they do, and then much more easily detect requirements gaps and errors.
Agile isn't a silver bullet - it relies very heavily on developer skills, and isn't helpful at getting a specific set of requirements completed at a specific date. On the other hand, brooks documented the benefits of focusing on developer skills 40 years ago, and nothing else is accurate at determining target dates anyway.
ken
> Now, yes, with clustering and keeping stuff in memory, you can get good response times. That's not
> some deep, fundamental property of the database, it's a simple add-on that sits on top of the
> database. I believe there are already solutions like that for PostgreSQL and MySQL, although they
> are not quite as flexible as those for DB2 yet.
Um, no. We're not talking about just setting up veritas on a couple of sun 6500s with 24 gbytes of memory, assuming that every product in the dbms category from MS Access to Terradata has the same scalability, crossing your fingers and hoping for the best.
We're talking about multiple forms of data partitioning, parallel query, and query rewrite (to redirect against summary tables). All built into the database - so that the optimizer can take full effect, and so it's (relatively) easy to build & manage, and will run portable sql.
Add to this dozens of fine-tuning knobs to dial in the specifics of your environment, very clever optimization, very functional sql, etc - and comparing an open source database to oracle/db2/teradata at this scale is rediculous.
It's like some kid infatuated with his brand-new honda bragging about how it can pull stumps as well as a farm tractor.
> PostgreSQL cannot compete with Terradata in this space.
right, but this is a huge space: and given the rate of data explosion, 3TB databases have become extremely common. And that's a good thing for all database products & customers: it means that db2, oracle, etc can keep adding more exotic features to support multi-TB databases while postgresql, etc can quietly pick up more of the smaller ones.
> Also one question-- suppose you have your storage go bad on a blade-- what happens to your
> relational integrity if the records are spread across various servers?
Depends on which dbms you're talking about: on oracle it's no big deal, since they use shared-disk, and one node can quickly take over for another on 10g (was OPS). Of course, that's a complex, high-priced product, and doesn't scale as well as the shared-nothing options like db2 & terradata. I can't answer for terradata, but if you're on db2 then you've got a few options:
1. implement a HA solution between pairs of nodes, so that one node can take over for another. This can be via HACMP or Veritas (both complex) to take over the storage from the crashed node, or even HADR (client reroute - not complex) to support a second duplicate node.
2. implement a failover database. This keeps the architecture very simple, and is a more complete solution (read: disaster recovery) - since it can protect you against datacenter floods, etc: by locating the failover in an entirely different data center.
Since we're generally talking read-only data here, you've got more options than on a large transactional database. I'd typically prefer #2, followed by #1/hadr. HACMP and Veritas are a pain in the butt, IMHO.
> something like Craig's list, if you had to pick between MySQL and Postgres only, not based on claims
> by the distributors, but based on the capabilities you elaborate in your post and what you know
> personally. Starting from scratch.
ok, just shooting from the hip here, ymmv, etc, etc, here's my $0.02 on that:
Assuming that you partitioned the application like Craigslist - by city, that drastically simplifies the scaling issue: since each city can be its own server, with possibly another server to support cross-city summaries. Your application would have to know which city translates to which server - but that can be handled easily many ways.
I'd go with postgresql given the choice, since:
1. partitioning by city drops the top read performance requirement. mysql has better read performance using myisam, but you could probably meet the requirement at the city level just using postgresql.
2. postgresql's write performance is *way* better than mysql's using myisam, and also better than mysql's using innodb according to anecdotes.
3. postgresql is just simply a far more mature database right now. Mysql has a few attractive features to it, but its gotchas (http://sql-info.de/mysql/gotchas.html) are bizarre in a data management product. And not everything you do will be simply serving up content - you'll have transactions supporting calendars, the creation of classified ads, discussion forums, etc, etc.
4. postgresql is free, mysql is not. If you're trying to save money, and might want to have a bunch of low-end servers, you don't want to be shelling out $500 each. Then extra for online backup software.
A few other notes:
1. mysql has two things going for it for something like this: myisam query speed, and its query-cache. This is one of the few places the mysql query cache is actually useful. And it would probably have a huge impact here. In fact, if your read:write ratio is sufficiently in favor of the reads, and your data volumes are sufficiently high - then I'd have to give mysql a second look. But as long as postgresql can meet your performance needs (and I suspect it can), then its other strengths (quality & exception handling, online backup, standard sql features, etc) put it way ahead of mysql.
2. I'd definitely treat each subject area within the site as a separate set of tables - and try to completely encapsulate them in the application. These subjects are discussion forum, housing, community, personals, etc, etc. By doing this you'll be able to scale them independently, move them around on different servers, even use a different product for just one component if you wanted.
3. I'd also define some performance targets - and early in the development cycle test out your performance to see if you can hit those targets.
4. Assuming a typical growth scenario, I'd start with multiple databases on one server, then gradually move them to separate servers as the hits & revenue increased (hopefully at the same time).
5. As far as the boxes are concerned, to keep costs low I suppose I'd go with lots of memory (4gbytes), battery-backed hardware raid with 128 mbytes or so of cache using 15,000 rpm drives if possible. raid-5 probably won't hurt you on postgresql or mysql with innodb as much as it would mysql with myisam.
6. Commercial databases are probably also worth considering - since they would allow you some of the scalability benefits via partitioning - and could help you avoid having to run 20 servers, most of which are idle most of the time. Labor's money too, and commercial database products start around $500-$1500.
well, i guess it's time to get back to work. oh, and good luck.
> Take a look at the hardware and the cost per transaction per second. I am not saying this could
> not be done on PostgreSQL (it could) but it would expensive as well.
Good point - a database performing 50,000 transactions a second isn't running on your grandmother's pc. But it wouldn't be correct to assume that all databases scale equally. This is where the greater tuning opportunities of db2 or oracle come into play:
- instead of just a single 8k page size, db2 has 4k,8k,16k, and 32k page sizes.
- db2 has much more flexible statistics gathering than either mysql or postgresql
- db2 has a far more robust & intelligent optimizer than any open source database
- db2/oracle/etc have query parallelism
- db2 in particular is very good at performing most database operations asynchronously: data writes are sent to a buffer pool by one agent, to a log buffer by another, from the log buffer to the log file by another, from the log buffer to actual storage by another, etc, etc.
- both mysql & postgresql have very primitive memory tuning capabilities: mysql can't share io buffers between innodb & mysql. Postgresql just doesn't have many options to manage sort memory, separation of memory buffers by tablespace, etc, etc.
But keep in mind - many of these weaknesses are also strengths: since the missing configurability is translated to simpler management for smaller & less-demanding databases.
> However, unless you have all the information in memory or unless the the data is spread across a
> large number of disk arrays, I don;t see how you can even get the information to process it in less
> than a second.
Ok, here's an exeample architecture using db2. Would work identically on informix, and probably Terradata. Oracle has similar capabilities, though a little different since they're shared-disk rather than shared-nothing for db2/informix/terradata.
Step 1: spread data across 10 2-way 64-bit AMD blades. Each blade has 8 gbytes of memory and a fibre connection to storage. You can expand this up to *hundreds* of blades (nodes) - and get almost linear performance improvement the entire time. You've now got a 20 CPU system for about 10% the price you'd pay for a Sun E10000 or HP Superdome. Since the data is spread across all nodes by hashkey - every query will get 20 CPUs processing it together - each on 1/10 (100 gbytes) of the original data.
Step 2: partition the data on each node by range or value via MDC. Say, by 'day' & 'department' - so that you've got (for example) 1000 days & 1000 departments. Now, each query will only tablescan the data within the blocks it needs: whether it needs to scan 5% or 75% it gets a linear performance improvement corresponding to the reduction of data.
Step 3: implement parallelism on each node. Now, this sample config only has 2 CPUs - so the opportunities aren't as great as they are on an 8-way. But a two-way will still usually cut your query time in half.
So, now a few examples:
1. query for 10 days of data for 1 department:
a. clustering cuts 100 gbytes on each node down to about 1 mbyte to scan on each node.
b. query parallelism cuts down scan on each node to about 500 kbytes
2. query for 100 days of data for 100 departments:
a. clustering cuts 100 gbytes on each node down to about 1 gbyte to scan on each node.
b. query parallelism cuts down scan on each node to about 500 mbytes
3. query for 365 days of data for all departments:
a. clustering cuts 100 gbytes on each node down to about 40 gbytes to scan on each node.
b. query parallelism cuts down scan on each node to about 20 gbytes
All three queries - whether very selective or not, all use the same facility for a linear performance improvement (unlike indexes). The last query is unlikely to complete in 1 second - but I'll bet it'll be fast nevertheless - especially since it'll probably end up with over 2