> However, in a commercial environment, defining the spec is really writing a > contract, which protects both the customer and the editor. Specs there defines > what the software can and must do, and ensures it will do.
And by focusing on the letter of the agreement rather than the spirit it also increases the chance of failing to meet the needs of the user. The fact that you've got a contract may protect you in court, but it won't win you any return business or references from yet-another-pissed-off-customer.
Just one of many reasons why agile methodologies are replacing paper-heavy & model-driven methodologies on so many commercial development projects.
Ten years ago Informix supported MPP configurations in which it would sit on top of MPP servers like the IBM SP2 (Deep Blue - the chess computer). In this configuration you'd have 20-200 separate nodes, each with its own memory & disk and Informix would be responsible for spreading its data across the nodes, running a query across nodes, and joining the results.
The sp2 was originally intended for scientific computing, but then most were sold as giant database servers. Around 1996 (I think) db2 also supported this kind of clustering.
Today the SP2 is no longer sold, but IBM supports the ICE configuration for db2 - in which you put together 5,10,200 blade servers, each with fibre connection to disk, and local memory. Then db2 handles queries spread across all blades. And of course, you don't have ot use blades - you can use desktops, or whatever. And it supports windows, aix, linux, solaris, etc. Informix still supports this kind of clustering as well.
Oracle now supports clustering over linux with RAC. It's more expensive, and doesn't scale as high, but probably is better at failover. And I think the teradata server (unix variant os & dbms) is also an mpp. At least it used to be.
Anyhow, I'd be surprised if there are as many scientific applications of clusters in production as there are databases.
> It gets 50 million page views a day and it's up to a 13+ million post count, with at least 1 million registered users. Is Slashdot not proof that MySQL has scalability?
Great question. But no it isn't.
Scalability isn't about size - it's about the economy of size. That is, if one million pcs running MS Access could also handle that load you wouldn't call MS Access scalable - because the cost to handle the load would be so much higher than other solutions.
So, running Slashdot isn't like getting a good score on the TPC: 1. it doesn't tell us what the hardware investment is 2. it doesn't describe how many of those queries come from cache that doesn't hit the database vs actual database hits 3. it doesn't tell us what compromises were taken - that might make it faster, but occasionally (or not so occasionally) result in data integrity problems 4. and we do know that slashdot provides very little in the way of real-time trending, in depth analysis of the data, etc. Exactly the kinds of queries that mysql *can't* scale to. The type of functionality offered is mostly restricted to view posting list, view details of one posting, insert new posting. This is *very easy* to scale on most databases. What wouldn't be easy to scale to do - is to provide a charts that showed (real-time, or nearly so):
- how the participation in a discussion ebbs & flows
- how angry or confrontational a discussion is, and how that changes over time
- which parts of the discussion are the "hottest"
- etc, etc, etc
So, yep - mysql can probably handle a good read-intensive *transactional* load. But it can't handle a very heavy mixed load, nor a very heavy load in which genuine data analysis occurs. Which isn't something far fetched at all any more - but constitutes most of what I am delivering these days. With Oracle and DB2...
> That page made 10 calls to the db with a total processing time of roughly 0.0055 seconds.
Right, looks like that's working well for you. With mysql getting that query performance on a 4m row table, it must be using an index, which means that you're selecting less than 3% of the data each time. The scalability issue emerges when your result set is 10% of the total table, you have to sort 100,000 rows, etc.
> Oh, it would still be available... I would just connect to a different db to pull it in. My code is > already written for that to happen, I just need to change the connection string for the historical db. > Right now it is exactly the same as the current db connection... so they both live in the same db.
> I would say that MySQL and PostgreSQL are free and DB2 starts at $700, but that cost is just the beginning. postgresql is free mysql is only free under some circumstances, and if you are distributing client apps that connect to a mysql database you might want to get legal advise before assuming that it is free.
> What happens if you run a processor with multiple cores? doesn't matter to db2's licensing on intel (not positive about power5, etc)
> How much if you want to run it on one of their mainframes? costs quite a lot more to run db2 on a mainframe. Then again at least it runs there, mysql doesn't. So, I'd count that as a bonus for db2, not a penalty. Of course, if you're running linux on the mainframe, they can both run there. Not sure about db2 licensing then, but assume that it would still follow its linux licensing model.
> It is my experience that Oracle and IBM tend to "give" away some of their products, but they do that to hook you in to the upgrades.
Not really - five years ago there was no really cheap version of either. The difference is that now there is more competitition on the low-end, so DB2 & Oracle have to be more competitive there in order to keep market share against sql server, postgresql, mysql, etc.
Besides, mysql is really the nasty example of this behavior with their hybrid licensing model.
> OK... I'm not arguing about the *premise*... but... I don't consider your numbers to be justification for a different db.
well, depends on what you do with those rows: mysql or postgresql can certainly handle a database with ten million rows in a single table. But unless you can rely on a b-tree index to select just the data you want - you'll end up doing a table scan - and then it'll take 60 seconds or so to get your query back.
With db2/oracle/informix/etc - you've got other options for this 10 million row table. If some of your queries select too much data (like 100% of the data for one customer, group by host & date, then show changes over time for their hosts) you can implement a range partition by customer. Then when you can't use your b-tree you can at least just scan the data for that customer. That might be just 5-10-15% of the total number of rows. Then in addition, you've got query parallelism. Since this is on a 4-way I expect to get response time 25% of what I'd see with a single threaded query like mysql or postgresql performs. Then you've got materialized views, better optimization, etc, etc, etc.
In the end, this table could be supported by postgresql/mysql. But I'd get 60-second response time on queries against that table. With db2 I'm going to get sub-second response time.
> I figured in a few years I may need to split the db and store the historical data on a different machine... > But, I'm not thinking it is too big at the moment... or for a few more years.
Sure - especially if you really don't need history very much. Personally though, I like to enable trending for data analysis - even on your typical transactional databases. And so I find that historical data is pretty valuable.
> You are talking about "massive" tables and a "reporting" application. Of course Oracle and DB/2 are the right choices for this. Has anyone ever thought differently?
Yes - everytime a product or technology is overhyped people believe it will do everything. Search for mysql and data warehouse - you'll find plenty of people who think it can handle massive data without really understanding what db2/oracle/informix/etc do that's different.
> The only time I've ever heard anyone compare MySQL to Oracle and say that MySQL was just as good as > Oracle is when using small tables on a web app, which is exactly what MySQL is used for 99% of the time.
Right - in that context mysql/postgresql are competitive. Of Course, that isn't what MySQL AB is telling people. Little surprise, these are the same people that a few years ago were telling people that they didn't need primary key/foreign key constraints either.
Anyhow, I've got an application right now that for its first two years stayed quite small - just a few thousand rows. Implemented in db2 - since we need a major database anyway for the larger databases with a billion rows, and we can save labor by staying consistent. Anyway, this little database is about to now explode in size due to expanded requirements and new customers - we're expecting one little critical table to go from 3,000 rows to around 500,000. Its history table will go from maybe 10,000 to 10,000,000 rows. Other tables throughout the databae will likewise expand. DB2 was overkill when this database was first created, now it's perfect: it'll support the heavy transactional loads, automated system failover, and very fast scans, reports and other tough queries. This application growth is not at all uncommon - everyone is putting far more data into databases than they were just a few years ago.
So yeah, mysql and postgresql are neat products, but their lack of scalability for massive tables or analytical queries is a major gap. And the requirements for this capability are now commonplace - unlike ten years ago when only a few data warehouses really need it.
1. The reason that mysql sometimes silently truncates data isn't because they are being clever - it's because they were being sloppy: it's occurs inconsistently, has been acknowledged by the company (relucantly) to be a bug, and they are now bragging about fixing it.
2. Even if it was a clever strategy it is inconsistent with every other database product out there. This interferes with portability, standardization, and commoditization.
3. The value of constraints in a database has been embraced by every other database product, and many other application and database designers. There is no credible argument against database constraints, so your defense of their sloppiness via this argument is irrelevant. But with this in mind - here's a quick explanation of why you're in the minority in this view: consider that your application layer is changing over time, you may add additional applications pointing to the same data, but the older persisted data is not changing in the same way. You may add validations & edits in place today for new data - but your old data idn't benefit from these. If you only validate at the application layer then you will end up with inconsistent data. For this reason you want constraints at both the persistence and application layers - they complement each other well. It's not one versus the other.
Agreed - it is a real database, that can do real work and provide real value.
However, it has a history of inexcusible quality problems with silent errors, silent data truncations, etc, etc. The leadership of mysql also has famously stated that almost nobody really needs transactions. So, there's quite a lot of skepticism about the company and the product.
MySQL might turn that around. Hopefully they will.
Back to price - the difference isn't always so huge: I know that db2 is around $700 for a license limited to a two-way SMP, MySQL is around $500. I think Oracle is also under $1000 for a similar machine. Now, db2 & oracle can go *way* up from there (to support 16-way partitioning, etc). But mysql doesn't even compete there. In the space it does compete, oracle and db2 are in the same ballpark.
> InnoDB supports the features you mentioned, and that I quoted, so why is that less desireable than having the table handler builtin?
Because it is core functionality.
Having a separate product responsible for managing io may be one of the reasons that mysql's optimizer is so primitive - and may be a reason why they will struggle to improve it. Note: the optimizer is the component responsible for determination of how your joins will be performed under the covers - mysql is notorious for failing to use indexes it should, using indexes when it shouldn't, never using star-joins, five-way joins that day 10x as long as they would in any other database, etc.
Perhaps it's also why views took forever to implement, and materialized views might take another forever.
It's also perhaps the reason for all the inconsistencies in table creation.
There are also benefits to using innodb - it has undoubtably speed up mysql's development by several years. Still, now it's a boat anchor that should be abandoned.
> For me, the massive community is a big plus. A rich set of tools is another advantage.
yep, although it may be the only advantage that mysql has, it's a huge one.
> And Oracle 7 for the two years before -- in a not-so-large database -- I think there is not much to fear...
Actually, no: many databases these days are now supporting various types of logs - in which you've got tables with tens of millions of rows. Oracle and DB2 have the following in place to support massive tables: 1. query parallelism - that provides linear performance improvements up to 4 cpus or so 2. data partitioning - that allows the database to just scan data needed, rather than entire table when indexes aren't appropriate (b-tree indexes only work for around 1-3% of data) 3. materialized views - in which views actually hold data - and this data is kept up to date by the database server. Often used for summary tables. 3. query rewrite - in which your queries are automatically re-written to apply to a summary table (see materialized view) if one exists. 4. clustering - in which data spans multiple servers, and all servers work together on your query. 5. smart optimizer - intelligent score-based optimization responsible for determination of best access paths for your queries.
With the above features db2 or oracle can drive 40x the performance of mysql or postgresql in a reporting application (or transaction one with a few large scan-oriented tables) on identical hardware (say a 4-way SMP). If you didn't see an impact from these features then either you have one of the fairly rare apps that can't benefit, or you should revist the design.
Don't get me wrong - I really like postgresql. But neither it nor mysql is even in the ballpark for performance on db2 or oracle. Nor is the price much different - db2 is only around $700 for 4/5 of the above features on a 2-way smp vs $500 for mysql. Eventually mysql & postgresql will support these features. But it'll probably be five years before they are working well.
> Add in that PostgreSQL's core engine has long been about 5x faster than Oracle's (not to mention orders > of magnitude easier to set up and administer) and basically the only reason left to go with Oracle is > their clustering. No doubt there are places that need that, but it's a pretty small niche.
It's not that small a niche anymore: the need to query vast amounts of information has changed from a specialization within data warehousing to a commonplace requirement that a significant minority of applications now requires.
So, it's not unusual for the kind of applications that we used to build five years ago to today also require the logging of everything in the world - requiring the ability to quickly query tables with tens of millions of rows.
And sure, postgresql/mysql/firebird can query such a table via a btree index quickly. But if you need to process more than just 1-3% of that table you're out of luck with those options. You want partitioning, query parallelism, fine grained memory management, flexible storage options, automated query rewrite against summary tables, sophisticated optimizer, etc. And db2/oracle/informix all have those features. And it will make the difference between a 2 second query vs a 80 second query.
The open source options will eventually get there. In the meanwhile, you need other options to manage large volumes of data well. And if you *must* use db2/oracle/informix to handle your large databases, you might as well use them for your small databases as well - since licensing cost for small databases is often cheaper than the OS, and it's easier to reuse your labor - which costs far more than the licensing costs.
> This is not a troll. If IBM wants to become an OSS company - they should open up their programs - > especially DB2. It is a nightmare to use that in collaboration with Samba, LDAP etc.
Why can't ibm just open up part of its software catalog and still be an oss company? I think over time we'll see it continually releasing software to open source as revenue from those products diminishes.
And sure, it might be nice to release db2 as open source now - but there are already a handful of good open source options (even if none can scale as high as db2).
Regarding samba & ldap: i haven't tried db2 on fileshares or samba, but wouldn't normally want to anyway - very fast access to disk is so critical to good performance. And db2 now supports (as of about nine months ago) an open security API that you can use to connect to any authentication/athorization mechanism you want. Granted, better out of the box connectors for ldap would still be good tho, and I'm assuming we'll see that in next six months: security requirements for databases has changed quite a bit over last few years.
ah, so the wise argument is to treat all subjects like they are balanced controversies?
Then how about adding to this new ID/Science teaching: 1. the earth is flat to all geography classes 2. there's no record of jesus actually existing to all bible classes 3. that the sun rotates around the earth to all astronomy classes 4. that mental illness is caused by bad humors in the brain that can be solved by releasing them through holes in the skull
However, the magna carta mostly focused on protection of rights of nobles. Didn't do much for anyone else (at least not directly).
Additionally, I don't see how Athens was hardly democratic since it only allowed male citizens to vote, but the early US was democratic even though it only allowed non-indian, non-black, male, land-owning citizens to vote.
It looks like you may have rearranged the facts to fit your theory.
that was my motivation as well: after writing cobol, jcl, cics, and ims dc & db - rexx was just really fun.
Since it was the macro language for ispf - i first wrote about twenty macros - so you you could put the cursor on a filename, hit a pfkey and automatically browse the file, etc, etc.
Wrote code that would convert a cobol program to cobol II (all periods except one replaced in every paragraph by scope terminators, etc). Then systematically converted *hundreds* of programs to this newer style.
Wrote a ton of little apps with Dialog Manager, rexx, and db2 - these little apps could be written in 1-2 days, were easy to maintain and performed CRUD on about a dozen tables. Probably more productive than Ruby on Rails today;-).
Also started writing batch apps in rexx instead of cobol. That also made it easier to eventually transition to c.
But the world of perl, java, and c wasn't fun. Really no better than cobol & jcl. Rexx was fun. But luckily now I've found python. Python is also fun. So i'm writing code again.
I suppose i could still use rexx (it's a stored procedure language in db2, and works on most platforms still). But these days python and ruby look more promising. And still fun.
> Yes, it is. I've worked on a project that allowed offline modification of a database by replicating a > copy to user's PCs, and it originally used XML as the format for data transfer. We got a 30% speedup > by switching to tab-separated variables with a line of metadata at the start of each chunk of the > stream.
Yeah, but I don't think that's a reasonable example - replicating a copy of a database to a user?
You also probably could have gotten a 75% speedup by compressing it before you sent it. But that doesn't mean that plain ascii is a bad thing.
And no, I don't enjoy working with xml. I think it's bloated & overhyped. But...in sending data between organizations or systems, its self-documenting nature is a plus. Usually.
Actually, I think that using sqlite would be far better than oracle for this purpose: Oracle carries a lot of baggage along with it - and really isn't optimized to just as few resources as possible on a desktop.
SQLite on the other hand, is so small and has so little overhead that it's probably very good for system performance. And with so little code involved it's much less likely to require upgrades. And (unlike another light database) it's highly sql compliant.
Re:The future of databases is... no Database at al
on
The Future of Databases
·
· Score: 2, Informative
> So, instead of investing on a DBMS why not buy massive quantities of ECC memory and keep all > instances of your data in-memory for near instant access?
because a *well-tuned* relational database with a 1:4 ratio of memory to disk is almost as fast as an in-memory database - due to efficient caching
because some queries require an enormous amount of temp space. supporting them can easily double your space requirements - which have to be purchased in memory.
because if you just want to run your database in-memory you can already do that with most databasees.
because you don't have the same speed requirements for every piece of data in your database. You might have some tables used for session & user management that are often read & written to and must be very fast. But other tables that just hold seldom-accessed historical data. A modern database would allow you to keep the small & fast tables effectively in memory, and the huge 100 gb history table on disk. And you don't have to buy 100 gbytes of memory to do it.
> In other words, some indexes are great for "field > number" queries, while others are better at "field matches string_pattern" queries.
yep
> So in order to have effective heuristic or "fuzzy logic" queries, somebody will need to work > out indexes and hashes for each fuzzy logic matching operator, or write an algorithm that > figures out how to make those indexes and hashes. And that's ummm... rather more difficult. > So until then, we have to catch as catch can with with analyzers and aggregators doing > underlying exact queries and applying as many optimizations as they can.
Right - we need new sql operations as well as new optimization capabilities. Some of these might be in the works, I don't know.
But we also need new ways of modeling - right now quite a few of us are creating more flexible models that sit within a relational database - but aren't able to take full advantage of key relational database capabilities like referential integrity, unique constraints, etc. That might sound a little wacky but sometimes we need more flexibility in modeling data than you can get out of 3NF: like the ability to add new types of things, new relationships between existing types, or perhaps attributes (weight, etc) upon the relationship. And do this without schema changes.
Lastly, one of the big obstacles is that we're quickly moving into new terrain here. I remember once discussing an e-commerce catalog with a team of engineers - and they *could not* understand how a hierarchy could be insufficient for describing items in the catalog. The notion of network of items in which there are n-parents caused a lot of sweating!
> Digital did this over ten years ago. One of the things that Oracle inherited when they bought RdB > from Digital was the cluster support. However it seems they tool a long time to get the technology > into their own RDMS.
I've got fond memories of an 800 gbyte billing & customer data warehouse on rdb around 1995 - giving sub-second access and running on a vms quad. That was such a slow system compared to what we've got now - but it sure handled a ton of data well.
On the other hand, I don't remember how much was due to the excellent clustering in vms - or how much was due to rdb...
agreed - as a human-readable way of persisting data it stinks. as a way of persisting data it stinks. But it isn't bad as an over-the-wire protocol.
> vertical partitioning...
Hmmm, don't see much vertical partitioning in data warehouses any more. Used to on oracle years ago, but can't even remember why. But I am finding that both mean range & hash partitioning work well together. The range is cheaper & easier to implement but only gives a performance benefit when you only want a subset of data. The hash comes into its own when you want to query 50% or more of the data.
> materialized views...
have to disagree: whether or not to use them depends a lot of on the update frequency, query frequency and query cost. But I've got queries running 1000x faster on summary tables (not automated materialized views, but same idea) than they would on base data. *1000x* is a lot faster.
> BUT all of this ignores the very real possibility that hardware is getting fast enough > to not have to do any of this crap.
yeah, but that ignores the very real fact that data is increasing *faster* than moore's law. So the cheap commodity machines *are* table to blow away the machines of ten years ago in doing the tasks from ten year ago. But today it isn't unusual to find a database loading ten million rows a day. Then someone wants to crunch away at 500 million rows in a query - and wants it to come back in 5 seconds. Commodity hardware is a long way away from doing this. Unless of course, you connect them in clusters. DB2 can do that very well - but Oracle is just starting and doesn't scale so well there yet.
Re:moving past relational model? I thinketh not
on
The Future of Databases
·
· Score: 2, Insightful
> it reads like a battle cry for us to move beyond the relational model.... it's NOT going to happen...
a couple of thoughts on that -
1. relational databases are really quite wonderful for analytical apps. Need to store two years of firewall/sales/whatever data - then churn away analysis? Great - no problem. And it's easy enough to do either through hand-written sql or via a tool. There's plenty that requires third-party tools (and data stores), but even in this scenario the staging area is almost always the relational database.
2. a lot of folks who would like to eliminate relational databases fail to account for point #1. They complain of the object/relational mapping problems. Ok, that's fine - but if you put your data in container-managed persistence or an object database - you'll then have to pay someone to pull it out and put it into a separate relational database for analysis. Of course, you might be fired right about that time...
3. java in the database is mostly a pain in the butt: On the performance side you've got optimization complexity, on the managability side you've got unusual dependencies, build processes, etc, on the availability side you've got the ability to take out the entire server with some bad code (ok, sometimes).
4. two-tiered architectures with a web service driven directly out of the database is more than just a pain in the butt. It's a security disaster. A cobbled together architecture. And Jim Gray shilling for microsoft.
5. the column-store as Jim Gray described it has never really left us. And we don't really need new technology to handle it.
6. users love tables. there are quite a few users out there that truly love tables - they understand them, they look just like spreadsheets, they can query them. This is important: it's fabulous when your users can easily understand your design.
7. however - like Gray said, we now need methods of working with data that go far beyond boolean logic. We need fuzzy logic queries. And we need new types of models - allowing for multiple many-to-many relationships via relationship tables. This breaks codd's rules - but is essential for agile & fast-moving projects.
Re:I want clustered databases for high-availabilit
on
The Future of Databases
·
· Score: 4, Informative
> The "next great advancement" in databases will be when I can setup 2 or more linux servers and have >them act as a single database server. Our database server is the most expensive item in our datacenter >because it's an N-way IBM server.
lol, IBM has supported *exactly* what you are talking about for at least five years.
That is, you can spread your db2 database across 10,100, or 1000+ linux commodity boxes (ideally blades). Or you can use windows, or aix, or solaris, or hp-ux, etc. Of course, those individual boxes can be SMPs in their own right - so a thousand 8-way aix boxes is certainly possible, if not cheap.
Oracle is now in this game as well - oracle 10g can certainly support 32, and maybe 64 individual linux boxes in a cluster. The techniques are different between the two - oracle might be better at transactional systems. db2 is definitely better at data warehousing, data mining, etc.
Of course, there are still benefits to a big smp: a single P570 16-way will cost you $250k. But each of those 16 cpus is multi-code (and far faster than intel or amd), and with its micro-partitioning - it can run at least 150 linux or aix lpars (logical partitions). These lpars can grow or shink as they need - so you aren't always over-buying for size, buying new hardly-used hardware, or having to colocate apps on a busy server - when a different os would be preferable. Not to say everyone should go this way - but there are definite benefits.
It all comes down to how reliable he wants it, and what his time is worth. If his time isn't worth much, and he can afford to scrap some parts a few times, the build-it-yourself route can certainly save money.
On the other hand, if his time is worth money, and if a loss of this machine's availability will cost money, it's worth spending that extra cash on a full-tested vendor solution.
> However, in a commercial environment, defining the spec is really writing a
> contract, which protects both the customer and the editor. Specs there defines
> what the software can and must do, and ensures it will do.
And by focusing on the letter of the agreement rather than the spirit it also increases the chance of failing to meet the needs of the user. The fact that you've got a contract may protect you in court, but it won't win you any return business or references from yet-another-pissed-off-customer.
Just one of many reasons why agile methodologies are replacing paper-heavy & model-driven methodologies on so many commercial development projects.
Ten years ago Informix supported MPP configurations in which it would sit on top of MPP servers like the IBM SP2 (Deep Blue - the chess computer). In this configuration you'd have 20-200 separate nodes, each with its own memory & disk and Informix would be responsible for spreading its data across the nodes, running a query across nodes, and joining the results.
The sp2 was originally intended for scientific computing, but then most were sold as giant database servers. Around 1996 (I think) db2 also supported this kind of clustering.
Today the SP2 is no longer sold, but IBM supports the ICE configuration for db2 - in which you put together 5,10,200 blade servers, each with fibre connection to disk, and local memory. Then db2 handles queries spread across all blades. And of course, you don't have ot use blades - you can use desktops, or whatever. And it supports windows, aix, linux, solaris, etc. Informix still supports this kind of clustering as well.
Oracle now supports clustering over linux with RAC. It's more expensive, and doesn't scale as high, but probably is better at failover. And I think the teradata server (unix variant os & dbms) is also an mpp. At least it used to be.
Anyhow, I'd be surprised if there are as many scientific applications of clusters in production as there are databases.
> It gets 50 million page views a day and it's up to a 13+ million post count, with at least 1 million registered users. Is Slashdot not proof that MySQL has scalability?
Great question. But no it isn't.
Scalability isn't about size - it's about the economy of size. That is, if one million pcs running MS Access could also handle that load you wouldn't call MS Access scalable - because the cost to handle the load would be so much higher than other solutions.
So, running Slashdot isn't like getting a good score on the TPC:
1. it doesn't tell us what the hardware investment is
2. it doesn't describe how many of those queries come from cache that doesn't hit the database vs actual database hits
3. it doesn't tell us what compromises were taken - that might make it faster, but occasionally (or not so occasionally) result in data integrity problems
4. and we do know that slashdot provides very little in the way of real-time trending, in depth analysis of the data, etc. Exactly the kinds of queries that mysql *can't* scale to. The type of functionality offered is mostly restricted to view posting list, view details of one posting, insert new posting. This is *very easy* to scale on most databases. What wouldn't be easy to scale to do - is to provide a charts that showed (real-time, or nearly so):
- how the participation in a discussion ebbs & flows
- how angry or confrontational a discussion is, and how that changes over time
- which parts of the discussion are the "hottest"
- etc, etc, etc
So, yep - mysql can probably handle a good read-intensive *transactional* load. But it can't handle a very heavy mixed load, nor a very heavy load in which genuine data analysis occurs. Which isn't something far fetched at all any more - but constitutes most of what I am delivering these days. With Oracle and DB2...
> That page made 10 calls to the db with a total processing time of roughly 0.0055 seconds.
Right, looks like that's working well for you. With mysql getting that query performance on a 4m row table, it must be using an index, which means that you're selecting less than 3% of the data each time. The scalability issue emerges when your result set is 10% of the total table, you have to sort 100,000 rows, etc.
> Oh, it would still be available... I would just connect to a different db to pull it in. My code is
> already written for that to happen, I just need to change the connection string for the historical db.
> Right now it is exactly the same as the current db connection... so they both live in the same db.
sounds like a good plan
> I would say that MySQL and PostgreSQL are free and DB2 starts at $700, but that cost is just the beginning.
postgresql is free
mysql is only free under some circumstances, and if you are distributing client apps that connect to a mysql database you might want to get legal advise before assuming that it is free.
> What happens if you run a processor with multiple cores?
doesn't matter to db2's licensing on intel (not positive about power5, etc)
> How much if you want to run it on one of their mainframes?
costs quite a lot more to run db2 on a mainframe. Then again at least it runs there, mysql doesn't. So, I'd count that as a bonus for db2, not a penalty. Of course, if you're running linux on the mainframe, they can both run there. Not sure about db2 licensing then, but assume that it would still follow its linux licensing model.
> It is my experience that Oracle and IBM tend to "give" away some of their products, but they do that to hook you in to the upgrades.
Not really - five years ago there was no really cheap version of either. The difference is that now there is more competitition on the low-end, so DB2 & Oracle have to be more competitive there in order to keep market share against sql server, postgresql, mysql, etc.
Besides, mysql is really the nasty example of this behavior with their hybrid licensing model.
> OK... I'm not arguing about the *premise*... but... I don't consider your numbers to be justification for a different db.
well, depends on what you do with those rows: mysql or postgresql can certainly handle a database with ten million rows in a single table. But unless you can rely on a b-tree index to select just the data you want - you'll end up doing a table scan - and then it'll take 60 seconds or so to get your query back.
With db2/oracle/informix/etc - you've got other options for this 10 million row table. If some of your queries select too much data (like 100% of the data for one customer, group by host & date, then show changes over time for their hosts) you can implement a range partition by customer. Then when you can't use your b-tree you can at least just scan the data for that customer. That might be just 5-10-15% of the total number of rows. Then in addition, you've got query parallelism. Since this is on a 4-way I expect to get response time 25% of what I'd see with a single threaded query like mysql or postgresql performs. Then you've got materialized views, better optimization, etc, etc, etc.
In the end, this table could be supported by postgresql/mysql. But I'd get 60-second response time on queries against that table. With db2 I'm going to get sub-second response time.
> I figured in a few years I may need to split the db and store the historical data on a different machine...
> But, I'm not thinking it is too big at the moment... or for a few more years.
Sure - especially if you really don't need history very much. Personally though, I like to enable trending for data analysis - even on your typical transactional databases. And so I find that historical data is pretty valuable.
> You are talking about "massive" tables and a "reporting" application. Of course Oracle and DB/2 are the right choices for this. Has anyone ever thought differently?
Yes - everytime a product or technology is overhyped people believe it will do everything. Search for mysql and data warehouse - you'll find plenty of people who think it can handle massive data without really understanding what db2/oracle/informix/etc do that's different.
> The only time I've ever heard anyone compare MySQL to Oracle and say that MySQL was just as good as
> Oracle is when using small tables on a web app, which is exactly what MySQL is used for 99% of the time.
Right - in that context mysql/postgresql are competitive. Of Course, that isn't what MySQL AB is telling people. Little surprise, these are the same people that a few years ago were telling people that they didn't need primary key/foreign key constraints either.
Anyhow, I've got an application right now that for its first two years stayed quite small - just a few thousand rows. Implemented in db2 - since we need a major database anyway for the larger databases with a billion rows, and we can save labor by staying consistent. Anyway, this little database is about to now explode in size due to expanded requirements and new customers - we're expecting one little critical table to go from 3,000 rows to around 500,000. Its history table will go from maybe 10,000 to 10,000,000 rows. Other tables throughout the databae will likewise expand. DB2 was overkill when this database was first created, now it's perfect: it'll support the heavy transactional loads, automated system failover, and very fast scans, reports and other tough queries. This application growth is not at all uncommon - everyone is putting far more data into databases than they were just a few years ago.
So yeah, mysql and postgresql are neat products, but their lack of scalability for massive tables or analytical queries is a major gap. And the requirements for this capability are now commonplace - unlike ten years ago when only a few data warehouses really need it.
1. The reason that mysql sometimes silently truncates data isn't because they are being clever - it's because they were being sloppy: it's occurs inconsistently, has been acknowledged by the company (relucantly) to be a bug, and they are now bragging about fixing it.
2. Even if it was a clever strategy it is inconsistent with every other database product out there. This interferes with portability, standardization, and commoditization.
3. The value of constraints in a database has been embraced by every other database product, and many other application and database designers. There is no credible argument against database constraints, so your defense of their sloppiness via this argument is irrelevant. But with this in mind - here's a quick explanation of why you're in the minority in this view: consider that your application layer is changing over time, you may add additional applications pointing to the same data, but the older persisted data is not changing in the same way. You may add validations & edits in place today for new data - but your old data idn't benefit from these. If you only validate at the application layer then you will end up with inconsistent data. For this reason you want constraints at both the persistence and application layers - they complement each other well. It's not one versus the other.
Agreed - it is a real database, that can do real work and provide real value.
However, it has a history of inexcusible quality problems with silent errors, silent data truncations, etc, etc. The leadership of mysql also has famously stated that almost nobody really needs transactions. So, there's quite a lot of skepticism about the company and the product.
MySQL might turn that around. Hopefully they will.
Back to price - the difference isn't always so huge: I know that db2 is around $700 for a license limited to a two-way SMP, MySQL is around $500. I think Oracle is also under $1000 for a similar machine. Now, db2 & oracle can go *way* up from there (to support 16-way partitioning, etc). But mysql doesn't even compete there. In the space it does compete, oracle and db2 are in the same ballpark.
> InnoDB supports the features you mentioned, and that I quoted, so why is that less desireable than having the table handler builtin?
Because it is core functionality.
Having a separate product responsible for managing io may be one of the reasons that mysql's optimizer is so primitive - and may be a reason why they will struggle to improve it. Note: the optimizer is the component responsible for determination of how your joins will be performed under the covers - mysql is notorious for failing to use indexes it should, using indexes when it shouldn't, never using star-joins, five-way joins that day 10x as long as they would in any other database, etc.
Perhaps it's also why views took forever to implement, and materialized views might take another forever.
It's also perhaps the reason for all the inconsistencies in table creation.
There are also benefits to using innodb - it has undoubtably speed up mysql's development by several years. Still, now it's a boat anchor that should be abandoned.
> For me, the massive community is a big plus. A rich set of tools is another advantage.
yep, although it may be the only advantage that mysql has, it's a huge one.
> And Oracle 7 for the two years before -- in a not-so-large database -- I think there is not much to fear...
Actually, no: many databases these days are now supporting various types of logs - in which you've got tables with tens of millions of rows. Oracle and DB2 have the following in place to support massive tables:
1. query parallelism - that provides linear performance improvements up to 4 cpus or so
2. data partitioning - that allows the database to just scan data needed, rather than entire table when indexes aren't appropriate (b-tree indexes only work for around 1-3% of data)
3. materialized views - in which views actually hold data - and this data is kept up to date by the database server. Often used for summary tables.
3. query rewrite - in which your queries are automatically re-written to apply to a summary table (see materialized view) if one exists.
4. clustering - in which data spans multiple servers, and all servers work together on your query.
5. smart optimizer - intelligent score-based optimization responsible for determination of best access paths for your queries.
With the above features db2 or oracle can drive 40x the performance of mysql or postgresql in a reporting application (or transaction one with a few large scan-oriented tables) on identical hardware (say a 4-way SMP). If you didn't see an impact from these features then either you have one of the fairly rare apps that can't benefit, or you should revist the design.
Don't get me wrong - I really like postgresql. But neither it nor mysql is even in the ballpark for performance on db2 or oracle. Nor is the price much different - db2 is only around $700 for 4/5 of the above features on a 2-way smp vs $500 for mysql. Eventually mysql & postgresql will support these features. But it'll probably be five years before they are working well.
> Add in that PostgreSQL's core engine has long been about 5x faster than Oracle's (not to mention orders
> of magnitude easier to set up and administer) and basically the only reason left to go with Oracle is
> their clustering. No doubt there are places that need that, but it's a pretty small niche.
It's not that small a niche anymore: the need to query vast amounts of information has changed from a specialization within data warehousing to a commonplace requirement that a significant minority of applications now requires.
So, it's not unusual for the kind of applications that we used to build five years ago to today also require the logging of everything in the world - requiring the ability to quickly query tables with tens of millions of rows.
And sure, postgresql/mysql/firebird can query such a table via a btree index quickly. But if you need to process more than just 1-3% of that table you're out of luck with those options. You want partitioning, query parallelism, fine grained memory management, flexible storage options, automated query rewrite against summary tables, sophisticated optimizer, etc. And db2/oracle/informix all have those features. And it will make the difference between a 2 second query vs a 80 second query.
The open source options will eventually get there. In the meanwhile, you need other options to manage large volumes of data well. And if you *must* use db2/oracle/informix to handle your large databases, you might as well use them for your small databases as well - since licensing cost for small databases is often cheaper than the OS, and it's easier to reuse your labor - which costs far more than the licensing costs.
> This is not a troll. If IBM wants to become an OSS company - they should open up their programs -
> especially DB2. It is a nightmare to use that in collaboration with Samba, LDAP etc.
Why can't ibm just open up part of its software catalog and still be an oss company? I think over time we'll see it continually releasing software to open source as revenue from those products diminishes.
And sure, it might be nice to release db2 as open source now - but there are already a handful of good open source options (even if none can scale as high as db2).
Regarding samba & ldap: i haven't tried db2 on fileshares or samba, but wouldn't normally want to anyway - very fast access to disk is so critical to good performance. And db2 now supports (as of about nine months ago) an open security API that you can use to connect to any authentication/athorization mechanism you want. Granted, better out of the box connectors for ldap would still be good tho, and I'm assuming we'll see that in next six months: security requirements for databases has changed quite a bit over last few years.
ken
ah, so the wise argument is to treat all subjects like they are balanced controversies?
Then how about adding to this new ID/Science teaching:
1. the earth is flat to all geography classes
2. there's no record of jesus actually existing to all bible classes
3. that the sun rotates around the earth to all astronomy classes
4. that mental illness is caused by bad humors in the brain that can be solved by releasing them through holes in the skull
etc
However, the magna carta mostly focused on protection of rights of nobles. Didn't do much for anyone else (at least not directly).
Additionally, I don't see how Athens was hardly democratic since it only allowed male citizens to vote, but the early US was democratic even though it only allowed non-indian, non-black, male, land-owning citizens to vote.
It looks like you may have rearranged the facts to fit your theory.
that was my motivation as well: after writing cobol, jcl, cics, and ims dc & db - rexx was just really fun.
;-).
Since it was the macro language for ispf - i first wrote about twenty macros - so you you could put the cursor on a filename, hit a pfkey and automatically browse the file, etc, etc.
Wrote code that would convert a cobol program to cobol II (all periods except one replaced in every paragraph by scope terminators, etc). Then systematically converted *hundreds* of programs to this newer style.
Wrote a ton of little apps with Dialog Manager, rexx, and db2 - these little apps could be written in 1-2 days, were easy to maintain and performed CRUD on about a dozen tables. Probably more productive than Ruby on Rails today
Also started writing batch apps in rexx instead of cobol. That also made it easier to eventually transition to c.
But the world of perl, java, and c wasn't fun. Really no better than cobol & jcl. Rexx was fun. But luckily now I've found python. Python is also fun. So i'm writing code again.
I suppose i could still use rexx (it's a stored procedure language in db2, and works on most platforms still). But these days python and ruby look more promising. And still fun.
> Yes, it is. I've worked on a project that allowed offline modification of a database by replicating a
> copy to user's PCs, and it originally used XML as the format for data transfer. We got a 30% speedup
> by switching to tab-separated variables with a line of metadata at the start of each chunk of the
> stream.
Yeah, but I don't think that's a reasonable example - replicating a copy of a database to a user?
You also probably could have gotten a 75% speedup by compressing it before you sent it. But that doesn't mean that plain ascii is a bad thing.
And no, I don't enjoy working with xml. I think it's bloated & overhyped. But...in sending data between organizations or systems, its self-documenting nature is a plus. Usually.
Actually, I think that using sqlite would be far better than oracle for this purpose: Oracle carries a lot of baggage along with it - and really isn't optimized to just as few resources as possible on a desktop.
SQLite on the other hand, is so small and has so little overhead that it's probably very good for system performance. And with so little code involved it's much less likely to require upgrades. And (unlike another light database) it's highly sql compliant.
> So, instead of investing on a DBMS why not buy massive quantities of ECC memory and keep all
> instances of your data in-memory for near instant access?
because a *well-tuned* relational database with a 1:4 ratio of memory to disk is almost as fast as an in-memory database - due to efficient caching
because some queries require an enormous amount of temp space. supporting them can easily double your space requirements - which have to be purchased in memory.
because if you just want to run your database in-memory you can already do that with most databasees.
because you don't have the same speed requirements for every piece of data in your database. You might have some tables used for session & user management that are often read & written to and must be very fast. But other tables that just hold seldom-accessed historical data. A modern database would allow you to keep the small & fast tables effectively in memory, and the huge 100 gb history table on disk. And you don't have to buy 100 gbytes of memory to do it.
because...it's just a bad idea.
> In other words, some indexes are great for "field > number" queries, while others are better at "field matches string_pattern" queries.
yep
> So in order to have effective heuristic or "fuzzy logic" queries, somebody will need to work
> out indexes and hashes for each fuzzy logic matching operator, or write an algorithm that
> figures out how to make those indexes and hashes. And that's ummm... rather more difficult.
> So until then, we have to catch as catch can with with analyzers and aggregators doing
> underlying exact queries and applying as many optimizations as they can.
Right - we need new sql operations as well as new optimization capabilities. Some of these might be in the works, I don't know.
But we also need new ways of modeling - right now quite a few of us are creating more flexible models that sit within a relational database - but aren't able to take full advantage of key relational database capabilities like referential integrity, unique constraints, etc. That might sound a little wacky but sometimes we need more flexibility in modeling data than you can get out of 3NF: like the ability to add new types of things, new relationships between existing types, or perhaps attributes (weight, etc) upon the relationship. And do this without schema changes.
Lastly, one of the big obstacles is that we're quickly moving into new terrain here. I remember once discussing an e-commerce catalog with a team of engineers - and they *could not* understand how a hierarchy could be insufficient for describing items in the catalog. The notion of network of items in which there are n-parents caused a lot of sweating!
> Digital did this over ten years ago. One of the things that Oracle inherited when they bought RdB
> from Digital was the cluster support. However it
seems they tool a long time to get the technology
> into their own RDMS.
I've got fond memories of an 800 gbyte billing & customer data warehouse on rdb around 1995 - giving sub-second access and running on a vms quad. That was such a slow system compared to what we've got now - but it sure handled a ton of data well.
On the other hand, I don't remember how much was due to the excellent clustering in vms - or how much was due to rdb...
> XML SUCKS. PASS IT ON.
agreed - as a human-readable way of persisting data it stinks. as a way of persisting data it stinks. But it isn't bad as an over-the-wire protocol.
> vertical partitioning...
Hmmm, don't see much vertical partitioning in data warehouses any more. Used to on oracle years ago, but can't even remember why. But I am finding that both mean range & hash partitioning work well together. The range is cheaper & easier to implement but only gives a performance benefit when you only want a subset of data. The hash comes into its own when you want to query 50% or more of the data.
> materialized views...
have to disagree: whether or not to use them depends a lot of on the update frequency, query frequency and query cost. But I've got queries running 1000x faster on summary tables (not automated materialized views, but same idea) than they would on base data. *1000x* is a lot faster.
> BUT all of this ignores the very real possibility that hardware is getting fast enough
> to not have to do any of this crap.
yeah, but that ignores the very real fact that data is increasing *faster* than moore's law. So the cheap commodity machines *are* table to blow away the machines of ten years ago in doing the tasks from ten year ago. But today it isn't unusual to find a database loading ten million rows a day. Then someone wants to crunch away at 500 million rows in a query - and wants it to come back in 5 seconds. Commodity hardware is a long way away from doing this. Unless of course, you connect them in clusters. DB2 can do that very well - but Oracle is just starting and doesn't scale so well there yet.
> it reads like a battle cry for us to move beyond the relational model. ... it's NOT going to happen...
a couple of thoughts on that -
1. relational databases are really quite wonderful for analytical apps. Need to store two years of firewall/sales/whatever data - then churn away analysis? Great - no problem. And it's easy enough to do either through hand-written sql or via a tool. There's plenty that requires third-party tools (and data stores), but even in this scenario the staging area is almost always the relational database.
2. a lot of folks who would like to eliminate relational databases fail to account for point #1. They complain of the object/relational mapping problems. Ok, that's fine - but if you put your data in container-managed persistence or an object database - you'll then have to pay someone to pull it out and put it into a separate relational database for analysis. Of course, you might be fired right about that time...
3. java in the database is mostly a pain in the butt: On the performance side you've got optimization complexity, on the managability side you've got unusual dependencies, build processes, etc, on the availability side you've got the ability to take out the entire server with some bad code (ok, sometimes).
4. two-tiered architectures with a web service driven directly out of the database is more than just a pain in the butt. It's a security disaster. A cobbled together architecture. And Jim Gray shilling for microsoft.
5. the column-store as Jim Gray described it has never really left us. And we don't really need new technology to handle it.
6. users love tables. there are quite a few users out there that truly love tables - they understand them, they look just like spreadsheets, they can query them. This is important: it's fabulous when your users can easily understand your design.
7. however - like Gray said, we now need methods of working with data that go far beyond boolean logic. We need fuzzy logic queries. And we need new types of models - allowing for multiple many-to-many relationships via relationship tables. This breaks codd's rules - but is essential for agile & fast-moving projects.
> The "next great advancement" in databases will be when I can setup 2 or more linux servers and have
>them act as a single database server. Our database server is the most expensive item in our datacenter
>because it's an N-way IBM server.
lol, IBM has supported *exactly* what you are talking about for at least five years.
That is, you can spread your db2 database across 10,100, or 1000+ linux commodity boxes (ideally blades). Or you can use windows, or aix, or solaris, or hp-ux, etc. Of course, those individual boxes can be SMPs in their own right - so a thousand 8-way aix boxes is certainly possible, if not cheap.
Oracle is now in this game as well - oracle 10g can certainly support 32, and maybe 64 individual linux boxes in a cluster. The techniques are different between the two - oracle might be better at transactional systems. db2 is definitely better at data warehousing, data mining, etc.
Of course, there are still benefits to a big smp: a single P570 16-way will cost you $250k. But each of those 16 cpus is multi-code (and far faster than intel or amd), and with its micro-partitioning - it can run at least 150 linux or aix lpars (logical partitions). These lpars can grow or shink as they need - so you aren't always over-buying for size, buying new hardly-used hardware, or having to colocate apps on a busy server - when a different os would be preferable. Not to say everyone should go this way - but there are definite benefits.
Good points.
It all comes down to how reliable he wants it, and what his time is worth. If his time isn't worth much, and he can afford to scrap some parts a few times, the build-it-yourself route can certainly save money.
On the other hand, if his time is worth money, and if a loss of this machine's availability will cost money, it's worth spending that extra cash on a full-tested vendor solution.