Why Don't Open Source Databases Use GPUs?
An anonymous reader writes "A recent paper from Georgia Tech (abstract, paper itself) describes a system than can run the complete TPC-H benchmark suite on an NVIDIA Titan card, at a 7x speedup over a commercial database running on a 32-core Amazon EC2 node, and a 68x speedup over a single core Xeon. A previous story described an MIT project that achieved similar speedups. There has been a steady trickle of work on GPU-accelerated database systems for several years, but it doesn't seem like any code has made it into Open Source databases like MonetDB, MySQL, CouchDB, etc. Why not? Many queries that I write are simpler than TPC-H, so what's holding them back?"
...because I/O is the limiting factor of database performance, not compute power?
The people with the skills have day jobs and want to enjoy time off with other projects.
The people with the skills have no jobs and want to write the code but the hardware is too expensive.
Domestic spying is now "Benign Information Gathering"
The R&D effort in the SQL field is roughly zero, so it's not surprising people aren't keeping up with the latest developments in the hardware field.
It's bad enough that the only standardized access system is ODBC, designed 25 years ago when pipes were short and thin and a WAN was the next building over. If we can't get that problem fixed, what's the hope for integrating new technologies?
so what's holding them back?
Wrong question. It is open source. If you need it, you fix it.
Because a lot of us have personal experience on how "reliable" GPU calculations are.
A few screen "artifacts" tend to be less painful than db "artifacts". Maybe things have changed. But it's not been that long since nvidia had a huge batch of video cards that were dying in all sorts of ways.
As for AMD/ATI, I suspect you'd normally use some of their crappy software when doing that GPU processing.
"Many queries that I write are simpler than TPC-H, so what's holding them back?" -- simple queries don't need acceleration.
A "SELECT * FROM users WHERE user_id = 12", or a "SELECT SUM(price) FROM products" doesn't need a GPU, it's IO bound and would benefit much more from having plenty cache memory, and a SSD. A lot of what things like MySQL get used for is forums and similar, where queries are simple. The current tendency seems to be to use the database as an object store, which results in a lack of gnarly queries that could be optimized.
I do think such features will eventually make it in, but this isn't going to benefit uses like forums much.
MapD is a GIS-centric database.
One problem is OS and toolchain support. You might get something together for Windows, OS X and Linux, but that's where the buck stops.
The next problem is that standalone compute cards are rather expensive and putting in a high power GPU has considerable power requirements. Then most server racks are full of 1u wonders not designed to get rid of heat or even hold a huge AMD or NVIDIA GPU.
Open source databases are great, but they're often pushed as a cost savings to companies. To turn around and buy extra hardware to make them faster isn't going to cut it.
Finally, there's the oddity of programming languages that some databases are written in. The popular SQL databases are in C so that's not a problem. Some of the others are in Java, erlang, or some other crazy language that may or may not have OpenCL or CUDA support.
MidnightBSD: The BSD for Everyone
Most servers do not have powerful GPUs, and that is where heavy production databases are run.
Servers turn over comparatively quickly, though (sure, every shop has ol' reliable trucking away on the 13GB SCSI drive that was pretty cool when it left the factory, doing something obscure but vital; but the population as a whole churns faster than that), and servers with nice chunks of PCIe (typically intended for your zippy network cards or fancy storage HBAs; but they are perfectly normal PCIe slots) aren't at all difficult to find. Nor has (Nvidia in particular, AMD trailing a touch) Team Graphics been shy about pushing server-suitable GPU compute parts.
It is true that servers today mostly have little to no GPU power; but if the case were made, that would change rather quickly.
It's waiting for you to get on it.
Sheesh, evil *and* a jerk. -- Jade
But for that money, more ram or faster drives makes more of a difference...
Research shows that there is good news and bad news on this approach.
The good news: Certain SQL queries can get a massive speedup by using a GPU.
The bad news: Only a small subset of queries got any benefit. They generally looked like this:
SELECT pixels FROM characters JOIN polygons JOIN textures
ON characters.character_id = polygons.character_id
WHERE characters.name = 'orc-wielding-mace' AND textures.name = 'heavy-leather-armor' AND color_theme = 'green'
ORDER BY y, x
Just a few projects into Database Performance Optimization would convince you that's not a true statement. IO/Memory/CPU are in fact largely interchangeable resources on a database. And depending on your schema you can just as easily run out of any of these resources equally.
For instance, I'm currently tuning a SQL Server database that's CPU heavy based on our load projection targets. We could tweak/increase query caching that would cause more resultsets to stay in memory. This would mean that less complex queries would be run, drastically reducing I/O and some CPU resource usage. But then drastically increasing memory usage. This is just a simple example of course to illustrate the point.
Databases run out of CPU resources all the time. And a CPU advancement would be very well received.
My guess as to why this hasn't been done is that it would require end-users to start buying/renting/leasing GPU enabled hardware for their Database infrastructure. This would be a huge change from how we do things today and this sector moves very slowly.
Also we have many fairly old but more important Database advancements which have been around for years and are still almost unusable. If you ever tried to horizontally scale most popular Open-source databases you may know what I'm talking about. Multi-master, or just scaling technology in general, is required by about every growing "IT-dependent" company at some point. But that technology ( though available ) is still "in the dark ages" as far as I'm concerned based on reliability and performance measurements.
Looks like exactly what PostgreSQL's PGStrom project is trying to acheive.
A GPU, even a GTX Titan, simply isn't 7 times faster than a modern 32-core x86 CPU in real life. Most of the gain probably comes from just general optimization that could have been done on the CPU too.
Gee, a $1000 GPU that runs 7x as fast a 1/8th of an $1500 CPU. It woud be good idea if you didn't need that CPU to run it, but just barely so. If you cheap out on the CPU and only spend ~$750 on it, assuminng there is no slowdown on the GPU because of it, then the economics break. And people wonder why GPU compute on databases isn't catching on.
Then there is the power use aka TCO/running costs to think about. And everything mentioned above. And.... This study has all he hallmarks of an Nvidia research project who's targets are financial analysts rather than potential customers. The science is fine but that is not the intent.
-Charlie
I don't do much of this sort of thing anymore but there was a time when I tried to look inside every file on my computer, telnet to every host available just to learn everything I could about this wonderful new tool that had come into my grasp. I sometimes miss those days and that guy, but now I tire easily and kind of just want everything to work.
I do feel what you are feeling.
As we get older we get more easily tired. But that doesn't mean I will rest more just because I get tired.
On the contrary - I push myself harder simply because I get more easily tired.
Only by doing so I get things to even out - maybe I am not as fast and as sharp, in both physical sense and in mental state, but as I push myself harder, I will do more in the same 24 hours allocated to me every single day.
Why should I let the young uns having all the fun ?
Why should I let my scopes be idle just because I tell myself "oh, I'm tired now, I can test that thing tomorrow" ?
If I need to find out something, I don't waste time debating with myself whether I should do that thing or not - I just go ahead and do it.
That's how I live my life, anyway.
Muchas Gracias, Señor Edward Snowden !