Why Don't Open Source Databases Use GPUs?
An anonymous reader writes "A recent paper from Georgia Tech (abstract, paper itself) describes a system than can run the complete TPC-H benchmark suite on an NVIDIA Titan card, at a 7x speedup over a commercial database running on a 32-core Amazon EC2 node, and a 68x speedup over a single core Xeon. A previous story described an MIT project that achieved similar speedups. There has been a steady trickle of work on GPU-accelerated database systems for several years, but it doesn't seem like any code has made it into Open Source databases like MonetDB, MySQL, CouchDB, etc. Why not? Many queries that I write are simpler than TPC-H, so what's holding them back?"
...because I/O is the limiting factor of database performance, not compute power?
The people with the skills have day jobs and want to enjoy time off with other projects.
The people with the skills have no jobs and want to write the code but the hardware is too expensive.
Domestic spying is now "Benign Information Gathering"
The R&D effort in the SQL field is roughly zero, so it's not surprising people aren't keeping up with the latest developments in the hardware field.
It's bad enough that the only standardized access system is ODBC, designed 25 years ago when pipes were short and thin and a WAN was the next building over. If we can't get that problem fixed, what's the hope for integrating new technologies?
so what's holding them back?
Wrong question. It is open source. If you need it, you fix it.
Because a lot of us have personal experience on how "reliable" GPU calculations are.
A few screen "artifacts" tend to be less painful than db "artifacts". Maybe things have changed. But it's not been that long since nvidia had a huge batch of video cards that were dying in all sorts of ways.
As for AMD/ATI, I suspect you'd normally use some of their crappy software when doing that GPU processing.
"Many queries that I write are simpler than TPC-H, so what's holding them back?" -- simple queries don't need acceleration.
A "SELECT * FROM users WHERE user_id = 12", or a "SELECT SUM(price) FROM products" doesn't need a GPU, it's IO bound and would benefit much more from having plenty cache memory, and a SSD. A lot of what things like MySQL get used for is forums and similar, where queries are simple. The current tendency seems to be to use the database as an object store, which results in a lack of gnarly queries that could be optimized.
I do think such features will eventually make it in, but this isn't going to benefit uses like forums much.
Most servers do not have powerful GPUs, and that is where heavy production databases are run.
Isn't everyone using them? I do 3D and the one drag about 3D is render time, I have a piece of software that uses the GPU and I am able to get a decent render in real time.
Premiere and AfterFX run much better and quite often real time renders too.
The way GPU's work seems to be the future, so I am puzzled why it isn't more prevalent, and I'm sure there is some technical reason I'm not aware of... right?
"If any question why we died, Tell them because our fathers lied."
MapD is a GIS-centric database.
Many queries that you write are simpler than TPC-H. Necessity is the mother of invention.
Databases in the real world are rarely cpu bound (and when I have seen them CPU bound it was when something was going badly wrong) Generally they are data bound and the GPU has several times lower bandwidth than the real cpus so effectively will be even slower, so while the computation on the gpu may be 10x faster...feeding the data in/out is 10x slower meaning it did not do anything for you, except require you a lot of extra coding complication do use it.
Benchmarks tend not look like real world queries, of often you can do something that helps a benchmark, but does nothing in the real world,.
Bus installed co processors (pci/pcie/vme) are only useful if you can fit the entire dataset in the co-processors memory, when you have to do large accesses outside of that ram because the data does not fit, then the co-processor usually becomes much slower and all advantages go away. That is why it works for supercomputing...the dataset being worked on is tiny in the cases the gpu works well for.
It's waiting for you to get on it.
Sheesh, evil *and* a jerk. -- Jade
What's holding them back? I'd have thought it was obvious!
The big issue with GPGPU for DB work is that you have to have the DB entirely in memory or your performance will suck (even SSDs aren't that fast). To get a big database to work in such a scenario, you have to split it into many smaller pieces, but that makes working with these sorts of things expensive even with an open source DB. The paper even says this. That makes this sort of work only really interesting for people with significant budgets, and they can easily use a commercial DB; the additional cost isn't prohibitive in that scenario.
Without general hardware availability, there's just that not many people pushing to have the feature; OSS thrives on having many people want it and many developers able to work on it.
"Little does he know, but there is no 'I' in 'Idiot'!"
All of these SGBDs are actually toys being sold for more then they are capable of. So developers there have to try to catch up to PostgreSQL before it becomes (even) easier to use and eat their lunch.
Meanwhile, the issues meriting scarce development and, mainly, review time at PostgreSQL are more interesting than accelerating a few workloads in hardware which is not yet in the servers out there. Things like making PostgreSQL even easier to install, set-up and manage, even more ISO SQL compliant, even more capable, even better than NoSQL at NoSQL loads
Now, if you can show your GPU aware PostgreSQL extension or modification, and show it is generally useful enough to merit review time for the next release why not?
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
Research shows that there is good news and bad news on this approach.
The good news: Certain SQL queries can get a massive speedup by using a GPU.
The bad news: Only a small subset of queries got any benefit. They generally looked like this:
SELECT pixels FROM characters JOIN polygons JOIN textures
ON characters.character_id = polygons.character_id
WHERE characters.name = 'orc-wielding-mace' AND textures.name = 'heavy-leather-armor' AND color_theme = 'green'
ORDER BY y, x
Just a few projects into Database Performance Optimization would convince you that's not a true statement. IO/Memory/CPU are in fact largely interchangeable resources on a database. And depending on your schema you can just as easily run out of any of these resources equally.
For instance, I'm currently tuning a SQL Server database that's CPU heavy based on our load projection targets. We could tweak/increase query caching that would cause more resultsets to stay in memory. This would mean that less complex queries would be run, drastically reducing I/O and some CPU resource usage. But then drastically increasing memory usage. This is just a simple example of course to illustrate the point.
Databases run out of CPU resources all the time. And a CPU advancement would be very well received.
My guess as to why this hasn't been done is that it would require end-users to start buying/renting/leasing GPU enabled hardware for their Database infrastructure. This would be a huge change from how we do things today and this sector moves very slowly.
Also we have many fairly old but more important Database advancements which have been around for years and are still almost unusable. If you ever tried to horizontally scale most popular Open-source databases you may know what I'm talking about. Multi-master, or just scaling technology in general, is required by about every growing "IT-dependent" company at some point. But that technology ( though available ) is still "in the dark ages" as far as I'm concerned based on reliability and performance measurements.
I'm responsible for a large university learning management system (Sakai). The daabase is completely CPU limited. I assume that's because the working set of data fits in memory. I would think lots of university and enterprise applications would be similar. Another data point is the experiments done on a no-SQL interface to innodb. That shows very large speedups. Surely some of this is due to the CPU overhead in processing SQL.
Guess nobody ever heard of the pgstrom
It is really that simple. The companies that would gain the most from this do not (as a general statement) equip their servers with GPU's. Even if the DB's started supporting it first, giving people a reason to add GPU's into servers, the processing isn't the major bottleneck for DB servers. So there isn't a tremendous value in either adding them into the servers (so they are useful), or in adding code to support the GPU (when there aren't many servers that have them).
Its a chicken and egg and usefulness problem. ;)
Besides datasets not fitting in to GPGPU memory, and I/O bottlenecks, I'm still seeing plenty of badly written SQL
A current contract has plenty of SQL work (not for me though), and the bulk of their time is cleaning up data exceptions, badly written report queries, and moving oft-used or large-dataset queries to stored procedures. GPGPU's will hide some of the rot, but if the SQL was written better in the first place, we're able to use parallelism and better use existing commodity hardware in clients virtualised environments.
I'm not dissing the prospect of GPU acceleration, just the priority TFA gives to it.
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
You mean you don't understand what a compute cluster IS.
If you need more computing power, you're doing it wrong.
Looks like exactly what PostgreSQL's PGStrom project is trying to acheive.
In PostgreSQL we have a project called PGStrom http://wiki.postgresql.org/wiki/PGStrom
...maybe it has something to do with the fact that it's called a Graphics Processing Unit? Why the fuck are we using them as CPUs?
it doesn't seem like any code has made it into Open Source databases like MonetDB, MySQL, CouchDB, etc.
Lemme guess, MySQL fanatic?
You can already go download:
https://wiki.postgresql.org/wiki/PGStrom
if it fits your problem domain and PostGIS has some hackers adding GPU support:
http://data-informed.com/fast-database-emerges-from-mit-class-gpus-and-students-invention/
Why not the others? Perhaps because PostgreSQL makes developing extensions easier - it's got the largest extension ecosystem, so I'm just presuming there. If it turns out well in Pg-land, the others will naturally adopt it.
So the answer to the story title is "they do." The next question would be, "why isn't it widely deployed", and the answer would be, "it's not done yet." Yadda, yadda, yadda, patches welcome. If the whole summary is just a way to try to turn "hey this is neat" (it is) into an ill-founded complaint story, then write a better story next time. It's neat stuff, no need to whine.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Indeed. We have a large WordPress based site and it is bound by database CPU despite the fairly powerful CPU it uses. It should scale to many cores, so I'm thinking of trying a pair of the 8 core AMD processors. Intel is faster PER CORE, but an AMD rig could have 16 cores.
I think the problem is that copying data from the GPU to the CPU is too slow to work on all queries and would be hard to model into a query optimizer. When AMD's Kaveri processor are released in a few weeks, I think several DB's will add patches to to improve DB performance. The 7X improvements are for queries where the entire DB is stored in GPU memory.
A GPU, even a GTX Titan, simply isn't 7 times faster than a modern 32-core x86 CPU in real life. Most of the gain probably comes from just general optimization that could have been done on the CPU too.
Don't use open source db. Use SQL Server for security and speed.
putting MonetDB, CouchDB and MySQL in single line already shows seriousness of the question. First of all, TPC-H is decision support (ad-hoc, analytic) workload, and putting all data into memory needs comparison with in-memory ad-hoc platforms like MemSQL where the difference might not be as pronounced. Also, it is easy to have hundreds of gigs of RAM for CPU driven systems, whereas GPU memory is still tiny. Yes, doing complex window functions on streaming data may seem fine, but anything requiring larger arenas of random data access would fall of the cliff. For all the people talking about speeding your Wordpress, you need to look at OLTP or even more readonly small data benchmarks (TPC-C is already too complex). Database that is efficient at transactional workload has too much overhead for analytical processing. Then there are all the datacenter considerations - getting rid of heat once you have thousands of GPUs around is no longer a trivial task, and may involve oil immersion or water cooling. Yes, sorting a dataset can be faster, but assembling it from all the I/O devices and memory is a task that is the major expense. Thats why multithreading works - there are lots of waits for memory already, making them even longer would be difficult. The I/O we talk about is not just reading from disk or disks or ssds, it is also about getting into the chip, and bandwidth there is still constrained. And yes, CPU can get quite busy in database server, for all the network, storage, compression, page and row mangling code has to run somewhere - sizing hardware for large scale databases is a tough balancing act. But very little of that can work on GPU in online world. Nice research though :)
From my experience: I don't think database programs do much mathematical formulas or computations that would benefit from a graphics processing or even a floating processing unit.
A spreadsheet, on the other hand, might be able to take advantage of a GPU.
I should move my personal databases (OpenOffice.org base, Oracle Express) to an SSD drive. I can't afford an SSD yet. Progsql MySQL won't run properly on my Windows XP box for some odd reason.
Gee, a $1000 GPU that runs 7x as fast a 1/8th of an $1500 CPU. It woud be good idea if you didn't need that CPU to run it, but just barely so. If you cheap out on the CPU and only spend ~$750 on it, assuminng there is no slowdown on the GPU because of it, then the economics break. And people wonder why GPU compute on databases isn't catching on.
Then there is the power use aka TCO/running costs to think about. And everything mentioned above. And.... This study has all he hallmarks of an Nvidia research project who's targets are financial analysts rather than potential customers. The science is fine but that is not the intent.
-Charlie
GPU instruction sets change all the time. Intel needs to make money from compatibility with an obsolete 70's stopgap embedded architecture that was already outdated when they designed it. If it can't boot DOS 1.0, Intel won't make it.
To recent for OS to respond to this new technological development.
This is clearly the question that corporate co-authors Nvidia and Logicblox hoped you would ask.
The paper seems to represent more of an evolutionary rather than revolutionary approach, but suffers from some unfortunate hand-waving, particularly in their attempt to negate the real cost of memory->PCIe transfers (to their credit, at least they call out that latency), their unwillingness to perform comparisons on like-to-like base hardware, and their rather odd choice of front-end environment. Coupled with their odd price-performance metric, I suspect that Nvidia marketing got way up in Gatech's business on this. My suspicion is that there are real use cases where SIMD processing is going to substantially speedup relational database performance on easily partitioned datasets, but as more vectorization effort is placed on main CPU, the advantages of kicking off to coprocessor will eventually go the way of the 387.
Running the same test on one thread of my desktop CPU with mysql and innodb completes in 75 secs while the article posts a result of 359 secs a difference of roughly 5 times. So there you have it it's simply not worth the effort or cost to optimise for gpu architecture it's easier to just optimise for CPU architectures alone. Plus it works on 100Gb databases 2.
Open sores developers are followers, not leaders.
Typical reponses above:
(a) DB operations aren't CPU intensive
(b) Servers don't come with dedicated graphics cards of any note
(c) Loading each server with a AMD or Nvidia card would increase power usage
So in summary, certain operations may benefit using GPUs but there's not a cost-effective solution to warrant such experimentation.
I'd be surprised ARM if haven't sponsored cloud research into OpenCL on the Mali GPUs.
I don't do much of this sort of thing anymore but there was a time when I tried to look inside every file on my computer, telnet to every host available just to learn everything I could about this wonderful new tool that had come into my grasp. I sometimes miss those days and that guy, but now I tire easily and kind of just want everything to work.
I do feel what you are feeling.
As we get older we get more easily tired. But that doesn't mean I will rest more just because I get tired.
On the contrary - I push myself harder simply because I get more easily tired.
Only by doing so I get things to even out - maybe I am not as fast and as sharp, in both physical sense and in mental state, but as I push myself harder, I will do more in the same 24 hours allocated to me every single day.
Why should I let the young uns having all the fun ?
Why should I let my scopes be idle just because I tell myself "oh, I'm tired now, I can test that thing tomorrow" ?
If I need to find out something, I don't waste time debating with myself whether I should do that thing or not - I just go ahead and do it.
That's how I live my life, anyway.
Muchas Gracias, Señor Edward Snowden !
Well do ya punk ? :)
Just wanted to mention AlenkaDB :
https://github.com/antonmks/Alenka
It is open source and it runs pretty fast on datasets that do not fit into gpu memory - check the 1TB TPC-H benchmarks.
Database users are pretty conservative, so I haven't seen many people using it so far, aside from some folks at U.S. DoD.
http://wiki.postgresql.org/wiki/PGStrom
The right question would be : when will a major Open Source DB be released with a GPU accelerator...
The issue is that it is quite hard to do, and it's harder to do it in a way that is sufficiently generic and elegant to deserve public dissemination.
this is the reason. We need a standard GPU access API and vendor independent. If the DBs need a GPU to speed up, please use Parallela or support OpenGraphics.
PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.
http://wiki.postgresql.org/wiki/PGStrom
1 - GPU memory is limited, so you better have some nice compression/great columnar based databases. What is REALLY cool, is a distributed or clustered GPU system. That can have value.
2 - Diso I/O is the real limit
3 - Cost and complication - Relying on CUDA/OpenCL. The true value of GPU's is parallel processing when processing data, so honestly performing analytics would be best served with GPU
A simple google of "GPU Database engine" gives this, and open source GPU DB:
https://github.com/antonmks/Alenka