Why Don't Open Source Databases Use GPUs?
An anonymous reader writes "A recent paper from Georgia Tech (abstract, paper itself) describes a system than can run the complete TPC-H benchmark suite on an NVIDIA Titan card, at a 7x speedup over a commercial database running on a 32-core Amazon EC2 node, and a 68x speedup over a single core Xeon. A previous story described an MIT project that achieved similar speedups. There has been a steady trickle of work on GPU-accelerated database systems for several years, but it doesn't seem like any code has made it into Open Source databases like MonetDB, MySQL, CouchDB, etc. Why not? Many queries that I write are simpler than TPC-H, so what's holding them back?"
...because I/O is the limiting factor of database performance, not compute power?
The people with the skills have day jobs and want to enjoy time off with other projects.
The people with the skills have no jobs and want to write the code but the hardware is too expensive.
Domestic spying is now "Benign Information Gathering"
The R&D effort in the SQL field is roughly zero, so it's not surprising people aren't keeping up with the latest developments in the hardware field.
It's bad enough that the only standardized access system is ODBC, designed 25 years ago when pipes were short and thin and a WAN was the next building over. If we can't get that problem fixed, what's the hope for integrating new technologies?
so what's holding them back?
Wrong question. It is open source. If you need it, you fix it.
Because a lot of us have personal experience on how "reliable" GPU calculations are.
A few screen "artifacts" tend to be less painful than db "artifacts". Maybe things have changed. But it's not been that long since nvidia had a huge batch of video cards that were dying in all sorts of ways.
As for AMD/ATI, I suspect you'd normally use some of their crappy software when doing that GPU processing.
"Many queries that I write are simpler than TPC-H, so what's holding them back?" -- simple queries don't need acceleration.
A "SELECT * FROM users WHERE user_id = 12", or a "SELECT SUM(price) FROM products" doesn't need a GPU, it's IO bound and would benefit much more from having plenty cache memory, and a SSD. A lot of what things like MySQL get used for is forums and similar, where queries are simple. The current tendency seems to be to use the database as an object store, which results in a lack of gnarly queries that could be optimized.
I do think such features will eventually make it in, but this isn't going to benefit uses like forums much.
Isn't everyone using them? I do 3D and the one drag about 3D is render time, I have a piece of software that uses the GPU and I am able to get a decent render in real time.
Premiere and AfterFX run much better and quite often real time renders too.
The way GPU's work seems to be the future, so I am puzzled why it isn't more prevalent, and I'm sure there is some technical reason I'm not aware of... right?
"If any question why we died, Tell them because our fathers lied."
MapD is a GIS-centric database.
Many queries that you write are simpler than TPC-H. Necessity is the mother of invention.
Most servers do not have powerful GPUs, and that is where heavy production databases are run.
Servers turn over comparatively quickly, though (sure, every shop has ol' reliable trucking away on the 13GB SCSI drive that was pretty cool when it left the factory, doing something obscure but vital; but the population as a whole churns faster than that), and servers with nice chunks of PCIe (typically intended for your zippy network cards or fancy storage HBAs; but they are perfectly normal PCIe slots) aren't at all difficult to find. Nor has (Nvidia in particular, AMD trailing a touch) Team Graphics been shy about pushing server-suitable GPU compute parts.
It is true that servers today mostly have little to no GPU power; but if the case were made, that would change rather quickly.
It's waiting for you to get on it.
Sheesh, evil *and* a jerk. -- Jade
so while the computation on the gpu may be 10x faster...feeding the data in/out is 10x slower meaning it did not do anything for you, except require you a lot of extra coding complication do use it. ...
Benchmarks tend not look like real world queries, of often you can do something that helps a benchmark, but does nothing in the real world,.
But what if the benchmark is larger than the memory size of the GPU? I don't know the actual size, but I guess they use at least realistic amounts of data (larger than the memory of the GPU card), so that would prove your theory wrong!
By the way, there's more to databases than just queries. Skimming through the abstract, I see that they only address speeding up the queries. The commit phase of a database is also interesting, but they don't seem to address it.
If Pandora's box is destined to be opened, *I* want to be the one to open it.
What's holding them back? I'd have thought it was obvious!
The big issue with GPGPU for DB work is that you have to have the DB entirely in memory or your performance will suck (even SSDs aren't that fast). To get a big database to work in such a scenario, you have to split it into many smaller pieces, but that makes working with these sorts of things expensive even with an open source DB. The paper even says this. That makes this sort of work only really interesting for people with significant budgets, and they can easily use a commercial DB; the additional cost isn't prohibitive in that scenario.
Without general hardware availability, there's just that not many people pushing to have the feature; OSS thrives on having many people want it and many developers able to work on it.
"Little does he know, but there is no 'I' in 'Idiot'!"
But for that money, more ram or faster drives makes more of a difference...
All of these SGBDs are actually toys being sold for more then they are capable of. So developers there have to try to catch up to PostgreSQL before it becomes (even) easier to use and eat their lunch.
Meanwhile, the issues meriting scarce development and, mainly, review time at PostgreSQL are more interesting than accelerating a few workloads in hardware which is not yet in the servers out there. Things like making PostgreSQL even easier to install, set-up and manage, even more ISO SQL compliant, even more capable, even better than NoSQL at NoSQL loads
Now, if you can show your GPU aware PostgreSQL extension or modification, and show it is generally useful enough to merit review time for the next release why not?
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
Research shows that there is good news and bad news on this approach.
The good news: Certain SQL queries can get a massive speedup by using a GPU.
The bad news: Only a small subset of queries got any benefit. They generally looked like this:
SELECT pixels FROM characters JOIN polygons JOIN textures
ON characters.character_id = polygons.character_id
WHERE characters.name = 'orc-wielding-mace' AND textures.name = 'heavy-leather-armor' AND color_theme = 'green'
ORDER BY y, x
Just a few projects into Database Performance Optimization would convince you that's not a true statement. IO/Memory/CPU are in fact largely interchangeable resources on a database. And depending on your schema you can just as easily run out of any of these resources equally.
For instance, I'm currently tuning a SQL Server database that's CPU heavy based on our load projection targets. We could tweak/increase query caching that would cause more resultsets to stay in memory. This would mean that less complex queries would be run, drastically reducing I/O and some CPU resource usage. But then drastically increasing memory usage. This is just a simple example of course to illustrate the point.
Databases run out of CPU resources all the time. And a CPU advancement would be very well received.
My guess as to why this hasn't been done is that it would require end-users to start buying/renting/leasing GPU enabled hardware for their Database infrastructure. This would be a huge change from how we do things today and this sector moves very slowly.
Also we have many fairly old but more important Database advancements which have been around for years and are still almost unusable. If you ever tried to horizontally scale most popular Open-source databases you may know what I'm talking about. Multi-master, or just scaling technology in general, is required by about every growing "IT-dependent" company at some point. But that technology ( though available ) is still "in the dark ages" as far as I'm concerned based on reliability and performance measurements.
I'm responsible for a large university learning management system (Sakai). The daabase is completely CPU limited. I assume that's because the working set of data fits in memory. I would think lots of university and enterprise applications would be similar. Another data point is the experiments done on a no-SQL interface to innodb. That shows very large speedups. Surely some of this is due to the CPU overhead in processing SQL.
Guess nobody ever heard of the pgstrom
IT staff needs GPUs to play Crysis. Your DBMS gets a lower priority.
Have gnu, will travel.
Besides datasets not fitting in to GPGPU memory, and I/O bottlenecks, I'm still seeing plenty of badly written SQL
A current contract has plenty of SQL work (not for me though), and the bulk of their time is cleaning up data exceptions, badly written report queries, and moving oft-used or large-dataset queries to stored procedures. GPGPU's will hide some of the rot, but if the SQL was written better in the first place, we're able to use parallelism and better use existing commodity hardware in clients virtualised environments.
I'm not dissing the prospect of GPU acceleration, just the priority TFA gives to it.
"We know what happens to people who stay in the middle of the road. They get run over." - Aneurin Bevan
Looks like exactly what PostgreSQL's PGStrom project is trying to acheive.
it doesn't seem like any code has made it into Open Source databases like MonetDB, MySQL, CouchDB, etc.
Lemme guess, MySQL fanatic?
You can already go download:
https://wiki.postgresql.org/wiki/PGStrom
if it fits your problem domain and PostGIS has some hackers adding GPU support:
http://data-informed.com/fast-database-emerges-from-mit-class-gpus-and-students-invention/
Why not the others? Perhaps because PostgreSQL makes developing extensions easier - it's got the largest extension ecosystem, so I'm just presuming there. If it turns out well in Pg-land, the others will naturally adopt it.
So the answer to the story title is "they do." The next question would be, "why isn't it widely deployed", and the answer would be, "it's not done yet." Yadda, yadda, yadda, patches welcome. If the whole summary is just a way to try to turn "hey this is neat" (it is) into an ill-founded complaint story, then write a better story next time. It's neat stuff, no need to whine.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Indeed. We have a large WordPress based site and it is bound by database CPU despite the fairly powerful CPU it uses. It should scale to many cores, so I'm thinking of trying a pair of the 8 core AMD processors. Intel is faster PER CORE, but an AMD rig could have 16 cores.
Oh, with the exception of dedicated GPU compute setups, definitely, that's why the servers in use are configured as they are. My point was not that servers should have more GPU power; but that (if a change in software made doing so a good idea) the existing hardware wouldn't provide too much 'inertia' to stop or slow adoption.
There doesn't seem to be too much interest, on the whole; but if one were interested they could change the composition of their servers in fairly short order; and a broader shift could happen comparatively quickly (again, given suitable software).
A GPU, even a GTX Titan, simply isn't 7 times faster than a modern 32-core x86 CPU in real life. Most of the gain probably comes from just general optimization that could have been done on the CPU too.
putting MonetDB, CouchDB and MySQL in single line already shows seriousness of the question. First of all, TPC-H is decision support (ad-hoc, analytic) workload, and putting all data into memory needs comparison with in-memory ad-hoc platforms like MemSQL where the difference might not be as pronounced. Also, it is easy to have hundreds of gigs of RAM for CPU driven systems, whereas GPU memory is still tiny. Yes, doing complex window functions on streaming data may seem fine, but anything requiring larger arenas of random data access would fall of the cliff. For all the people talking about speeding your Wordpress, you need to look at OLTP or even more readonly small data benchmarks (TPC-C is already too complex). Database that is efficient at transactional workload has too much overhead for analytical processing. Then there are all the datacenter considerations - getting rid of heat once you have thousands of GPUs around is no longer a trivial task, and may involve oil immersion or water cooling. Yes, sorting a dataset can be faster, but assembling it from all the I/O devices and memory is a task that is the major expense. Thats why multithreading works - there are lots of waits for memory already, making them even longer would be difficult. The I/O we talk about is not just reading from disk or disks or ssds, it is also about getting into the chip, and bandwidth there is still constrained. And yes, CPU can get quite busy in database server, for all the network, storage, compression, page and row mangling code has to run somewhere - sizing hardware for large scale databases is a tough balancing act. But very little of that can work on GPU in online world. Nice research though :)
We use them as CPUs because we don't suffer from that cognitive bias known as functional fixedness.
Gee, a $1000 GPU that runs 7x as fast a 1/8th of an $1500 CPU. It woud be good idea if you didn't need that CPU to run it, but just barely so. If you cheap out on the CPU and only spend ~$750 on it, assuminng there is no slowdown on the GPU because of it, then the economics break. And people wonder why GPU compute on databases isn't catching on.
Then there is the power use aka TCO/running costs to think about. And everything mentioned above. And.... This study has all he hallmarks of an Nvidia research project who's targets are financial analysts rather than potential customers. The science is fine but that is not the intent.
-Charlie
This is clearly the question that corporate co-authors Nvidia and Logicblox hoped you would ask.
The paper seems to represent more of an evolutionary rather than revolutionary approach, but suffers from some unfortunate hand-waving, particularly in their attempt to negate the real cost of memory->PCIe transfers (to their credit, at least they call out that latency), their unwillingness to perform comparisons on like-to-like base hardware, and their rather odd choice of front-end environment. Coupled with their odd price-performance metric, I suspect that Nvidia marketing got way up in Gatech's business on this. My suspicion is that there are real use cases where SIMD processing is going to substantially speedup relational database performance on easily partitioned datasets, but as more vectorization effort is placed on main CPU, the advantages of kicking off to coprocessor will eventually go the way of the 387.
Don't use open source db. Use SQL Server for security and speed.
I agree, simply because I'm paid 50 cents to post this.
How much were you paid?
Typical reponses above:
(a) DB operations aren't CPU intensive
(b) Servers don't come with dedicated graphics cards of any note
(c) Loading each server with a AMD or Nvidia card would increase power usage
So in summary, certain operations may benefit using GPUs but there's not a cost-effective solution to warrant such experimentation.
I'd be surprised ARM if haven't sponsored cloud research into OpenCL on the Mali GPUs.
I don't do much of this sort of thing anymore but there was a time when I tried to look inside every file on my computer, telnet to every host available just to learn everything I could about this wonderful new tool that had come into my grasp. I sometimes miss those days and that guy, but now I tire easily and kind of just want everything to work.
I do feel what you are feeling.
As we get older we get more easily tired. But that doesn't mean I will rest more just because I get tired.
On the contrary - I push myself harder simply because I get more easily tired.
Only by doing so I get things to even out - maybe I am not as fast and as sharp, in both physical sense and in mental state, but as I push myself harder, I will do more in the same 24 hours allocated to me every single day.
Why should I let the young uns having all the fun ?
Why should I let my scopes be idle just because I tell myself "oh, I'm tired now, I can test that thing tomorrow" ?
If I need to find out something, I don't waste time debating with myself whether I should do that thing or not - I just go ahead and do it.
That's how I live my life, anyway.
Muchas Gracias, Señor Edward Snowden !
Just wanted to mention AlenkaDB :
https://github.com/antonmks/Alenka
It is open source and it runs pretty fast on datasets that do not fit into gpu memory - check the 1TB TPC-H benchmarks.
Database users are pretty conservative, so I haven't seen many people using it so far, aside from some folks at U.S. DoD.
http://wiki.postgresql.org/wiki/PGStrom
The right question would be : when will a major Open Source DB be released with a GPU accelerator...
The issue is that it is quite hard to do, and it's harder to do it in a way that is sufficiently generic and elegant to deserve public dissemination.
this is the reason. We need a standard GPU access API and vendor independent. If the DBs need a GPU to speed up, please use Parallela or support OpenGraphics.
It IS consumer shit, but versions where you can enable ECC and the clock/voltage profiles are toned down. BTW the ECC is not available on midrange FirePro and Quadro, or on older generations.
But what if the benchmark is larger than the memory size of the GPU? I don't know the actual size, but I guess they use at least realistic amounts of data (larger than the memory of the GPU card), so that would prove your theory wrong!
They didn't (1GB), he's right.
(I stand corrected then. Thanks for looking it up.)
If Pandora's box is destined to be opened, *I* want to be the one to open it.