tmostak · Slashdot Mirror

Re:Large datasets are mostly IO limited on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 18:22 · Score: 4, Informative

Hi - MapD creator here. Agreed, GPUs aren't going to me of much use if you have petabytes of data and are I/O bound, but what I think unfortunately gets missed in the rush to indiscriminately throw everything into the "big data bucket" is that a lot of people do have medium-sized (say 5GB-500GB) datasets that they would like to query, visualize and analyze in an iterative, real-time fashion, something that existing solutions won't allow you to do (even big clusters often incur enough latency to make real-time analysis difficult).

And then you have super-linear algorithms like graph processing, spatial joins, neural nets, clustering, rendering blurred heatmaps which do really well on the GPU, which the formerly memory bound speedup of 70X turns into 400-500X. Particularly since databases are expected to do more and more viz and machine learning, I don't think these are edge cases

Finally, although GPU memory will always be more expensive (but faster) than CPU memory, MapD already can run on a 16-card 128GB GPU ram server, and I'm working on a multi-node distributed implementation where you could string many of these together. So having a terabyte of GPU RAM is not out of the question, which, given the column-store architecture of the db can be used more efficiently by caching only the necessary columns in memory. Of course it will cost more, but for some applications the performance benefits may be worth it.

I just think people need to realize that different problems need different solutions, and just b/c a system is not built to handle a petabyte of data doesn't mean its not worthwhile.

Re:Q: Whats better than a GPU database? on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 17:17 · Score: 3, Insightful

Try to heatmap or do hierarchical clustering on a billion rows in a few milliseconds with just the aid of indexes - not all applications need lots of cores and high memory bandwidth - but some do.

Re:Good to see things like this. on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 16:09 · Score: 1

Thanks for the kind words! Hopefully this is just the start of a fun project... Todd

Re:Could have... on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 14:45 · Score: 2

Umm... no I didn't submit this. Perhaps the author of the article did. But I may have just done a super-hacky bandaid fix (also disallowed click requests - which may be a bit buggy) - we'll see if it holds up. Todd

Re:That Didn't Take Long: Database Down For Maint. on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 14:27 · Score: 5, Informative

Har har... Well things got tricky when I wrote the code to support streaming inserts (not implemented in the current map) so you could view tweets or whatever else as they came in - this required a lot of fine-grained locking. May just bandaid this and give locks to connections as they come in until I can figure out what's going on. Todd

Re:sounds like... on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 12:58 · Score: 5, Informative

I'm not using thrust - I rolled my own hash join algorithm. This is something I still haven't optimized a great deal and I'm sure your stuff runs much better. Would love to talk. Just contact me on Twitter (@toddmostak) and I'll give you my contact details. Todd

Re:sounds like... on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 12:54 · Score: 5, Informative

So I use postgres all the time, but MapD isn't built on Postgres, it actually stores its own data on disk in column-form in (I admit crude) memory-mapped files. I have written a Postgres connector that connects MapD to Postgres though since I use postgres to store the tweets I harvest for long-term archiving. The connector uses pqxx (the C++ Postgres library). Todd

Re:PostgreSQL used GPU 2 years ago on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 12:14 · Score: 5, Informative

The 70X is actually highly conservative - and this was benched against an optimized parallelized main-memory (i.e. not off of disk) CPU version, not say MySQL. On things like rendering heatmaps, graph query operations, or clustering you can get 300-500X speedups. The database caches what it can in GPU memory (could be 128GB on one node if you have 16 GPUs) and only sends back a bitmap of the results to be joined with data sitting in CPU memory. But yeah, if the data's not cached, then it won't be this fast. That's true, a lot of work has been done on GPU database processing - this is a bit different I think b/c it runs on multiple GPUs and b/c it tries to cache what it can on the GPU. Todd (MapD creator)

Re:sounds like... on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 12:04 · Score: 5, Informative

Hi, MapD creator here - and I have to disagree with you. The database ultimately stores everything on disk, but it caches what it can in GPU memory and performs all the computation there. So all the SQL operations are occurring on the GPU, after which, in case of the tweetmap demo, the results are rendered to a texture before being sent out as a png. But it works equally well as a traditional database - it doesn't do the whole SQL standard yet but can handle aggregations, joins, etc just like a normal database, just much faster. Todd

Re:That Didn't Take Long: Database Down For Maint. on Harvard/MIT Student Creates GPU Database, Hacker-Style · 2013-04-22 11:59 · Score: 5, Informative

Hi... MapD creator here... this is the first time we've been seriously load tested, and I realize I might have a "locking" bug that's creating a deadlock when people hit the server at the exact same time. Todd

Slashdot Mirror

User: tmostak

Comments · 10