Harvard/MIT Student Creates GPU Database, Hacker-Style

← Back to Stories (view on slashdot.org)

Harvard/MIT Student Creates GPU Database, Hacker-Style

Posted by Unknown on Monday April 22, 2013 @11:21AM from the search-faster dept.

First time accepted submitter IamIanB writes "Harvard Middle Eastern Studies student Todd Mostak's first tangle with big data didn't go well; trying to process and map 40 million geolocated tweets from the Arab Spring uprising took days. So while taking a database course across town at MIT, he developed a massively parallel database that uses GeForce Titan GPUs to do the data processing. The system sees 70x performance increases over CPU-based systems, and can out crunch a 1000 node MapReduce cluster, in some cases. All for around $5,000 worth of hardware. Mostak plans to release the system under an open source license; you can play with a data set of 125 million tweets hosted at Harvard's WorldMap and see the millisecond response time." I seem to recall a dedicated database query processor that worked by having a few hundred really small processors that was integrated with INGRES in the '80s.

12 of 135 comments (clear)

Min score:

Reason:

Sort:

Two thoughts based on this story by Anonymous Coward · 2013-04-22 11:23 · Score: 5, Interesting

1. Facebook would like to have a discussion with him.
2. The FBI would like to have a discussion with him.
Re:I'm not a computer scientist, and... by gubon13 · 2013-04-22 11:36 · Score: 5, Informative

Sort of a lazy effort on my part to not summarize, but here's a great explanation: https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU.
Re:I'm not a computer scientist, and... by PhamNguyen · 2013-04-22 11:40 · Score: 5, Informative

GPUs are much faster for code that can be parallelized (basically this means having many cores doing the same thing, but on different data). However there is a signficant complexity in isolating hte parts of the code that can be done in parallel. Additionally, there is a cost to moving data to the GPU's memory, and also from the GPU memory to the GPU cores. CPU's on the other hand, have a cache architecture that means that much of the time, memory access is extremely fast.
Given progress in the last 10 years, the set of algorithms that can be parallelized is very large. So the GPU advantage should be overwhelming. The main issue is that the complexity writing a program that does things on the GPU is much higher.
Re:That Didn't Take Long: Database Down For Maint. by tmostak · 2013-04-22 11:59 · Score: 5, Informative

Hi... MapD creator here... this is the first time we've been seriously load tested, and I realize I might have a "locking" bug that's creating a deadlock when people hit the server at the exact same time. Todd
Re:sounds like... by tmostak · 2013-04-22 12:04 · Score: 5, Informative

Hi, MapD creator here - and I have to disagree with you. The database ultimately stores everything on disk, but it caches what it can in GPU memory and performs all the computation there. So all the SQL operations are occurring on the GPU, after which, in case of the tweetmap demo, the results are rendered to a texture before being sent out as a png. But it works equally well as a traditional database - it doesn't do the whole SQL standard yet but can handle aggregations, joins, etc just like a normal database, just much faster. Todd
Re:PostgreSQL used GPU 2 years ago by tmostak · 2013-04-22 12:14 · Score: 5, Informative

The 70X is actually highly conservative - and this was benched against an optimized parallelized main-memory (i.e. not off of disk) CPU version, not say MySQL. On things like rendering heatmaps, graph query operations, or clustering you can get 300-500X speedups. The database caches what it can in GPU memory (could be 128GB on one node if you have 16 GPUs) and only sends back a bitmap of the results to be joined with data sitting in CPU memory. But yeah, if the data's not cached, then it won't be this fast. That's true, a lot of work has been done on GPU database processing - this is a bit different I think b/c it runs on multiple GPUs and b/c it tries to cache what it can on the GPU. Todd (MapD creator)
Re:I'm not a computer scientist, and... by Morpf · 2013-04-22 12:19 · Score: 5, Informative

Close, but not quite correct.
The point is GPUs are fast doing the same operation on multiple data. (e.g. multiplying a vector with a scalar) The emphasize is on _same operation_, which might not be the case for every problem one can solve parallel. You will loose speed as soon your elements of a wavefront (e.g. 16 threads, executed in lockstep) diverge into multiple execution paths. This happens if you have something like an "if" in your code and one for one work item the condition is evaluated to true and for another it's evaluated to false. Your wavefront will only be executed one path at a time, so your code becomes kind of "sequential" at this point. You will loose speed, too, if the way you access your GPU memory does not fulfill some restrictions. And by the way: I'm not speaking about some mere 1% performance loss but quite a number. ;) So generally speaking: not every problem one can solve in parallel can be efficiently solved by a GPU.
There is something similar to caches in OpenCL: it's called local data storage, but it's the programmers job to use them efficiently. Memory access is always slow if it's not registers you are accessing, be it CPU or GPU. When using a GPU you can hide part of the memory latency by scheduling way more threads than you can physically run and always switch to those who aren't waiting for memory. This way you waste less cycles waiting for memory.
I support your view writing for GPU takes quite a bit of effort. ;)
Re:I'm not a computer scientist, and... by UnknownSoldier · 2013-04-22 12:24 · Score: 5, Informative

If one woman can have a baby in 9 months, then 9 women can have a baby in one month, right?
No.
Not every task can be run in parallel.
Now however if your data is _independent_ then you can distribute the work out to each core. Let's say you want to search 2000 objects for some matching value. On a 8-core CPU you would need 2000/8 = 250 searches. On the Titan each core could process 1 object.
There are also latency vs bandwidth issues, meaning it takes time to transfer the data from RAM to the GPU, process, and transfer the results back, but if the GPU's processing time is vastly less then the CPU, you can still have HUGE wins.
There are also SIMD / MIMD paradigms which I won't get into, but basically in layman's terms means the SIMD is able to process more data in the same amount of time.
You may be interested in reading:
http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html
http://stackoverflow.com/questions/7091958/cpu-vs-gpu-when-cpu-is-better
When your problem domain & data are able to be run in parallel then GPU's totally kick a CPU's in terms of processing power AND in price. i.e.
An i7 3770K costs around $330. Price/Core is $330/8 = $41.25/core
A GTX Titan costs around $1000. Price/Core is $1000/2688 = $0.37/core
Remember computing is about 2 extremes:
Slow & Flexible < - - - > Fast & Rigid
CPU (flexible) vs GPU (rigid)
* http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501
* http://www.newegg.com/Product/Product.aspx?Item=N82E16814130897
Re:sounds like... by tmostak · 2013-04-22 12:54 · Score: 5, Informative

So I use postgres all the time, but MapD isn't built on Postgres, it actually stores its own data on disk in column-form in (I admit crude) memory-mapped files. I have written a Postgres connector that connects MapD to Postgres though since I use postgres to store the tweets I harvest for long-term archiving. The connector uses pqxx (the C++ Postgres library). Todd
Re:sounds like... by tmostak · 2013-04-22 12:58 · Score: 5, Informative

I'm not using thrust - I rolled my own hash join algorithm. This is something I still haven't optimized a great deal and I'm sure your stuff runs much better. Would love to talk. Just contact me on Twitter (@toddmostak) and I'll give you my contact details. Todd
Large datasets are mostly IO limited by zbobet2012 · 2013-04-22 13:07 · Score: 5, Interesting

While cool and all 125million tweets with geo tagging is at most: 1250000000*142bytes = 165 GB. That is not what "big data" considers a large data set. Indeed most "big data" queries are IO limited. For around 16k USD you can fit that entire working set in memory. You are not really in the "big data" realm into you have datasets in the 10's of TB's compressed (100's of TB's uncompressed).
For these kinds of datasets, and where more compute is necessary there is MARs.
Re:That Didn't Take Long: Database Down For Maint. by tmostak · 2013-04-22 14:27 · Score: 5, Informative

Har har... Well things got tricky when I wrote the code to support streaming inserts (not implemented in the current map) so you could view tweets or whatever else as they came in - this required a lot of fine-grained locking. May just bandaid this and give locks to connections as they come in until I can figure out what's going on. Todd