Harvard/MIT Student Creates GPU Database, Hacker-Style
First time accepted submitter IamIanB writes "Harvard Middle Eastern Studies student Todd Mostak's first tangle with big data didn't go well; trying to process and map 40 million geolocated tweets from the Arab Spring uprising took days. So while taking a database course across town at MIT, he developed a massively parallel database that uses GeForce Titan GPUs to do the data processing. The system sees 70x performance increases over CPU-based systems, and can out crunch a 1000 node MapReduce cluster, in some cases. All for around $5,000 worth of hardware. Mostak plans to release the system under an open source license; you can play with a data set of 125 million tweets hosted at Harvard's WorldMap and see the millisecond response time."
I seem to recall a dedicated database query processor that worked by having a few hundred really small processors that was integrated with INGRES in the '80s.
1. Facebook would like to have a discussion with him.
2. The FBI would like to have a discussion with him.
I want to know why GPUs are so much better at some tasks than CPUs? And, why aren't they used more often if they are orders of magnitude faster?
Thanks.
Slashdotted? I happened to catch the story just as it went live, and hit the link to the service. After scrolling the map and getting a couple of updates: Database is down for maintenance. The front end may not be as high performance as the back... or it may have been coincidence.
Luke, help me take this mask off
as the TFS states he uses GPUs to do the data processing, but you are never going to believe what he uses to store the actual data, you won't believe it, that's why it's not mentioned in TFS. Sure sure, it's PostgreSQL, but the way the data was stored physically was in the computer monitor itself. Yes, he punched holes in computer monitors with a chisel and used punch card readers to read those holes from the screens.
You can't handle the truth.
Could anyone give a brief and non over technical explanation about this?!
It sounds like he's doing standard GPU computations, loading everything into memory, and then calling it a "database", even though it really isn't a "database" in any traditional sense.
I'd hardly call them "really small processors" haha.
The 70x times seem optimistic. Does this include ALL the overheads for the GPU?
But this done and patented over 2 years ago.
http://www.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
And there has been earlier work using SQLite on GPU's.
The Egyptian government...
Still waiting for one/two....to play games on....
The 70X is actually highly conservative - and this was benched against an optimized parallelized main-memory (i.e. not off of disk) CPU version, not say MySQL. On things like rendering heatmaps, graph query operations, or clustering you can get 300-500X speedups. The database caches what it can in GPU memory (could be 128GB on one node if you have 16 GPUs) and only sends back a bitmap of the results to be joined with data sitting in CPU memory. But yeah, if the data's not cached, then it won't be this fast. That's true, a lot of work has been done on GPU database processing - this is a bit different I think b/c it runs on multiple GPUs and b/c it tries to cache what it can on the GPU. Todd (MapD creator)
That thought that this would be a searchable database of all GPUs that exist? Because that sounded kinda useful.
does it blend?
Altera and Xilinx both have high level synthesis tools out that can target FPGA's using generic C. The Altera one allows you to target GPU's, CPU's or FPGA's. In the case of highly parallel tasks, an FPGA can run many times faster than even a GPU. There are fairly large gate count devices with ARM cores available now so you move the tasks around for better performance. I'd love to see some of these tasks targeting these devices.
Maybe we should make it a habit of giving the owner some warning before slashdotting them. I know that if I ever get any concept development project up and running, I'm pretty excited to show my friends and tend to make it accessible before it's optimized enough to handle that king of onslaught.
MouseClass extends ScrollClass, which extends TabClass, which extends SidebarClass, which extends PowerClass, w
Ingres was renamed to Actian and have released an analytic/reporting database called "Vectorwise" which makes use of SIMD and many other innovations in data throughtput techniques(everything in the Intel optimisation manual plus a lot more) and it gets more than 70 times performance. Check out TPC-H results "This is not an advertisement"
While cool and all 125million tweets with geo tagging is at most: 1250000000*142bytes = 165 GB. That is not what "big data" considers a large data set. Indeed most "big data" queries are IO limited. For around 16k USD you can fit that entire working set in memory. You are not really in the "big data" realm into you have datasets in the 10's of TB's compressed (100's of TB's uncompressed).
For these kinds of datasets, and where more compute is necessary there is MARs.
Actually, depending on the specific problem GPU can still be significantly faster than FPGAs mostly because of the large number of processing units.
The FPGAs are far more power efficient though.
For data processing workloads, a frequent problem with GPU acceleration is that the working dataset size is too large to fit into the available GPU memory and the whole thing slows to a crawl on data ingest (physical disk seeks, random much of the time) or disk writes for persisting the results.
For folks serious about getting good ROI on their GPU hardware in real world scenarios, I strongly recommend you take a look at the fusion IO PCIe flash cards, which now support writing to and reading from them directly from CUDA via DMA, with little to no CPU handling required. (See: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0619-GTC2012-Flash-Memory-Throttle.pdf).
I can't talk about what we do with it, but lets just say the following hardware combination has lead to interesting results;
i) 16x PCIe slot chassis: http://www.onestopsystems.com/expansion_platforms_3U.php
ii) 8x Nvidia Kepler K20x's
iii) 8x Fusion IO 2.4TB IoDrive 2 Duo's
We have been able sustain over 4 million data operations a second, each one processing ~16 K of data in a recoverable, transactionally consistent manner, totaling up to around 50 Gigabytes of data processed per second. All in a 5U deployment drawing less than 4 kilowatts.
Granted its not free or cheap, but IBM will ship you a prebuilt rack of 'stuff' that will load 5TB/hour and scan 128GB/sec. PGStrom came out in the last year. Custom hardware/ASIC/FPGA for this sort of thing is not new.
I want to delete my account but Slashdot doesn't allow it.
They're massively more parallel, running many more smaller simpler cores.
It's the same reason these guys can make a 16 core parallel computer for $99.... the cores are focused on their job so they can be smaller and cheaper and can put more on a die.
http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/
So these guys can run 8 daughter boards, with 64 cores per board, 512 cores, and it looks like they plan on scaling to 4096 cores because they use the top 12 bits of the address as the core routing id.
The tradeoff with all those cores is they're dirt simple cores, moves, adds, branches, and some floating point ops (misses divide even, its done in software, but then for signal processing and multiply-add is the one that needs to be fast and its coded as a single instruction).
If you read up on your high end graphics card it might have 900+ CUDA cores, really just ALU cores, the actual thread running cores are far fewer than that. But the ALU's can be run in parallel.
So a vector multiply is done as a parallel operation on these ALU blocks, and many other operations break down to be parallel in the same way.
As a data analyst/software engineer, it makes me glad to see these kind of actual strides are being made to ensure that both data and software will eventually start being designed properly from their inception. To have a single cluster database with anything more than a few thousand entries is nothing short of incompetence, and I believe anyone who does this should be publicly shamed and flogged. When dealing with excessively large amounts of data, it quickly becomes a necessity to have a paralleled database design to ensure that searches aren't hampered by long query times. It genuinely makes me thrilled to see someone else use this kind of design other than me, so when I put out numbers on my end, maybe my results won't seem as fantastical or unbelievable. Even though I don't know you personally, keep up the good work, Todd.
...and do big data on an FPGA cluster.
A: Indexes that don't suck.
Using GPUs and massivelly parallel blah blah blah is cool and all but most databases are not processor limited so why should we care?
40 million rows is what we used to manage in Oracle tables in the late 80s. Jeez, did this guy have no clue how to build a database?
Korma: Good
AFAIR using a database with a GPU has been patented by IBM some years ago
It's great the GPU is faster than the CPU for massively parallel non-conditional operations. Why not use the CPU in addition to the GPU? Does the computer memory speed or bus bandwidth prevent it?
Student writes inefficient code, learns how to optimize it using known techniques, it becomes faster. Film at 11.
In Soviet Russia, GPU database creates you. Oh wait, wrong GPU
No left turn unstoned.
I to would hope that a cluster of 16 GPUs would be 300x-500x faster than a non-clustered CPU.
A truck can deliver 1000 times more goods at once than a compact, but if you need to deliver babies, your truck can only deliver one, in the passenger seat.
A map-reduce cluster, such as Hadoop, is useful when you have a lot of data to sift through. It brings the data to the CPUs, rather than the other way around. It allows you to do a bunch of I/O in parallel, so you're not I/O bound. Contrast this to crunching numbers with GPUs or CPUs, where the bottleneck is processing throughput instead of I/O. These two architectures are optimized for solving different problems.
So the comparison between this very-cool GPU-centric solution to a 1000-node map-reduce cluster is not useful. It's like saying that printers are better than FAX machines, because they can print more pages per minute.
Thanks for the link to the GPU "on-board" flash memory presentation. Interesting to see that original Apple ][ hardware guru Wozniak is the chief scientist on this for Fusion I O hardware. I hadn't seen that about him on any other sites. Merci!