BOINC Now Available For GPU/CUDA
GDI Lord writes "BOINC, open-source software for volunteer computing and grid computing, has posted news that GPU computing has arrived! The GPUGRID.net project from the Barcelona Biomedical Research Park uses CUDA-capable NVIDIA chips to create an infrastructure for biomolecular simulations. (Currently available for Linux64; other platforms to follow soon. To participate, follow the instructions on the web site.) I think this is great news, as GPUs have shown amazing potential for parallel computing."
As someone who is interested in software neural nets, this announcement practically gives me a chubber.
And let me be the first to welcome our new Distributed Overlord. The lack of an 's' on "Overlord" is the exciting part of this article.
Video conversion for GPU/CUDA (an amd64 version for ubuntu heron, if I get to be really choosy)
saw something about this, and they were getting unbelievable transcoding speeds...
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker
The only sad thing is that CUDA is a single platform API that only supports a handful of cards from a single constructor. For a project that tries to get as many computers working together as possible like BOINC, it would be also good if they tried to support at least one more API.
Brook could have been also a nice candidate. It has already been used by other distributed computing project (Folding@home), it supports multiple back-end (including a multi-CPU one which actually works(*), an OpenGL which works with most hardware, and AMD/ATI's CAL backend featured in their Brook+ fork)
Too bad that currently both nVidia and Intel are trying to attract customers to proprietary single platform APIs (CUDA and Ct resp.)
Specially given some memory management weirdness in CUDA.
(*) : unlike CUDA's device emulation mode which is just a ridiculous joke performance-wise.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
- Implement CUDA in Gallium, so all Gallium-capable HW can run CUDA
~ C.
Yes, but sorry, CUDA is as much oriented toward other graphic manufacturers as Microsoft's ISO Office XML with all its "use_spacing_as_in_word_96='ture' " options is an open standard.
It very heavily oriented toward nVidia's architecture. It has several deeply asinine architecture quirks. (you see, you have several different type of memory architecture. The twist is that 3 of them are accessed using regular pointer arithmetic, but textures are accessed using dedicated specific functions. because using "[]" operator like all other memory type wo uld have been too much straight forward).
Also instead of being just able to declare stream buffers and bind them to some data with a language extensions (as in Brook for exemple) you have to go through a couple of specific function calls into the CUDA API. It's all over 1980's-style C language again.
This whole thing being very much directed toward an architecture like nVidia which can't apply a kernel on the fly while loading memory from the main memory to the GFX cards, but instead relies on concurrent kernels and loads.
And don't ask me about this all weird tendency to require the user to go through some function calls just to set a constant to its default value (instead of simply declaring and accessing it directly).
CUDA provides a nice C-like language for kernels. But the host code it self looks like a direct dump of the driver's interface.
It's definitely something that won't be easily used by 3rd party developer and map nicely to other architectures.
That's why ATI isn't interested. Because most of the host API is designed in a way which is very nVidia oriented and won't necessarily map nicely to other architectures.
FYI, i've been both working on several projects using CUDA and using Brook. Although I appreciate the speed gain of CUDA, and I appreciate having several C-dialects which could get a port of an algorithm between C, CUDA and Brook without too much efforts ; I still find that Brook has a nicer and much more abstract architecture
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Does Brook provide access like CUDA does to fast shared memory and registers vs. device memory vs. host memory?
No. Being multiplatform to begin with, Brook exposes less details of the memory architecture underneath (because it can vary widely between platform - like CPU to GPU -, or not be exposed at all by the platform underneath - like OpenGL)
But what it has is that data is represented by simple C-like array, and the compiler remaps that to cached fast texture accesses. No weird "tex2D" functions, unlike CUDA - that's something I find weird in an architecture which is supposed to abstract and simplify GPGPU coding, specially when all the other memory types are accessed in CUDA using C pointer math.
Probably now that ATI's Brook+ is maturing, extra attributes on variable declaration could be introduced to have more influence on the memory organisation on that specific back-end.
CUDA is nice because it enables very low-level control on how memory is used. But this currently comes at the cost of syntax complexity.
It's interesting to note that both CUDA and Brook+ use a matrix multiplication as an example of language usage. Brook+ simply explain how to partition the work to keep the data nicely inside the fast cache. CUDA has a significant amount of code lines devoted to moving data between several Hungarian notation-prefixed pointer, which is a little bit more confusing.
Just to pick a nit, I'm pretty sure that the point of device emulation mode is ease of debugging, not performance.
But to be debugable, the code must at least be runnable. Sadly, the emulation is so slow, that it can run real-word complex algorithms only on really small sets of data. Which might be corner cases and you might misses bugs that only happen on larger data sets. Also, it always runs single threaded, no matter how many cores are available in the system, which may lead to missing some concurrency problems (code works fine on CPU but breaks on GPU because a sync is missing somewhere)
It can be used to debug short matrix-operation algorithms, but it's very hard to debug more complex things like sequence analysis (and there are even a couple of teams trying to do parallelised antivirus on the GPU)
But at this early stage, with things still emerging, using CUDA directly seems to have some advantages.
There are cases where the low level-ness of CUDA definitely makes sense :
when developing code for specially on purpose built hardware. Say, the lab you work in has built a machine with a couple GeForces inside for you project (given the price of graphic cards and the performance increase between each generation, it makes sense to just throw in a couple of hundred bucks per graphic card for a specific project when the performance need arises). CUDA makes sense - even if it is ugly in places - because it'll let you squeeze the last possible cycle out of the hardware.
But for something that will run distributed across a huge number of home configurations like "@home" distributed computing, adding an API which will bring additional architectures and is more abstract makes sense. Going for a single API roughly restrict the code to running on only half of gamers population's machines.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Yes, indeed, F@H sports quite an original zoo of various computation engines in order to squeeze as much performance as possible from as many clients as possible. Including a client running on PS3's Cell.
I agree that BOINC should include support for more than 1 single API. Either adding CAL as you suggest (although it's rather low level stuff) or adding Brook (which has a CAL backend - I would think that would be better as it is much higher level).
And you presume correct, currently Brook only supports nVidia through the OpenGL/GLSL backend which lacks some advanced features. There has been some discussion on some forums about trying a CUDA backend on Brook, but the idea doesn't have enough follower mainly because the most speed critical optimisation (shared memory) won't be easy to implement automatically (in CUDA it's a voodoo art done manually by the coder).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]