Writing Linux Kernel Functions In CUDA With KGPU

← Back to Stories (view on slashdot.org)

Writing Linux Kernel Functions In CUDA With KGPU

Posted by Soulskill on Friday May 6, 2011 @07:59AM from the easy-as-pie dept.

An anonymous reader writes "Until today, GPGPU computing was a userspace privilege because of NVIDIA's closed-source policy and AMD's semi-open state. KGPU is a workaround to enable Linux kernel functionality written in CUDA. Instead of figuring out GPU specs via reverse-engineering, it simply uses a userspace helper to do CUDA-related work for kernelspace requesters. A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions. With the accelerated performance of a GPU AES cipher in the Linux kernel, eCryptfs can get a 3x uncached read speedup and near 4x write speedup on an Intel X25-M 80G SSD. However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. A CTR, counter mode, cipher may be much more secure, although the real vanilla eCryptfs uses CBC mode. Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel, given the fact that GPUs are so widely deployed and the potential future of heterogeneous operating systems."

28 of 101 comments (clear)

Min score:

Reason:

Sort:

Best possible example by Anonymous Coward · 2011-05-06 08:07 · Score: 2, Interesting

Hand off encryption routines to a closed source black box. Brilliant.
1. Re:Best possible example by icebraining · 2011-05-06 08:17 · Score: 2
  
  Yes, because the CPU isn't, we're all running open hardware /s
  
  --
  Dilbert RSS feed
2. Re:Best possible example by Jaqenn · 2011-05-06 08:17 · Score: 4, Insightful
  
  As opposed to having them done by my Intel CPU, for which Intel has helpfully provided full schematics.
  
  --
  You are awash in a sea of fiercely stated opinions. Obvious exits are: 'File->Quit', 'Reply', and 'Page Down'.
Question: by Jaqenn · 2011-05-06 08:11 · Score: 3, Interesting

(I have never written kernel level code, and the statement that follows is only from listening to what other people are doing)

I thought that a tiny bit of kernel code reflecting calls into a user level process was old news, and has become established as the preferred development model. Is there a reason that it's undesirable?

Because the summary makes it sound like we're sad to be following this model, and we're only doing it because we can't pull NVidia's driver source into the linux kernel.

--
You are awash in a sea of fiercely stated opinions. Obvious exits are: 'File->Quit', 'Reply', and 'Page Down'.
1. Re:Question: by sockman · 2011-05-06 08:18 · Score: 2
  
  The NVIDIA extensions are only available in userland.
  So a call to the kernel level crypto system gets routed back out to user land, and back to kernel land via the GPU module. That's why we're sad.
2. Re:Question: by killmenow · 2011-05-06 08:23 · Score: 2
  
  I've never written kernel modules either so take this with a grain of salt: my understanding is there is a cost associated with the switching/passing back and forth between userspace and kernelspace and it's best to minimize that. I remember similar discussions going back as far as NT4 when Microsoft decided to implement the entire GDI in kernelspace, which is what led to a billion BSODs because video drivers are notoriously shitty code and you'd be way better off stability-wise having that code run in userspace. Performance-wise, not so much.
  
  The interesting thing about encryption code working this way is there is such a tremendous speedup by running the bulk of the encryption code on the GPU as opposed to the CPU that the cost incurred in the user/kernel switch is well worth it.
3. Re:Question: by afidel · 2011-05-06 08:27 · Score: 2
  
  The reason it's undesirable is the hit you taking when moving back and forth between kernel space and user space. The move in each direction requires the CPU to change ring levels which increases latency.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
4. Re:Question: by PoochieReds · 2011-05-06 09:07 · Score: 4, Interesting
  
  There are also other concerns than the context switch overhead...particularly when dealing with filesystems or data storage devices.
  For instance, suppose part of your userspace daemon gets swapped out, and you now need to upcall to userspace. That part that got paged out then has to be paged back in. If memory is tight, then the kernel may have to free some memory, and it may decide to flush out dirty data to the filesystem or device that is dependent on the userspace daemon. At that point, you're effectively deadlocked.
  Most of those sorts of problems can be overcome with careful coding and making sure the important parts of the daemon are mlocked, but you do have to be careful and it's not always straightforward to do that.
Re:Did a anyone else's brain switch off half way.. by h4rr4r · 2011-05-06 08:11 · Score: 4, Informative

GTFO!
This is what should be on slashdot, not stories about the latest iphone.
Re:AES-NI by gman003 · 2011-05-06 08:16 · Score: 2

Yes, but that was comparing a Pentium 4 (last one came out in 2006) to a brand-new processor (2011). That is NOT scientifically accurate - they are completely different designs, which will produce vastly different runtimes for the exact same instructions. How about doing a comparison between Crypto++ running on a 2500k, and Crypto++ running on a 2500k without being compiled with AES-NI support. That would be infinitely more rigorous.
Wow by killmenow · 2011-05-06 08:17 · Score: 2

I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.
Re:AES-NI by Anonymous Coward · 2011-05-06 08:19 · Score: 2, Informative

KGPU uses AES just as a demonstration, it's architecture is general to any GPU-friendly algorithm.
All in good time by deadline · 2011-05-06 08:34 · Score: 2

Proof of concepts are nice, but when the GPU is firmly planted in the CPU, this will make more sense. The PCI bus can be a bottleneck in these types of situations. AMD fusion is a great example of this idea.

--
HPC for Primates. Read Cluster Monkey
Re:AES-NI by wagnerrp · 2011-05-06 09:12 · Score: 2

It's for an entirely different application. AES-NI is one application specific set of instructions. While encryption and decryption is an application in which dedicated hardware can have tremendous gains, introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU. It's an inherently limiting design methodology. Introducing GPU access to the kernel opens up a very powerful piece of hardware to be used for a wide range of applications, enhancing any process that is suitable for the architecture found on a GPU.
Think of GPUs like picking up a new math co-processor 20 years ago.
Re:Recipe for a corrupted filesystem by calmofthestorm · 2011-05-06 09:15 · Score: 2

fragility of an encrypted file system{citationneeded}.
I've been using them since 2006. Never had any problems.

--
93rd rule of Slashdot: No matter how obvious my sarcasm is, my comment will be taken seriously by someone.
Re:AES-NI by gman003 · 2011-05-06 09:32 · Score: 2

No, cycle per byte is EXTREMELY important. Even contemporary processors can execute instructions in highly different amounts of time - a K5 can perform some instructions in 80% the time of an identically-clocked Pentium. And when you compare it to such wildly different architectures as Sandy Bridge and NetBurst, all bets are off. You might as well be throw an 8086 and a SPARC into the mix, because that'll be about as rigorous.
Re:Cool test... by Panaflex · 2011-05-06 09:43 · Score: 2

That's a pretty cool project! But I do think they still suffer the same latency problems - in order to take advantage of the GPU's full throughput - they have to have a huge number of client connections (chosen solution) or a very deep queue (hard to optimize, only works with larger file sizes).
Certainly this is a great solution for what it is - but it's not a general purpose solution. And you can get a much more reliable and supported solution out there. (e.g. BIG-IP SSL Accelerator, which uses certified FIPS 140-2 hardware.)

--
I said no... but I missed and it came out yes.
ECB Mode is totally insecure by jasonwc · 2011-05-06 09:44 · Score: 3, Interesting

I hope this is just a proof-of-concept design because ECB mode should not be used for this purpose. Wikipedia provides a pretty obvious example of the weakness of ECB mode:
"The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks; thus, it does not hide data patterns well. In some senses, it doesn't provide serious message confidentiality, and it is not recommended for use in cryptographic protocols at all. A striking example of the degree to which ECB can leave plaintext data patterns in the ciphertext is shown below; a pixel-map version of the image on the left was encrypted with ECB mode to create the center image, versus a non-ECB mode for the right image."
http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Initialization_vector_.28IV.29
1. Re:ECB Mode is totally insecure by Melkhior · 2011-05-06 19:02 · Score: 2
  
  And because a picture straight from the horse's mouth is worth a thousand words, here's what NVidia has to say about it:
  http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html
  Go to 36.5, figure 36-11 & 36-13.
Re:AES-NI by OeLeWaPpErKe · 2011-05-06 10:02 · Score: 2

The problem is that parallelized encryption is not as secure as the other modes. Let me show you the difference between CBC, ECB and CTR ( block(i) means the i'th block of data)
1) CBC
CBC(pwd, block(i)) == encrypt(pwd, block(i)) xor block(i-1)
* block(-1) = hash(pwd, 0) (sometimes half the password is used as block(-1))
2) ECB
ECB(block(i)) = encrypt(block(i))
3) CTR
CTR(block(i)) = encrypt(block(i)) xor i
I hope it's obvious why CBC and CTR are the only candidates for parallelization. CBC can only be done in sequence. But there's a huge issue. Ciphers have weak spots, and there are rainbow tables. So let's suppose you have an encrypted file in ECB mode.
encrypt(block(1)) : encrypt(block(2)) : ... : encrypt(block(n)) * bing rainbow table hit ! (ie. somehow you're able to decrypt block(3))
now you have a combination block(n), encrypt(block(n)) and password. Well you've broken the encryption. The problem is the contents of blocks are quite predictable (e.g. you will pretty much know every bit in an ext3 superblock if you know the size of the volume, so you can generate targeted rainbow tables). The only thing you need to find is the password.
Suppose the same happens in CBC mode
encrypt(block(1) xor initializer) : encrypt(block(2) xor encrypt(block(1) xor initializer) : encrypt(block(3) xor encrypt(block(2) xor encrypt(block(1) xor initializer)) ...
Now block(1) is still perfectly predictable, block(1) xor initializer, however, is not. You have to generate 2^(passwordlength + blocksize)/2 rainbow tables before you'd get a single hit. Also, just because you get one hit, doesn't mean it's the correct one (in ECB you know it's the correct one because the plaintext is meaningful. "Bob, I secretly loved your brother last night" is easily recognized as plaintext, while that same string xorred with a pseudorandom value doesn't make sense to anything). That means that you know have to find both the password and the plaintext. That generally, with a 256 bit password and 4 kb blocksize, that you effectively have a "password" that's 4.5 kb. This makes CBC orders of magnitude harder to crack.
It should be said that attacks on ECB or CTR, while a LOT easier, are only theoretical for recent algorithms (e.g. AES). However, CBC remains secure much longer than ECB, both using the same encryption algorithm. CBC 3DES encryption, for example, is considered safe (and it is very doubtful even the NSA or CIA has the resources even for CBC DES).
So, in short, NVIDIA cheated.
Why not OpenCL? by gerddie · 2011-05-06 10:09 · Score: 3, Interesting

They should go with OpenCL, then there would be a chance that at one point one can use it with free drivers (and other hardware), but I guess that's the prise you pay for a graduate fellowship from NVIDIA.
1. Re:Why not OpenCL? by GameboyRMH · 2011-05-06 23:34 · Score: 2
  
  Came here to say this. Why the hell are they writing things in CUDA instead of OpenCL? CUDA is closed and Nvidia-proprietary!
  
  --
  "When information is power, privacy is freedom" - Jah-Wren Ryel
Encryption is not the main beneficiary by voss · 2011-05-06 10:40 · Score: 2

Imagine mysql database GPU accelerated...
GPU accelerated routers, gpu acceleration of anti-virus software.
The use of gpus to accelerate search engines.
Re:AES-NI by draconx · 2011-05-06 10:51 · Score: 2

3) CTR
CTR(block(i)) = encrypt(block(i)) xor i
Sorry, but what you describe is not CTR mode. Using your notation, CTR would look (roughly) like this:
CTR(block(i)) = encrypt(counter) xor block(i)
where "counter" is usually constructed by concatenating a nonce value with i
(the block number). It is critical that the resulting counter never be re-used
with the same key for a different block).
Re:AES-NI by kasperd · 2011-05-06 11:05 · Score: 2

CTR(block(i)) = encrypt(block(i)) xor i
That's not how CTR works. Rather it works like
CTR(block(i)) = encrypt(IV || i) xor block(i)
However since most storage encryptions cheat and use an IV that is the same every time you write to the same logical sector, the CTR mode will actually turn into a pseudorandom one-time-pad. This means if you ever write to the same logical sector number twice, you are potentially leaking data. In the case of ecryptfs it is probably only a problem if you overwrite sectors in an existing file as the design of ecryptfs would make it easy to use a new IV per file, but not per sector.

If you want an encryption that is highly parallelizable and doesn't lose a lot of security when you cut corners and use a fixed IV, I think LRW is your best bet. (I don't like the name LRW as I find it an offence against the inventors of tweakable block ciphers, but I am not aware of any other name for that mode, and I don't even know who invented it.)

--

Do you care about the security of your wireless mouse?
Re:AES-NI by doublebackslash · 2011-05-06 13:56 · Score: 2

I'm curious, would CTR be less vulnerable if one XORed before encryption? Call the operation CXR.
Where ^ is the XOR operator
CXR(block(i)) = encrypt(IV ^ i ^ block(i))
I'm not sure if there is analysis that can be done on the block at that point that makes this undesirable. Methinks not because as far as I know having a well known IV in, say, CBC is not a vulnerability. That implies to me that the security still rests firmly in the key. At the very least it stops being vulnerable to bitwise changes and reinstates the Confusion and Diffusion principals.
There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.

--
md5sum /boot/vmlinuz
d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
CUDA? That makes zero sense by tyrione · 2011-05-06 15:35 · Score: 2

Instead, one should use OpenCL. It's Platform Agnostic for a reason, but don't let Linux's chance to be hypocritical step in the way.
Re:AES-NI by kasperd · 2011-05-06 20:01 · Score: 2

CXR(block(i)) = encrypt(IV ^ i ^ block(i))
This is about as secure as ECB, but that's still better than what you get from incorrect use of CTR that degenerates to multiple use of a one-time-pad. What you want is a tweakable block cipher. Just use the block using i as tweak. That is how LRW mode works, with a specific construction for the tweakable block cipher.

One of the constructions for the tweakable block cipher is encrypt(t ^ encrypt(plaintext)), a more efficient construction (but requires a larger key) is (t*k2)^encrypt((t*k2)^plaintext). In this construction * is multiplication in a finite field. * is a bit expensive, but still less than the cipher itself. And, * can be optimized if you are doing multiple operations where the different values of t are related.

You should take a look on the paper that introduced tweakable block ciphers. It explains the constructions much better than I could do.

as far as I know having a well known IV in, say, CBC is not a vulnerability.
It is, but only a minor weakness. With early disk encryptions, that simply used sector number as IV, it was possible to construct a file that when written to that file system would produce an easily recognizable pattern in the encrypted data. I have an example of such a file here http://kasperd.net/ivtest.txt

There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.
The best way I know to produce an IV is to do a calculation over the plaintext of the entire sector except from the first block of the sector. You could say hash the complete sector with first block replaced by sector number and then encrypt the hash value. The advantage of such a construction is that any change anywhere in the sector will affect every block of the encrypted sector.

--

Do you care about the security of your wireless mouse?