Slashdot Mirror


Writing Linux Kernel Functions In CUDA With KGPU

An anonymous reader writes "Until today, GPGPU computing was a userspace privilege because of NVIDIA's closed-source policy and AMD's semi-open state. KGPU is a workaround to enable Linux kernel functionality written in CUDA. Instead of figuring out GPU specs via reverse-engineering, it simply uses a userspace helper to do CUDA-related work for kernelspace requesters. A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions. With the accelerated performance of a GPU AES cipher in the Linux kernel, eCryptfs can get a 3x uncached read speedup and near 4x write speedup on an Intel X25-M 80G SSD. However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. A CTR, counter mode, cipher may be much more secure, although the real vanilla eCryptfs uses CBC mode. Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel, given the fact that GPUs are so widely deployed and the potential future of heterogeneous operating systems."

101 comments

  1. AES-NI by RightSaidFred99 · · Score: 1

    Wonder how this compares in performance to AES-NI, because it sure as hell sounds a lot more complex and fragile.

    1. Re:AES-NI by adisakp · · Score: 1

      It might be more "complicated" but it's probably more useful since currently a lot more systems have GPU's than AES-NI, given that AES-NI is only on a subset of Intel's most recents CPU's.

    2. Re:AES-NI by Anonymous Coward · · Score: 0

      In AES-NI Performance Analyzed, Patrick Schmid and Achim Roos found, "... impressive results from a handful of applications already optimized to take advantage of Intel's AES-NI capability".[6] A performance analysis using the Crypto++ security library showed an increase in throughput from approximately 28.0 cycles per byte to 3.5 cycles per byte with AES/GCM versus a Pentium 4 with no acceleration.[7] [8]

      Looks like a 8x speedup with AES-NI, versus a 3-4x speedup using KGPU.

    3. Re:AES-NI by gman003 · · Score: 2

      Yes, but that was comparing a Pentium 4 (last one came out in 2006) to a brand-new processor (2011). That is NOT scientifically accurate - they are completely different designs, which will produce vastly different runtimes for the exact same instructions. How about doing a comparison between Crypto++ running on a 2500k, and Crypto++ running on a 2500k without being compiled with AES-NI support. That would be infinitely more rigorous.

    4. Re:AES-NI by Anonymous Coward · · Score: 2, Informative

      KGPU uses AES just as a demonstration, it's architecture is general to any GPU-friendly algorithm.

    5. Re:AES-NI by DarkOx · · Score: 1

      Well I am sure it compares very favorably if you have an old CPU or a CPU of a different architecture which does not feature those instructions.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    6. Re:AES-NI by wagnerrp · · Score: 2

      It's for an entirely different application. AES-NI is one application specific set of instructions. While encryption and decryption is an application in which dedicated hardware can have tremendous gains, introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU. It's an inherently limiting design methodology. Introducing GPU access to the kernel opens up a very powerful piece of hardware to be used for a wide range of applications, enhancing any process that is suitable for the architecture found on a GPU.

      Think of GPUs like picking up a new math co-processor 20 years ago.

    7. Re:AES-NI by JonySuede · · Score: 1

      they were comparing cycle per byte not the total run time so the difference between the cpu generation is less important. But the rest of your argument is still quite valid.

      --
      Jehovah be praised, Oracle was not selected
    8. Re:AES-NI by gman003 · · Score: 2

      No, cycle per byte is EXTREMELY important. Even contemporary processors can execute instructions in highly different amounts of time - a K5 can perform some instructions in 80% the time of an identically-clocked Pentium. And when you compare it to such wildly different architectures as Sandy Bridge and NetBurst, all bets are off. You might as well be throw an 8086 and a SPARC into the mix, because that'll be about as rigorous.

    9. Re:AES-NI by OeLeWaPpErKe · · Score: 2

      The problem is that parallelized encryption is not as secure as the other modes. Let me show you the difference between CBC, ECB and CTR ( block(i) means the i'th block of data)

      1) CBC
        CBC(pwd, block(i)) == encrypt(pwd, block(i)) xor block(i-1)
      * block(-1) = hash(pwd, 0) (sometimes half the password is used as block(-1))

      2) ECB
        ECB(block(i)) = encrypt(block(i))

      3) CTR
        CTR(block(i)) = encrypt(block(i)) xor i

      I hope it's obvious why CBC and CTR are the only candidates for parallelization. CBC can only be done in sequence. But there's a huge issue. Ciphers have weak spots, and there are rainbow tables. So let's suppose you have an encrypted file in ECB mode.

      encrypt(block(1)) : encrypt(block(2)) : ... : encrypt(block(n)) * bing rainbow table hit ! (ie. somehow you're able to decrypt block(3))

      now you have a combination block(n), encrypt(block(n)) and password. Well you've broken the encryption. The problem is the contents of blocks are quite predictable (e.g. you will pretty much know every bit in an ext3 superblock if you know the size of the volume, so you can generate targeted rainbow tables). The only thing you need to find is the password.

      Suppose the same happens in CBC mode

      encrypt(block(1) xor initializer) : encrypt(block(2) xor encrypt(block(1) xor initializer) : encrypt(block(3) xor encrypt(block(2) xor encrypt(block(1) xor initializer)) ...

      Now block(1) is still perfectly predictable, block(1) xor initializer, however, is not. You have to generate 2^(passwordlength + blocksize)/2 rainbow tables before you'd get a single hit. Also, just because you get one hit, doesn't mean it's the correct one (in ECB you know it's the correct one because the plaintext is meaningful. "Bob, I secretly loved your brother last night" is easily recognized as plaintext, while that same string xorred with a pseudorandom value doesn't make sense to anything). That means that you know have to find both the password and the plaintext. That generally, with a 256 bit password and 4 kb blocksize, that you effectively have a "password" that's 4.5 kb. This makes CBC orders of magnitude harder to crack.

      It should be said that attacks on ECB or CTR, while a LOT easier, are only theoretical for recent algorithms (e.g. AES). However, CBC remains secure much longer than ECB, both using the same encryption algorithm. CBC 3DES encryption, for example, is considered safe (and it is very doubtful even the NSA or CIA has the resources even for CBC DES).

      So, in short, NVIDIA cheated.

    10. Re:AES-NI by slew · · Score: 1

      Just a couple small nits to pick..

      Although CBC encryption needs to be done in sequence, CBC decryption can be done mostly in parallel (don't have to wait until you do the AES part of the previous block)...

      Also security is better than other modes only in some cases. As a trivial example, in CBC it's easier to tamper with the plain-text.: all you have to do to flip a bit in the plaintext of a CBC encrypted stream is to flip that same bit in the previous block's cipher-text. Although that kills that previous block's decrypted plaintext, it make it possible to easily arbitrarily manipulate somethings (of course if that is a threat model, you should really be doing a MAC, but that is another discussion)...

      So, in short, it depends... ;^)

    11. Re:AES-NI by draconx · · Score: 2

      3) CTR
          CTR(block(i)) = encrypt(block(i)) xor i

      Sorry, but what you describe is not CTR mode. Using your notation, CTR would look (roughly) like this:

          CTR(block(i)) = encrypt(counter) xor block(i)

      where "counter" is usually constructed by concatenating a nonce value with i
      (the block number). It is critical that the resulting counter never be re-used
      with the same key for a different block).

    12. Re:AES-NI by kasperd · · Score: 2

      CTR(block(i)) = encrypt(block(i)) xor i

      That's not how CTR works. Rather it works like

      CTR(block(i)) = encrypt(IV || i) xor block(i)

      However since most storage encryptions cheat and use an IV that is the same every time you write to the same logical sector, the CTR mode will actually turn into a pseudorandom one-time-pad. This means if you ever write to the same logical sector number twice, you are potentially leaking data. In the case of ecryptfs it is probably only a problem if you overwrite sectors in an existing file as the design of ecryptfs would make it easy to use a new IV per file, but not per sector.

      If you want an encryption that is highly parallelizable and doesn't lose a lot of security when you cut corners and use a fixed IV, I think LRW is your best bet. (I don't like the name LRW as I find it an offence against the inventors of tweakable block ciphers, but I am not aware of any other name for that mode, and I don't even know who invented it.)

      --

      Do you care about the security of your wireless mouse?
    13. Re:AES-NI by makomk · · Score: 1

      introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU

      More oversized, expensive and power-hungry than the GeForce GTX 480 they used for this benchmark? It's right at the limits of manufactuability in terms of chip size, costs hundreds of dollars, and has a 300W power consumption at load. You'd need an awful lot of application-specific hardware modules before you even got close to that.

    14. Re:AES-NI by doublebackslash · · Score: 2

      I'm curious, would CTR be less vulnerable if one XORed before encryption? Call the operation CXR.
      Where ^ is the XOR operator
      CXR(block(i)) = encrypt(IV ^ i ^ block(i))

      I'm not sure if there is analysis that can be done on the block at that point that makes this undesirable. Methinks not because as far as I know having a well known IV in, say, CBC is not a vulnerability. That implies to me that the security still rests firmly in the key. At the very least it stops being vulnerable to bitwise changes and reinstates the Confusion and Diffusion principals.

      There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.

      --
      md5sum /boot/vmlinuz
      d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz
    15. Re:AES-NI by kasperd · · Score: 2

      CXR(block(i)) = encrypt(IV ^ i ^ block(i))

      This is about as secure as ECB, but that's still better than what you get from incorrect use of CTR that degenerates to multiple use of a one-time-pad. What you want is a tweakable block cipher. Just use the block using i as tweak. That is how LRW mode works, with a specific construction for the tweakable block cipher.

      One of the constructions for the tweakable block cipher is encrypt(t ^ encrypt(plaintext)), a more efficient construction (but requires a larger key) is (t*k2)^encrypt((t*k2)^plaintext). In this construction * is multiplication in a finite field. * is a bit expensive, but still less than the cipher itself. And, * can be optimized if you are doing multiple operations where the different values of t are related.

      You should take a look on the paper that introduced tweakable block ciphers. It explains the constructions much better than I could do.

      as far as I know having a well known IV in, say, CBC is not a vulnerability.

      It is, but only a minor weakness. With early disk encryptions, that simply used sector number as IV, it was possible to construct a file that when written to that file system would produce an easily recognizable pattern in the encrypted data. I have an example of such a file here http://kasperd.net/ivtest.txt

      There might also be some magic in reading the whole block (since we are talking about block level devices) and having, say, a CBC over the block with an IV calculated with encrypt(IV ^ i) but I think that goes out of scope of my question.

      The best way I know to produce an IV is to do a calculation over the plaintext of the entire sector except from the first block of the sector. You could say hash the complete sector with first block replaced by sector number and then encrypt the hash value. The advantage of such a construction is that any change anywhere in the sector will affect every block of the encrypted sector.

      --

      Do you care about the security of your wireless mouse?
  2. Did a anyone else's brain switch off half way.... by Anonymous Coward · · Score: 0

    ..... through the summary??? Sorry, But, I had to read it 3 times, to sink in.... Sorry... but, as a geek myself, I find this just far too geeky!....Sorry. (hands back geek card!)

  3. Best possible example by Anonymous Coward · · Score: 2, Interesting

            Hand off encryption routines to a closed source black box. Brilliant.

    1. Re:Best possible example by icebraining · · Score: 2

      Yes, because the CPU isn't, we're all running open hardware /s

    2. Re:Best possible example by Jaqenn · · Score: 4, Insightful

      As opposed to having them done by my Intel CPU, for which Intel has helpfully provided full schematics.

      --
      You are awash in a sea of fiercely stated opinions. Obvious exits are: 'File->Quit', 'Reply', and 'Page Down'.
    3. Re:Best possible example by Anonymous Coward · · Score: 1

      Good point.

      In fact, Intel CPUs are worse in this regard, as they contain special AES instructions. GPUs, as far as I know, don't do this yet, so you'll know have a higher level of confidence that the correct code is indeed running.

    4. Re:Best possible example by Lunix+Nutcase · · Score: 1

      Yes, and those AES instructions are well documented.

    5. Re:Best possible example by Noughmad · · Score: 1

      How can you be sure that what's going on on the processor is the same thing as what's described in the documentation?

      --
      PlusFive Slashdot reader for Android. Can post comments.
  4. Question: by Jaqenn · · Score: 3, Interesting

    (I have never written kernel level code, and the statement that follows is only from listening to what other people are doing)

    I thought that a tiny bit of kernel code reflecting calls into a user level process was old news, and has become established as the preferred development model. Is there a reason that it's undesirable?

    Because the summary makes it sound like we're sad to be following this model, and we're only doing it because we can't pull NVidia's driver source into the linux kernel.

    --
    You are awash in a sea of fiercely stated opinions. Obvious exits are: 'File->Quit', 'Reply', and 'Page Down'.
    1. Re:Question: by sockman · · Score: 2

      The NVIDIA extensions are only available in userland.
      So a call to the kernel level crypto system gets routed back out to user land, and back to kernel land via the GPU module. That's why we're sad.

    2. Re:Question: by killmenow · · Score: 2

      I've never written kernel modules either so take this with a grain of salt: my understanding is there is a cost associated with the switching/passing back and forth between userspace and kernelspace and it's best to minimize that. I remember similar discussions going back as far as NT4 when Microsoft decided to implement the entire GDI in kernelspace, which is what led to a billion BSODs because video drivers are notoriously shitty code and you'd be way better off stability-wise having that code run in userspace. Performance-wise, not so much.

      The interesting thing about encryption code working this way is there is such a tremendous speedup by running the bulk of the encryption code on the GPU as opposed to the CPU that the cost incurred in the user/kernel switch is well worth it.

    3. Re:Question: by Hatta · · Score: 1

      There is overhead in a context switch from kernel space to user space.

      --
      Give me Classic Slashdot or give me death!
    4. Re:Question: by afidel · · Score: 2

      The reason it's undesirable is the hit you taking when moving back and forth between kernel space and user space. The move in each direction requires the CPU to change ring levels which increases latency.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    5. Re:Question: by blair1q · · Score: 1

      Context-switching is always expensive, but avoiding it without regard to the actual benefit leads to system bloat, so learning where it is and isn't significant is a good skill to have.

      The speedup from GPU hardware is so big that it's worth giving up a few hundred cycles of context switching to get a few thousand cycles of reduction in computing.

      But (not having read TFA yet) I wonder just how much kernel functionality is really that parallelizable. When does the context switching cost you more than the CUDA gains you? Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?

    6. Re:Question: by Anonymous Coward · · Score: 1

      Many developers feel that Nvidia's userspace driver workaround, only done to avoid licensing issues, shouldn't be permitted at all. This would be seen as validating Nvidia's actions.

      It's also a giant architectural hack so that won't help matters.

    7. Re:Question: by Anonymous Coward · · Score: 1

      a tiny bit of kernel code reflecting calls into a user level process

      You mean generally? This could be said of micro-kernels but the LInux kernel is monolithic; Drivers for devices typically live entirely inside the kernel.

      That being said I don't think it's necessarily desirable to pull every conceivable hardware interaction into the kernel. There is an endless variety of hardware and APIs. Why must all of this churn live in the kernel? The kernel<-->user-space bridge that was built to make the GPU vendors user-space API accessible by the kernel isolates the kernel from the frequent driver updates published by the vendor. The vendor can distribute all the drivers it wants, create new and vastly different hardware and the bridge doesn't have to change as long as the user space API survives.

      Note: the above isn't a 'rights' argument; it applies whether or not the hardware and/or drivers are 'open.'

      If you've run 'menuconfig' et al. recently and waded through the thousands of devices with their subtle dependencies and relationships, it might occur to you that this may not scale forever. Relegating some of the less ubiquitous stuff to user space through a robust and common interface could be a good idea.

      The world is messy. There will always be stuff that can't be sanitized by the kernel gnomes. Making these cases work smoothly contributes to world domination.

    8. Re:Question: by Jah-Wren+Ryel · · Score: 1

      Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?

      Maybe RAID computations. Block-level data-deduplication is starting to catch on and that needs to hash every block written to disk. i bet that could benefit from a GPU but the userland overhead may be enough to kill the practicality, at least for anything but long streaming writes.

      --
      When information is power, privacy is freedom.
    9. Re:Question: by emanem · · Score: 1

      I've written kernel code in both OpenGL (GPGPU old school)/OpenCL.
      Main issue might be context switching? Or writing GPU binary code without having to compile via driver (i.e. a la math accelerator FPU?)
      Cheers!

    10. Re:Question: by PoochieReds · · Score: 4, Interesting

      There are also other concerns than the context switch overhead...particularly when dealing with filesystems or data storage devices.

      For instance, suppose part of your userspace daemon gets swapped out, and you now need to upcall to userspace. That part that got paged out then has to be paged back in. If memory is tight, then the kernel may have to free some memory, and it may decide to flush out dirty data to the filesystem or device that is dependent on the userspace daemon. At that point, you're effectively deadlocked.

      Most of those sorts of problems can be overcome with careful coding and making sure the important parts of the daemon are mlocked, but you do have to be careful and it's not always straightforward to do that.

    11. Re:Question: by sjames · · Score: 1

      What I would like to know is since they're already taking the hit for downcalls into userspace, why not use fuse instead and let the userland filesystem daemon use the GPU. Why produce yet another mechanism to protect the kernel from the wierdness that can happen when it depends on userspace rather than the other way around?

  5. Re:Did a anyone else's brain switch off half way.. by h4rr4r · · Score: 4, Informative

    GTFO!
    This is what should be on slashdot, not stories about the latest iphone.

  6. F*ck Nvidia AND AMD by Anonymous Coward · · Score: 1

    Until they open-source drivers, I refuse to buy them. Stuff like this is typically a nightmare to install and keep running anyway.

    1. Re:F*ck Nvidia AND AMD by blair1q · · Score: 1

      Just what are you using for graphics hardware, then? Intel's integrated core?

    2. Re:F*ck Nvidia AND AMD by Anonymous Coward · · Score: 0

      Serial Terminal.

    3. Re:F*ck Nvidia AND AMD by Anonymous Coward · · Score: 0

      Yeah, right. Good luck running Battlefield 3 on that crap.

    4. Re:F*ck Nvidia AND AMD by jd · · Score: 1

      The Hercules graphics card. :)

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    5. Re:F*ck Nvidia AND AMD by TeknoHog · · Score: 1

      I only used open source graphics drivers, including Intel's integrated, until about 6 months ago when I needed to run some OpenCL code on a Radeon. There is nothing wrong with Intel graphics and the opensource Radeon drivers, unless you are a gamer or need serious GPGPU power. Both are capable of plenty of 3D, for example molecular modelling in my case.

      I am posting this on a Powerbook running Linux, and for some strange reason AMD does not release binary drivers for PPC Linux ;) but the opensource Radeon driver is good enough. There are some artefacts in 3D but no serious problems.

      --
      Escher was the first MC and Giger invented the HR department.
    6. Re:F*ck Nvidia AND AMD by gerddie · · Score: 1

      You might want to rethink your opinion on AMD, they are getting there: http://www.x.org/wiki/RadeonFeature

    7. Re:F*ck Nvidia AND AMD by serviscope_minor · · Score: 1

      Just what are you using for graphics hardware, then? Intel's integrated core?

      Yes, why? I don't play 3D games, so it's fine and stable.

      --
      SJW n. One who posts facts.
    8. Re:F*ck Nvidia AND AMD by dbIII · · Score: 1

      It looks like the old SGI guys at Nvidia know that as soon as source is released they are going to get jumped on by patent trolls and have to spend a lot of time and money on pointless court cases that can do nothing of value to anyone apart from shifting money into patent troll pockets. They've been bitten once before and the closed drivers are the result.

    9. Re:F*ck Nvidia AND AMD by Noughmad · · Score: 1

      Uphill, both ways?

      --
      PlusFive Slashdot reader for Android. Can post comments.
  7. Wow by killmenow · · Score: 2

    I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.

    1. Re:Wow by Obfuscant · · Score: 1

      I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.

      Be careful what you say, next we'll have a hockey game break out.

      I'm sorry, but am I the only one here who thinks this is, well, not a good way to go? Even if the code could be kernel-space code on the GPU? I mean, if I buy a CUDA GPU, I'm doing it because I have serious computing I want to do on it, not because I want my file system reads to be faster. I'd be rather miffed if I spent the time writing my CUDA code to speed things up and then found out it wasn't speeding things up because the GPU was already busy doing encryption on the filesystem.

      This seems a lot like the problem that cheap HP printers cause. You buy a $50 printer, but the software "driver" consumes the computer you've hooked it to, to the point where you should just consider the computer as part of the printer and get a new one to do computing on.

    2. Re:Wow by wild_berry · · Score: 1

      They're racking their brains as to what to do next.

      I would aim for kernel threads running directly through CUDA and the Scheduler knowing the performance profile of suitable work for the GPU and the message-passing cost of moving work to the GPU^H^H^H parallelism co-processor. Make the interface right and you should be able to shift tasks across heterogeneous processing units. Do it perfectly and you can have a Linux Virtual Processor model which allows you to start running a task on your desktop, shuffle it to a laptop for transit, pare it down to use on your mobile phone, buy some CPU time from an internet cluster to grind through some calculations before transferring it home. Choose x86: there's already enough x86 junk in other trees, and it might fix up the ARM shenanigans too!

    3. Re:Wow by tibit · · Score: 1

      You seem to be seriously overstating the impact of host-based printing. Obviously when you're not printing (and that's probably most of the time!), there's no overhead. And when you are printing, then the rasterizer consumes a little bit of memory and plenty of CPU, but that's transient. I would never venture as far as calling it "consuming" the computer.

      I haven't personally felt it to be a problem, and I'm using a host-based printer (HP LJ P1006). It spits out about 17 pages per minute, not too shabby if you ask me. Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost, so I'm not complaining. The Core2 Duo in the iMac is already paid for ;)

      Heck, I've used plenty of PCL-only LaserJets, and they were -- for all practical purposes -- host-rendering printers. Some of them could render scalable fonts, but that only helped if you were printing text. As soon as there were graphics involved, or output from professional typesetting/design packages, the printer was receiving a huge monochrome bitmap wrapped in a couple PCL instructions. In all recent-enough cases, though, the host CPU was much faster than the one on the printer, so it actually helped with throughput if the PC would do the rasterization.

      --
      A successful API design takes a mixture of software design and pedagogy.
    4. Re:Wow by Obfuscant · · Score: 1

      You seem to be seriously overstating the impact of host-based printing.

      Uhhh, no. I was there. Firsthand experience.

      Obviously when you're not printing (and that's probably most of the time!), there's no overhead.

      Other than the half a dozen monitor demons that tell you when there are updates for the drivers, when the printer is out of paper, when the printer is out of ink, when the printer is low on ink and would you like to buy official HP products now?, and whatever other things they had demons doing.

      then the rasterizer consumes a little bit of memory and plenty of CPU,

      The last 200Mb of disk is "a little bit"?

      I haven't personally felt it to be a problem,

      And thus it cannot have been a problem for me. Thanks.

      Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost,

      Wow. A whole $100 for a printer. So the printer could actually be a printer and not just the printhead and servos and the rest of the printer installed as drivers on your main CPU.

    5. Re:Wow by tibit · · Score: 1

      The monitors and stuff are not a problem inherent in host-based printing. Not at all. For reasons better left to be explained by marketing types, HP's Windows printing support for home printer product line sucks donkey balls. Their support on Linux and Mac doesn't come with any of the overhead.

      So what you're complaining against is not host based printing per se, but broken drivers peddled by HP and others, bundled with bloatware. There's no inherent technical reason for it to be that way. And the problem is not because printer is not performant enough. The problem is bloatware. What you feel as the problem is not the rasterizer, it's everything else. Note that the same bloatware, unfortunately, comes with printers that have a built-in rasterizer, too.

      As for 200MB of disk: you are not complaining about what the rasterizer is doing, merely about bloatware that came with the printer. A monochrome letter-sized page at 600dpi takes 4.2 Mbytes. At that resolution, you could stuff uncompressed bitmaps for about 45 monochrome pages in 200MB, or 12 pages worth of CMYK bitmaps.

      --
      A successful API design takes a mixture of software design and pedagogy.
    6. Re:Wow by sjames · · Score: 1

      To be fair, most of that crap isn't actually the printer driver, it's the HP marketing trojan combined with REALLY bad design. A sane driver would only check paper and ink just before, during and just after a print job.

  8. AES speed by afidel · · Score: 1

    I wonder if this would be any faster than an implementation that took advantage of the hardware AES on the newer Intel CPU's? Latency should be lower for the CPU based version as would memory bandwidth.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    1. Re:AES speed by Anonymous Coward · · Score: 0

      Yes, it would be faster with AES-NI. But APUs may make a different because GPU and CPU then can shared even L3 cache.
      Another issue is: what if we use other algorithms instead of AES? Designing specialized instructions for every algorithm doesn't make sense.

    2. Re:AES speed by afidel · · Score: 1

      I really not sure why you would use anything other than AES at this point and the AES-NI instructions also foil most sideband attacks.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  9. Re:Did a anyone else's brain switch off half way.. by thePowerOfGrayskull · · Score: 1

    It wasn't too geeky, but it was written as if by someone with ADD. Perhaps no surprise?

  10. Re:Did a anyone else's brain switch off half way.. by Anonymous Coward · · Score: 0

    Poorly written maybe, but not that geeky. If you were that confused, I'm not certain you ever had a geek card to hand back.

  11. GPU by MM-tng · · Score: 0

    The hardware that is so brilliantly made nobody is allowed to know how it works. And as a result it actually doesn't work. I say congratulations.

  12. All in good time by deadline · · Score: 2

    Proof of concepts are nice, but when the GPU is firmly planted in the CPU, this will make more sense. The PCI bus can be a bottleneck in these types of situations. AMD fusion is a great example of this idea.

    --
    HPC for Primates. Read Cluster Monkey
    1. Re:All in good time by cnettel · · Score: 1

      If you are indeed reading from something like an SSD, the data bandwidth shouldn't be a problem. The data pipe to any recent GPU is much wider than SATA, and quite favorable latency-wise as well. Of course, you are adding another layer of latency and transfers, but the situation is quite different from a case where you are offloading some computation whose data could otherwise stay in the CPU cache all the time.

  13. Recipe for a corrupted filesystem by drewm1980 · · Score: 1

    Wow, the fragility of an encrypted file system plus the instability of a GPU, implemented in the kernel. Do not even read TFA without doing a full backup of your system.

    1. Re:Recipe for a corrupted filesystem by calmofthestorm · · Score: 2

      fragility of an encrypted file system{citationneeded}.

      I've been using them since 2006. Never had any problems.

      --
      93rd rule of Slashdot: No matter how obvious my sarcasm is, my comment will be taken seriously by someone.
  14. Cool test... by Panaflex · · Score: 1

    As someone who's doing a lot of the same work, this is pretty spectacular! I'm surprised they get > 100MB/sec in software - but I guess that's due to using ECB mode vs. CBC. I think the real I/O limit here is probably in the user/kernel mem copies - context switch weight can be optimized with good buffer alignments.

    We did a lot of testing with CUDA under openssl 3-4 years ago - in the end it was better to just stick with software. The latencies are the real killers.

    --
    I said no... but I missed and it came out yes.
    1. Re:Cool test... by Anonymous Coward · · Score: 0

      You may wanna take a look at SSLShader (http://shader.kaist.edu/sslshader/) and rethink your 3~4-year-old work again.

      Although it is still 'will be available later'

    2. Re:Cool test... by Panaflex · · Score: 2

      That's a pretty cool project! But I do think they still suffer the same latency problems - in order to take advantage of the GPU's full throughput - they have to have a huge number of client connections (chosen solution) or a very deep queue (hard to optimize, only works with larger file sizes).

      Certainly this is a great solution for what it is - but it's not a general purpose solution. And you can get a much more reliable and supported solution out there. (e.g. BIG-IP SSL Accelerator, which uses certified FIPS 140-2 hardware.)

      --
      I said no... but I missed and it came out yes.
  15. Re:Did a anyone else's brain switch off half way.. by MarkRose · · Score: 1

    Completely off-topic, but I've been looking for a decent ssh client for my crapberry -- thanks!

    --
    Be relentless!
  16. Protection by Adrian+Lopez · · Score: 1

    Is it a good idea for the protected kernel to rely on unprotected code for critical functions such as filesystem operations? I know that user-space code cannot directly interfere with the kernel, but it also doesn't have to do anything the kernel requests of it. Unless the kernel is designed to treat such user-space code as altogether untrustworthy, it seems to me a bad idea for the kernel to rely on user-space code in this manner.

    --
    "In prison you just have to shut your eyes and take it. Here you have to shut your eyes and give it."
  17. ECB Mode is totally insecure by jasonwc · · Score: 3, Interesting

    I hope this is just a proof-of-concept design because ECB mode should not be used for this purpose. Wikipedia provides a pretty obvious example of the weakness of ECB mode:

    "The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks; thus, it does not hide data patterns well. In some senses, it doesn't provide serious message confidentiality, and it is not recommended for use in cryptographic protocols at all. A striking example of the degree to which ECB can leave plaintext data patterns in the ciphertext is shown below; a pixel-map version of the image on the left was encrypted with ECB mode to create the center image, versus a non-ECB mode for the right image."

    http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Initialization_vector_.28IV.29

    1. Re:ECB Mode is totally insecure by drinkypoo · · Score: 1

      I hope so too, because I was excited by the idea of using my CUDA-capable GPU to do encryption, which might actually get me to use it. It's barely ticking over providing Compiz functions.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    2. Re:ECB Mode is totally insecure by Anonymous Coward · · Score: 0

      eCryptfs uses CBC with a secret IV.

    3. Re:ECB Mode is totally insecure by jasonwc · · Score: 1

      According to the summary, the GPU enhanced version uses ECB:

      "A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions . . . .However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. "

    4. Re:ECB Mode is totally insecure by Melkhior · · Score: 2

      And because a picture straight from the horse's mouth is worth a thousand words, here's what NVidia has to say about it:

      http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html

      Go to 36.5, figure 36-11 & 36-13.

    5. Re:ECB Mode is totally insecure by lucag · · Score: 1

      Writing parallel code is difficult. Writing parallel code which makes sense even more. Actually, if you have a quad-core CPU and do ECB instead of CBC, then you can manage a 4x increase in performance ... no need to use a GPU!
      (The reason is that ECB encryptions might be done in parallel, as each of them is independent; for CBC you need to know
      the encryption of textblock-1 in order to produce that of a block).
      A counter mode (CTR) might make sense for ecryptfs, but the security analysis is definitely non-trivial to make.
      Actually, it is amateurish at best to say that this implementation of ecryptfs "is not a toy" ...
      (per http://code.google.com/p/kgpu/wiki/IozoneBenchmarkResults )
      it is, in fact, something which seriously compromises security.

  18. Do NOT use ECB mode by buglista · · Score: 1

    it doesn't obscure patterns in your input data. Please take a look at the tux images here; http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Electronic_codebook_.28ECB.29 (it may be faster, but it doesn't f---ing work.)

  19. Why not OpenCL? by gerddie · · Score: 3, Interesting

    They should go with OpenCL, then there would be a chance that at one point one can use it with free drivers (and other hardware), but I guess that's the prise you pay for a graduate fellowship from NVIDIA.

    1. Re:Why not OpenCL? by Anonymous Coward · · Score: 0

      OK, I'd better say something about our decision.
      Use OpenCL? No, the reason is that new features of GPGPU computing on Fermi are not available on OpenCL.
      Prise for fellowship? Maybe in future... In fact. we already implemented most of KGPU before submitting our application.

    2. Re:Why not OpenCL? by GameboyRMH · · Score: 2

      Came here to say this. Why the hell are they writing things in CUDA instead of OpenCL? CUDA is closed and Nvidia-proprietary!

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    3. Re:Why not OpenCL? by gerddie · · Score: 1

      You might also want to consider this thread on the linux kernel mailing list. It is about adding a module to the kernel that has only one use: to talk to proprietary user space code. The module got rejected from mainline for this reason. By using CUDA and the proprietary user space portions from NVIDIA, you module will also never make it into mainline (unless hell freezes over and NVIDIA opens up their drivers).

    4. Re:Why not OpenCL? by Anonymous Coward · · Score: 0

      We are doing some similar work at the University of Toronto. Our work is OpenCL-based, and we have a paper appearing at HotCloud in June. If you're interested, take a gander: http://sysweb.cs.toronto.edu/projects/21

  20. Encryption is not the main beneficiary by voss · · Score: 2

    Imagine mysql database GPU accelerated...

    GPU accelerated routers, gpu acceleration of anti-virus software.
    The use of gpus to accelerate search engines.

    1. Re:Encryption is not the main beneficiary by kvvbassboy · · Score: 1

      Imagine whether prediction and stock prediction using these. I am surprised that the guys in New York haven't used it already given the massive amount of gold they have in their coffers.

    2. Re:Encryption is not the main beneficiary by makomk · · Score: 1

      These days, automated stock trading is in fashion, which depends on having really tiny latencies - the exact opposite of what you get from GPU acceleration. I believe companies are experimenting with implementing stock trading algorithms on FPGAs connected directly to network interfaces...

    3. Re:Encryption is not the main beneficiary by Anonymous Coward · · Score: 0

      Problem needs to parallelize well. Encryption usually does. Encoding usually does.
      Databases, not so much, at least not with current architectures.
      Can run a lot of queries in parallel, but one query gets at most one core in most modern databases, mysql inclusive.

    4. Re:Encryption is not the main beneficiary by Anonymous Coward · · Score: 0

      GPU accelerated routers, gpu acceleration of anti-virus software.

      I hope that is a joke, imagining that is making me feel ill. It's also ridiculous.

      The reason we HAVE CPUs and GPUs rather than just GPUs that do everything is because the two are complementary and work better on different problems. CPUs are good at decision making, you have short sequences of instructions with a bunch of if-this-then-that thrown in, they also handle sharing (not exactly well but anyway) through locking. GPUs are good at long sequences of branch-free calculations which are independent of what other cores are doing, they choke and shed performance rapidly on if-statements and die a rapid death on problems that require results from other cores which can't be performed in multiple passes. This is what makes it so much faster for 3d rendering, each pixel on the screen is mostly independent of the others so each of the 100+ 'GPU cores' can do a separate pixel each then collate the result together at the end.

      I'd love to see you try and break down a SQL query execution engine into a multi-pass algorithm with minimal decision making logic. [Even if you succeed, getting the records off of the storage is always the bottleneck in a well designed DB, good luck making your HDD spin faster with magic GPU power]

      Commercial routers are already accelerated and in better ways, rather than installing $100s of GPU, they use specialised network chips (probably programmed FPGAs but same difference) which will always be faster anyway. This is where GPGPU obsession gets ridiculous — dedicated hardware designed to perform a specific well-bounded task is always faster than general purpose hardware wired (both physically or in software) for that purpose, as GPUs get more general and less graphics specialised, they become more like CPUs and therefore worse at their primary task of pixel banging. (We can see this with nVidia's current GPGPU specialised Fermi architecture which consumes 50% more power than AMD's equivalent GPUs across the board; and despite all that wasted power, are barely holding the performance line in 3d graphics)

    5. Re:Encryption is not the main beneficiary by smorken · · Score: 1

      The only time that you want to use a GPU is when your code has a high proportion of numerical operations, and when your problem can be executed in parallel. (modeling, graphics) If this is not the case then using a GPU is not going to speed things up. Code where you are mostly just moving data around with sparse calculations (routers, databases, webservers, AV) is not a good problem for video cards.

    6. Re:Encryption is not the main beneficiary by kvvbassboy · · Score: 1

      Hmm.. I am pretty much in GPU architecture, but here's why I thought it would be great in the stock and weather forecasting.

      1. They involve a lot of matrix multiplications and matrix inversion algorithms, which, from what I heard can be handled nicely by the GPU.

      2. This is a very naive thought, but TFA mentioned talked about easy parallelization using GPU. This can be harnessed by the multitude of parallel, machine learning algorithms out there.

      However, after some searching, I came across a white paper (dammit, I am really not able to find the link now) which mentioned that GPUs are poor when it comes to using and re-using a large data size due to some kind of latency, which I think is what you are talking about.

    7. Re:Encryption is not the main beneficiary by kvvbassboy · · Score: 1

      "I am pretty much ignorant* in GPU architecture"

      Fixed.

    8. Re:Encryption is not the main beneficiary by nochez · · Score: 1

      Imagine the day when someone finally implements a GPU accelerated "make me a sandwich" (http://xkcd.com/149/) ... that, would be pure awesomeness.

    9. Re:Encryption is not the main beneficiary by Anonymous Coward · · Score: 0

      I'm also pretty ignorant on the subject, but I suspect the key to understanding GPGPU computing is that it's SIMD (Single Instruction Multiple Data).

      A SIMD machine is analogous to a robot with lots of arms. Each arm can hold any of a number of tools (let's say hammer, screwdriver, paintbrush). But you can't have one arm holding a hammer and another holding a paintbrush at the same time. All the arms have to hold the same tool at the same time. So you have a machine that excels at jobs like hammering a million nails, or driving a million screws, but is useless for jobs like building a house.

    10. Re:Encryption is not the main beneficiary by Anonymous Coward · · Score: 0

      The advantage of GPU is that it can do parallelized, relatively simple (only a few operations) tasks fast. So it probably wouldn't help MySQL, since that relational database is unstable garbage that can't get databases big enough to make it worth it to parallelize. Routing is not simple enough, and network data streams are not reliable enough to parallelize. Antivirus software is mostly limited by disk speed, and I don't think the heuristic algorithms are simple enough.

      With search engines, ask Google. From what I recall they use ARM cpus, so I guess the search engine algorithms are not simple.

  21. Re:Did a anyone else's brain switch off half way.. by kvvbassboy · · Score: 1

    Sure.. but the summary is still badly written. Read the TFA, and that makes a lot more sense for us illiterate folks.

  22. CUDA? That makes zero sense by tyrione · · Score: 2

    Instead, one should use OpenCL. It's Platform Agnostic for a reason, but don't let Linux's chance to be hypocritical step in the way.

  23. I like the random reference to Ubuntu by RichiH · · Score: 1

    In former times, people made sure you knew they used Slackware, then LFS, then Gentoo, now Ubuntu.

    Distributions are like a penis and religion...

    Anyway, get off my lawn.

  24. 4x speedup is nothing by loufoque · · Score: 1

    4x speedup is nothing. Using the GPU correctly should bring much higher speedups.
    That kind of gain could simply be obtained by optimizing the CPU code.

    1. Re:4x speedup is nothing by Rockoon · · Score: 1

      Indeed. It has been my experience that when crypto writers move their libs from C to well optimized x86 assembly language they get at least 2x performance boost.

      These guys are getting 4x, but only on a fairly powerful GTX 480 GPU. How will a typical mobile GPU's compare? Probably even slower than the CPU, right? This article makes me sad.

      --
      "His name was James Damore."
  25. SSE , AltiVec by mehemiah · · Score: 1

    there are plenty of architectures specific vector instruction sets on the CPU that the kernel could be taking advantage of instead; for example SSE and AltiVec for x86 and PPC respectivlly.

    1. Re:SSE , AltiVec by mehemiah · · Score: 1

      or VIS for SPARC

  26. open CUDA or give up. by bored · · Score: 1

    For the last ~8 years I've needed extremely fast encryption (and compression) in the project I use. A few years ago when CUDA began to gain traction, I got all excited and actually decided to see what was necessary to make it work and see how fast it was.

    Well at the time, I discovered that CUDA enabled encryption is quite fast. The problem is that copying the data segment to the GPU, doing the encryption and then copying the result back is painful. The copies and setup/interrupt/etc add so much latency that it runs at a roughly the same speed as just doing the operation on the CPU. Adding a couple of user/kernel space crossings probably makes the problem even worse. So during this timeframe we used dedicated compression/encryption boards for the customers that needed it fast, and everyone else just got a couple of extra CPU's dedicated to the effort. Now with AES-NI dedicated boards generally aren't necessary. Sure you have to buy a machine specifically with AES-NI right now, but I suspect that with all these instruction set extensions, within a couple of years it will be widespread.

    To patch the kernel to support such an ugly hack would be quite stupid, given the fact that AES is already fairly respectable (~100MB/sec or so per CPU) anyone that needs it faster could use blowfish, or find a CPU with AES-NI.

  27. Re:Did a anyone else's brain switch off half way.. by thePowerOfGrayskull · · Score: 1

    Excellent, glad it helps! Look for some updates coming in the fairly near future...

  28. Actually they shouldn't by Anonymous Coward · · Score: 0

    "Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel"
              Actually they shouldn't. There's always debate about this kind of thing, but in my humble opinion adding large and complex systems that don't have to be in the kernel into the kernel is not a good thing. For this, a cryptfs userspace crypto shim is a clean solution, this would allow for adding arbitrary new crypto systems too. Regarding "the chicken and the egg", if you have a encrypted root filesystem, a lot of distros already build an initramfs -- basically a preloaded RAM disk -- this is how all the SATA and SCSI drives can be built as modules, but the right ones are loaded before the system tries to mount your hard disk. So any extra cryptfs stuff can be handled there.