Slashdot Mirror


Gzip on a PCI card

steve writes "The German tech news site heise.de is reporting here (in German, of course) about a PCI card developed by the Universiy of Wuppertal and Vigos AG being shown at CeBIT, which does Gzip compression in hardware, thus freeing the CPU to do other tasks. The PCI card can compress 32MB/sec, which is more than enough to compress a 100Mbit LAN in realtime. A future version will do 64MB/sec. The article mentions that this will be of particular interest for web servers. The card should be on sale by the end of the year."

30 of 141 comments (clear)

  1. Useful for netbackups too by walt-sjc · · Score: 5, Insightful

    Seems this would be a great help to those doing backups over a LAN. Shouldn't take too much to alter a version of tar , rsync, etc. to use this card.

    1. Re:Useful for netbackups too by Bazzargh · · Score: 4, Informative

      rsync doesnt use gzip, or the deflate algorithm - it uses the Burrows-Wheeler Transform, same as used in bzip2. If you read Tridge's thesis you'll see that he actually proposes an rzip algorithm based on the BWT and his work on rsync that compresses better than gzip or bzip2 on typical files.

      -Baz

    2. Re:Useful for netbackups too by walt-sjc · · Score: 3, Insightful

      Interesting, didn't know that. I just assumed it used the same code. Note that one of the cool things about open source is that you could swap out the compression code which is exactly what I was suggeting, so it wouldn't really matter what algorithm the code originally used. (of course it would no longer be compatible, but I'm also assuming that this wouldn't be an issue in this case for this application.) I normally don't use the built-in compression with rsync, instead I use the compression in ssh which I believe IS gzip.

      It would be Very cool if the card supported multiple compression algorithms. Considering that GNU tar supports bzip as well., this would definately be useful.

    3. Re:Useful for netbackups too by stilwebm · · Score: 2, Interesting

      Maybe you're thinking of dynamic linking against zlib or other compression libraries. This would use the same code, quite literally. That would be the most usefull way to support a card like this. The zlib.so (or zlib.dll) could be modified to interface the drivers for the card, so programs linked against zlib would transparently use the faster hardware acceleration. Few programs will be statically linked to zlib anyway, and those exceptions are likely to either be binaries you don't mind recompiling for speed (e.g. you linked it statically and tweaked the binary for speed already) or binaries on some rescue disk or small root filesystem where zlib.so may not be readable.

  2. bandwidth saving by buro9 · · Score: 5, Insightful

    the key to using gzip is really not to compress at too high a ratio... a low rate of compression offers a pretty sizeable saving in bandwidth for an acceptable CPU usage... once you edge up to the higher compression levels then you pay for it in the CPU and your app slows.

    i love the idea of a hardware based gzip... but i'd start by educating the software users on the cost vs benefit ratio of their existing configuration... i always seem to find that those who don't know what they're doing are the ones that have it set to maximum compression

  3. A bzip2 version would be nice ... by geirt · · Score: 4, Insightful

    I try to avoid bzip2 because it is so slow, even on modern hardware. bzip2 compresses very well, much better than gzip. A bzip2 version of this card makes sense ....

    --

    RFC1925
    1. Re:A bzip2 version would be nice ... by arvindn · · Score: 5, Informative
      No, bzip2 is something that won't work for applications like serving web pages.

      gzip works with streams, producing input as it gets output. OTOH bzip2 treats the input as blocks. Thus it needs to get a whole block before it produces any output. Similarly the client needs to get a whole block of data before it can even start rendering the page. The man page of bzip2 says that the default block size is 900,000 (!) bytes. So while using bzip2 may improve bandwidth it will result in large latency.

    2. Re:A bzip2 version would be nice ... by ianezz · · Score: 4, Interesting
      gzip works with streams, producing input as it gets output. OTOH bzip2 treats the input as blocks.

      Gzip works with blocks of data too, but the block size is 32KB instead of nearly 1MB and it is not nearly as CPU intensive as bzip2, so this is why it appears to produce a continuous stream of compressed data (even if, strictly speaking, it doesn't).

      Gzip just seems to be a well-balanced compromise between resources and resulting compression ratio, plus it is Free Software (hint: bzip2 is Free Software too, but Rar isn't).

  4. Re:Complete, Utter, Comprehension! by Specialist2k · · Score: 3, Informative

    A translation: A joint venture between the University of Wuppertal and Vigos AG showcase the prototype of a "GZIP accelerator board" at CeBIT (Hall 11, D26). The PCI card removes the burden of performing time-consuming data compression tasks from the system CPU and already achieves a data throughput of 32 MB/s in its current development state. This is sufficient to compress the traffic generated by a 100 MBit LAN connection in real-time; through the modular design, it will be possible to reach 64 MB/s in the future. [end of first paragraph] Specialist

  5. Re:Hardware Gzip by Lord+Sauron · · Score: 4, Funny

    A hardware that does the dirty processing job while freeing the CPU ? Wow, that's new. I'm going to the USPTO to get my patent on this.

    Maybe I can even make some money on Intel, as they were in clear violation of my patent with their arithmetic coprocessor for use with the 80386SX family of microprocessors .

  6. Comparison by Merlin42 · · Score: 3, Interesting

    For comparison i ran gzip on two machines I happen to have immediate access to, I compressed a 32mb file gotten from /dev/urandom,which probably would be a worst case scenario for a compressor

    dd if=/dev/urandom of=32m bs=1024k count=32 ; time gzip 32m

    P4-1.8Ghz:
    real 0m4.428s
    user 0m4.220s
    sys 0m0.170s

    AthlonXP2200+
    real 0m3.579s
    user 0m3.310s
    sys 0m0.160s

    So 32MB/s sounds pretty good to me.

  7. Not a good comparison by TheSHAD0W · · Score: 2, Interesting

    You're assuming the card is using the same settings as your version of gzip defaults to. More likely it's using a much lower compression level and a considerably slower processor.

    Note that this isn't necessarily a bad thing; at the expense of maybe 5-10% less compression, you're getting that high throughput. Depending on your task, it's a good trade-off.

    1. Re:Not a good comparison by Merlin42 · · Score: 3, Interesting

      Good point ... lets test a little more:
      P4-18Ghz: gzip -9
      real 0m4.437s
      user 0m4.200s
      sys 0m0.210s
      P4-18Ghz: gzip -1
      real 0m4.366s
      user 0m4.130s
      sys 0m0.200s
      AthlonXP2200+: gzip -9
      real 0m3.387s
      user 0m3.160s
      sys 0m0.210s
      AthlonXP2200+: gzip -1
      real 0m3.427s
      user 0m3.200s
      sys 0m0.170s

      The really funny part is that I ran the Athlon one several times and the gzip -9 was always just ever so slightly faster than the gzip -1 version.

      Maybe random data is not the best for testing the different compression levels though, since if it is truly random it cannot be compressed no matter how hard you try.

      Even if this is not a perfect(or even reasonable) "apples to apples" comparison, it is a good end-to-end system level comparison. While it may not be "4x faster than a 2Ghz CPU", when building a system that _needs_ to do compression, adding this card would _effectively_ boost my CPU speed.

    2. Re:Not a good comparison by The+Ego · · Score: 2, Interesting

      If gzip -9 (a.k.a. gizp --best) is faster than gzip -1, it must be because you are IO limited, so writing a smaller file ends up as wall-clock saving.

      It clearly is a flawed test to compare the CPU loads of -9 and -1 but it is an excellent example that IO is often the bottleneck.

  8. Browser Compression by Kalak · · Score: 3, Informative

    Most all current browsers will automatically uncompress gzipped files sent to it, allowing things such as the mod_gzip module to compress web pages and have them rendered on the browser transparently. The bandwith savings ccan be huge, with all the associated benefits (less bandwith for the server, less for the clients and less congestion on the net). Without bzip2 support built into the browser, the hardware compression isn't useful for general web traffic, as it can't be used for the pages being sent.

    It'd be nice if I could convince my boss to get some of these for us, but our CPU usage is pretty low right now with the mod_gzip module installed, so it'd be an unnecessary luxury at this point for us.

    --
    I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
  9. How cute but useless. by _Eric · · Score: 5, Interesting

    The general trend in the industry goes to non-intelligent interconnections (Gigabit card used to have a processor (Alteon), they don't anymore (see latest intels)). I2O never took off because you don't really need to relieve a computer from computation when your computation power is pletoric.

    On a Xeon 2.8GHz, I just got 71 MB/s for gzip.

    What's the use for such hardware then?

    Plus it will eat the PCI bus because data has to go out of memory to processing card, back to memory, then to network card. You triple the PCI bus bandwidth. (Not true if the compression is embedded in the network card).

    1. Re:How cute but useless. by sporty · · Score: 2, Insightful

      Not really. Can you cheaply create a cluster of say.. 50 web servers, all that use mod_gzip for line compression?

      Xeon's arent' THAT cheap, but hey, 1ghz machines (or even 500mhz machines) with this card would easily match your Xeon once the 64MB/s cards come out. Or was that 64mb/s. Well, you get the point.

      As for the bus latency, well.. you are right, it'd be better in the network card, but remember, that's layer 1 and 2 stuff you'd be meddling with, where gzip would end up in layer 4. Layer 3 is tcp/udp, 4 is app data, right?

      --

      -
      ping -f 255.255.255.255 # if only

  10. Reconfigurable by KingPrad · · Score: 5, Interesting
    This is cool - dedicated chips can process monstrous amounts of data and much faster than a general purpose CPU. So it's a good idea to let this card do the heavy lifting of compression. Of course the use extends to many types of data analysis: encryption, scientific number crunching, graphics compression.

    The best idea would be to make the chip an FPGA not a specially-designed processor. Then you could load in different chip designs for whatever was currently needed. Need to do RSA encryption? The board reconfigures the FPGA for it. Same goes for Divx compression, gzip, SETI@Home, etc. FPGAs take a few milliseconds to reconfigure but when they operate as a dedicated signal processor they can leave a general purpose processor in the dust - leaving the main CPU to run the other apps, the desktop, etc.

    Check out the IEEE archives and journals, searching for "adaptive computing" or "reconfigurable computing".

    KingPrad

    --
    Stop the Slashdot Effect! Don't read the articles!
    1. Re:Reconfigurable by UranusReallyHertz · · Score: 2, Interesting

      I always thought it would be cool if some of the transistors in general purpose cpus could be used as an FPGA to serve as an "algorithm cache". When a program is run the most frequently used algorithms are automatically implemented in hardware on the FPGA, resulting in speedups anywhere between 10 and a 1000 times. Seeing as how CPUs will have a billion or more transistors in the near future, this would seem like an excellent use for them.

      --
      Smoking is an expensive, slow, and unreliable method of suicide.
  11. Only useful for dynamic sites? by d-Orb · · Score: 2, Interesting

    I guess that this would only be useful for dynamic sites, wouldn't it? Otherwise, static pages would be cached on the server, only needing compression the first time they are served :-?
    At any rate, most of the visitors to my site rarely get the gzipped pages, as their browsers don't seem to support it :(

  12. Cool by arvindn · · Score: 5, Informative
    gzip was designed with such considerations in mind. Throughput of the algorithm took precedence over compression level. Good to see their farsightedness paying off. And the algorithm is pretty simple so that it can be implemented in hardware directly.

    Another thing about gzip is that it is assymmetric: decompression is much faster than compression. Again this is a nice feature, because most files will be decompressed many times but compressed only once. Thus for instance, all man pages are stored in gzipped form and decompressed on demand.

    But I can't see the point of implementing it in a PCI card. Wouldn't it be better to integrate it with either the processor or the network interface?

  13. Not quiet yet... by buzzbomb · · Score: 4, Informative

    The article mentions that this will be of particular interest for web servers.

    I'm assuming one is referring to something that will work with mod_gzip. That may be fine and dandy, but I just recently had to disable mod_gzip on my server. You can blame Microsoft.[1] It seems that both IE 5.5 and 6.0 have nasty little "sometimes" bugs[2] where they won't know what do with gzipped content. I tried to disable by user agent header with no luck. If anyone else has some good pointers or perhaps even a link to a patched version of mod_gzip that'll avoid those two bugs, I would apprieciate it.

    [1] No, really. This isn't a troll. They even admit the bugs.
    [2] Microsoft Knowledge Base Articles: Q313712 IE 5.5 Q312496 IE 6.0

    1. Re:Not quiet yet... by arvindn · · Score: 2, Funny

      You might want to try out mod_msff: the Microsoft-free friday apache module ;)

  14. You have an important point... by mnmn · · Score: 4, Interesting


    When the PCI bus is taken, other stuff that the CPU needs to do will also be halted. And then the PCI bus is much slower than the FSB.

    I think what we need to push distributed computing more is altering the RAM and DMA channels. There should be many physical channels to the RAM capable of simultaneously reading/writing different parts of it. As in if the ram can output 200 MB per sec, 16 devices could attach themselves to the RAM via maybe EDMA (enhanced DMA?) and simultaneously be able to read at 200MB each. This might be done by:

    (1) Altering the addressing logic in the memory ICs, maybe put 16 different addressing systems and multiply their pins x16. Then have an external matrix, more advanced than the 802x DMA chip to allow simultaniety.

    (2) Seperate the addressing schemes of each chip, so an OS kernel could smartly put data of important processes in the right chip to be worked on by external devices.. again also having an external matrix for the address multiplexing.

    This way such a PCI gzip device could have its PCI address space, IRQ as well as (EDMA?) address which it would use to access the data to gzip and put back into the RAM, at full speed, not taking up RAM bandwidth, PCI bandwidth, IRQs or the CPU at all.

    The AGP as achieved this by seperating the AGP channel from PCI, but still using dedicated memory rather than smartly-shared memory. I understand multiprocessor systems technically do the same thing, but in this case we are treating the external devices like complete slaves, like the GPU, for only dedicated purposes, and I'm emphasizing the smart sharing of memory that doesnt exist in multiprocessor systems either. In this scheme, one could add CPU cards, maybe hot-plugged, and have insta-multiprocessor system or use it to offload kernel compilation, zipping, 3d transformations, or even take user tasks while the main CPU just works in supervisor mode.

    --
    "Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
  15. Re:Moo by benjamindees · · Score: 2, Informative
    When the PCI bus is used in conjunction with a 32-bit CPU, the bandwidth is 132 Mbytes/s

    That's Bytes, as in 8 bits. A 100 Mbit/sec NIC is only 12.5 MBytes/sec.

    --
    "I assumed blithely that there were no elves out there in the darkness"
  16. Sun machines use PCI busses, too. by Vengeful+weenie · · Score: 3, Insightful
    A little late posting, but I did want to point out that modern Sun machines use PCI buses, and the Enterprise class [4000+] machines have a crap load of bandwidth through their backplanes.

    I think it's a little naive to say "Oh, my 1000 hit a day web box, running on a cheap 686 wouldn't benfit from this, so it must suck." Hey, dont get mad! You said it! :P

  17. Here is a thoguht! by f00zbll · · Score: 2, Insightful

    What if you run a website that gets say 5million+ page views a day and you generate around 2gigs of logs per day per machine across 8 machines. At night you setup an automated batch to zip the logs and ftp them to a log reporting server. Then a cron jobs kicks off log analysis of all 16gigs of logs. Wouldn't this hardware acceleration help? Now let's try to scale that up to 20million+ page views a day. Or what if you're Yahoo who gets 1billion page views a day. How many gigs of logs do you have to process now. Not everyone needs hardware acceleration, but I would hardley call it useless.

  18. Very interesting, but a little late by monish · · Score: 3, Interesting

    We at Indra Networks developed a PCI based gzip accelerator a long time ago. It has been on sale for almost a year. The current version of the card is already at 50 MB/s and we have been shipping that since last September. A higher performance version is on the way.

    The card is being sold on an OEM basis to manufacturers of load balancers and SSL accelerators. These boxes front-end multiple Web servers and have very high performance requirements. Also, the CPU has plenty of other work to do, for example TCP/IP processing. This is the application that needs hardware acceleration.

    For a low performance site, mod_gzip is fine. But, if you have a busy site with hundreds of Web servers, you don't want to go around installing mod_gzip hundreds of times. It is a lot cheaper to buy a load balancer with gzip hardware acceleration.

    bzip2 is irrelevant here as IE and Netscape would not understand bzip2 encoding anyway. But they understand gzip just fine (unless you have a version that is many years old).

    Monish Shah
    CTO, Indra Networks
    www.indranetworks.com

  19. Much better: Reprogrammable Co-processors by pacc · · Score: 3, Informative

    A lot of computing records over the years have been set vector computers or other specialized hardware. Putting that power on a PCI-card like this gzip-solution and in addition making the algorithm reprogrammable and reconfigurable you get: Mitron Co-processor on a PCI-card.

    has been traditional areas for these kinds of devices, but with the new FPGA's and PCI-express on the horizon I can see it becoming usable for even more specialized applications.

    Here is a crude translation of an article in Swedish ( Source Elektroniktidningen)

    FPGA enhances PC
    You don't have to be a logic constructor to make use of FPGA-chips. Using a normal PCI-card and a compiler from the innovation startup Flow Computing in Lund, programming in Flow's dialect of C is enough.
    - We can make a normal PC do calculations that otherwize would have needed supercomputers of large Linux-clusters, said Josef Macznik on Carlstedt Research & Technology, a company that invested and works together with Flow Computing.
    The main idea is parallelism. That implies that the PC hardware has to be added in some way, since normal PC-processors works sequentially and normal programs are written to be executed in that way.
    Flow has chosen to use normal PCI-cards. The cards are equipped with an FPGA-chip from Xilinx with two million gates, but the size of the chip can be selected depending on requirements according to Josef Masznik.
    The corporate secret lies in the compiler. Software has to be written in Flows own variety of C, and the compiler can decide which processes that wins the most on parallell execution, configuring the FPGA for maximum efficiency.
    - The user don't see the FPGA-chip and don't really have to know what kind of hardware there is on the card. We are directed towards programmers - that's where the market is, said Josef Macznik.
    Flows solution is currently used by a bioinformationcompany in Lund. But the technology can according to the company be used for all purposes where the computing power in a PC needs to be multiplied using parallelism ane where the effort to adapt their programs to the special variety of C is worthwhile.

  20. Re:Why use Gzip? by NtG · · Score: 2, Insightful

    There are many many issues with this test, which has proved absolutely nothing:

    a. It appears (as someone mentioned elsewhere) that you are compressing an already compressed file

    b. You have not specified the options used when compressing, which can seriously alter the result

    c. You have thrown in TAR, which can be overlooked, however taring a single file before gzip compressing it is simply a waste of time unless there is some particularly pertinent permissions/directory structure data you want to preserve. Basically, you have inflated the gzip output by doing this

    d. Each of these compression methods has its own benifits and shortfalls. Good compression ratio is not the be-all and end-all. Certainly many people have explained the whole block-compression theory and why gzip is so versatile.

    e. You seem to be trying to prove here that RAR is a superior compression method. It is also not free. It certainly can't be used without licensing fees as gzip can.

    f. Where is output such as time taken, i/o and cpu demands, etc?

    You may want to rethink your research.