Gzip on a PCI card
steve writes "The German tech news site heise.de is reporting here (in German, of course) about a PCI card developed by the Universiy of Wuppertal and Vigos AG being shown at CeBIT, which does Gzip compression in hardware, thus freeing the CPU to do other tasks. The PCI card can compress 32MB/sec, which is more than enough to compress a 100Mbit LAN in realtime. A future version will do 64MB/sec. The article mentions that this will be of particular interest for web servers. The card should be on sale by the end of the year."
For comparison i ran gzip on two machines I happen to have immediate access to, I compressed a 32mb file gotten from /dev/urandom,which probably would be a worst case scenario for a compressor
dd if=/dev/urandom of=32m bs=1024k count=32 ; time gzip 32m
P4-1.8Ghz:
real 0m4.428s
user 0m4.220s
sys 0m0.170s
AthlonXP2200+
real 0m3.579s
user 0m3.310s
sys 0m0.160s
So 32MB/s sounds pretty good to me.
Thoughts on tech, Software Engineering, and stuff
You're assuming the card is using the same settings as your version of gzip defaults to. More likely it's using a much lower compression level and a considerably slower processor.
Note that this isn't necessarily a bad thing; at the expense of maybe 5-10% less compression, you're getting that high throughput. Depending on your task, it's a good trade-off.
The general trend in the industry goes to non-intelligent interconnections (Gigabit card used to have a processor (Alteon), they don't anymore (see latest intels)). I2O never took off because you don't really need to relieve a computer from computation when your computation power is pletoric.
On a Xeon 2.8GHz, I just got 71 MB/s for gzip.
What's the use for such hardware then?
Plus it will eat the PCI bus because data has to go out of memory to processing card, back to memory, then to network card. You triple the PCI bus bandwidth. (Not true if the compression is embedded in the network card).
The best idea would be to make the chip an FPGA not a specially-designed processor. Then you could load in different chip designs for whatever was currently needed. Need to do RSA encryption? The board reconfigures the FPGA for it. Same goes for Divx compression, gzip, SETI@Home, etc. FPGAs take a few milliseconds to reconfigure but when they operate as a dedicated signal processor they can leave a general purpose processor in the dust - leaving the main CPU to run the other apps, the desktop, etc.
Check out the IEEE archives and journals, searching for "adaptive computing" or "reconfigurable computing".
KingPrad
Stop the Slashdot Effect! Don't read the articles!
I guess that this would only be useful for dynamic sites, wouldn't it? Otherwise, static pages would be cached on the server, only needing compression the first time they are served :-? :(
At any rate, most of the visitors to my site rarely get the gzipped pages, as their browsers don't seem to support it
Maybe you're thinking of dynamic linking against zlib or other compression libraries. This would use the same code, quite literally. That would be the most usefull way to support a card like this. The zlib.so (or zlib.dll) could be modified to interface the drivers for the card, so programs linked against zlib would transparently use the faster hardware acceleration. Few programs will be statically linked to zlib anyway, and those exceptions are likely to either be binaries you don't mind recompiling for speed (e.g. you linked it statically and tweaked the binary for speed already) or binaries on some rescue disk or small root filesystem where zlib.so may not be readable.
When the PCI bus is taken, other stuff that the CPU needs to do will also be halted. And then the PCI bus is much slower than the FSB.
I think what we need to push distributed computing more is altering the RAM and DMA channels. There should be many physical channels to the RAM capable of simultaneously reading/writing different parts of it. As in if the ram can output 200 MB per sec, 16 devices could attach themselves to the RAM via maybe EDMA (enhanced DMA?) and simultaneously be able to read at 200MB each. This might be done by:
(1) Altering the addressing logic in the memory ICs, maybe put 16 different addressing systems and multiply their pins x16. Then have an external matrix, more advanced than the 802x DMA chip to allow simultaniety.
(2) Seperate the addressing schemes of each chip, so an OS kernel could smartly put data of important processes in the right chip to be worked on by external devices.. again also having an external matrix for the address multiplexing.
This way such a PCI gzip device could have its PCI address space, IRQ as well as (EDMA?) address which it would use to access the data to gzip and put back into the RAM, at full speed, not taking up RAM bandwidth, PCI bandwidth, IRQs or the CPU at all.
The AGP as achieved this by seperating the AGP channel from PCI, but still using dedicated memory rather than smartly-shared memory. I understand multiprocessor systems technically do the same thing, but in this case we are treating the external devices like complete slaves, like the GPU, for only dedicated purposes, and I'm emphasizing the smart sharing of memory that doesnt exist in multiprocessor systems either. In this scheme, one could add CPU cards, maybe hot-plugged, and have insta-multiprocessor system or use it to offload kernel compilation, zipping, 3d transformations, or even take user tasks while the main CPU just works in supervisor mode.
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
Gzip works with blocks of data too, but the block size is 32KB instead of nearly 1MB and it is not nearly as CPU intensive as bzip2, so this is why it appears to produce a continuous stream of compressed data (even if, strictly speaking, it doesn't).
Gzip just seems to be a well-balanced compromise between resources and resulting compression ratio, plus it is Free Software (hint: bzip2 is Free Software too, but Rar isn't).
We at Indra Networks developed a PCI based gzip accelerator a long time ago. It has been on sale for almost a year. The current version of the card is already at 50 MB/s and we have been shipping that since last September. A higher performance version is on the way.
The card is being sold on an OEM basis to manufacturers of load balancers and SSL accelerators. These boxes front-end multiple Web servers and have very high performance requirements. Also, the CPU has plenty of other work to do, for example TCP/IP processing. This is the application that needs hardware acceleration.
For a low performance site, mod_gzip is fine. But, if you have a busy site with hundreds of Web servers, you don't want to go around installing mod_gzip hundreds of times. It is a lot cheaper to buy a load balancer with gzip hardware acceleration.
bzip2 is irrelevant here as IE and Netscape would not understand bzip2 encoding anyway. But they understand gzip just fine (unless you have a version that is many years old).
Monish Shah
CTO, Indra Networks
www.indranetworks.com