Gzip on a PCI card
steve writes "The German tech news site heise.de is reporting here (in German, of course) about a PCI card developed by the Universiy of Wuppertal and Vigos AG being shown at CeBIT, which does Gzip compression in hardware, thus freeing the CPU to do other tasks. The PCI card can compress 32MB/sec, which is more than enough to compress a 100Mbit LAN in realtime. A future version will do 64MB/sec. The article mentions that this will be of particular interest for web servers. The card should be on sale by the end of the year."
"for sale", not "on sale"
So why doesn't this card do bzip2? ;^)
Seriously, this is a funny development. The inverse of the winmodems.
Other interesting ideas for dedicated cards?
Seems this would be a great help to those doing backups over a LAN. Shouldn't take too much to alter a version of tar , rsync, etc. to use this card.
the key to using gzip is really not to compress at too high a ratio... a low rate of compression offers a pretty sizeable saving in bandwidth for an acceptable CPU usage... once you edge up to the higher compression levels then you pay for it in the CPU and your app slows.
i love the idea of a hardware based gzip... but i'd start by educating the software users on the cost vs benefit ratio of their existing configuration... i always seem to find that those who don't know what they're doing are the ones that have it set to maximum compression
The methods I have seen of Gzip seemed to be made to make it possible to do it in hardware. I was under the impression that was intended.
On an aside note this could be ofcause easily dome using an FPGA pci card. One that can do anything you want. Reprogram it to accelerate seti at home or stick some routines used in quake into it. Much more versetile.
The only problems are standarsation and convincing developers to use them.
Mouse powered Chips, Open source Processors and Lego
I try to avoid bzip2 because it is so slow, even on modern hardware. bzip2 compresses very well, much better than gzip. A bzip2 version of this card makes sense ....
RFC1925
GZIP-Kompression per Hardware
Ein Joint-Venture der Universität Wuppertal mit der Hagener Vigos AG zeigt auf der CeBIT (Halle 11, D26) den Prototyp eines "GZIP Accelerator Board". Die PCI-Steckkarte nimmt dem Prozessor die zeitraubende Kompression ab und soll in der aktuellen Version bereits 32 MByte pro Sekunde zusammenstauchen können. Damit läßt sich der Netzwerktraffic einer 100-MBit-Leitung bereits in Echtzeit komprimieren; durch einen modularen Aufbau sollen später bis zu 64 MByte pro Sekunde erreicht werden.
Vor allem in Webservern soll so das ausgehende Datenvolumen on-the-fly komprimiert und damit sowohl die CPU als auch die Netzwerkanbindung entlastet werden -- eine willkommene Hilfe für Internet-Provider, die ressourcenschonend agieren müssen. Diese sind auch die primäre Zielgruppe für das mittlerweile patentierte Verfahren, das in ersten Seriengeräten Ende 2003 zum Einsatz kommen soll. Bis dahin will der Hersteller auch das noch sehr klobige Layout der Karte auf die Gegebenheiten in Servergehäusen angepaßt haben. (Christopher Kunz) / (sun/iX)
oh OH!!!!! NOW it makes sense!
Get paid to code OSS
For comparison i ran gzip on two machines I happen to have immediate access to, I compressed a 32mb file gotten from /dev/urandom,which probably would be a worst case scenario for a compressor
dd if=/dev/urandom of=32m bs=1024k count=32 ; time gzip 32m
P4-1.8Ghz:
real 0m4.428s
user 0m4.220s
sys 0m0.170s
AthlonXP2200+
real 0m3.579s
user 0m3.310s
sys 0m0.160s
So 32MB/s sounds pretty good to me.
Thoughts on tech, Software Engineering, and stuff
Not a professional job, just bablefished..
GZIP compression by hardware A Joint venture of the University of Wuppertal with the Hagener Vigos AG points to the CeBIT (, D26 resounds to 11) the prototype of a "GZIP accelerator board". The PCI plug-in card removes the time-consuming compression from the processor and is in the current version already 32 MByte per second to compress together to be able. Thus the Netzwerktraffic of a 100-MBit-Leitung can be already compressed in real time; by a modular structure are to be achieved later up to 64 MByte per second. Particularly in Web servers so the outgoing volume of data is to be compressed on-the-fly and be relieved thus both the CCU and the network binding -- a welcome assistance for InterNet Provider, which must act resources-carefully. These are also the primary target group for the procedure patented meanwhile, which is to be used in first standard sets at the end of of 2003. Up to then the manufacturer wants to have adapted also the still very klobige layout of the map on the conditions in server housings. (Christopher Kunz)/(sun/iX)
does the article mention anything about decompression? my german is lousy but it seems it doesn't. Is decompression really that fast so that it doesn't need dedicated hardware??
John Carmack fan, browsing at +5 since 1999.
You're assuming the card is using the same settings as your version of gzip defaults to. More likely it's using a much lower compression level and a considerably slower processor.
Note that this isn't necessarily a bad thing; at the expense of maybe 5-10% less compression, you're getting that high throughput. Depending on your task, it's a good trade-off.
Most all current browsers will automatically uncompress gzipped files sent to it, allowing things such as the mod_gzip module to compress web pages and have them rendered on the browser transparently. The bandwith savings ccan be huge, with all the associated benefits (less bandwith for the server, less for the clients and less congestion on the net). Without bzip2 support built into the browser, the hardware compression isn't useful for general web traffic, as it can't be used for the pages being sent.
It'd be nice if I could convince my boss to get some of these for us, but our CPU usage is pretty low right now with the mod_gzip module installed, so it'd be an unnecessary luxury at this point for us.
I am, and always will be, an idiot. Karma: Coma (mostly effected by
The general trend in the industry goes to non-intelligent interconnections (Gigabit card used to have a processor (Alteon), they don't anymore (see latest intels)). I2O never took off because you don't really need to relieve a computer from computation when your computation power is pletoric.
On a Xeon 2.8GHz, I just got 71 MB/s for gzip.
What's the use for such hardware then?
Plus it will eat the PCI bus because data has to go out of memory to processing card, back to memory, then to network card. You triple the PCI bus bandwidth. (Not true if the compression is embedded in the network card).
The best idea would be to make the chip an FPGA not a specially-designed processor. Then you could load in different chip designs for whatever was currently needed. Need to do RSA encryption? The board reconfigures the FPGA for it. Same goes for Divx compression, gzip, SETI@Home, etc. FPGAs take a few milliseconds to reconfigure but when they operate as a dedicated signal processor they can leave a general purpose processor in the dust - leaving the main CPU to run the other apps, the desktop, etc.
Check out the IEEE archives and journals, searching for "adaptive computing" or "reconfigurable computing".
KingPrad
Stop the Slashdot Effect! Don't read the articles!
I guess that this would only be useful for dynamic sites, wouldn't it? Otherwise, static pages would be cached on the server, only needing compression the first time they are served :-? :(
At any rate, most of the visitors to my site rarely get the gzipped pages, as their browsers don't seem to support it
Another thing about gzip is that it is assymmetric: decompression is much faster than compression. Again this is a nice feature, because most files will be decompressed many times but compressed only once. Thus for instance, all man pages are stored in gzipped form and decompressed on demand.
But I can't see the point of implementing it in a PCI card. Wouldn't it be better to integrate it with either the processor or the network interface?
The article mentions that this will be of particular interest for web servers.
I'm assuming one is referring to something that will work with mod_gzip. That may be fine and dandy, but I just recently had to disable mod_gzip on my server. You can blame Microsoft.[1] It seems that both IE 5.5 and 6.0 have nasty little "sometimes" bugs[2] where they won't know what do with gzipped content. I tried to disable by user agent header with no luck. If anyone else has some good pointers or perhaps even a link to a patched version of mod_gzip that'll avoid those two bugs, I would apprieciate it.
[1] No, really. This isn't a troll. They even admit the bugs.
[2] Microsoft Knowledge Base Articles: Q313712 IE 5.5 Q312496 IE 6.0
Yeah, I'm stupid. Correct me where I'm wrong.
This thing is going to sit on the PCI bus? Isn't that where your hard drives are too? On older computers which use a 33 megahertz bus, that would mean that compression @33 megahertz would keep the hard drive receiving any of the data. So, it would actually have to compress it at a slower rate, unless it caches everything. Even at 133 megahertz, the hard drive would be both reading and writing when trying to compress, and that's without worrying about swap.
Have you read my journal today?
Good point, Ego.
Merlin? Mind running those tests one more time, this time to a ramdisk?
Oh, one more thing I found out in extensive tests: the MS IE patches don't always work as advertised. If they did, it would be easy to say "if you get garbage on these pages, install SP1 for your browser." They appear to fix it somewhat, but not always. The "sometimes" bug still exists in 5.5 SP1 and 6.0 SP1...and that is why mod_gzip is disabled now.
When the PCI bus is taken, other stuff that the CPU needs to do will also be halted. And then the PCI bus is much slower than the FSB.
I think what we need to push distributed computing more is altering the RAM and DMA channels. There should be many physical channels to the RAM capable of simultaneously reading/writing different parts of it. As in if the ram can output 200 MB per sec, 16 devices could attach themselves to the RAM via maybe EDMA (enhanced DMA?) and simultaneously be able to read at 200MB each. This might be done by:
(1) Altering the addressing logic in the memory ICs, maybe put 16 different addressing systems and multiply their pins x16. Then have an external matrix, more advanced than the 802x DMA chip to allow simultaniety.
(2) Seperate the addressing schemes of each chip, so an OS kernel could smartly put data of important processes in the right chip to be worked on by external devices.. again also having an external matrix for the address multiplexing.
This way such a PCI gzip device could have its PCI address space, IRQ as well as (EDMA?) address which it would use to access the data to gzip and put back into the RAM, at full speed, not taking up RAM bandwidth, PCI bandwidth, IRQs or the CPU at all.
The AGP as achieved this by seperating the AGP channel from PCI, but still using dedicated memory rather than smartly-shared memory. I understand multiprocessor systems technically do the same thing, but in this case we are treating the external devices like complete slaves, like the GPU, for only dedicated purposes, and I'm emphasizing the smart sharing of memory that doesnt exist in multiprocessor systems either. In this scheme, one could add CPU cards, maybe hot-plugged, and have insta-multiprocessor system or use it to offload kernel compilation, zipping, 3d transformations, or even take user tasks while the main CPU just works in supervisor mode.
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
could someone tell me what this has to do with apache?
On current PCI architectures, you already have that implemented.
Here is the description of the Serverworks chipset (Scroll down to the drawings) Intel's (e7500/7501) is very similar, in architecture at least.
The memory subsystem is one leg of the northbridge (center of the chipset), (two channels allows the chipset to double the bandwidth, but not the latency)
The CPU(s) sit on another bus.
The PCI busses are interconnected through HUBs and specilised links. With this kind of architecture, you can reach 4 times 400 MB/s (1.6GB/s agregate) using the busses in PCI64bits/66MHz). Even better can be expected with PCI-X interfaces.
About the address tricks, you can do that kind of things, but in this case, expect to have to write many things ad-hoc, and forget the general-purpose side of your system. You usually want a real-time system, and I see no point in doing that for a simple web server. RADAR systems, avionics, and stuff like that can be expected to use that kind of trick and optimisations a lot (lots of processing done, and in a very systematic way). 3D rendering seems a nice application for that as well, but I don't know what the state of the art is for high quality (movies) rendering.
The hard part of Web sites is usually database access, which implies complex algorithm that don't fit well in specialised hardware. Compression is I think anacdotis in a web server.
Running gzip on a PCI card could invalidate its warranty. Make a backup of /proc/bus/pci/(card number) first.
The article mentions that this will be of particular interest for web servers.
Why? Gzip already uses minimal processor time...and many sites already use Mod_Gzip...
So, as far as I'm concerned, unless the Mod_Gzip project supports this hardware,it's not gonna float...
Now, Gcc on a PCI card is something I'd pay for...
HIV Crosses Species Barrier... into Muppets
As someone who has been working with a large number of new P4 and Athlon PCs, I can tell you that most new PCs still use one single 32 bit, 33 MHz PCI bus. Even wiz-bang mobos with onboard RAID controllers tend to use a single PCI bus of this type... a major I/O bottleneck if you plan on moving more than 100 MB/sec of data. (granted, RAM, AGP, and CPU still have lots of legroom) Keep this in mind when building your next server... you may want to consider a board with 64 bit, 66 MHz PCI or even 133 MHz PCI-X.
Then implement bzip2 AND gzip on the same card! And while they're at it, include an implementation of lzip, rzip, s3tc, and flac on it as well. Isn't the point to have a dedicated micro-computer, ie small package, do all this? Hello! Microcomputer should do more than just gzip!
When are they gonna offload something interesting, like 3-d rendering, to cards instead of abusing the poor cpu?!
Oh... wait...
Sorry about that; my computer date was set for January 3rd, 1987... let me get out my soldering iron and correct it
Just because I doubt myself does not mean I find your position compelling.
I think it's a little naive to say "Oh, my 1000 hit a day web box, running on a cheap 686 wouldn't benfit from this, so it must suck." Hey, dont get mad! You said it! :P
What if you run a website that gets say 5million+ page views a day and you generate around 2gigs of logs per day per machine across 8 machines. At night you setup an automated batch to zip the logs and ftp them to a log reporting server. Then a cron jobs kicks off log analysis of all 16gigs of logs. Wouldn't this hardware acceleration help? Now let's try to scale that up to 20million+ page views a day. Or what if you're Yahoo who gets 1billion page views a day. How many gigs of logs do you have to process now. Not everyone needs hardware acceleration, but I would hardley call it useless.
The PCI GZip will work under Ninnle Linux, of course.
Now to come out with a single "web server accelerator card"
.jsp pages etc...)
that does both ssl/cram-md5/AES/etc.. and gzip/zlib/other compression
I can see my clients salivating already(saving the processors for those
well except for the IO-bound jobs...
Sometimes I shudder when I hear of people zipping large volumes onto backup. Hopefully hardware compression won't aggravate this problem by making it easier.
One of the big problems with compressed backups, particular if you are tar-gzipping something is that any resulting damage/error in the file can render an entire archive unusable.
Hopefully, most people are into tar-clustering files (that is to say... tar'ing large archives as a group of files, then gzip'ing the grouped archive). You might save a little on CPU and grow the file a bit, but the saving in integrity and possibly speed can be worth it.
We at Indra Networks developed a PCI based gzip accelerator a long time ago. It has been on sale for almost a year. The current version of the card is already at 50 MB/s and we have been shipping that since last September. A higher performance version is on the way.
The card is being sold on an OEM basis to manufacturers of load balancers and SSL accelerators. These boxes front-end multiple Web servers and have very high performance requirements. Also, the CPU has plenty of other work to do, for example TCP/IP processing. This is the application that needs hardware acceleration.
For a low performance site, mod_gzip is fine. But, if you have a busy site with hundreds of Web servers, you don't want to go around installing mod_gzip hundreds of times. It is a lot cheaper to buy a load balancer with gzip hardware acceleration.
bzip2 is irrelevant here as IE and Netscape would not understand bzip2 encoding anyway. But they understand gzip just fine (unless you have a version that is many years old).
Monish Shah
CTO, Indra Networks
www.indranetworks.com
When I execute your command I only get about 1.5MB of random data instead of 32MB. I'm running Gentoo Linux on this box.
Learn from the mistakes of others. There isn't enough time to make them all yourself.
but do I remove my tv tuner card for it?
How would this be implimented into unix? Would there be a device to stream to and a replacement for the gzip command and compression libraries?
You can't judge a book by the way it wears its hair.
When rar gives better compression? Since CPU speed won't be a factor anymore, it would make sense to go with a compression system that is more compact.
.RAR is best and is cross-platform. I would have used .RAR...
Using just the standard options, here's my results:
Original file: 732,921,856 bytes
.ZIP compressed: 725,244,234 bytes
.CAB compressed: 719,244,234 bytes
.RAR compressed: 719,855,409 bytes
.TAR compressed: 732,928,000 bytes
.BZ2 compressed: 732,884,505 bytes
.LHA/.LZH compressed: 725,886,696 bytes
.BH compressed: 725,251,468 bytes
.tar.gz compressed: 725,254,634 bytes
.CAB actually won, but that one has some problems (like being Windows), and of the remainder
A lot of computing records over the years have been set vector computers or other specialized hardware. Putting that power on a PCI-card like this gzip-solution and in addition making the algorithm reprogrammable and reconfigurable you get: Mitron Co-processor on a PCI-card.
has been traditional areas for these kinds of devices, but with the new FPGA's and PCI-express on the horizon I can see it becoming usable for even more specialized applications.
Here is a crude translation of an article in Swedish ( Source Elektroniktidningen)
FPGA enhances PC
You don't have to be a logic constructor to make use of FPGA-chips. Using a normal PCI-card and a compiler from the innovation startup Flow Computing in Lund, programming in Flow's dialect of C is enough.
- We can make a normal PC do calculations that otherwize would have needed supercomputers of large Linux-clusters, said Josef Macznik on Carlstedt Research & Technology, a company that invested and works together with Flow Computing.
The main idea is parallelism. That implies that the PC hardware has to be added in some way, since normal PC-processors works sequentially and normal programs are written to be executed in that way.
Flow has chosen to use normal PCI-cards. The cards are equipped with an FPGA-chip from Xilinx with two million gates, but the size of the chip can be selected depending on requirements according to Josef Masznik.
The corporate secret lies in the compiler. Software has to be written in Flows own variety of C, and the compiler can decide which processes that wins the most on parallell execution, configuring the FPGA for maximum efficiency.
- The user don't see the FPGA-chip and don't really have to know what kind of hardware there is on the card. We are directed towards programmers - that's where the market is, said Josef Macznik.
Flows solution is currently used by a bioinformationcompany in Lund. But the technology can according to the company be used for all purposes where the computing power in a PC needs to be multiplied using parallelism ane where the effort to adapt their programs to the special variety of C is worthwhile.
from memory, gzip takes in a byte at a time, and outputs a bitstream of huffman-like variable width tokens. As each input token (0..255 and EOF (say 256)) is applied to the engine sequentially, it is possibly replaced by a compound output token (numbered 257...2^N-1) encoding an already-seen sequence (old-output-token, new-input-token).
The 32KB blocking is mainly to simplify resync, IIRC.
There is *no* lookahead in gzip, just that one pending output token, and memory of past input.
Bzip2 OTOH analyzes the entire block (default 900kB) before outputting a single bit, and thus can do a better job with changing pattern space.
Go explore gzip and friends -- they're beautiful.
The trick is to modify the encoder's state (learning a better encoding for a sequence) only *after* that token has been emitted, so the decoder learns exactly the same lesson as the encoder has, just by watching the token stream.
No metadata has to be passed between the machines.
*Really* simple in hardware (for sufficiently complex values of 'simple')
^..^ OO (oo)
Doesn't urandom need to wait to collect entropy before it produces output?
Also... who told you random data can't be compressed? That's completely wrong.
Consider a binary string X of length 1000000, where the contents of X are set randomly. It is possible for the contents of X to be all 0's.
I can describe the contents of X in fewer bits than the string itself. I shall do so now. "A binary string of length 1000000 where the contents are 0's.".
Since I can describe the string with another binary string of length less than 1000000, I can compress it. Since it's possible for that string to result if it's been randomly chosen, some random strings can be compressed.
When someone might yell at me, it has to be OpenBSD.
gzip finds repeats among the most recent 32K of the stream it's processing, using a hash table etc. to match its current position against previous ones.
IIRC it hashes the three bytes from its current position and looks for a match against hashes from 32k previous positions, then does a lookup in the hash bucket for as much as it can match following the initial 3 bytes.
The BWT actually sorts every position in the block. It's not streamable in any significant way.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger