Gzip Encoding of Web Pages?
Both Brendan Quinn and msim were curious about the ability to send gzip-encoded Web pages. Brendan asks: "It's possible to make Apache detect the "Accept-encoding: gzip" field sent by NS 4.7+, IE 4+ and Lynx, and send a gzip-encoded page, thus saving lots of bandwidth all over the place. So why don't people do it?
Here is a module written by the Mozilla guys a couple of years ago that -almost- does what I want, and I could change it pretty easily... but I thought someone else would have done it by now? eXcite do it, does anyone know of any other large-scale sites that use gzip encoding?"
"If you have LWP installed, you can check with:
GET -p '<my proxy>' -H 'Accept-encoding: gzip' -e http://www.site.com/ | less
Try that with 'www.excite.com' and you'll get binary (gzipped) data. That's what I want to do."
bleh, dead web site. Try axkit.org.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
Why bother compressing data? Face it, 99% of all web pages out there consist of the following:
- Text - the easiest to compress, but for most sites it's the quickest element to load..
- Graphics - already in compressed (.GIF,
.JPG, .PNG) format, so gzip won't (shouldn't be able to) compress them any further - and these are usually the bulk of most page downloads...
I would also think that there is some sort of level of traffic you would have to have before the improvement in bandwidth would compensate for the extra load placed on the server having to gzip everything dynamically.Of course, for high-text, heavy traffic sites (for example, right here on /.), this may make some sense. But for the majority of sites, it doesn't seem to make sense to me.
On the other hand, I might just be a grumpy old man who can't understand all these new-fangled things... :=]
________________________
Corporate Jenga: You take a blockhead from the bottom and you put him on top...
This could actually be used to get around content-based 'net access filters. You could use this method of requesting compressed text to thwart and keyword-scanning filter. Of course, if this became very popular, it would only be a matter of time before the filtering programs added the capability to decode gzip or whatever other compression people were using.
But you can do that pretty easely with mod_rewrite and PHP.
Have the PHP script make a html and gzipped image of its output whenever it is called (there's a bunch of ob_* fonctions in PHP who can help you do that). Then use mod_rewrite to have the server serve :
gziped image if available and supported by client
html image if available and gzip not supported
php file if no file is available
you can refresh the content of the file by deleting the html and gzip image... that way you have optimal load on the server and a bandwidth-friendly site.
Would you dynamically gzip the whole site before sending it? On some sites where the page content goes on for (screen) pages the browser loads what it has and when it has more it loads more. Think about the ramifications of waiting for the entire (site) page to be gzip'd then sent, then you have to unzip it...
./$1 | rsh destination "cd /export/home1; tar -xBpf -"
What about parts and peices??
Here is a tar command I use to move files around from system to system occasionally:
tar -cf -
it goes in chuncks - not the whole thing, maybe you should think about incorporating this type of duck movement...
Wheeeee
I was thinking more along the lines of NS for *nix might be able to handle it, but the Win version might not be able to. I just went from a T1 in my college dorm to a 56k on my dad's computer at home, and anything to speed up the downloads would keep hair in my head. So my original question stands,
This is something I don't know how to test, and I don't know where to start an intelligent search, so if anyone has a good place for me to start looking, I would be grateful. Thanks.BTW, my criteria for a new place to live just grew to include DSL/cable modem access. How do people live on 56k?
Louis Wu
"Where do you want to go ...
Does this trick need gzip installed already, or is it included in the huge download of NS?
Louis Wu
"Where do you want to go ...
I realised a long long time ago that I could save space on my Linux box's hard drive by goinf into the html documentation directories and doing a gzip -9 `find . -name "*.html"` .
Since I was opening these files through the file system, not via http, Netscape had no problem whatsoever opening and displaying them.
I just tried this using Netscape on an SGI with http, ( like this http://server/path/page.html.gz ) and it still works... I seem to remember that when I tried this at home with Linux, it didn't work...
I'm running a server, dishing up static HTML batch generated from source files once per month. The saving can be enormous... two HTML files of 25kB and 13kB were reduced to just 2kB each! Admittedly, the body of files only takes up 100MB, to I'm not going to run out of space anytime soon...
Now surely the server would fetch a small file off the disc faster than it could fetch a bigger file. And since I'm not compressing these files on the fly, there's no overhead on the server side. The LAN should get some benefit, too, since there is less data being whizzed around. There's going to be some overhead on the client side, as Netscape needs to gunzip the data at some point...
However, I was under the impression that analog modems already had some dedicated data compression hardware... so if you have a server grabbing gzipped data off its discs, pushing that out to an analog modem, then the hardware of the modem won't be able to compress it (much) further in any case. And if your server is generating the HTML on the fly, maybe it would be better to just push uncompressed data to the modem, and let the hardware compression take care of things.
s errare humanum est, sed merda futare machinem necessit
Will you have my babies?
--Giving to trolls for the benefit of us all
I thought we had something, you and I, but was I ever wrong.
--Giving to trolls for the benefit of us all
Well if you're not going to take my babies at least fix that CV.rtf link on your website.
--Giving to trolls for the benefit of us all
This could make it more difficult for prying eyes (e.g. ISPs, CIA, CSIS, MI6..) to search passing packets for keywords. It wouldn't be secure, but you'd need to look at the entire application layer packet to know it's gzipped, and have enough contextual information to decrypt it properly.
I adblock all animated gifs.
Blessed be the prime numbered slashdotters
- A.P.
--
* CmdrTaco is an idiot.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
This is a bit of a plug, but I found a really big win for the server side (not the client side) when I added this feature to AxKit (link in .sig). I'm behind a 64Kb line, and some of the AxKit pages are pure documentation. This feature reduced the outgoing page size by about 80% for many pages, which seriously helps me deliver more content to my users. And the gzipped content is cached, so its just as fast as the non-gzipped content when using cacheable pages.
Yes, its not much help for images, but then you just shouldn't enable this concept for images.
Apache::GzipChain can also provide this option for people working with static pages on mod_perl enabled servers, but it has a serious memory leak in it that I found last week (and posted details of to the mod_perl mailing list).
Matt. Want XML + Apache + Stylesheets? Get AxKit.
When ever I try to open a file that's been gzipped, Netscape (4.75 on linux) automatically prompts me with a file dialog box. This is even if I'm reading it straight from the file system. Thanks
the good ground has been paved over by suicidal maniacs
You just made it so that pages can't incrementally load any more. The browser would have to wait until the whole .pak was downloaded before it could start laying out the page.
Yes, there are many places along the transmission lines where compression is attempted, but like the standard setting in most disk compression packages it's a little simple and typically does the worst job of compression in the system. Since compression in a modem is handled independent of any CPU, if you can do better somewhere else it then it doesn't really matter if the modem's efforts are wasted.
In addition, people have been saying it isn't worth compressing .gif or .jpg files. While that's typically true with .gif files, .jpgs can usually have 10-15% of their bulk squeezed out even with the humble zip program.
I'm a huge fan of compression and I strongly believe that transmission of compressed HTML files will have a major positive impact on the 'Net. Don't just think of the lower serving overhead on the servers, think of all the (caching) proxies and other routers and gateways. HTML files seriously lose 80% of their bulk when compressed.
But we need to go further. We need to start bringing in a new highly compressed image format now so it's in popular use before 2005. There are a couple of nice fractal formats around that result in smaller files than the equivalent zipped .jpg -- we need to get at least one into the standard installation of the next IE or NS.
Actually, you can display the files in the order they're packed, you just can't parallel download so some of the multilink systems might be disadvantaged...
Something like;
- Client: I want http://blah.com/foo.html
- Server: That has files; foo.html, foopic1.gif, foopic2.jpg/foopic2.fractal, fooflash & adiframe10111.html
- C: I have adiframe10111.html and I support
.fractal - S: Here is foo.html.your.pak
Make any sense?Doesn't Keep-Alive in HTTP/1.1 take care of the problem of sending multiple resources for one page?
Though I definitely agree with you about the whole multiple-version of a single resource thing (foopic2.jpg/foopic2.fractal)
pooptruck
Acctually I built in GZIP compression to the core product at the company I'm working for (a web application) about a year ago. All HTML content coming out of our application passes through a layer which examines the browser and compresses it. The programmers never need to think about it. All the compression is done in realtime though, so there is a minute cpu overhead assosciated with it. We average about 4% extra cpu time because of GZIP. However, we've been averaging about 75% compression of our html. That -triples- the speed of page loads on modems. It's really noticable when I'm doing work from home. GZIP is a run-length compression, so if the page load stalls half way though, it still renders perfectly fine.
GZIP Compression is supported in NS4.5 and higher, IE4.01 and higher, and all versions of Mozilla. We have, in the past year, never had a reported problem with the GZIP compression. There are some known bugs if you try to compress other mimetypes other then html.
On a side note in probably about a month or so, I will be releasing into open source a java servlet web application framework. Included, among other goodies, is a layer which can automatically do GZIP encoding if the browser supports it. So anybody writing a web application using this automatically gets the benefits. Eventually coming to http://www.projectapollo.org
http://perl.apache.org/guide/modules.html#Apache_G zipChain_compress_HTM