Slashdot Mirror


Gzip Encoding of Web Pages?

Both Brendan Quinn and msim were curious about the ability to send gzip-encoded Web pages. Brendan asks: "It's possible to make Apache detect the "Accept-encoding: gzip" field sent by NS 4.7+, IE 4+ and Lynx, and send a gzip-encoded page, thus saving lots of bandwidth all over the place. So why don't people do it? Here is a module written by the Mozilla guys a couple of years ago that -almost- does what I want, and I could change it pretty easily... but I thought someone else would have done it by now? eXcite do it, does anyone know of any other large-scale sites that use gzip encoding?"

"If you have LWP installed, you can check with:

GET -p '<my proxy>' -H 'Accept-encoding: gzip' -e http://www.site.com/ | less

Try that with 'www.excite.com' and you'll get binary (gzipped) data. That's what I want to do."

24 of 42 comments (clear)

  1. err... by Wakko+Warner · · Score: 2
    Would this really make much of a difference for web pages? It's stuff like images, sound files, MP3s, pr0n jpegs, etc that make up the bulk of web transfers. HTML text files, gzipped or not, make up such a tiny fraction of web traffic that I don't see how it'd matter if they were zipped or not. Perhaps that's the reason nobody uses the gzip module?

    - A.P.

    --
    * CmdrTaco is an idiot.

    --
    "Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
    1. Re:err... by Ian+Bicking · · Score: 2
      The CPU usage on the server scares a lot of people away from this - it's not a big deal for static content (zip once and cache), but for dynamic sites (say, /.) gzipping 5-700K of text each time would kill a loaded server pretty quickly...
      If you have enough bandwidth to waste on putting 80% of redundant (compressable) data over it, but you don't have enough computer power to run gzip on that data, your resource allocation is seriously messed up. Fast computers to gzip the data are a lot cheaper than fat pipes to send it.

      The only place where it might not make sense is in an academic environment where (for artificial reasons) the bandwidth is very cheap, and the servers might still be overwhelmed.
      --

    2. Re:err... by Matts · · Score: 2

      I'm not talking about outgoing bandwidth!!!

      I'm talking about people on slow links USING your web site. People with modems. I don't care how fast your pipe is outgoing - these people on slow modems can effectively crush your site, shocking as it may sound, because they end up spawning more httpd's, eventually either forcing you to your httpd limit (if you've taken the time to set it sensibly), or forcing your server into swap. And you don't want that to happen.

      Please go and read some real quality information on people who have worked with these high end solutions before thinking about replying again. Such as the mod_perl guide, at http://perl.apache.org/guide/

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    3. Re:err... by Matts · · Score: 2
      Will you have my babies?

      Not if you don't want them! ;-)

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    4. Re:err... by Matts · · Score: 2

      It all depends on your architecture. Sure if you have a caching proxy front end it might not be worth it. But if you don't, and have a slow client connecting (say a 56K modem), the time taken to gzip a 700K file (assuming this is mostly text) vs the time it takes to actually download that file make the benefit definitely worthwhile.

      People easily forget that, and assume that their bandwidth is big enough that the file will just instantly disappear down the pipe. Your server will get overloaded an awful lot quicker if every httpd is waiting on a slow client to download 700K when they could be downloading 100K.

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    5. Re:err... by Tower · · Score: 2

      >I'm not talking about outgoing bandwidth!!!
      I wasn't either...
      ...that was just sort of an extra thought I tacked on at the end, the rest of it wasn't directed in that fashion...

      I agree with your points here, as I had said, my previous posts were coming from a viewpoint where everyone had fairly high bandwidth, especially considering the increasing availability of DSL/cable modems. I know there are a lot of lower bandwidth links out there, and from a time perspective, they spend megapercentages more connected to each httpd. If you can afford the hardware to throw at it for gzipping and large dynamic generation, that's fine. I've found that you fill a (even large) pipe faster than you run out of CPU time on a fairly powerful system (which agrees with your assesments more than mine). I was trying to provide a different viewpoint (since almost none of the people who use my site are on anything slower than 256k DSL, and it runs with a small amount of mem/CPU reserve). If you have a quad-xeon with a couple of gigs of memory, or an S80, then, by all means, go right ahead - it can and will save the slow people time. I haven't run anything to the scale that would need any of this (mostly since I have 80% static content), and my end-user demographic is much more bandwith-enabled than the typical cross-section.

      >Please go and read some real quality information on people who have worked with these high end solutions before thinking about replying again.

      Thanks for the kind comment... I have read the mod_perl guide... relax a little, will ya?

      --

      --
      "It's tough to be bilingual when you get hit in the head."
    6. Re:err... by Tower · · Score: 2

      Yeah, but I'm biased 8^) I spent 4 years on the campus LAN (with a few T3s), cable modem in the last year, and I'll probably be switching to DSL soon... Bandwith spoiled... There are always tradeoffs, and yes, if you are going to be running with a slow endpoint, there can be savings, but I still think that the overhead of gzipping all the files (from memory to CPU time) outweighs another httpd that is waiting for the client (since it is now waiting for the gzip before it waits for the client). Of course, if your server is behind a slow pipe, and you have static pages, it will save a bundle.

      I'll have to see if I can get one of those modules, and give something a shot with webbench.

      --

      --
      "It's tough to be bilingual when you get hit in the head."
    7. Re:err... by Tower · · Score: 2

      The CPU usage on the server scares a lot of people away from this - it's not a big deal for static content (zip once and cache), but for dynamic sites (say, /.) gzipping 5-700K of text each time would kill a loaded server pretty quickly...
      --

      --
      "It's tough to be bilingual when you get hit in the head."
  2. AxKit does this by Matts · · Score: 2

    This is a bit of a plug, but I found a really big win for the server side (not the client side) when I added this feature to AxKit (link in .sig). I'm behind a 64Kb line, and some of the AxKit pages are pure documentation. This feature reduced the outgoing page size by about 80% for many pages, which seriously helps me deliver more content to my users. And the gzipped content is cached, so its just as fast as the non-gzipped content when using cacheable pages.

    Yes, its not much help for images, but then you just shouldn't enable this concept for images.

    Apache::GzipChain can also provide this option for people working with static pages on mod_perl enabled servers, but it has a serious memory leak in it that I found last week (and posted details of to the mod_perl mailing list).

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
  3. How do you get Netscape to do this? by ksheff · · Score: 2

    When ever I try to open a file that's been gzipped, Netscape (4.75 on linux) automatically prompts me with a file dialog box. This is even if I'm reading it straight from the file system. Thanks

    --
    the good ground has been paved over by suicidal maniacs
  4. Re:file-by-file is okay, but all together is bette by Jon+Evans · · Score: 2
    S: Here is foo.html.your.pak

    You just made it so that pages can't incrementally load any more. The browser would have to wait until the whole .pak was downloaded before it could start laying out the page.

  5. compression of compressed files problem by Kris_J · · Score: 2
    I like to think that I know a bit about the practical side of compression, so I'll jump in here.

    Yes, there are many places along the transmission lines where compression is attempted, but like the standard setting in most disk compression packages it's a little simple and typically does the worst job of compression in the system. Since compression in a modem is handled independent of any CPU, if you can do better somewhere else it then it doesn't really matter if the modem's efforts are wasted.

    In addition, people have been saying it isn't worth compressing .gif or .jpg files. While that's typically true with .gif files, .jpgs can usually have 10-15% of their bulk squeezed out even with the humble zip program.

    I'm a huge fan of compression and I strongly believe that transmission of compressed HTML files will have a major positive impact on the 'Net. Don't just think of the lower serving overhead on the servers, think of all the (caching) proxies and other routers and gateways. HTML files seriously lose 80% of their bulk when compressed.

    But we need to go further. We need to start bringing in a new highly compressed image format now so it's in popular use before 2005. There are a couple of nice fractal formats around that result in smaller files than the equivalent zipped .jpg -- we need to get at least one into the standard installation of the next IE or NS.

  6. Re:file-by-file is okay, but all together is bette by Kris_J · · Score: 2

    Actually, you can display the files in the order they're packed, you just can't parallel download so some of the multilink systems might be disadvantaged...

  7. file-by-file is okay, but all together is better by Kris_J · · Score: 2
    Compressing a file at a time, without reference to data that has gone before can only do so well. There needs to be some way to quickly determine which files of an impending page the client doesn't have, then package them up into a single compressed wad. Obviously, the gains would need to excessed the negotiation overhead in terms of both time and size, but I believe that's the next step after every individual file is compressed.

    Something like;

    • Client: I want http://blah.com/foo.html
    • Server: That has files; foo.html, foopic1.gif, foopic2.jpg/foopic2.fractal, fooflash & adiframe10111.html
    • C: I have adiframe10111.html and I support .fractal
    • S: Here is foo.html.your.pak
    Make any sense?
  8. Re:file-by-file is okay, but all together is bette by smileyy · · Score: 2

    Doesn't Keep-Alive in HTTP/1.1 take care of the problem of sending multiple resources for one page?

    Though I definitely agree with you about the whole multiple-version of a single resource thing (foopic2.jpg/foopic2.fractal)

    --
    pooptruck
  9. We've been doing this in production for a year by Leghk · · Score: 2

    Acctually I built in GZIP compression to the core product at the company I'm working for (a web application) about a year ago. All HTML content coming out of our application passes through a layer which examines the browser and compresses it. The programmers never need to think about it. All the compression is done in realtime though, so there is a minute cpu overhead assosciated with it. We average about 4% extra cpu time because of GZIP. However, we've been averaging about 75% compression of our html. That -triples- the speed of page loads on modems. It's really noticable when I'm doing work from home. GZIP is a run-length compression, so if the page load stalls half way though, it still renders perfectly fine.

    GZIP Compression is supported in NS4.5 and higher, IE4.01 and higher, and all versions of Mozilla. We have, in the past year, never had a reported problem with the GZIP compression. There are some known bugs if you try to compress other mimetypes other then html.

    On a side note in probably about a month or so, I will be releasing into open source a java servlet web application framework. Included, among other goodies, is a layer which can automatically do GZIP encoding if the browser supports it. So anybody writing a web application using this automatically gets the benefits. Eventually coming to http://www.projectapollo.org

  10. Re:Why do it at all? by Tower · · Score: 2

    >Of course, for high-text, heavy traffic sites (for example, right here on /.), this may make some sense.

    Ah, but (like I mentioned in another comment) when you have a page that is say 500k of text (a hundred or so comments), dynamically generated for each hit, the overhead of compression is rather dangerous, and if a server is already somewhat near capacity, it could slow it dramatically... if you can't cache it, and have high traffic, it's a big problem.

    [Insert your own joke about Jon Katz wasting even more time with compression]
    --

    --
    "It's tough to be bilingual when you get hit in the head."
  11. Re:Does it work with Windows? by technos · · Score: 2

    Actually, there are the needed provisions to render those.. For owners of 95(c), 98, 98SE, Millenium and W2K, the needed .dlls come with the OS. For MacOS, 95(a), and (b), they were supplied when you installed Internet Explorer 4+.

    Also, IE4+ does work correctly with gzipped pages.

    --
    .sig: Now legally binding!
  12. Re:Does it work with Windows? by Quietust · · Score: 2

    I know for a fact that Netscape 4.75 can handle gzip-compressed data.

    I set up a program to listen on port 80 and told NS to browse to localhost. It sent the "Accept-encoding: gzip". I then telnetted to www.excite.com:80 and sent that data. I got gzipped data in return. I then browsed the site using Netscape, and it loaded properly; therefore, Netscape 4.75 can handle gzipped downloads.

    I then tricked IE 5.5 into sending the same HTTP request; I connected to a proxy (127.0.0.1) which would transparently forward to excite.com, filtered out the HTTP request, pasted in Netscape's; it also loaded properly.

    So yes, gzip downloads work fine under Windows systems using Netscape 4.75 or IE5.5 (not sure about older versions, though), though IE5.5 sends an odd "Accept-encoding: gzip, deflate" which results in some sites not compressing it at all.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.

    --
    * Q
    P.S. If you don't get this note, let me know and I'll write you another.
  13. Re:Does it work with Windows? by Quietust · · Score: 2
    Since I'm stuck with Windows for a couple more months, I'm wondering if this will work on Netscape 4.7+ for Windows. Or even IE 4+ for Windows. Does Opera do this?
    Let me quote part of the original question:
    ...the "Accept-encoding: gzip" field sent by NS 4.7+, IE 4+...
    If Netscape 4.7+ and IE 4+ claim that they can accept gzipped data, they had better know how to handle it.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.
    --
    * Q
    P.S. If you don't get this note, let me know and I'll write you another.
  14. Re:Why do it at all? by Evil+Grinn · · Score: 2
    Why bother compressing data?

    For conventional web pages, I agree. The slowness of most web sites is either due to graphics, or they are using some slow CGI on the server side. Compression of HTML wouldn't help them much.

    There are also cases where the HTML is just plain resource-intensive for the browser to render (lots of nested tables, for example). Adding in the extra step of de-compressing wouldn't help there either.

    However, I could see clients (not necessarily browsers) sucking down large chunks of XML in a gzipped form. It could be used for things like sending thousands of raw database records to a client application for further processing and presentation to the end user.

  15. How about this? by kevin42 · · Score: 3

    http://perl.apache.org/guide/modules.html#Apache_G zipChain_compress_HTM

  16. Re:Does it work with Windows? by Quietust · · Score: 3

    Here's what IE5.5 gives when I go to http://127.0.0.1/:

    GET / HTTP/1.1
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/msword, application/vnd.ms-powerpoint, */*
    Accept-Language: en-us
    Accept-Encoding: gzip, deflate
    User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
    Host: 127.0.0.1
    Connection: Keep-Alive


    In comparison, Netscape 4.75:

    GET / HTTP/1.0
    Connection: Keep-Alive
    User-Agent: Mozilla/4.75 [en] (Win98; U)
    Host: 127.0.0.1
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
    Accept-Encoding: gzip
    Accept-Language: en
    Accept-Charset: iso-8859-1,*,utf-8


    The main points of interest are that IE5.5 can handle HTTP/1.1 while Netscape only requests HTTP/1.0, and that IE5.5 also claims to handle gzip AND deflate encoding, even though they're exactly the same (last time I checked, gzip used the deflate algorithm).

    I also tried sending the IE5.5 HTTP request via telnet to www.excite.com; it returned plain text, whereas Netscape's HTTP request returned gzipped data.

    -- Sig (120 chars) --
    Your friendly neighborhood mIRC scripter.

    --
    * Q
    P.S. If you don't get this note, let me know and I'll write you another.
  17. Re:Why do it at all? by AT · · Score: 4

    The page quoted in the article shows its a pretty big win for some "typical use" sites on slower modems.

    Incidentally, no extra load would be neccessary on the server for static content if it was pre-compressed.