Look-Ahead Caching For HTTP Proxies?

← Back to Stories (view on slashdot.org)

Look-Ahead Caching For HTTP Proxies?

Posted by Cliff on Monday February 5, 2001 @10:49AM from the predicting-the-next-click dept.

ryandlugosz asks: "Why can't I find an easy way to do look-ahead web caching/proxying for my network? I'm running the squid web proxy/cache right now and it's great, however I want to take it a step further. Why doesn't Squid look at the HTML I download, parse out the 'A HREF' and 'IMG SRC' tags and go get those documents while I'm reading the page that I just loaded? This way, when I click on any of those first-level links on the page there is a good chance that they're already sitting in the cache waiting for me? There would have to be a way to prioritize these look-aheads, ala explicit page requests are always loaded before look-ahead pages are downloaded. I've got lots of bandwidth and disk space at my disposal... why can't I do this easily in Linux? (BTW: there used to be a couple of windows apps that would do this for you, such as Peak's Net.Jet, but they appear to be gone now) I'd appreciate any recommendations other /.ers can provide."

2 of 19 comments (clear)

Min score:

Reason:

Sort:

Very un-nice to web sites by pjrc · 2001-02-06 00:55 · Score: 4

My little web site suffers regularily from people running archiver programs, Teleport Pro, WebZIP, WebReaper, WebCopier, HTTrack, Wget, WebSymmetrix, Xyro, and many others. Some people set these things to run late at night, which helps a bit, but still impacts the responsiveness for overseas visitors. Often times people running these things have dialup... when a DSL user runs TeleportPro on my little site, it really hurts performance for everyone. My site offers free technical resources, hardware and firmware development tools, and other similar stuff, so many people want to download the whole thing to their hard drives. I have years of experience try to deal with the problems from this sort of activity.
It's easy to get into the "gim'me gim'me gim'me" mode of thinking when surfing the web. After all, "information wants to be free", right? The sad truth is that is costs real money to host a web site. Sure, there are some crappy free hosting services, but their performance is dismal AND your page gets served with their adverts. User account website at ISP (www.some-isp.com/~username) come at no extra charge but only for a few megs of data and rather limited transfer each month.
Web unfriendly software raises the cost of hosting a web site. Low cost hosting usually seems to be billed on the number of bytes transfered, and software like what you're proposing will needlessly increase the site owner's costs. High end hosting tends to be billed on bandwidth (not total transfer), so this software doesn't hit the site's owner directly in the wallet, instead it just makes the site less responsive for other users.
This idea is even worse than archivers, as most of the bytes will sit in a cache and get expired, instead of in a archive directory where they _might_ someday be seen or used. In the case of an archiver, the user went out of their way to obtain a complete local copy of the web site, presumably because they are interested in the material and might actually read it off-line. With a predicitive caching proxy, there's no indication from the user that they will ever make any use of the material dwonloaded. The vast majority will sit in the cache, which will in all likelyhood rapidly need to remove least recently obtained pages. In normal caching terminology, one would say "least recently used", but in this caching scheme, the vast majority of pages in the cache will never have been viewed. Utter waste.
This sort of net-unfriendly behavior is analogous to pollution. Even if just one person pollutes the environment, there is some small harm to a small number of people, perhaps significant harm to a couple if the pollution is severe. If a large company pollutes recklessly, perhaps a community or two is badly effected. If pollution becomes widespread, it's a global problem and almost everyone is harmed.
Likewise, if you hack together a predicitive look-ahead caching proxy and use it amongst yourself and your friends, you're impacting the net similarily as if you'd take your use motor oil and other waste and dump it directly in a local stream. If a large company or two replaces their bandwidth conserving squid proxy with your bandwidth abusing look-ahead caching, a lot of sites will suffer increased costs. If its use becomes widespread, it would significantly increase overall bandwidth usage on the net, in all likelyhood raising costs enough to be passed all the way back down to end users, and it's raise the cost of hosting web sites, which would need to be made up somehow. Perhaps large website could absorb the cost of more bandwidth? Smaller sites, like mine, would be in a world of hurt. For quite some time, I paid out-of-pocket a couple hundred dollars a month to keep the site up. Now, we're making some small sales from the site... getting close to covering the costs.
Well, that's been a long rant. I hope you'll take a moment to consider that web site operators pay real dollars to make their sites available to you, and keep that in mind when you consider designing networking software.

--
PJRC: Electronic Projects, 8051 Microcontroller Tools
it's a waste... by whydna · 2001-02-05 14:28 · Score: 4

This is not the purpose of squid. Squid is designed for helping a large number of users handle a not so large ammount of bandwidth. If squid performed the actions that you mention in your post, it would be defeating this purpose. Imagine if you have 200 users sharing a dsl line. Squid would cache all those pages that everybody constantly loads up (yahoo, msn, etc.). If it also grabbed all the extra links, it would be wasting bandwidth because the cache-miss-rate would be so increadibly large.

What should really happen is that the first user that goes to a site should have to wait a bit. Squid will then have a cache and it'll be fast after that. But you know that.

Squid just isn't the right tool... perhaps if you rewrite it/make your own proxy you can do what you're looking for...

-Andy