Look-Ahead Caching For HTTP Proxies?
ryandlugosz asks: "Why can't I find an easy way to do look-ahead web caching/proxying for my network? I'm running the squid web proxy/cache right now and it's great, however I want to take it a step further. Why doesn't Squid look at the HTML I download, parse out the 'A HREF' and 'IMG SRC' tags and go get those documents while I'm reading the page that I just loaded? This way, when I click on any of those first-level links on the page there is a good chance that they're already sitting in the cache waiting for me? There would have to be a way to prioritize these look-aheads, ala explicit page requests are always loaded before look-ahead pages are downloaded. I've got lots of bandwidth and disk space at my disposal... why can't I do this easily in Linux? (BTW: there used to be a couple of windows apps that would do this for you, such as Peak's Net.Jet, but they appear to be gone now) I'd appreciate any recommendations other /.ers can provide."
Also if the link is a redirect it should dump into a stack for your own visual approval of the redirect. Also when you look at pornography, you wouldn't have to close windows super-quick when you get trapped.
I gotta say this would affect a company's stats for ad viewing as well, so nobody with major content providing interest (ZDNet, Tucows, everyone who sells ads) would likely be too interested in hosting.
So I ramble, I started on-topic . . .
Just imagine if your look-ahead proxy decides to prefetch all the one click buy links on Amazon's (and nobody else's) site.
1. It's not very net friendly
Imagine if you will, you have a web site. Let's say that it's a simple site with 50 static pages with text and graphics. You pay to have this hosted with certain bandwidth restrictions. now lets say that your site is linked to from another site. If some yokel with a look ahead caching proxy pulls up the other site and is reading the info there, in the background, your site is getting hit. Possibly very hard, if the yokel has broadband. Now, this yokel didn't ever want to see your 50 pages at all, but you know what, you just paid for that yokel to sasturate your web server for a period of time even though he never saw your site.
2. You might now want what's out there.
Let's say you are looking for a tidbit of information. Say, some information on the book "Little Women" You go to your favorite search engine and type in "little women" and click GO. The results page may indeed have pertinent infomration, but it may also have links to websites that may contain stuff you don't want to see like kiddie porn. Despite you not wanting to download the kiddie porn site, your computer sees the link, goes to the site and downloads away into your cache. "No problem, I delete my cache regulary" you maight say. With the current state of Law Enforcement, how sure can you be that you are not being monitored for ilegal activity such as possession of child pornography. According to all logs, you visited the kiddie porn site. If the Feds bust down your door *before* you delete your cache, you are busted! (And if you *do* delete your cache, they still may be able to recover the information anyway. Even if you can convince the jury that you didn't mean to download kiddie porn (like th'ed believe you) You are still looking at several months of the police holding your computer(s) "for evidence".
If you have your look-ahead proxy looking more than 1 link deep, you are really asking for some bad juju. Even 1 link deep can be really bad.
Is it worth the damage? Is it worth the risk?
-Joe
HTTP 1.1 already has transfer encoding as an optional parameter that can be negotiated between the client and server. I'll bet we can all guess which side of the communication is lacking support for it.
PJRC: Electronic Projects, 8051 Microcontroller Tools
I think the WWWoffle cache that comes with Debian can do things like this. I remember when I first installed Debian, I was playing with it, and told it to go and fetch /., AND everything it linked to a depth of 5, or was it 15, I can't remember. What I do remember was waking up the next moring to find that my 13GB drive had 20K of free space. OOPS!!!
>~~~~~~~~~~~~~~~~
>~~~~~~~~~~~~~~~~
Pilchie
becuase, the thing would have to be INTELIGENT!..
for example, say if im on slashdot.org and there is an add for thinkgeek, pointing to the "thinkgeek.com" domain, ofcourse it wouldnt get that - as that is on a different domain..
if it looks like a big message board, with lots of links, ofcourse it wouldnt follow it.
also, itd have to have a selectable read-ahead level - default would be only 1 level - hence if i did it on slashdot, it would goto the "read more's" - and a few others, but owuldnt KEEP GOING..
there is so much more intelegience it could have - eg, if it discovered on slashdot i never clicked the "rob's page" link, it would stop getting that after a while..
and ofcourse, to save space, these links would have a low expiry time (30min)..
shouldnt be to hard to implement into squid, maybe theres a plugin somewhere.,.
crazney
"Who is General Failure and why is he reading my hard disk ?"
stuff
It's easy to get into the "gim'me gim'me gim'me" mode of thinking when surfing the web. After all, "information wants to be free", right? The sad truth is that is costs real money to host a web site. Sure, there are some crappy free hosting services, but their performance is dismal AND your page gets served with their adverts. User account website at ISP (www.some-isp.com/~username) come at no extra charge but only for a few megs of data and rather limited transfer each month.
Web unfriendly software raises the cost of hosting a web site. Low cost hosting usually seems to be billed on the number of bytes transfered, and software like what you're proposing will needlessly increase the site owner's costs. High end hosting tends to be billed on bandwidth (not total transfer), so this software doesn't hit the site's owner directly in the wallet, instead it just makes the site less responsive for other users.
This idea is even worse than archivers, as most of the bytes will sit in a cache and get expired, instead of in a archive directory where they _might_ someday be seen or used. In the case of an archiver, the user went out of their way to obtain a complete local copy of the web site, presumably because they are interested in the material and might actually read it off-line. With a predicitive caching proxy, there's no indication from the user that they will ever make any use of the material dwonloaded. The vast majority will sit in the cache, which will in all likelyhood rapidly need to remove least recently obtained pages. In normal caching terminology, one would say "least recently used", but in this caching scheme, the vast majority of pages in the cache will never have been viewed. Utter waste.
This sort of net-unfriendly behavior is analogous to pollution. Even if just one person pollutes the environment, there is some small harm to a small number of people, perhaps significant harm to a couple if the pollution is severe. If a large company pollutes recklessly, perhaps a community or two is badly effected. If pollution becomes widespread, it's a global problem and almost everyone is harmed.
Likewise, if you hack together a predicitive look-ahead caching proxy and use it amongst yourself and your friends, you're impacting the net similarily as if you'd take your use motor oil and other waste and dump it directly in a local stream. If a large company or two replaces their bandwidth conserving squid proxy with your bandwidth abusing look-ahead caching, a lot of sites will suffer increased costs. If its use becomes widespread, it would significantly increase overall bandwidth usage on the net, in all likelyhood raising costs enough to be passed all the way back down to end users, and it's raise the cost of hosting web sites, which would need to be made up somehow. Perhaps large website could absorb the cost of more bandwidth? Smaller sites, like mine, would be in a world of hurt. For quite some time, I paid out-of-pocket a couple hundred dollars a month to keep the site up. Now, we're making some small sales from the site... getting close to covering the costs.
Well, that's been a long rant. I hope you'll take a moment to consider that web site operators pay real dollars to make their sites available to you, and keep that in mind when you consider designing networking software.
PJRC: Electronic Projects, 8051 Microcontroller Tools
This is not the purpose of squid. Squid is designed for helping a large number of users handle a not so large ammount of bandwidth. If squid performed the actions that you mention in your post, it would be defeating this purpose. Imagine if you have 200 users sharing a dsl line. Squid would cache all those pages that everybody constantly loads up (yahoo, msn, etc.). If it also grabbed all the extra links, it would be wasting bandwidth because the cache-miss-rate would be so increadibly large.
What should really happen is that the first user that goes to a site should have to wait a bit. Squid will then have a cache and it'll be fast after that. But you know that.
Squid just isn't the right tool... perhaps if you rewrite it/make your own proxy you can do what you're looking for...
-Andy
Use google's cache. http://www.google.com/search?q=cache:www.domain.co m will bring up the cached version of the page (with the google header). They can probably tweak this to avoid abuse, but for now it's open.
I think a lot of time is wasted waiting for pages to appear. I routinely use google's cached version of a page to avoid indefinite waits (and in fact the "stale" version is often better than a page that's gone dead - after all it's more likely to match my query). Large-scale caching a la Akamai is probably the way to go for a lot of sites.
It would be interesting to build a distributed network of cache sites to decrease latency. My thought is to use a proxy of some sort to get requested pages, and then stream them out on IP multicast to other proxies. Each client proxy could then maintain a certain number of recently-broadcast pages as a cache. I think this has a lot of similarity to Gnutella, so maybe a gateway could be built around this program instead.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger