Cloaking Detection?
drcrja asks: "I am conducting some academic research on the use of cloaking and how it affects search engine rankings (cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user). I am currently using Alta Vista's Babel Fish to retrieve pages and compare those pages to the pages on the actual web sites but I am trying to find other methods of detecting cloaking. I am wondering if any members of the /. community have any experience with this?"
I imagine wget or another HTTP client can be coaxed to spit out the spider and browser type strings associated with search engine spiders. It would be a simple, straightforward hack to make a script that would request a page twice, once reporting itself as a search engine (and requesting the robots.txt file for good measure) and secondly as a regular browser. Then do a simple compare.
You could give it a list of sites and it could go through dozens or hundreds of sites a minute, rather than you doing it by hand. You could have it save pages that show differences, or at least give you the URLs so you could load them later and study the differences (if that is a goal).
You could use PHP, perl, java, etc to do this very simply as well. I imagine a simple PHP script could well be less than 50 lines, and could even call your browser and load the two pages side by side each time it found a difference.
-Adam
Hi
First: sometimes google cached copies of pages might be informative.
Changing your browser's User-agent str won't always detect the cloaking, as it is quite likely to be configured to work by ip addr block too (googlebot!). Similarly, babelfish may not show cloaked pages because it comes from a different IP than altavista'a index bots and this can be checked for in the cloaking server's config.
Second: it is *imperitave* that search engines keep unique user-agent strings that identify them. P'haps none of you who suggested the engine change user-agent str runs a website? It would remove a great tool from log analasys, and in the end make no difference to cloakers as they'd just do engine detection by IP anyway.