Cloaking Detection?
drcrja asks: "I am conducting some academic research on the use of cloaking and how it affects search engine rankings (cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user). I am currently using Alta Vista's Babel Fish to retrieve pages and compare those pages to the pages on the actual web sites but I am trying to find other methods of detecting cloaking. I am wondering if any members of the /. community have any experience with this?"
...why don't the search engines just play the game? Cloak themselves to look like regular users.
Download the robots.txt file through one set of IP addresses, with your normal user-agent header. Then request the actual pages using a Mozilla or MSIE user-agent ID, and using new IP addresses that cannot be traced back to google (or whoever) using DNS. Queue up URL's to be downloaded in a random order so that a really clever website can't detect your robot by examining traffic patterns. (I.e. maybe take a full day or week to download all the pages from a site.)
If they did all this, could someone still detect it's a search engine robot and use cloaking?
"And like that