Slashdot Mirror


Cloaking Detection?

drcrja asks: "I am conducting some academic research on the use of cloaking and how it affects search engine rankings (cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user). I am currently using Alta Vista's Babel Fish to retrieve pages and compare those pages to the pages on the actual web sites but I am trying to find other methods of detecting cloaking. I am wondering if any members of the /. community have any experience with this?"

2 of 42 comments (clear)

  1. If cloaking becomes a problem... by tswinzig · · Score: 3, Interesting

    ...why don't the search engines just play the game? Cloak themselves to look like regular users.

    Download the robots.txt file through one set of IP addresses, with your normal user-agent header. Then request the actual pages using a Mozilla or MSIE user-agent ID, and using new IP addresses that cannot be traced back to google (or whoever) using DNS. Queue up URL's to be downloaded in a random order so that a really clever website can't detect your robot by examining traffic patterns. (I.e. maybe take a full day or week to download all the pages from a site.)

    If they did all this, could someone still detect it's a search engine robot and use cloaking?

    --

    "And like that ... he's gone."
    1. Re:If cloaking becomes a problem... by mclearn · · Score: 2, Interesting

      Well, I gues that's what happens when you don't hit preview, eh? :-)

      I was going to qualify that by saying that statistically-speaking, one could deliver the false page to the set of requestors following closely behind the IP that grabbed /robots.txt. Of course, you go on to say that why don't the search engine companies spread the requests out?

      Well, how does the search engines know who is a cloaker and who isn't? Search engines *should* be good netizens, and abide by rules of conduct. Hence /robots.txt, throttling, some form of search that doesn't kill a server (eg. breadth-first), etc.