Cloaking Detection?
drcrja asks: "I am conducting some academic research on the use of cloaking and how it affects search engine rankings (cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user). I am currently using Alta Vista's Babel Fish to retrieve pages and compare those pages to the pages on the actual web sites but I am trying to find other methods of detecting cloaking. I am wondering if any members of the /. community have any experience with this?"
Ingredients:
Computer
Perl
Internet Connection
LibWWW, UserAgent, and all the dependencies, I forgot which
Optional: Perl Cookbook, by Christianson and Torkingham.
Directions:
Start with the Perl Cookbook to give you a quick background of how to design an autonomous www agent that will crawl around gathering webpages. You can have them visit links or read from a list of links or whatever you want.
Read the documentation from the Perl UserAgent libs and figure out how to change the http headers to spoof various browsers. I've done this before. I think that I ended up going into the UserAgent code and doing this manually. I don't remember exactly how I accomplished this, I just remember that it was easy.
Now have to agents to crawl websites and compare output from one website using the "Spider" http headers with the output from spoofing the "IE" http headers. Websites would sometimes still think the IE headers were a robot. The key is to pause the request so that it is as though a human is reading the page/clicking the links/etc.
Keep track of the sites that are different or keep track of whatever stats that you need.
Mix, Stir, Burn, Enjoy.
I've actually done this type of thing before in order to test various IE only websites on non-IE browsers (non-MS computers). My results were that all of the pages the *require* IE render perfectly in Mozilla and most render fine in Opera. I still don't understand why businesses would *turn-away* potential customers only for having different http headers!
Bringing irony to the Slash-masses
I imagine wget or another HTTP client can be coaxed to spit out the spider and browser type strings associated with search engine spiders. It would be a simple, straightforward hack to make a script that would request a page twice, once reporting itself as a search engine (and requesting the robots.txt file for good measure) and secondly as a regular browser. Then do a simple compare.
You could give it a list of sites and it could go through dozens or hundreds of sites a minute, rather than you doing it by hand. You could have it save pages that show differences, or at least give you the URLs so you could load them later and study the differences (if that is a goal).
You could use PHP, perl, java, etc to do this very simply as well. I imagine a simple PHP script could well be less than 50 lines, and could even call your browser and load the two pages side by side each time it found a difference.
-Adam
--
Evan
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
...why don't the search engines just play the game? Cloak themselves to look like regular users.
Download the robots.txt file through one set of IP addresses, with your normal user-agent header. Then request the actual pages using a Mozilla or MSIE user-agent ID, and using new IP addresses that cannot be traced back to google (or whoever) using DNS. Queue up URL's to be downloaded in a random order so that a really clever website can't detect your robot by examining traffic patterns. (I.e. maybe take a full day or week to download all the pages from a site.)
If they did all this, could someone still detect it's a search engine robot and use cloaking?
"And like that
Some sites may apply cloaking based on the IP addess of the spider.
I suggest using Google's cache as a method to detect cloaking. The advantage is that the page cached is exactly the same page used for indexing, and google is the most popular search engine, and thus you win.
Make even shorter URLs - 8LN.org
Read the documentation from the Perl UserAgent libs and figure out how to change the http headers to spoof various browsers. I've done this before. I think that I ended up going into the UserAgent code and doing this manually. I don't remember exactly how I accomplished this, I just remember that it was easy.
It's quite straightforward:my $ua = LWP::UserAgent->new;
$ua->agent("Whatever 3.11/sun4u");
Use your long range sensors to detect search-space anomolies.
"Your superior intellect is no match for our puny weapons!"
Hi
First: sometimes google cached copies of pages might be informative.
Changing your browser's User-agent str won't always detect the cloaking, as it is quite likely to be configured to work by ip addr block too (googlebot!). Similarly, babelfish may not show cloaked pages because it comes from a different IP than altavista'a index bots and this can be checked for in the cloaking server's config.
Second: it is *imperitave* that search engines keep unique user-agent strings that identify them. P'haps none of you who suggested the engine change user-agent str runs a website? It would remove a great tool from log analasys, and in the end make no difference to cloakers as they'd just do engine detection by IP anyway.
I recommend you look at Webmasterworld there is a massive amount of knowledge there.
One of the guys from google even posts there on occasion.
no sig.
(cloaking is the practice of delivering a specially optimized page to search engine spiders while delivering a completely different page to the user).
1 4. html
There are good reasons for a site to respond differently to different clients. Indeed responding to the capabilities of the client should be considered 'best practice'.
There are a host of client types out there other than just PC Browsers and Robots, IDTV STB's, 3G & WAP Phones, Convergent devices. The range is set to explode.
This is the whole reason for the Http 'Accept' header, which is provided to allow a server to handler clients with different capabilities.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec
Didn't that require some sort of inverse tackyon pulse from the main deflector?