Amazon Bots Cause Grief For Associate Web Sites
theodp writes "Amazon Associates and Web Services developers are crying foul over the hammering they're taking from ill-behaved bots that Amazon had subsidiary Alexa Internet dispatch to evaluate the 'quality and reliability' of their sites. Amazon fessed up and acknowledged problems exist, but points to recent Operating Agreement changes that not only give Amazon and any of its corporate affiliates the right to do so, but also to use unstated technical means to overcome any methods that are used to try to block or interfere with such crawling or monitoring. Interesting stance from the folks who called on the Senate to prosecute those who degrade the technical quality of service at web sites."
We've noticed quite a few requests for robots.txt by the Alexa archiver. So a suggestion to boot may be throwing this into your root directory of your domain's web site (in a file called robots.txt)
/
User-agent: ia_archiver
Disallow:
And if its really annoying, bloody hell, just do an active firewall block and put the sharks (lawyers) away with those goofy lawsuits before they start wasting our senators' time and taxpayer cash.
The problem is that the bots are way too diligent. They go to every single link on every single page, even if the page is dynamically generated. Many sites have an infinite combination of url's, and as a result, the bots sit on them trying to download every single variation of query. That means that Joe Amazon Associate's web site is hammered with requests and his bandwidth fees go through the roof.
The simple solution would be to just stop Amazon from sucking up the bandwidth via a robots.txt file. But Amazon says that is not allowed. There's the dilemma.
Amazon.com has been silent on this issue for the last several days. My bet is that the bots won't come back without some heavy-duty tuning.
--Your Sex
Sex - Find It
To make this palatable is lower the request rate to something like 1 per minute.
Most robots do something like that.
Of course - it takes a lot longer....
It's Christmas everyday with BitTorrent.
Therefore, you agree that we and our corporate affiliates may take such actions and that you will not seek to block or otherwise interfere with such crawling or monitoring (and that we and our corporate affiliates may use technical means to overcome any methods used on your site to block or interfere with such crawling or monitoring).
Actually, doesn't it say that you aren't allowed to block it, but if you do, they can try and get around it?
Looked to me like the latest update on the noamazon.com site that you link to was Valentine's Day, 2001. Hardly the most active looking protest.
-- Oh Well
That is so blatantly wrong how can it be modded up to 4?!
... and that you will not seek to block or otherwise interfere with such crawling or monitoring"
It says exactly that you agree not to block them!
"you agree
It looks like these people are signing agreements they didn't read or understand.
They have a few options that I can see.
Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.
Make a mini site for the amazon site/bot, but the rest of your website in a second location (that they don't have access too)
Why deal with a company like this anyway, they're obviously inconsiderate pricks (at least) move on with your life.
http://forums.prosperotechnologies.com/n/mb/messag e.asp?webtag=am-associhelp&msg=2579.1&maxT=3">http ://forums.prosperotechnologies.com/n/mb/message.as p?webtag=am-associhelp&msg=2579.1&maxT=3
ok that is a post from the associates board
in which amazon state
"Hello Associates.
Thank you for providing such valuable feedback. The Alexa crawl (id amzn_assoc) has ceased while we investigate the statements made in this post. We plan to address the following concerns:
1. The impact the crawler may have on bandwidth
2. The number of pages the crawler hits per second
3. How the Alexa crawler might identify and ignore AWS pages or links
Points of clarification:
1. Regarding Archive.org, Alexa has confirmed that material that is crawled by the 'amzn_assoc' crawler is not donated to the internet archive. It is used exclusively for the purposes of the Broken Link Reports.
2. The Alexa crawler 'amzn_assoc' differs from the 'ia_archiver' crawler. The 'ia_archiver' can be excluded by using a robots.txt file and will not violate the Amazon.com Associates operating agreement.
You should expect a response from us by COB Friday as it may take a few days to research your concerns. This issue is important to us and we will get it resolved. Thank you for your patience.
The Amazon.com Associates Program"
I participated in that conversation myself though and I don't think I seen one happy person that though making the agreement so we had to let them crawl our sites as often as they like.
cj.com report error links but they do it from the server end, amazons system is just stupid and it was only done to try and give there alexa company some work todo.
so I guess its just wait and see now till we know if the bot starts back up again.
As has been noted elsewhere, the affiliate bot ignores robots.txt. Disallowing ia_archiver will have the effect of removing the site from the wayback machine (http://www.archive.org/), which may not be what you want to do.
A while back (when I was still using a CobaltRaQ2 - adequate for the job, but not particularly speedy with cgi scripts) I got DoSSed by ia_archiver (yes, cgi-bin is in robots.txt, no I'm not associated with Amazon, but someone else who links to the cgi-script in question probably was). I thought ia_archiver was another Teleport Pro, and just modified the acutal script to display a rejection page if it saw ia_archiver in the HTTP_USER_AGENT.
Finally, I know what it is...
It was trying to crawl *every* available url for the CGI script - and it appeared to be buggy because it got itself into an endless loop changing from one mode to the other.
Oolite: Elite-like game. For Mac, Linux and Windows
http://www.booksense.com/affiliate/
I am an Amazon Associate who has experience with the Alexa Crawler. I believe the crawl is intended to find broken links, or links to products that are no longer stocked.
The Amazon Associates program has been around long enough for "page rot" to kick in, and I am sure there are many sites out there with links to non-existant products, such as old editions of books, etc. Historically, associates had to build static links (for the most part) by hand, and embedded them in more or less static page.
The problem comes in due to the recent introduction of their web services, where sites can build essentially unlimited pages based on dynamic real-time queries to amazon. I don't believe their intent is to "thrash around" in these sites, which is what is occuring.
A few month ago, I asked to have the Alexa bot crawl my site, (StarvingMind.net) , I was curious about the reports it was able to generate. The bot ended up in endless loops and had to be manually stopped by someone at Alexa. They spent an impressive amount of time trying to identify and fix the problem my site was creating for their bot. I don't know whether my specific problem was ever resolved, but I have the impression the bug was found and fixed. I also have the impression that the bot is very immature code and buggy.
Based on the personal and public responses I have seen from the Amazon and Alexa people involved, they actually do care about these issues very much, and don't wish to cause harm by the bots use. I believe their goal is to eliminate the link rot that has accumulated on associate sites over the years, manytimes with the site owner unaware of the problem.
Web services threw a curve into the mix, and that is where the major problems are occuring. The post I a replying to seems to imply Amazon may want to "use then throw out" the associates. I think that is pure speculation without any knowledge of the fact. Amazon has recently gone from what appeared to be no fulltime staff to a team of people dedicated to supporting and running the associates program. I believe they consider it a very cost effective way of advertising, and I expect it is doing quite well for them. Based on their recent actions, I believe they are trying to build a strong long term relationship with the active ones of us, as we bring them a fair amount of business.
Another post has pointed out they have stopped the crawl while the issues talked about here are looked into. They realize they may have made a mistake, and are trying to figure out how to address the problem. They have been responsive (with me at least) resolving problems like this in the past, they deserve a chance to resolve it this time as well. They have started down the right path, by stopping the crawl.
-Pete
Soccer Goal Plans
That sounds weird... Isnt the US "Land of the lawsuit" ? I've read about people suing companies for sexual harrassment, and winning. Now you get physical damage, assault and whatnot, and she has to quit ? Wouldnt one of those late-nite 1-800-SUE-ME lawyers take this case ? Seems pretty much open and shut to me.
Marriage is considered capital punishment for the theft of a goat in some third world countries...
> Looking at user agents, the browser war is over. IE is #1, and Netscape often isn't even in the top 10; various indexer 'bots generate more traffic than Netscape.
Looking at user agents is incredibly foolish since most browsers' agent strings default to IE and most users don't change that default.
Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.
Here's an idea.... How about politely posting a question or two about it in the appropriate forums? Who knows, something crazy might happen, like responsible people at Amazon might respond and turn the bot off while they investigate. Then, they might post a reasonable explaination and take reasonable steps to make sure they're not abusing associate's servers.
Here's another idea.... Try reading the pages that slashdot linked to. I know that's a lot of work, so I'll save you a bit of effort by posting each slashdot link, and a brief summary of what you would have found had you bothered to click on it and ACTUALLY READ it (before posting here with a subject advocating actually reading the terms and conditions).
This just isn't that sensational of a story. Yet another 'bot that needs some refinement, but a it IS designed to avoid more than one hit every 2 seconds (and the evidence posted seems to be consistent with that). They at least did respond to people's concerns and they took the bot off-line while they investigated it. Sounds pretty reasonable. It's not clear what might actually be done, and some of it appears that Amazon is claiming the problem isn't so great... but clearly they are attempting to respond to people's concerns.
Amazon feels they have a right to check the links on associate sites, and they put it in the terms. Again, it's really not that unreasonable.
What is unreasonable is the inflamatory summary appearing on the main slashdot page. Yes, timothy and other slashdot "editors" can claim it's all just editorial from "theodp" who submitted the summary. But what kind of editing it that?
The summary concludes with:
The link is to Amazon's position on DDOS attacks... there's really no similarity to a well-intentioned 'bot, which clearly identifies itself, limits itself to 0.5 Hz access rate, AND was responsibly taken off-line and reexamined when some people complained that it used too much bandwidth.
PJRC: Electronic Projects, 8051 Microcontroller Tools