Amazon Bots Cause Grief For Associate Web Sites

← Back to Stories (view on slashdot.org)

Amazon Bots Cause Grief For Associate Web Sites

Posted by timothy on Sunday December 8, 2002 @10:49PM from the all-bots-must-be-attended dept.

theodp writes "Amazon Associates and Web Services developers are crying foul over the hammering they're taking from ill-behaved bots that Amazon had subsidiary Alexa Internet dispatch to evaluate the 'quality and reliability' of their sites. Amazon fessed up and acknowledged problems exist, but points to recent Operating Agreement changes that not only give Amazon and any of its corporate affiliates the right to do so, but also to use unstated technical means to overcome any methods that are used to try to block or interfere with such crawling or monitoring. Interesting stance from the folks who called on the Senate to prosecute those who degrade the technical quality of service at web sites."

18 of 136 comments (clear)

Min score:

Reason:

Sort:

The Alexa archiver -- you can stop that one. by Deal-a-Neil · 2002-12-08 22:59 · Score: 5, Informative

We've noticed quite a few requests for robots.txt by the Alexa archiver. So a suggestion to boot may be throwing this into your root directory of your domain's web site (in a file called robots.txt)

User-agent: ia_archiver
Disallow: /

And if its really annoying, bloody hell, just do an active firewall block and put the sharks (lawyers) away with those goofy lawsuits before they start wasting our senators' time and taxpayer cash.
1. Re:The Alexa archiver -- you can stop that one. by hairmare · 2002-12-08 23:07 · Score: 5, Informative
  
  The problem seems to be the amzn_assoc crawler that alexia uses on amazons behalf to find out about broken links.
  
  the bot was scanning through some kind of cgi script thus generating thousands of requests.
  
  At issue is not only the frequency of page retrievals, but also the duration of the crawl. For example, on Nov 26th, amzn_assoc visited one of my sites 13,406 times over a period of 17 hours, consuming approximately 200 Mbytes of bandwidth via calls to MrRat's CGI script.
I've been following that problem... by dagg · 2002-12-08 23:05 · Score: 5, Informative

I'm an Amazon associate, and I've been following this problem. Amazon's web-bots are looking for outdated links to books that don't exist, etc. The reasoning is that if the associate fixes the dead-links, then Amazon (and the associate) will presumably make more money.
The problem is that the bots are way too diligent. They go to every single link on every single page, even if the page is dynamically generated. Many sites have an infinite combination of url's, and as a result, the bots sit on them trying to download every single variation of query. That means that Joe Amazon Associate's web site is hammered with requests and his bandwidth fees go through the roof.
The simple solution would be to just stop Amazon from sucking up the bandwidth via a robots.txt file. But Amazon says that is not allowed. There's the dilemma.
Amazon.com has been silent on this issue for the last several days. My bet is that the bots won't come back without some heavy-duty tuning.
--Your Sex

--
Sex - Find It
1. Re:I've been following that problem... by oo7tushar · 2002-12-08 23:10 · Score: 3, Informative
  
  most bots have this problem when they're initially made.
  
  Remember when you could boost your ratings on Google but trapping the bots?
  
  --
  internet like monkeys'
2. Re:I've been following that problem... by cottonmouth · 2002-12-08 23:41 · Score: 1, Informative
  
  Strange thing is I am not an Amazon associate and I get those bots.
All they have to do by TerryAtWork · 2002-12-08 23:18 · Score: 4, Informative

To make this palatable is lower the request rate to something like 1 per minute.

Most robots do something like that.

Of course - it takes a lot longer....

--
It's Christmas everyday with BitTorrent.
Re:Just block it? by Anonymous Coward · 2002-12-08 23:18 · Score: 1, Informative

Therefore, you agree that we and our corporate affiliates may take such actions and that you will not seek to block or otherwise interfere with such crawling or monitoring (and that we and our corporate affiliates may use technical means to overcome any methods used on your site to block or interfere with such crawling or monitoring).

Actually, doesn't it say that you aren't allowed to block it, but if you do, they can try and get around it?
Re:Boohoo!! by scrutty · 2002-12-08 23:22 · Score: 5, Informative

Looked to me like the latest update on the noamazon.com site that you link to was Valentine's Day, 2001. Hardly the most active looking protest.

--
-- Oh Well
Re:Just block it? by Anonymous Coward · 2002-12-08 23:40 · Score: 5, Informative

That is so blatantly wrong how can it be modded up to 4?!

It says exactly that you agree not to block them!

"you agree ... and that you will not seek to block or otherwise interfere with such crawling or monitoring"
Read before you sign. by nuggz · 2002-12-08 23:43 · Score: 5, Informative

It looks like these people are signing agreements they didn't read or understand.

They have a few options that I can see.

Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.

Make a mini site for the amazon site/bot, but the rest of your website in a second location (that they don't have access too)

Why deal with a company like this anyway, they're obviously inconsiderate pricks (at least) move on with your life.
its not running at the momant by jkcity · 2002-12-08 23:47 · Score: 5, Informative

http://forums.prosperotechnologies.com/n/mb/messag e.asp?webtag=am-associhelp&msg=2579.1&maxT=3">http ://forums.prosperotechnologies.com/n/mb/message.as p?webtag=am-associhelp&msg=2579.1&maxT=3

ok that is a post from the associates board

in which amazon state

"Hello Associates.

Thank you for providing such valuable feedback. The Alexa crawl (id amzn_assoc) has ceased while we investigate the statements made in this post. We plan to address the following concerns:

1. The impact the crawler may have on bandwidth
2. The number of pages the crawler hits per second
3. How the Alexa crawler might identify and ignore AWS pages or links

Points of clarification:

1. Regarding Archive.org, Alexa has confirmed that material that is crawled by the 'amzn_assoc' crawler is not donated to the internet archive. It is used exclusively for the purposes of the Broken Link Reports.

2. The Alexa crawler 'amzn_assoc' differs from the 'ia_archiver' crawler. The 'ia_archiver' can be excluded by using a robots.txt file and will not violate the Amazon.com Associates operating agreement.

You should expect a response from us by COB Friday as it may take a few days to research your concerns. This issue is important to us and we will get it resolved. Thank you for your patience.

The Amazon.com Associates Program"

I participated in that conversation myself though and I don't think I seen one happy person that though making the agreement so we had to let them crawl our sites as often as they like.

cj.com report error links but they do it from the server end, amazons system is just stupid and it was only done to try and give there alexa company some work todo.

so I guess its just wait and see now till we know if the bot starts back up again.
ia_archiver = wayback machine by cstrom · 2002-12-09 00:05 · Score: 5, Informative

As has been noted elsewhere, the affiliate bot ignores robots.txt. Disallowing ia_archiver will have the effect of removing the site from the wayback machine (http://www.archive.org/), which may not be what you want to do.
Ah, so that's what ia_archiver is... by Alioth · 2002-12-09 01:58 · Score: 5, Informative

A while back (when I was still using a CobaltRaQ2 - adequate for the job, but not particularly speedy with cgi scripts) I got DoSSed by ia_archiver (yes, cgi-bin is in robots.txt, no I'm not associated with Amazon, but someone else who links to the cgi-script in question probably was). I thought ia_archiver was another Teleport Pro, and just modified the acutal script to display a rejection page if it saw ia_archiver in the HTTP_USER_AGENT.

Finally, I know what it is...

It was trying to crawl *every* available url for the CGI script - and it appeared to be buggy because it got itself into an endless loop changing from one mode to the other.

--
Oolite: Elite-like game. For Mac, Linux and Windows
Alternatives to Amazon! by Flow · 2002-12-09 02:41 · Score: 5, Informative

If you don't like the tactics of Amazon, there are alternatives. One of the best is BookSense.com. Not only do they offer an affiliate/partner program, you'll also be supporting independent bookstores (rather than the chains or Amazon):
http://www.booksense.com/affiliate/
Informed View by peterdaly · 2002-12-09 03:58 · Score: 5, Informative

I am an Amazon Associate who has experience with the Alexa Crawler. I believe the crawl is intended to find broken links, or links to products that are no longer stocked.

The Amazon Associates program has been around long enough for "page rot" to kick in, and I am sure there are many sites out there with links to non-existant products, such as old editions of books, etc. Historically, associates had to build static links (for the most part) by hand, and embedded them in more or less static page.

The problem comes in due to the recent introduction of their web services, where sites can build essentially unlimited pages based on dynamic real-time queries to amazon. I don't believe their intent is to "thrash around" in these sites, which is what is occuring.

A few month ago, I asked to have the Alexa bot crawl my site, (StarvingMind.net) , I was curious about the reports it was able to generate. The bot ended up in endless loops and had to be manually stopped by someone at Alexa. They spent an impressive amount of time trying to identify and fix the problem my site was creating for their bot. I don't know whether my specific problem was ever resolved, but I have the impression the bug was found and fixed. I also have the impression that the bot is very immature code and buggy.

Based on the personal and public responses I have seen from the Amazon and Alexa people involved, they actually do care about these issues very much, and don't wish to cause harm by the bots use. I believe their goal is to eliminate the link rot that has accumulated on associate sites over the years, manytimes with the site owner unaware of the problem.

Web services threw a curve into the mix, and that is where the major problems are occuring. The post I a replying to seems to imply Amazon may want to "use then throw out" the associates. I think that is pure speculation without any knowledge of the fact. Amazon has recently gone from what appeared to be no fulltime staff to a team of people dedicated to supporting and running the associates program. I believe they consider it a very cost effective way of advertising, and I expect it is doing quite well for them. Based on their recent actions, I believe they are trying to build a strong long term relationship with the active ones of us, as we bring them a fair amount of business.

Another post has pointed out they have stopped the crawl while the issues talked about here are looked into. They realize they may have made a mistake, and are trying to figure out how to address the problem. They have been responsive (with me at least) resolving problems like this in the past, they deserve a chance to resolve it this time as well. They have started down the right path, by stopping the crawl.

-Pete

--
Soccer Goal Plans
Re:Amazon sucks. by Cedric+C.+Girouard · 2002-12-09 05:02 · Score: 3, Informative

My wife used to work for Amazon. She was attacked by a coworker and forced to quit because the management would not do anything about it. She had to visit the doctor for months after the attack that gave her whiplash and nerve damage. In my mind, Amazon is a very bad company and should be punished.

That sounds weird... Isnt the US "Land of the lawsuit" ? I've read about people suing companies for sexual harrassment, and winning. Now you get physical damage, assault and whatnot, and she has to quit ? Wouldnt one of those late-nite 1-800-SUE-ME lawyers take this case ? Seems pretty much open and shut to me.

--
Marriage is considered capital punishment for the theft of a goat in some third world countries...
Re:OK from here by Anonymous Coward · 2002-12-09 05:22 · Score: 1, Informative

> Looking at user agents, the browser war is over. IE is #1, and Netscape often isn't even in the top 10; various indexer 'bots generate more traffic than Netscape.

Looking at user agents is incredibly foolish since most browsers' agent strings default to IE and most users don't change that default.
Re:Read before you sign (and before you post) by pjrc · 2002-12-09 06:50 · Score: 2, Informative
They have a few options that I can see.
Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.

Here's an idea.... How about politely posting a question or two about it in the appropriate forums? Who knows, something crazy might happen, like responsible people at Amazon might respond and turn the bot off while they investigate. Then, they might post a reasonable explaination and take reasonable steps to make sure they're not abusing associate's servers.
Here's another idea.... Try reading the pages that slashdot linked to. I know that's a lot of work, so I'll save you a bit of effort by posting each slashdot link, and a brief summary of what you would have found had you bothered to click on it and ACTUALLY READ it (before posting here with a subject advocating actually reading the terms and conditions).
- Amazon Associates and Web Services developers are crying foul over the hammering they're taking - Alan Richmond comments that the bot made 13406 hits in 17 hours on November 26, transfering a total of 200 megs. Many posts preceed this, and several follow it. It's all pretty level headed discussion. Many people seem to feel the bot is not designed that well and ought to be improved, but very little of it amounts to "crying foul". Even Alan says he want an explaination. Nobody is terminating their agreement, attempting to recoup significant losses, threatening to sue, advocating blocking (other than discussion of robots.txt). People in the forum are expressing their concerns "in such a manner they actually listen", which happens to be a polite, level-headed manner... which you would know of had you actually read the forum, rather than blindly posting here that the associated should read the terms and conditions before they "sign".
- Amazon fessed up - Amazon explains what they're doing, and why, and the steps they've taken to avoid abusing servers. They claim they've designed the bot to avoid accessing any server more than once every two seconds (Alan's example is 13406 hits in 17 hours, or one hit every 4.56 seconds, on average)
- Amazon acknowledged problems exist - They actually say they're investigating, and while they're investigating their bot's impact, they've taken it off-line. They also answer the question that appears frequently in the forum... the purpose of "ia_archiver" vs "amzn_assoc". It's not clear what they'll actually do, but they obviously are trying to respond to people's legitimate concerns
- but points to recent Operating Agreement changes - Yes, while Amazon appears to be taking the matter seriously, they also are making it clear that they expect to be able to verify the accuracy of links from associates. They explain the purpose in the agreement (and it's really not that unreasonable, is it?)
This just isn't that sensational of a story. Yet another 'bot that needs some refinement, but a it IS designed to avoid more than one hit every 2 seconds (and the evidence posted seems to be consistent with that). They at least did respond to people's concerns and they took the bot off-line while they investigated it. Sounds pretty reasonable. It's not clear what might actually be done, and some of it appears that Amazon is claiming the problem isn't so great... but clearly they are attempting to respond to people's concerns.
Amazon feels they have a right to check the links on associate sites, and they put it in the terms. Again, it's really not that unreasonable.
What is unreasonable is the inflamatory summary appearing on the main slashdot page. Yes, timothy and other slashdot "editors" can claim it's all just editorial from "theodp" who submitted the summary. But what kind of editing it that?
The summary concludes with:
... Amazon and any of its corporate affiliates the right to do so, but also to use unstated technical means to overcome any methods that are used to try to block or interfere with such crawling or monitoring. Interesting stance from the folks who called on the Senate to prosecute those who degrade the technical quality of service at web sites.
The link is to Amazon's position on DDOS attacks... there's really no similarity to a well-intentioned 'bot, which clearly identifies itself, limits itself to 0.5 Hz access rate, AND was responsibly taken off-line and reexamined when some people complained that it used too much bandwidth.
--
PJRC: Electronic Projects, 8051 Microcontroller Tools