Amazon Bots Cause Grief For Associate Web Sites
theodp writes "Amazon Associates and Web Services developers are crying foul over the hammering they're taking from ill-behaved bots that Amazon had subsidiary Alexa Internet dispatch to evaluate the 'quality and reliability' of their sites. Amazon fessed up and acknowledged problems exist, but points to recent Operating Agreement changes that not only give Amazon and any of its corporate affiliates the right to do so, but also to use unstated technical means to overcome any methods that are used to try to block or interfere with such crawling or monitoring. Interesting stance from the folks who called on the Senate to prosecute those who degrade the technical quality of service at web sites."
I am not able to view any of the mentioned links. Keeps on redirecting between login and some other page.
Funny to see that someone complaining about abuse links to pages that do not work with Webwasher filtering.
.. something about not accepting any cookies? cookie filtering is just great ;)
Given that many people still boycott Amazon for their stance on software patents, I guess that they won't be shedding many tears.
One could argue something about watching out for who your bed-partners are! Bear in mind that a company that has such a disregard for even their affiliates has to have a pretty poor respect for anyone else out there! Caveat emptor!
A little planning goes a long way...
My God,
The nigerian 419 scam, anyone wishing for a bit of humor should check out the recent UserFriendly strips on the subject. http://www.userfriendly.org
crashandburn99
We've noticed quite a few requests for robots.txt by the Alexa archiver. So a suggestion to boot may be throwing this into your root directory of your domain's web site (in a file called robots.txt)
/
User-agent: ia_archiver
Disallow:
And if its really annoying, bloody hell, just do an active firewall block and put the sharks (lawyers) away with those goofy lawsuits before they start wasting our senators' time and taxpayer cash.
The problem is that the bots are way too diligent. They go to every single link on every single page, even if the page is dynamically generated. Many sites have an infinite combination of url's, and as a result, the bots sit on them trying to download every single variation of query. That means that Joe Amazon Associate's web site is hammered with requests and his bandwidth fees go through the roof.
The simple solution would be to just stop Amazon from sucking up the bandwidth via a robots.txt file. But Amazon says that is not allowed. There's the dilemma.
Amazon.com has been silent on this issue for the last several days. My bet is that the bots won't come back without some heavy-duty tuning.
--Your Sex
Sex - Find It
The Associates Operating Agreement states:
:) Does anyone believe they'd actually do that? Most likely they'll just leave you alone.
Therefore, you agree that we and our corporate affiliates may take such actions and that you will not seek to block or otherwise interfere with such crawling or monitoring (and that we and our corporate affiliates may use technical means to overcome any methods used on your site to block or interfere with such crawling or monitoring).
As such, it doesn't say that you agree not to block them or that you're violating their license if you do block them. All you agree to is that they can monitor your site, but if you don't like how they do it, it doesn't state that you have to put up with their crawler. The only thing you do agree to is that they can use "technical means to overcome" your blocking. But so what? Let them waste money on attempting to monitor your site by modifying their crawler
To make this palatable is lower the request rate to something like 1 per minute.
Most robots do something like that.
Of course - it takes a lot longer....
It's Christmas everyday with BitTorrent.
we ignore /robots.txt and we'll circumvent every actions you take to not let us crawl your cgi-bins
kneel down and I'll spank you, associates...?
Noting comes for free. Presumably, they are Amazon Affiliates to get a cut off a sold book. You don't get anything for free. Perhaps an opportune time to do the Barnes and Noble thing?
It looks like these people are signing agreements they didn't read or understand.
They have a few options that I can see.
Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.
Make a mini site for the amazon site/bot, but the rest of your website in a second location (that they don't have access too)
Why deal with a company like this anyway, they're obviously inconsiderate pricks (at least) move on with your life.
Seems to be going the Microsoft way. They seem to be exploiting their monopoly in their sphere of business. Their recent ploy to patent their click n buy commerce system had attracted lots of attention from the people and the OS community. Many Open-letters were exchanged. But people seemt haev already forgotten; the average human, understandably is worried only about factors that affect him, and that too, immediatly. Now this new issue....
|/________
|\A|ALYS|
http://forums.prosperotechnologies.com/n/mb/messag e.asp?webtag=am-associhelp&msg=2579.1&maxT=3">http ://forums.prosperotechnologies.com/n/mb/message.as p?webtag=am-associhelp&msg=2579.1&maxT=3
ok that is a post from the associates board
in which amazon state
"Hello Associates.
Thank you for providing such valuable feedback. The Alexa crawl (id amzn_assoc) has ceased while we investigate the statements made in this post. We plan to address the following concerns:
1. The impact the crawler may have on bandwidth
2. The number of pages the crawler hits per second
3. How the Alexa crawler might identify and ignore AWS pages or links
Points of clarification:
1. Regarding Archive.org, Alexa has confirmed that material that is crawled by the 'amzn_assoc' crawler is not donated to the internet archive. It is used exclusively for the purposes of the Broken Link Reports.
2. The Alexa crawler 'amzn_assoc' differs from the 'ia_archiver' crawler. The 'ia_archiver' can be excluded by using a robots.txt file and will not violate the Amazon.com Associates operating agreement.
You should expect a response from us by COB Friday as it may take a few days to research your concerns. This issue is important to us and we will get it resolved. Thank you for your patience.
The Amazon.com Associates Program"
I participated in that conversation myself though and I don't think I seen one happy person that though making the agreement so we had to let them crawl our sites as often as they like.
cj.com report error links but they do it from the server end, amazons system is just stupid and it was only done to try and give there alexa company some work todo.
so I guess its just wait and see now till we know if the bot starts back up again.
I haven't read 1984 in a long time, but I don't remember big brother coming from the amazon.
sig.
As has been noted elsewhere, the affiliate bot ignores robots.txt. Disallowing ia_archiver will have the effect of removing the site from the wayback machine (http://www.archive.org/), which may not be what you want to do.
I participated in that conversation myself though and I don't think I seen one happy person that though making the agreement so we had to let them crawl our sites as often as they like.
:)
"Sleep with the Devil, get hammered hard in the ass".
Ok, so I just made that one up. But it could have been an old saying. It's not as if it is any worse than some clichés people spew out all the time, after all.
Maybe he's one of the boys from brazil.
Seems every other link on the 'net is a link to some book on Amazon. All too often I'll follow an innocent looking link and find myself at Amazon yet another time.
Reminds me of that old horror movie where they try to drive away from a haunted house, but every road they take leads them back up the driveway to the place.
If you agree not to block or interfere with crawling or monitoring, you're not telling them they can do whatever they want. You agree they can crawl and/or monitor your site, but not doing that in any way *they* want to.
It's OK if they crawl/monitor my site using a bunch of people surfing my site all day long. I won't attempt to block that. Anything else, I might.
"... called on the Senate to prosecute those who degrade the technical quality of service at web sites." Would that include the Slashdot effect?
Do as I say, not as I do. I am not surprised by this attitude.
Simple 'nuff...
Just temporarily (perhaps 1 day) block ANY client's class C (not just that of Alexa's crawler) that starts generating more than X hits per second for longer than five minutes.
By doing so, you haven't taken steps to specifically thwart *Amazon's* activity, you have simply enacted a reasonably security measure to block DOS attacks. If Amazon actually dared to sue for blocking them, you'd have a HELL of a countersuit on the grounds that their 'bot triggered your DOS alarm.
Personally, I'd just block their bot and if they complain, tell them where they can stick their partner agreement. No self respecting online retailer needs their own "partners" degrading their QOS. Anyway, When I want to buy something, I use either Google, or a product-specific price-search engine (like PriceWatch). Amazon counts as my LAST choice for finding something (actually not quite true... If I need to use Google to find a product for sale, I often check Amazon first, just to get things like UPC or ISBN numbers to narrow my search).
Instead of crawling websites, why doesn't amazon and other companies just require you to have formated index of all the links you provide on your website. Could be amazon.xml in the root. And this file could be dynamic or hand-typed...
http://www.yourwebsite.com/amazon.xml http://www.somewebsite.com/~yoursite/amazon.xml
They'd only read User Friendly if they were seeking a really, really tiny bit of humor. Like perhaps some kind of trace element of humor.
A while back (when I was still using a CobaltRaQ2 - adequate for the job, but not particularly speedy with cgi scripts) I got DoSSed by ia_archiver (yes, cgi-bin is in robots.txt, no I'm not associated with Amazon, but someone else who links to the cgi-script in question probably was). I thought ia_archiver was another Teleport Pro, and just modified the acutal script to display a rejection page if it saw ia_archiver in the HTTP_USER_AGENT.
Finally, I know what it is...
It was trying to crawl *every* available url for the CGI script - and it appeared to be buggy because it got itself into an endless loop changing from one mode to the other.
Oolite: Elite-like game. For Mac, Linux and Windows
I think that you've just violated your agreement by sharing your revenue information, and your choice of punishment is either a amzn_crawler DoS or a monetary penalty of 20 times your gross revenue generated.
Is it time to add Amazon to the /etc/hosts.deny file?
If you're a member company, employing Amazon's services, then in my opinion you should be responsible for providing Amazon with the links you want Amazon to vend, not that Amazon should crawl through your site for your pricing information...
There is no guarantee that the "formatted index of all links" is accurate, or up-to-date. Amazon wants to make sure that every single amazon affiliate link meets their criteria.
Your solution would work only for the intelligent and diligent and lucky. There are many Amazon affiliates who are neither.
Goofballs.
... take a look at this comment: http://yro.slashdot.org/comments.pl?sid=47296&cid= 4843032
Utilizing magnetic schemata since
http://www.booksense.com/affiliate/
Amazon is crawling these sites so that they can be featured on their website. When you search for an item, Amazon lists the prices and availability from the associates--everyone wins.
It seems that Amazon is searching a bit too often--combined with some affiliated sites that have very s-l-o-w dynamic pages, which is causing some problem. It's hardly a crime that Amazon is commiting--after all they want the most accurate, up-to-the-minute information on their website.
Best Buy can have you arrested
The timing of this problem is interesting. A few years back, we had the problem of the one-click patent and the fact that Amazon used it to disrupt the christmas sales of Barnes and Noble. It seems that the one-click thing became a less pressing problem on December 26. Although I can't remember the specifics of other events, it sticks in my mind that other ploys used to disrupt competitors businesses have been timed to screw with the christmas season.
I know that the people being DOS'ed by Amazon are defined as 'affiliates', but maybe Amazon percieves 'affiliates' in the same way Microsoft percieves 'partners'; people to use and then buy or destroy. How much you wanna bet that this problem goes away after christmas? Of course, the claim will be that it was brought to their attention and it was fixed, but the timing of the whole thing is very suspicious. Perhaps this was the plan all along.
In these days of slim margins in business, maybe Amazon figures the average internet user is smart enough to figure that it their preferred site is slow, they will go directly to Amazon for their purchase and Amazon would be able to avoid reimbursement of their 'affiliate' for the sale.
Has this problem been going on, but been unnoticed for a while, or did it just start? I'm no consipiracy theorist, but the elements seem to be there for this to have been intentional and the timing is very suspicious. Why couldn't they have done this last month, or the month before if they're just checking for outdated links? Am I out in left field with this idea?
Anyway... just a different perspective and some food for thought.
War is Peace. Freedom is Slavery. Ignorance is Strength. - George Orwell or George Bush?
Amazon pays so much in affiliate fees that they can have all the bandwidth they like from us ... I've seen much worse crawlers, from german search engines to broken proxies doing 10 hits/second on dynamic pages to stupid windows users who wanted to make our (very dynamic) website available for offline browsing. If you can't take a few 1000 hits/day because your CGIs are so slow, then what is your site doing on the web anyway? ;-)
"I love my job, but I hate talking to people like you" (Freddie Mercury)
Powells Books offers a better associate program for web sites. Why even deal with Amazon's crap?
I am an Amazon Associate who has experience with the Alexa Crawler. I believe the crawl is intended to find broken links, or links to products that are no longer stocked.
The Amazon Associates program has been around long enough for "page rot" to kick in, and I am sure there are many sites out there with links to non-existant products, such as old editions of books, etc. Historically, associates had to build static links (for the most part) by hand, and embedded them in more or less static page.
The problem comes in due to the recent introduction of their web services, where sites can build essentially unlimited pages based on dynamic real-time queries to amazon. I don't believe their intent is to "thrash around" in these sites, which is what is occuring.
A few month ago, I asked to have the Alexa bot crawl my site, (StarvingMind.net) , I was curious about the reports it was able to generate. The bot ended up in endless loops and had to be manually stopped by someone at Alexa. They spent an impressive amount of time trying to identify and fix the problem my site was creating for their bot. I don't know whether my specific problem was ever resolved, but I have the impression the bug was found and fixed. I also have the impression that the bot is very immature code and buggy.
Based on the personal and public responses I have seen from the Amazon and Alexa people involved, they actually do care about these issues very much, and don't wish to cause harm by the bots use. I believe their goal is to eliminate the link rot that has accumulated on associate sites over the years, manytimes with the site owner unaware of the problem.
Web services threw a curve into the mix, and that is where the major problems are occuring. The post I a replying to seems to imply Amazon may want to "use then throw out" the associates. I think that is pure speculation without any knowledge of the fact. Amazon has recently gone from what appeared to be no fulltime staff to a team of people dedicated to supporting and running the associates program. I believe they consider it a very cost effective way of advertising, and I expect it is doing quite well for them. Based on their recent actions, I believe they are trying to build a strong long term relationship with the active ones of us, as we bring them a fair amount of business.
Another post has pointed out they have stopped the crawl while the issues talked about here are looked into. They realize they may have made a mistake, and are trying to figure out how to address the problem. They have been responsive (with me at least) resolving problems like this in the past, they deserve a chance to resolve it this time as well. They have started down the right path, by stopping the crawl.
-Pete
Soccer Goal Plans
...to what any sensible software engineering team would have built as a re-active solution?
Problem:
Some of our affiliates have out of date links.
Dumb Solution:
Create stupid high bandwidth consuming spider that endlessly crawls affiliate sites looking for out of date links;
or
Sensible Solution:
When an out of date link comes along to the website, display an apology screen to the visitor (whilst not letting up on any other sales opportunity) and email the affiliate telling them to get their site up to date.
Some people just don't fink.
Alexa's web crawler is great from one perspective and terrible from another.
On the great side their crawler can easily use an entire T3 with just a stock PC driving the requests.
On the terrible side the crawler has is stateless - it has NO IDEA OF WHAT IT'S RECENTLY DONE. It doesn't know when it has hit a particular site 1M times in the last hour.
So when they say "it only crawled each site on average every 4 seconds" that is on average. You know, take total urls divided by total time. Doesn't say anything about how hard they hit aaa.com
The problem is that the crawler is designed in the extreme to be efficient. Keeping site stats and blocking GETs is inefficient.
You generate a list of URLs for it to crawl. It blindly crawls this list in order. To prevent aaa.com from getting hit with the first 100k requests (assuming aaa.com has 100k urls in the list) you randomize the list before crawling.
Problem is the randomization isn't perfect, and also any site with a high % of urls in the list is still going to get hammered.
Now I don't know if this is the crawler Alexa used on the associates. But I wouldn't be too surprised.
Looking at user agents, the browser war is over. IE is #1, and Netscape often isn't even in the top 10; various indexer 'bots generate more traffic than Netscape.
Amazon's hiring practices are questionable at beast. I interviewed with them four years ago for a coding job. The interview consisted of a woman in a room asking me if I had any questions. I had a bunch, which she attempted to answer ... but after 5 or 6 minutes, I realized this was all there was to the interview, and that the company had either sent a horrible interviewer, or that this wasn't the place for me.
The one question she did ask was whether I was faimiliar with Windows. I wasn't sure in what context she was speaking, so I told her I'd done some programming in it, but I was more familiar with programming in UNIX. A frown crossed her face, and she said "I guess you wouldn't know why I can't get my email then". The woman had a dial-up laptop card that wasn't plugged in to the wall.
She promised to follow up with me. Never did. I wasn't upset. Perhaps it was one bad interviewer, but if somebody gave her a job, I'd hate to know what else they have working for them.
From the Amazon User Agreement:
"You are granted a limited, revocable, and nonexclusive right to create a hyperlink to the home page of Amazon.com so long as the link does not portray Amazon.com, its affiliates, or their products or services in a false, misleading, derogatory, or otherwise offensive matter."
So if I write Amazon sucks, I'm no longer allowed to visit or buy stuff from Amazon. Oops, darn.
This is slightly offtopic, but if you are in the NY area, I highly reccommend you see the play "21 Dog Years: Doing Time@Amazon.com" about a guy who went from customer service to bizdev to resignation. It's based on this book; and yes it is very funny that Amazon carries it. They profit from their own critics.
That sounds weird... Isnt the US "Land of the lawsuit" ? I've read about people suing companies for sexual harrassment, and winning. Now you get physical damage, assault and whatnot, and she has to quit ? Wouldnt one of those late-nite 1-800-SUE-ME lawyers take this case ? Seems pretty much open and shut to me.
Marriage is considered capital punishment for the theft of a goat in some third world countries...
"Absent from our suggested federal response is a role for the Federal Communications Commission. The reason is straightforward: the distributed denial of service attacks involve coordinated and criminal transmission of content over the Internet. It is hard to see how the FCC has statutory authority over such matters. Yet even if it had, or were given, such authority, the agency currently lacks the resources and expertise to do what is necessary at this point, namely, to fight the criminal activity. Simply put, useful FCCinvolvement would require statutory changes, additional resources, and additional expertise to succeed. This is work better left to law enforcement agencies."
Okay, note the line "...distributed denial of service attacks involve coordinated and criminal transmission of content over the Internet"
Criminal transmission of content? WTFF??
Note also how it goes on to say the FCC shouldn't get involved since "FCC involvement would require statutory changes..." In other words, let's not waste time with all this analysis and law-making business and just get straight to the enforcement of what we want.
all your bots are belong to us...
Terminate the agreement.
Bill for the bandwidth, or sue for damages.
Various technical measures (which are prohibited by the agreement)
Point out to your contacts at Amazon that this is pointless and dumb in such a manner they actually listen.
Here's an idea.... How about politely posting a question or two about it in the appropriate forums? Who knows, something crazy might happen, like responsible people at Amazon might respond and turn the bot off while they investigate. Then, they might post a reasonable explaination and take reasonable steps to make sure they're not abusing associate's servers.
Here's another idea.... Try reading the pages that slashdot linked to. I know that's a lot of work, so I'll save you a bit of effort by posting each slashdot link, and a brief summary of what you would have found had you bothered to click on it and ACTUALLY READ it (before posting here with a subject advocating actually reading the terms and conditions).
This just isn't that sensational of a story. Yet another 'bot that needs some refinement, but a it IS designed to avoid more than one hit every 2 seconds (and the evidence posted seems to be consistent with that). They at least did respond to people's concerns and they took the bot off-line while they investigated it. Sounds pretty reasonable. It's not clear what might actually be done, and some of it appears that Amazon is claiming the problem isn't so great... but clearly they are attempting to respond to people's concerns.
Amazon feels they have a right to check the links on associate sites, and they put it in the terms. Again, it's really not that unreasonable.
What is unreasonable is the inflamatory summary appearing on the main slashdot page. Yes, timothy and other slashdot "editors" can claim it's all just editorial from "theodp" who submitted the summary. But what kind of editing it that?
The summary concludes with:
The link is to Amazon's position on DDOS attacks... there's really no similarity to a well-intentioned 'bot, which clearly identifies itself, limits itself to 0.5 Hz access rate, AND was responsibly taken off-line and reexamined when some people complained that it used too much bandwidth.
PJRC: Electronic Projects, 8051 Microcontroller Tools
Alexa is all over my web logs every day....I don't even link to amazon (or any other commercial site, just some basic open source ones...apache, openbsd, sourceforge, etc)
Soon I might just block them....but I would like to know how I got on their list of sites to crawl to excess.
... why don't they just collect the 404s off the requests to their site? No need for spiders; if someone puts up a bad link, they can find out as soon as someone clicks on it. *sheesh*
Never ascribe to intent what may be accounted for by simply rolling out premature code that has been subjected to very little test. Amazon has a bias toward making schedules at the expense of testing.
Amazon's software schedules follow the same seasonal cycles as the rest of the company. It is likely that this happened simply because the software was getting rushed out into production just in time for the holiday rush (so the engineers working on it can be assigned to holiday duty somewhere else).
Too bad you were too much of a pussy to do anything about it, yourself.
Too busy jerking off to the latest Michael Jackson photos, eh?
Here amazon admits the issue and how they have stopped the bot until they can investigate the issue.
Amazon is actually very affiliate friendly. They have banned the scumware like wurldmedia, ebates and others that try and hijack affiliate comissions. Unlike affiliate programs by overstock.com,buy.com and others that are so desperate for short term cash they will screw over their current affiliates for some quick cash.
Considering buy.com is so deep in with the scumware people, i am surprised slashdot.org advertises them.
I did read the links.
Amazon released a bot that negatively affected the affiliate websites.
This is at the very least inconsiderate.
I posted my opinion how this or similar activities COULD be handled.
You seem quite defensive about it, were you the one who wrote a buggy bot?
Funny though they can do no wrong
Its a matter of proof. The only people that seen it happen are on the same side as Amazon because they do not want to loose their jobs or get in trouble. None of the lawyers that we have contacted will take the case. Its really retarded. We tried for 6 months to do something, and now we have given up. a) We do not have the money to fight them. b) No free lawyer will go against Amazon. We are sueing the attacker though . I didn't want her to work there anyway.
The above is not worth reading.
Yeah, I'm not really into picking fights with women. Especially 7' 300 pound black women from the desert. She is being sued though.
The above is not worth reading.
I too have steadfastly not used Amazon, and I find the noamazon.com site quite useful.
I didn't know you read my http logs. Not only do they hit my site that often they also hit my office mate homepage that often.
That's right 28,000 hits to a single dns entry in one hour, which is about 7.78 hits/second.
I use to work for Amazon.com as a Unix Admin and I can tell you Amazon and Alexa are barely related. They are two different companies, it's just that one owns the other. Barely anything between them on the computer system level is intergrated. The main offices for Amazon.com are in Seattle and Alexa offcies are in S.F., Ca.
If someone is making a mistake at Alexa, Amazon.com can not really be held responsible.
Linux O Muerte!