Amazon Bots Cause Grief For Associate Web Sites
theodp writes "Amazon Associates and Web Services developers are crying foul over the hammering they're taking from ill-behaved bots that Amazon had subsidiary Alexa Internet dispatch to evaluate the 'quality and reliability' of their sites. Amazon fessed up and acknowledged problems exist, but points to recent Operating Agreement changes that not only give Amazon and any of its corporate affiliates the right to do so, but also to use unstated technical means to overcome any methods that are used to try to block or interfere with such crawling or monitoring. Interesting stance from the folks who called on the Senate to prosecute those who degrade the technical quality of service at web sites."
I am not able to view any of the mentioned links. Keeps on redirecting between login and some other page.
Funny to see that someone complaining about abuse links to pages that do not work with Webwasher filtering.
.. something about not accepting any cookies? cookie filtering is just great ;)
Given that many people still boycott Amazon for their stance on software patents, I guess that they won't be shedding many tears.
One could argue something about watching out for who your bed-partners are! Bear in mind that a company that has such a disregard for even their affiliates has to have a pretty poor respect for anyone else out there! Caveat emptor!
A little planning goes a long way...
The Associates Operating Agreement states:
:) Does anyone believe they'd actually do that? Most likely they'll just leave you alone.
Therefore, you agree that we and our corporate affiliates may take such actions and that you will not seek to block or otherwise interfere with such crawling or monitoring (and that we and our corporate affiliates may use technical means to overcome any methods used on your site to block or interfere with such crawling or monitoring).
As such, it doesn't say that you agree not to block them or that you're violating their license if you do block them. All you agree to is that they can monitor your site, but if you don't like how they do it, it doesn't state that you have to put up with their crawler. The only thing you do agree to is that they can use "technical means to overcome" your blocking. But so what? Let them waste money on attempting to monitor your site by modifying their crawler
Seems like Alexa sold Amazon a whole lotta nothing when they agreed to verify the links on AWS sites.
According to one of the posts here:
Again, I don't get how my links can be broken since Amazon is delivering the content.
He painted a unicorn in outer space. I'm askin' ya, what's it breathin'?
Thats why they should set it to max request 1 page per minute from any one site, but check out many thousands of sites during that one minute.
Robots have been around since the web started and it suprises me that the designers of this robot havent looked at previous design and good practice.
If any of you Alexia numbskulls happen to be reading this perhaps you could buy yourself a copy of HTTP the def. guide from O'Reilly, which has a tremendously clear explanation of what to think about to prevent your robots from destroying every site they visit that isn't sat on a T3 and Sun Fire w/ 64 CPUs and 64 GB ram.
Economic Left/Right: -0.62
Social Libertarian/Authoritarian: -3.69
If you agree not to block or interfere with crawling or monitoring, you're not telling them they can do whatever they want. You agree they can crawl and/or monitor your site, but not doing that in any way *they* want to.
It's OK if they crawl/monitor my site using a bunch of people surfing my site all day long. I won't attempt to block that. Anything else, I might.
Instead of crawling websites, why doesn't amazon and other companies just require you to have formated index of all the links you provide on your website. Could be amazon.xml in the root. And this file could be dynamic or hand-typed...
http://www.yourwebsite.com/amazon.xml http://www.somewebsite.com/~yoursite/amazon.xml
There is no guarantee that the "formatted index of all links" is accurate, or up-to-date. Amazon wants to make sure that every single amazon affiliate link meets their criteria.
Your solution would work only for the intelligent and diligent and lucky. There are many Amazon affiliates who are neither.
Amazon is crawling these sites so that they can be featured on their website. When you search for an item, Amazon lists the prices and availability from the associates--everyone wins.
It seems that Amazon is searching a bit too often--combined with some affiliated sites that have very s-l-o-w dynamic pages, which is causing some problem. It's hardly a crime that Amazon is commiting--after all they want the most accurate, up-to-the-minute information on their website.
Best Buy can have you arrested
Powells Books offers a better associate program for web sites. Why even deal with Amazon's crap?
Alexa's web crawler is great from one perspective and terrible from another.
On the great side their crawler can easily use an entire T3 with just a stock PC driving the requests.
On the terrible side the crawler has is stateless - it has NO IDEA OF WHAT IT'S RECENTLY DONE. It doesn't know when it has hit a particular site 1M times in the last hour.
So when they say "it only crawled each site on average every 4 seconds" that is on average. You know, take total urls divided by total time. Doesn't say anything about how hard they hit aaa.com
The problem is that the crawler is designed in the extreme to be efficient. Keeping site stats and blocking GETs is inefficient.
You generate a list of URLs for it to crawl. It blindly crawls this list in order. To prevent aaa.com from getting hit with the first 100k requests (assuming aaa.com has 100k urls in the list) you randomize the list before crawling.
Problem is the randomization isn't perfect, and also any site with a high % of urls in the list is still going to get hammered.
Now I don't know if this is the crawler Alexa used on the associates. But I wouldn't be too surprised.
This is slightly offtopic, but if you are in the NY area, I highly reccommend you see the play "21 Dog Years: Doing Time@Amazon.com" about a guy who went from customer service to bizdev to resignation. It's based on this book; and yes it is very funny that Amazon carries it. They profit from their own critics.
"Absent from our suggested federal response is a role for the Federal Communications Commission. The reason is straightforward: the distributed denial of service attacks involve coordinated and criminal transmission of content over the Internet. It is hard to see how the FCC has statutory authority over such matters. Yet even if it had, or were given, such authority, the agency currently lacks the resources and expertise to do what is necessary at this point, namely, to fight the criminal activity. Simply put, useful FCCinvolvement would require statutory changes, additional resources, and additional expertise to succeed. This is work better left to law enforcement agencies."
Okay, note the line "...distributed denial of service attacks involve coordinated and criminal transmission of content over the Internet"
Criminal transmission of content? WTFF??
Note also how it goes on to say the FCC shouldn't get involved since "FCC involvement would require statutory changes..." In other words, let's not waste time with all this analysis and law-making business and just get straight to the enforcement of what we want.
... why don't they just collect the 404s off the requests to their site? No need for spiders; if someone puts up a bad link, they can find out as soon as someone clicks on it. *sheesh*