Microsoft Bots Effectively DDoSing Perl CPAN Testers
at_slashdot writes "The Perl CPAN Testers have been suffering issues accessing their sites, databases and mirrors. According to a posting on the CPAN Testers' blog, the CPAN Testers' server has been being aggressively scanned by '20-30 bots every few seconds' in what they call 'a dedicated denial of service attack'; these bots 'completely ignore the rules specified in robots.txt.'"
From the Heise story linked above: "The bots were identified by their IP addresses, including 65.55.207.x, 65.55.107.x and 65.55.106.x, as coming from Microsoft."
Anyone know what sites on Microsoft's front-facing sites are most computationally intensive, and yet always dynamically generated? :D
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
probably a PERL script to handle that!
Run, Microsoft is coming to get you!
Have you heard about SoylentNews?
Bing?
Until I read the summary I thought it was another article about windows botnets and was wondering why the "microsoft" was tacked on since windows is the default OS assumption. Of course it would be interesting if these were new CPAN mirrors that MS was settings up.
Sooooo, lets all go to the testers blog and DDOS that too. Dumbass...
Excuse me, but please get off my Pennisetum Clandestinum, eh!
I manage some networks in my home city in Italy, and in the past year I've often seen strange traffic coming from some of their IP addresses. Guess they have been exploited by someone long time ago, and didn't even notice it.
Lazy, feckless, inconsiderate crooks.
Looks like Microsoft's Bing managers are on it. They'll make it worse in no-time flat. :)
BTW, the difference between a DDOS and a Slashdotting? You know why your site went down -- you got linked!
--
# Canmephians for a better Linux Kernel
$Stalag99{"URL"}="http://stalag99.net";
It's not like ASP.NET is the most efficient way to sling web pages to being with.
This is my sig.
From TFA:
Hi,
I am a Program Manager on the Bing team at Microsoft, thanks for bringing this issue to our attention. I have sent an email to nospam@example.com as we need additional information to be able to track down the problem. If you have not received the email please contact us through the Bing webmaster center at nospam@example.com.
I mean, what additional information is needed wrt "respecting robots.txt" and "not letting loose more than one bot on a site at a time"?
Bing. Meh.
I know everyone likes to assume that Microsoft is being evil here, but wouldn't the more realistic assumption be that they were just being incompetent?
This is my sig.
Its not a bug, its a feature to index a site with a new, rapid, powerful, direct, personalised crawler :)
http://arstechnica.com/microsoft/news/2010/01/microsoft-outlines-plan-to-improve-bings-slow-indexing.ars
Domestic spying is now "Benign Information Gathering"
I had a registration page - static content basically. The only thing that was dynamic was that it was referred to by many pages on the site with a variable in the querystring. Bing decided that it needed check on this one page *thousands* of time per day.
They ignored robots.txt.
I sent a note to an address on the Bing site that requested feedback from people having issues with the Bing bots - nothing.
The only thing they finally 'listened' to was placing "" in the header.
This kind of sucked because it took the registration page out of the search engines' index, however it was much better than being DDOS'd. Plus, the page is easy to find on the site so not *that* big a deal.
Bing has been open for months now and if you search around there are tons of stories just like this. Maybe now that a site with some visibility has been 'attacked', the engineers will take a look at wtf is wrong.
I have noticed the microsoft crawlers (msnbot) being fairly inefficient on many of my sites...
In contrast to googlebot and spiders from other search engines msnbot is far more aggressive, ignores robots.txt and will frequently re-request the same files repeatedly, even if those files haven't changed... Looking at my monthly stats (awstats) which groups traffic from bots, msnbot will frequently have consumed 10 times more bandwidth than googlebot, but is responsible for far less incoming traffic based on referrer headers (typically 1-2% of the traffic generated by google on my sites).
Other small search engines don't bring much traffic either, but their bots don't hammer my site as hard as msnbot does.
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Are we sure this traffic comes from Microsoft? Could it not consist of forged network packets? You don't need a reply if you are running a DDOS. On the other hand, why would anyone, including Microsoft, want to bring down CPAN?
Nae king! Nae laird! Nae yurrupiean pressedent! We willna be fooled again!
If they've identified the IP ranges, why not just block them? You can do it at the router or TCP level (drop packets), or just throw up a 403 Forbidden.
rooooar
They know how.
Block the IP addresses and send Microsoft email?
What am I missing here?
Yes, Evil more so
I suppose Microsoft can offer a simple explanation: "Our servers and other internal infrastructure are so vulnerable that they have been hacked and being used as remote-controlled botnets."
The largest prime factor of my UID is 263267.
So.. by your definition of evil. If you fail math exam, you're being evil?
If you trip down the stairs, and crash into somebody, you're evil?
Do not attribute to malice, what can very well be attributed to incompetence, or just bad luck.
Else, your mistaking this quote, is also evil then, according to your own definition of evil. ;)
However, that is logically impossible, since it falsifies the very premise, thus I must conclude you are false, and also probably with good intentions,
if not just to get some modpoints, but I wouldn't call that evil
Can anyone here clarify what robots.txt stands for, as in:
Is it an 'agreement' to not scan the site at all (by a search engine bot), or is it meant to just not -display- those results in the search engine?
I'd assume, since everything on a site is more or less public, that it would be the second. And if so, I can't see anything wrong with what Microsoft's bots did.
I can see how scanning a site's content (even if you're not going to list the results in your search engine) can have some value to a company.
When you shoot a mime, do you use a silencer?
. For additional examples, see Government, US.
I'm a right winger and I like to see smaller, less intrusive government, but, I think it is wrong to say that the US government is competent.
The US Gov't has successfully operated as a going concern for 220+ years, with a proven and reliable management structure. Few, if any corporations, have been able to do that.
This is my sig.
AFAIK, the one doesn't exclude the other.
However, assuming evil is more fun :-)
Insert
> ...issues accessing their sites...
"Issues"? What's wrong with "problem"? "Issues" is marketing-speak. Microsoft marketing-speak.
And yes, get off my lawn.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
Sure, it should not ignore robots.txt. And if that's true, there's a problem - but I'd like MS's side of the story before assuming that it ignores robots.txt - who knows, maybe the robots.txt is malformed.
I'd also like to know what user agent string is the crawler using.
But all that said, this is not exactly news worthy. I've run large, dynamic internet sites for years. I've had problems with many, many different kinds of crawlers, from many companies (including companies like Google). There's a ton of bots out there that do ignore robots.txt (there was a few hundred bots that scanned the site I used to run, back in 2001, that ignored robots.txt). So it's something a programmer really needs to be ready to deal with.
Yes, these bots are rude, abusive, and inconsiderate of the site owners (go figure - most of the companies running them, the small bots, are pretty much unethical anyhow - anything for a buck). But it's on the internet, just like spam and a bunch of other things we all get annoyed with. You have to deal with it.
I suggest applications like mod_bwshare to even out this type of behavior, traffic shaping at the network layer for known abusers you don't just want to block, etc. Those are the tactics I use.
I redirect lost bots home, seems a polite thing to do. 301 www.microsoft.com
O, it's just a pumpkin :-(
Here's the real address goatse.fr. Doesn't Mr Sarkozy have a lovely face?
However, look at the private CEOs. When the company goes under, they get the golden parachute and off to another business.
I'm pretty sure the first "D" in DDoS stands for "Distributed."
/^65\.55\.(106|107|207)/. from TFA).
If it was really a DDoS, you wouldn't be able to filter the IP out with a simple regex (like the
To boot, TFA didn't even say DDoS. Maybe that's too much to expect the editors to oh... I don't know...say... RTFA or Fact-Check it?
I should drop my bar a bit, I suppose.
ipchains -A input -j REJECT -p all -s 65.55.207.0/24 -i eth0 -l
ipchains -A input -j REJECT -p all -s 65.55.107.0/24 -i eth0 -l
ipchains -A input -j REJECT -p all -s 65.55.106.0/24 -i eth0 -l
problem solved
Don't kid yourself. It's the size of the regexp AND how you use it that counts.
The CPAN folks could complain to their ISP and have them drop the traffic that's coming in to their boxes.
Most ISP's will work with you to correct DDOS problems.
If you dont know, you should Google it, that will make it clear /. is not a -help mailing list and this was stupid, feckless and criminal, as in mis-use of a computer system beyond authorisation.
How dare you sir (or madam)!! How dare you! It is clear from the title of your post that you were not so subtly casting aspersions on an organization who I hold dear -- namely the Hirsute Dungeons n' Dragons society. You can frame your remarks in some obscure racial epithets, but to those of us who twirl our mustaches or stroke our beards while rolling dice, your insidious implication is brazenly clear. As the leader of a group of men (and women) With decorative facial hair who play Dungeons n' Dragons every Wednesday night, I cannot help but express the strongest offense to your euphamisticaly delivered hidden acronym. In the future, should you have such thoughts I would urge you to Do Not Say them.
"You never pushed a noun against a verb except to blow up something" (Spencer Tracey, 'Inherit the Wind')
Yeah, in statistics of my site Microsoft bots are most active visitors. Really, they crawl site hundreds times more often than Googlebot.
Hide your files and folders from others!
You know women with decorative facial hair? mkaaaay....
For every failure you list, I can give you three that succeeded.
War on Poverty - yeah, that worked out *real* well, didn't it.
Homestead act, Rural electrification act, Highways
War on Drugs - See any results there?
CDC, Peace Corp - cures smallpox worldwide. I don't know -any- government that can make that claim, but our US government.
Social Security, Medicare - unless you really want your grandma to move in and then die.
Food and Drug administration, Small Business Administration, Student Loans. Safe food, help for small businesses, put kids in college.
Fannie Mae - yeah, it blew up, but look at how many people actually have -homes-. The whole banking crisis could have been Bush's finest hour. When the Democrats were railing on about the mortgage meltdown, Bush could have said, "yeah, but we put people into homes. We tried to put people into homes and give them a chance, and for the 95% of people who did NOT default on their mortgages, it totally worked."
War on Terror - With this one, I can't really tell if it's bungling, or actual malice
That's on all of us. Americans overreacted. We voted for the war on terror and the invasion or Iraq. We lost our cool after 9/11, and now we pay the price for our own stupidity.
But, I'll see your war on drugs and raise you one US Military. Brings democracy to Japan and Germany, deters Commies from taking over europe. The military is a government operation, and for the most part, its actually worked pretty well.
PS. Whose saving lives in Haiti right now? Why, its fresh water from American aircraft carriers, US Marines acting as peacekeepers. Our government did that, and we should be proud.
This is my sig.
What happens when the MS bots (which apparently ignore the robots.txt file) start indexing some site which provides pay-per-view information? Can we expect a fix to the problem then? All it takes is to get some lawyers involved, you know how that snowball goes.
How's it possible that, on Slashdot of all sites, *I*, of all people, need to tell you that IP packets do not necessarily come from the address inscribed in their headers?
While he could be more polite, it is indeed embarrassing for Microsoft if they cannot check their own network
a) for the existence of computers with given IPs
b) what these computers are doing
I think that deserves an "insightful" that cancels out the "flamebait".
C - the footgun of programming languages
Robots.txt is merely advisory. Ignoring it is discourteous and oafish but not illegal.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
if it's a scan (TCP established stream, taxing the SERVERS, not the NETWORK) that's the problem, as opposed to a SYN flood etc, and the IP addresses are in a very small range, why aren't they just using a hardware firewall at the router and blocking the IPs? There's not a whole lot to "distributed" when it's coming from a pair of C's.
Not saying they should be DOING it, but this is not a Denial of Service, it's a Denial of Stupid.
I work for the Department of Redundancy Department.
Wow, this article is prescient.
I was just noticing in my web logs that small, out of the way sites that I host that used to get 1,000 hits a month were suddenly getting 1,000 hits PER DAY. Sure enough, anybody care to guess what netblock the 26,000 hits came from?
Microsoft.com just earned a ban.
Sweet, got any room in your group? I have my own dice, mustaches and beard, though sadly lacking women for some reason.
Question reality.
"never ascribe to malice that which can be adequately explained by stupidity. (Insert lame joke about MSFT being full of stupidity here)."
Insert true story about Microsoft being EVIL here, sometimes even unintentionally evil.
I believe soon we will see a new Bing feature - real time results. This will definitely beat Google
Got it! Bing is written in perl. They do regular expression matching while crawling and forgot to have a \E ... \Q escape sequence for the regex matching. They got so much perl code on CPAN, full of special characters, that somehow the crawler engine went into an infinite loop.
Bingo Dictionary - Pragmatist, n. A myopic idealist.
Never attribute to malice that which can be adequately explained by stupidity.
Have gnu, will travel.
Looking at another Robots.txt file seems to return what I expect.
Let no rock remain unthrown when it shows Microsoft is in the wrong - even if they aren't
I have mod points and I am not afraid to use them
Add to your .htaccess file:
deny from 65.55.207.
deny from 65.55.106.
deny from 65.55.107.
http://www.womenwithmustaches.com/
"Lazy, feckless, inconsiderate crooks." You forgot abusive and ignorant and socially backward.
Don't you hate it when people are excessively positive about Microsoft?
Steve Ballmer has little technical knowledge, and any good people who were at Microsoft left long ago, I'm guessing.
Bing should have used Wget first to download the articles to a local hard drive, and also to add a 2 to 3 second wait. Let it run over the weekend. Then test the search indexing algorithms on the local HTML files. They were probably performing indexing tests. I know they have smart people working for them, so it probably involved a contractor who didn't think about performance issues.
If you remember the history of robots.txt because you were there are the time, rather than because you read it in some history book somewhere, the purpose was to protect small web servers from being trashed by big search robots, initially altavista, and secondarily to protect them from other well-behaved web crawlers of whatever sorts. There were no script-generated pages back then, or at least hardly any; just handing out static html could be difficult enough if you had a small pipe and a slow server, though serving images to a robot obviously a waste of time back then.
Tarpits of various sorts existed soon after robots.txt, as a way of trapping spammer-run crawlers that ignored robots.txt, but that was as much for fun as for necessity :-)
And yes, people did have /private/ directories back then and still do now, thinking that because Google's polite about not looking in directories robots.txt says not to that there aren't humans or impolite robots that won't look there.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
It'll just keep them from bothering you, and you're (almost by definition) too small for them to care that they're not indexing your site.
Advertising their IP address block with BGP, if your ISP is careless enough to let you do that, now *that* would get their attention :-)
As an intermediate level of annoyance, you could set up your DNS server to respond to queries from Microsoftland to return entertaining IP addresses, such as 127.0.0.2 or bing's IP addresses or whatever.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Well, it's true here to.
You mean "too", as in "also". "to" is the opposite of "from".
The primary reason for robots.txt was to protect small slow web servers from being swamped by Altavista's big fast web crawlers. Dynamic pages weren't a problem back then. On the other hand, after robots.txt became common, setting up dynamic pages to trap crawlers that ignored it into infinite loops became common also, because most of them were run by spammers of various sorts.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
Search engines try to tell humans what web sites would have interesting contents based on their queries. They use robots and content models to approximate that so they can produce results quickly and economically. SEOs try to get the robots to tell the humans "my page is really interesting", when it usually isn't, which is scummy lying, and you shouldn't encourage such people.
They've really got three things to offer:
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
I don't see any requests for Robots.txt in my logs. It's always lower case : /robots.txt HTTP/1.1" 200 30 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.138 - - [19/Jan/2010:01:00:46 +0100] "GET
The spec for robots.txt says that strings matched internally in the text file should be done in a case insensitive manner.
It would only make sense for a "reasonable person" to assume" that any web fetches for a file name for 'robots.txt' should also match in a case insensitive manner.
This sounds like Microsoft being used to Uppercasing the first letter of words -- which looks aesthetically pleasing, and not having it make any real difference on 70% of the computers on the planet (running Microsoft) and (in my experience, on most webservers running apache). Never noticed any case sensitivity.
This looks like a case of the perl guys being at fault. They likely have a web-server written in perl and DIDn't do a case ignore when processing requests for 'robots.txt'. This violates the intent if not the letter of the spec.
Check out http://www.robotstxt.org/orig.html. It specifies that all of its strings should be matched in a case insensitive manner. IT doesn't explicitly say that the filename 'robots.txt' should also be matched by the webserver, in a case insensitive manner, but if if specifies that all of the web-addresses in the file should be handled in a case-insensitive manner, doesn't it makes sense that the file name it-self should also be case insensitive?
People should use a little common sense before going off and blaming microsoft for doing something that is perfection natural and perfectly understandable, while the supposed victims should be a bit more robust in the design of the web server.
At least, that's how it appears to me -- anyone care to show me a sound reasoning why it should be otherwise or why one would expect otherwise?
Here it is another one from some minutes ago:
IPv4: 65.55.34.139 -> 83.211.46.34
hlen=5 TOS=192 dlen=162 ID=46000 flags=0 offset=0 TTL=0 chksum=7990
Payload: Priority Count: 5
Connection Count: 6
IP Count: 7
Scanner IP Range: 78.130.238.2:212.90.12.134
Port/Proto Count: 7
Port/Proto Range: 80:40210
65.55.34.139 resolving to col0-omc3-s1.col0.hotmail.com
Bing
Bing
Ya, give them an excuse to get away with it. "it wasn't us attacking our competition, really"
---- Booth was a patriot ----
http://www.networksolutions.com/whois/results.jsp?ip=65.55.207.0