Is Microsoft Crawling Google?

Don't concern yourself with this crap... by garcia · 2004-11-11 07:37 · Score: 4, Insightful

Has anyone out there seen similar behavior on their own sites? Please comment with your qualitative/objective data if so.

Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No. Do I care what they are doing? No, as long as they are obeying my robots.txt.

I have complained before about MSNbot ignoring changes to robots.txt while Google happily changed its habbits (I can't find the link sorry). My recent fighting with Googlebot has come to a head when I had to disallow them access to my gallery completely because they refused to honor anything except Disallow: /. I had to go so far as to point Googlebot at my robots.txt and tell it to remove all the previous links. It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

Do I care if MSNbot is crawling Google and then finding sites and links to search? No as it's none of OUR concern. What is OUR concern is our own robots.txt and how the spiders interact with our sites through that file. Let Google deal with Microsoft/MSNbot if that's what needs to be done but don't concern yourself with it otherwise.

Re:Don't concern yourself with this crap... by finkployd · 2004-11-11 07:45 · Score: 4, Insightful

Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion. I can write a search bot today that completely ignores it and there is nothing wrong with that (except perhaps ethically but even that is arguable) If you don't want people (or bots) viewing it then password protect it or take it off the public interweb.
Re:Don't concern yourself with this crap... by garcia · 2004-11-11 07:47 · Score: 2, Insightful

Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

Crawling a gallery of images (and all image property links as well) all day for several days might be considered "DoSing" I consider it being rude.

You're right, they don't have to obey the robots.txt but they should when they say they will.
Re:Don't concern yourself with this crap... by mollymoo · 2004-11-11 07:59 · Score: 5, Interesting

No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

There's more to it than that. Google caches your pages and makes that cache of your copyright material available. Arguably if you have used your robots.txt file to tell it not to index (and therefore cache) your pages and it still does they are breaching copyright. OK, the Google cache is the world's largest breach of copyright anyway, but if you have told its spider not to index and it does regardless, that's a different ballgame.
Putting it out there on the web does not give anyone the right to do with it as they please.

--
Chernobyl 'not a wildlife haven' - BBC News
Re:Don't concern yourself with this crap... by Eric+Giguere · 2004-11-11 08:12 · Score: 3, Interesting

Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No.

Google gives a partial answer to this on their GoogleBot page:

In general, Googlebot should only download one copy of each file from your site during a given crawl. Occasionally the crawler is stopped and restarted, and it may recrawl pages that it has recently retrieved. These recrawls should happen infrequently.

If they're playing around with new indexing algorithms then I would expect to see more of these multiple hits.
Eric
How to (gently) detect Internet Explorer
Re:Don't concern yourself with this crap... by liquidsin · 2004-11-11 08:13 · Score: 4, Interesting

Hmmm...let's call "robots.txt" a "copyright control device" in that it states who may and may not have access to my copyrighted images directory. I'd bet a DMCA suit or two for circumventing your copyright control device would get them to pay attention...

--
do not read this line twice.
Re:Don't concern yourself with this crap... by gUmbi · 2004-11-11 08:15 · Score: 2, Funny

If you don't want people (or bots) viewing it then password protect it or take it off the public interweb.

Interweb? Is that the same as the 'Information superhighway'?
Re:Don't concern yourself with this crap... by Thumpnugget · 2004-11-11 08:21 · Score: 3, Funny

Interweb? Is that the same as the 'Information superhighway'?

They're very similar. One notable difference is that the Information Superhighway was invented by Al Gore.

--
Free yourself. Everything else will follow.
Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 08:24 · Score: 2, Funny

In the early days of MSN Bot, it ate up about 4 GB of my bandwidth on ONE html page, requesting it constantly, every few seconds for days! I emailed Microsoft and they replied with a 'oops, we found the problem'. That doesn't pay my bandwidth overage changes, does it?
Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 08:32 · Score: 2, Insightful

Well anything on the internet that doesn't have normal web server access controls blocking access, is open slather IMO. That's what makes the internet so cool. Doesn't mean you can't still copyright your material so others can't use it, but I think for search engine purposes there is an implied agreement between YOU and THEM - and I think there should be.

In a sense it's like tourism. The world is full of stuff like Historical buildings and the owners of those places have legal rights against theft/damage etc. But the tour companies can still take people around the streets and show them the places without having to necessarily pay a fee.
Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 08:43 · Score: 2, Insightful

Since databases are currently copyrightable, I would argue that a website is a database. If Google insists, I would imagine that MSN, hitting Google's database...er, website, would amount to using a copyrighted database without its owner's permission, which in this case could amount to being a robots.txt file that punts known websites that link w/o attribution.

I would imagine a metacrawler, which attributes its links back to Google, is probably OK, because it keeps Google's adstream intact when the user clicks on the link to a Google search result.

But MSNSearch (or whatever it's called), taking Google search results as its own without attribution, well, that might be a copyright infringement...

If you were an on-line bookstore and deep-linked to Amazon's reviews while portraying them as your own, well, you're gonna get a C&D from Amazon's lawyers awfully fast.
Re:Don't concern yourself with this crap... by nofx_3 · 2004-11-11 08:55 · Score: 4, Funny

Yes, but I invented the "Information Historic Old Country Road" its not fast, and there ain't much information, but its so durn quaint you gotta love it.

-kaplanfx

--
Visualize Whirled Peas
Re:Don't concern yourself with this crap... by CowboyBob500 · 2004-11-11 09:17 · Score: 2, Interesting

As far as I see, MSNBot is behaving itself whilst Googlebot is hungriest - (much as I hate to stick up for Microsoft).

Googlebot (Google) 74 945.51 KB 11 Nov 2004 - 03:02
Netcraft Web Server Survey 13 0 10 Nov 2004 - 23:48
Mirago 6 76.44 KB 02 Nov 2004 - 04:13
MSNBot 6 76.44 KB 05 Nov 2004 - 05:58

It's interesting that Mirago and MSNBot have taken exactly the same bandwidth in the same amount of visits. Are MS innov^H^H^H^H^H buying new technology again?

Bob

--
Listen to my latest album here
Re:Don't concern yourself with this crap... by Jahf · 2004-11-11 09:31 · Score: 2, Informative

IANAL but I would see this as falling under fair use.

1) the LoC is not profitting from your works nor is it re-using them (with the exception of providing an archive to others, see next item).

2) the LoC regularly tells people requesting copies of their information to first obtain permission from the copyright holder (in other words, as with any library, you can browse but you can't copy without permission and copy permission does not equal permission to reuse in a commercial work).

3) Copy protection schemes require active protection to fall under the DMCA, even if it is so simple that anyone can defeat it. Robots.txt is -passive- protection because you have to purposefully search for the file and then purposefully utilize it. To be active protection the document should not come up without the viewer (or blocked viewer) performing some form of action. When someone/something visits an unprotected public web page there is not a way for your web server to invoke the robots.txt file, therefore it is not an active mechanism.

--
It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.
Re:Don't concern yourself with this crap... by ad0gg · 2004-11-11 09:33 · Score: 5, Informative

If don't want your site indexed or cached by google. Go here and follow the directions.
Remove yourself from google
"Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code. "

--
Have you ever been to a turkish prison?
Re:Don't concern yourself with this crap... by mollymoo · 2004-11-11 11:41 · Score: 4, Insightful

If don't want your site indexed or cached by google. Go here and follow the directions.

I shouldn't need to go and fill out some form for every search engine to protect my rights. One accepted standard way to say "do not index this" should be sufficient. This is an automated system. There is an accepted automated method to stop crawlers indexing your site (robots.txt). If they (Google or anyone else) take your copyrighted content and reproduce it automatically when their automatic system could have automatically respected your explicitly stated and legally protected rights they are knowlingly making a flagrant copyright violation.

--
Chernobyl 'not a wildlife haven' - BBC News
Re:Don't concern yourself with this crap... by djcapelis · 2004-11-11 13:26 · Score: 3, Informative

To remove all the images on your site from our index, place the following robots.txt file in your server root:
User-agent: Googlebot-Image
Disallow: /

That should work? No?

--
I touch computers in naughty places
Re:Don't concern yourself with this crap... by big_gibbon · 2004-11-11 21:16 · Score: 2, Insightful

It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

Google should follow the robots.txt - definitely. But there needs to be some way on confirming on your website that you actually want the pages removing - otherwise what's to stop your competitors "accidentally" entering your URL into the removal form? Meta elements would seem to be the natural choice.

P

Difficult to do if Google doesn't want them to by Anonymous Coward · 2004-11-11 07:37 · Score: 5, Insightful

All Google has to do is run some unusual queries through MSN, check their logs, find the IP addresses and block them.

Re:Difficult to do if Google doesn't want them to by carpe_noctem · 2004-11-11 07:41 · Score: 4, Funny

Why stop there? Google should just ban all of Microsoft's netblocks to prevent their employees from gathering useful information from them...

"Begun, this war of the corporations has!"

--
"Quoting famous computer scientists out of context is the root of all evil (or at least most of it) in programming." - K
Re:Difficult to do if Google doesn't want them to by Anonymous Coward · 2004-11-11 07:43 · Score: 2, Funny

Microsoft could create a new distributed crawler that comes bundled with Windows! Every Windows user could crawl Google for them, and then Google's only option would be to block everyone using an MS product.

Remember, helping Microsoft is like helping yourself.
Re:Difficult to do if Google doesn't want them to by blamanj · 2004-11-11 08:09 · Score: 5, Interesting

Yes, and don't think Google wouldn't notice. My company had a summer intern that once wrote a program that started sucking a lot of information out of Google. They blocked our entire site for about three days until everything got straightened out.
Re:Difficult to do if Google doesn't want them to by zentigger · 2004-11-11 09:55 · Score: 3, Interesting

Better yet, Provide those addresses with the correct search results, but change all the links to the raunchiest porn (or pictures of little puppy dogs, if that better suits your sense of moral rectitude)

--
the above is my personal opinion and does not necessarily reflect that of the little voices in my head
Re:Difficult to do if Google doesn't want them to by asavage · 2004-11-11 12:28 · Score: 3, Interesting

If you go to whatismyip you get a website that displays your IP address. If you search msn and google for that site the search results show the IP address of the bot that indexed that site.
For google I get: crawl-66-249-64-167.googlebot.com [66.249.64.167]
for msn I get: fj1011.inktomisearch.com [66.196.91.16]
and msn beta I get: 65.54.188.83 (can't find associated domain)
So we can tell that at least this result wasn't stolen from Google.

Does it violate Google's Terms of Service by winkydink · 2004-11-11 07:38 · Score: 4, Insightful

If so, they have legal remedies.

If not, it's called doing business and gaining an advantage any legitimate way that you can.

I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

--

"I'd rather be a lightning rod than a seismometer." -Ken Kesey

Re:Does it violate Google's Terms of Service by Lev13than · 2004-11-11 07:44 · Score: 3, Insightful

Does it violate Google's Terms of Service? If so, they have legal remedies.
If not, it's called doing business and gaining an advantage any legitimate way that you can.
I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

If I copy your work and take credit or it, does it violate your terms of service? If so, you have legal remedies. If not, it's called doing business and gaining an advantage any legitimate way that I can.

Furthermore, I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

--
When you have nothing left to burn you must set yourself on fire
Re:Does it violate Google's Terms of Service by TheRaven64 · 2004-11-11 07:58 · Score: 4, Interesting

Do Google's terms of service have any legal standing? Click-through EULAs don't in many jurisdictions, and I don't remember ever even seeing Google's ToS, let alone agreeing to them.

--
I am TheRaven on Soylent News
Re:Does it violate Google's Terms of Service by nick13245 · 2004-11-11 08:22 · Score: 5, Informative

Yes it does.
From Googles Privacy Center (http://www.google.com/terms_of_service.html):

Personal Use Only

The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.

Yea, and by BrianGa · 2004-11-11 07:38 · Score: 5, Funny

The new search engine's name will be Mooglesoft.

Re:Yea, and by MooseByte · 2004-11-11 07:46 · Score: 4, Funny

"The new search engine's name will be Mooglesoft."
Which will subsequently be sued by SCOogle, the latest startup from The Canopy Group, after announcing they purchased the rights to the Internet in a complex transaction which is documented in a briefcase somewhere in Germany.
Re:Yea, and by meabolex · 2004-11-11 07:48 · Score: 3, Funny

Initiating a Mooglesoft search:

Instead of clicking a button named Google Search, it simply says "KupoKupo!"

You are then returned a page where 100% of the text is the word "Kupo"

This is slightly less optimized than a Marklar search (which at least has some words other than 'Marklar').

--
FORTUNE FAVORS IRONY

But will this mean Google can crawl back? by biffnix · 2004-11-11 07:39 · Score: 5, Funny

Couldn't Google just crawl Microsoft in return? Then they'd be stuck in an endless loop, and William Shatner can then swoop in, crack some skulls, and save the day.

Or something like that.

biffnix

--
Don't Die Wondering

Microsoft stealing someone elses technology??? by Shant3030 · 2004-11-11 07:39 · Score: 4, Funny

Nah, never happens....

--
100% Insightful

Re:Microsoft stealing someone elses technology??? by isometrick · 2004-11-11 07:56 · Score: 3, Interesting

Google's "data" is collected, generated, and stored by their technology.

I won't steal your oven, but I'll steal your food!
Re:Microsoft stealing someone elses technology??? by isometrick · 2004-11-11 08:03 · Score: 2, Informative

Google Terms of Service

" ... You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance ..."
Re:Microsoft stealing someone elses technology??? by netringer · 2004-11-11 08:11 · Score: 4, Interesting

I fail to see how they are stealing any of Google's technology. Data maybe.
Are are they stealing Google's innovations?

Lo! Note how the review articles of the last few days mention the innovative NEW FEATURE of MSN search called, "Search Near Me" which stores the calculated lat/long of addresses on web pages and returns matches near you.

Note how Google's long in beta Google Local (http://local.google.com) stores the calculated lat/long of addresses on pages and returns matches near you. Google Local works better.

Another Microsoft innovation! Let's hope WE remember who had it first!

--
Ever dream you could fly? Get up from the Flight Sim. I Fly

They been crawling like mad lately by mpost4 · 2004-11-11 07:40 · Score: 5, Interesting

I can say that they been crawling like mad as of late, Google, Yahoo, and MSN. I say this because on my site I have had a lot of traffic from all three, and my site is not a popular, or even an important one but I seen a lot of traffic from them. Not just once a week or a few times a week but every day. There are big updates coming. I was not surprised to see the article about google doubling their index, I know something was coming from the way they are crawling unimportant/unpopular sites.

Try this term on MSN search by bbzzdd · 2004-11-11 07:40 · Score: 5, Funny

more evil than satan

ROOFLES!

Re:Try this term on MSN search by JohnnyKlunk · 2004-11-11 07:47 · Score: 5, Funny

OK. This is really freaky. Try

more evil than god and you get FIREFOX as the first result (then google, of course)
Re:Try this term on MSN search by finkployd · 2004-11-11 07:49 · Score: 4, Funny

That they put google up there as the number one search result is not that surprising. What gets me is they have themselves at number four.
Re:Try this term on MSN search by fireshipjohn · 2004-11-11 07:51 · Score: 2, Informative

Now try it on google and you get articles about the 'more evil that....' debate.

I know which search engine I'm sticking with :)
Re:Try this term on MSN search by Kalak451 · 2004-11-11 07:52 · Score: 2, Interesting

Also note that the "SPONSORED SITES" part of the page goes away on that search.
Re:Try this term on MSN search by hehman · 2004-11-11 07:54 · Score: 2, Interesting

I think you meant this URL: more evil than microsoft
Re:Try this term on MSN search by }InFuZeD{ · 2004-11-11 07:55 · Score: 2, Funny

I'm not sure if it's funnier that Google is #1, or that Microsoft lists itself as #4.
Re:Try this term on MSN search by Garion+Maki · 2004-11-11 07:57 · Score: 3, Informative

pritty funny :)

but it seems like google started it several years ago.

http://www.cnn.com/TECH/computing/9911/15/search.e ngine.ms.idg/
and
http://searchenginewatch.com/sereport/article.php/ 2167621
btw, it doesen't seem to work on google anymore...

--
All indicators show that the human race is selectively breeding itself for stupidity.
Re:Try this term on MSN search by stratjakt · 2004-11-11 07:57 · Score: 2, Funny

Realize it takes into account popularity of the site, and occurence of the words, and I believe thw word types are ranked too, nouns before verbs before adjectives before adverbs.

The Firefox page is fairly popular, and the words "more" and "than" appear over and over, as with Google. (Uh, googles motto "do no evil" wouldn't hit another word, hmmmmmmmmm)

Try this one (seriously): more gay than slashdot

--
I don't need no instructions to know how to rock!!!!
Re:Try this term on MSN search by Red+Alastor · 2004-11-11 08:30 · Score: 4, Interesting

Sure. Bill Gates is an atheist so he think that God is evil. Open Source too, specially that pesky browser eating his market share.
Before you mod me down for that, I'd like to mention that this isn't Microsoft bashing since I am an atheist too and so are Linus and RMS.

--
Slashdot anagrams to "Sad Sloth"
Re:Try this term on MSN search by mormop · 2004-11-11 09:33 · Score: 4, Funny

It's not so much so much the result that scares me as the thought processes that led you to try it ;)

--
Hmmmmmm..... Deep fried and look like Squirrel.
Re:Try this term on MSN search by KFury · 2004-11-11 09:55 · Score: 3, Interesting

That they put google up there as the number one search result is not that surprising. What gets me is they have themselves at number four.

Not anymore. They apparently hand-edited their own company out of the results about an hour ago.

--

Kevin Fox
Re:Try this term on MSN search by StikyPad · 2004-11-11 10:19 · Score: 3, Informative

His thought process probably started here

--
https://www.eff.org/https-everywhere

They wouldn't... by Wrathie · 2004-11-11 07:40 · Score: 4, Funny

Such trouble. Just buy the damned company.

Re:They wouldn't... by RobertB-DC · 2004-11-11 07:45 · Score: 4, Funny

Such trouble. Just buy the damned company.

Come on, be serious. Google doesn't plan to buy Microsoft until *after* they reach the one-year post-IPO mark, silly.

--
Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.

Shocked I tell you by finkployd · 2004-11-11 07:41 · Score: 5, Funny

Well, that kind of business practice would be completely out of character for Microsoft.

This is a non-story. A good Slashdot headline will be when they get caught actually NOT doing something like this.

Microsoft Has Original Idea and Implements it By Themselves
From the 70%-of-slashdot-editors-suffered-heart-attacks -reading-this-submission Dept.

Re:Shocked I tell you by oGMo · 2004-11-11 07:51 · Score: 2, Funny

Microsoft releases "Bob"
From the laugh-it's-funny Dept.

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

Google is Catholic? by TheAmazingBob · 2004-11-11 07:43 · Score: 5, Funny

"Google happily changed its habbits..."

Google is Catholic?

--

The Geek Crew

Meta-search? by grasshoppa · 2004-11-11 07:44 · Score: 3, Interesting

The question is why? If they are doing this, are they simply going to present the results as their own, or are they going to work some magic and find the most relevant search results from ALL the engines and use those.

In the first case, it's a slimy business practice. In the second, it's fairly cunning ( and has been tried before ).

In either case, I doubt google is in any real danger. They are to search engines what MS is to the desktop. And while MS has squandered that advantage in the desktop arena ( reader homework: 250 word essay as to why ), google is only improving on their work.

--
Mod me down with all of your hatred and your journey towards the dark side will be complete!

Msn Crawling by clinko · 2004-11-11 07:46 · Score: 3, Informative

If you've been watching the logs to your site lately Microsoft has been RAPING most servers. Most crawlers will pick through pages with large lists 1 at a time, then come back every hour or so.

MSN starting last week has been pulling EVERY LINK in sequence from my site. Even the larger Artist Index pages of my site.

Seriously, I've had this same spider on my site for about 36 hours now.

Violates Google's TOS by Anonymous Coward · 2004-11-11 07:46 · Score: 5, Informative

From Google's Terms of Service

Personal Use Only

The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.

Re:Violates Google's TOS by Dhalka226 · 2004-11-11 08:47 · Score: 2, Interesting

Ahhh. So, let's see. If you use google at work, you should be going to jail. Sounds fair.
Can anybody take your comments seriously after you say something like "you should be going to jail?" I don't know when Google became a government agency that could send officers to your door for violating a TOS. No, at best it would be a civil issue. More likely, as you say, they have that clause as a justification if they choose to block usage.
However, of all the companies out there, Google would be the one of the least anal ones I could think of. Almost certainly that clause exists for only the purpose of blocking people doing what MS is (rightly or wrongly) accused of: Crawling them to offer a competing service. And THAT is taking money directly out of their pockets--you can bet if it were true and could be proven, they would do more than start firewalling. They'd be sueing somebody's ass off.
Frankly, I think that is a perfectly legitimate attempt to protect one's business. But hey, if you think it's moronic and crappy, that's your call.

Re:You don't say! by Ryan+Stortz · 2004-11-11 07:47 · Score: 2, Funny

Wasn't that the "plot" to the movie Anti-Trust?

--
Bugs are just features that have been fixed.

Re:More lies from garcia by calibanDNS · 2004-11-11 07:47 · Score: 2, Insightful

Actually, search engines profit from ad revenue displayed on search result pages (amoung other things). The search engine with the best results SHOULD attract the most users. Increasing the number of users can correlate to increasing profits from ads. Thus, search engine sites profit from having THEIR 'bots crawl YOUR site. On the flip side, we as web users, profit (non-monetarily) by having a better search engine.

Absurd by targo · 2004-11-11 07:50 · Score: 4, Insightful

The claims are so absurd I don't even know where to start.
1) His whole theory is based on the "fact" that the only way in the world to find his pages is to use site:www.sitename.com in Google, implying that Google has cached the results from an earlier crawl. Of course, there is no way that the Microsoft search couldn't have also cached it.
2) Then, he claims that Microsoft is probably screen-scraping Google's results (for all the millions of sites out there), and using these results to recrawl those sites? This doesn't even make any sense.
3) And last but not least, Microsoft is certainly basing its whole search architecture on the assumption that Google wouldn't ever notice MSN mirroring its whole index. Yeah right.

--
When men used to be men

Probably Not.. by DelawareBoy · 2004-11-11 07:52 · Score: 2, Interesting

My website is the #1 site listed with specific Criteria on Google. Consistently for the last 2 months. I try the same thing with MSN search and My site does not even show up at all.

If they are searching Google, they haven't done it recently, or else they haven't gotten to my site yet.

Spike the results, then sue by G4from128k · 2004-11-11 07:52 · Score: 4, Informative

It would be easy for Google to insert a small fraction of non-sequiturs in the results, look at Microsoft's search results, and then sue for misuse. Even if MSFT uses random proxies to avoid detection, it cannot manually recheck all the hits to make sure they are correct (if they could, they had the resources to check all the sites, then they not need to crawl Google. A few made-up sites or inappropriate search hits would be enough to establish a pattern of abuse.

--
Two wrongs don't make a right, but three lefts do.

Re:Spike the results, then sue by Dogun · 2004-11-11 09:54 · Score: 2, Informative

Seems you don't understand how search engines work^^

What a normal spider does is generally try different IP's, see if they're running a webserver. Then they do a DNS lookup, fetch http:///robots.txt and read that to decide if indexing is allowed, and where. Then it just walks through the website. A number of places on the website might not be directly accessible, but also not disallowed for indexing by robots.txt.

If some other site has a link to that webserver in some disconnected region of the website, then the crawler generally makes sure it's okay to index that against the robots.txt, and if so, indexes.

The accusation here is that Microsoft isn't finding these adresses on their own, but instead using google's 'site:host.domain' results as a shortcut, which would constitute a violation of google's terms of service.

They really only need to seed their crawler... by JustNiz · 2004-11-11 07:55 · Score: 5, Interesting

You can't get to every page on the internet just by starting at one page and recursively following links, therefore the more places you from, the more likely you are to have 100% coverage.

I could imagine that Microsoft just needs a few thousand URL's evenly-spread across the internet just to seed their crawler, which they can get from Google by using a list of most popular queries.

Once their crawler has so many starting points it can do the rest itself.

a company I worked for did this once... by Skuld-Chan · 2004-11-11 08:00 · Score: 2, Insightful

And got banned from using google. Seriously.

Terrible article by angio · 2004-11-11 08:02 · Score: 4, Insightful

The author suggests that microsoft must be scraping google b/c the only place _he_ could find the URLs they're requesting was google's cache.

Uh.

Microsoft has been developing their internal search engine for quite a while now. Part of developing a search engine is using it to crawl and creating a large corpus of test data. It's hugely likely that M$ has had a working crawler system for much, much longer than would be indicated by their public announcement. Quite a few people who helped develop Altavista at HP/Compaq/DEC research joined Microsoft Research about two years ago - the kind of people who could write a high-performance crawler in their sleep and wake up feeling refreshed.

That article seems like baseless, uninformed speculation, to put it not-so-politely.

This could be entirely natural... by theluckyleper · 2004-11-11 08:02 · Score: 4, Insightful

I'm certainly no Microsoft groupie, but this behavior may not be as sinister as it seems. Afterall, Google is on the internet, too. There are links found all over the internet to Google, with some specific search term embedded in the URL. If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?

--
Visit the Game Programming Wiki!

Re:This could be entirely natural... by IIH · 2004-11-11 08:47 · Score: 2, Informative

If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?
Find a link, fine
Follow the link, fine
Spider the link, not fine - google's Robots.txt does not give them permission to.

--
Exigo spamos et dona ferentes

Interesting by Eric119 · 2004-11-11 08:02 · Score: 2, Insightful

Try entering a known Googlebomb into the MS search engine. "litigious bastards" shows up www.sco.com as the number one hit.

In other news... by dfj225 · 2004-11-11 08:03 · Score: 2, Funny

Microsoft's beta search engine's index doubled in size to over 8 billion pages.

--
SIGFAULT

Hey Google, please don't make us... by potus98 · 2004-11-11 08:07 · Score: 4, Funny

Hey Google, please don't make us read those wacky JPG/GIF letter scrambles with criss-cross lines and input the random characters into a field before submitting a search.

"Hold on a sec while I Goog- Huh? Grrrr.... H... P... 7... O... wait no, 7... zero... ummm...

--
This one gang kept wanting me to join cause I'm pretty good with a bo staff.

Bogus article by YU+Nicks+NE+Way · 2004-11-11 08:13 · Score: 2, Insightful

This whole article is based on the speculation of a web master who notices that a bot which allegedly isn't leaving behind a bot name is crawling his site. He then figures out that, oh look, there is a standard record in his server log.

And I'm supposed to take this clown's "friend" seriously? That's not a good start, anyway.

But then there's the real howler: the site can allegedly only be found through site: on Google. How does the friend know that? Has he done a complete crawl of the web to find all forward links to any image in his site -- even broken ones? MSNBot, like all bots, recognizes that many anchors are broken, and tries plausible corrections around the broken links. That's particularly useful with a deep link, where the deep link may have timed out but the shallow link still exists.

Re:You don't say! by cortana · 2004-11-11 08:16 · Score: 4, Funny

Movie? I thought that thing was a documentary!

Full Circle by Guppy06 · 2004-11-11 08:17 · Score: 5, Interesting

"Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

It's interesting to know that Bill Gates has been forced to go back to his roots...

The best way to prepare [to be a programmer] is to write programs, and to study great programs that other people have written. In my case, I went to the garbage cans at the Computer Science Center and fished out listings of their operating system.

Arg I hate M$ by OverlordQ · 2004-11-11 08:28 · Score: 3, Interesting

Yes this might sound like a rant, but somehow (partly my fault), the MSN Spider bot found one of my joke cgi scripts that translate pages to my own imaginary language. It's linked nowhere on my site, and maybe 3-4 places on the entire web. Said MSNBot began to pull PDF after PDF through the script, in addition to other large files, it also tried mailto: links. All in all said spider pulled about 1GB of data in a single day. My site's previous average was about maybe 300-400MB a Month. Let's just say that entire M$ IP Netblock was quickly filtered through iptables.

--
Your hair look like poop, Bob! - Wanker.

Highly unlikely by David+Leppik · 2004-11-11 08:28 · Score: 3, Insightful

Google keeps track of IP addresses and blocks which are doing an unusually high number of searches and disables requests from them.

How do I know? Because a friend of mine decided to find out how common all TLAs are (three-letter acronyms) by counting Google hits on each TLA. This was before the Google API, so he did it with good old fashioned HTTP/HTML. It didn't take long for Google to flag him as evil and block access from his IP block.

Sure, Microsoft could find some way around this-- using different enough IP addresses to conceal the source-- but that's more trouble than it's worse. Worse yet, it sets up a cat-and-mouse game and keeps M$ dependent on Google-- when their stated goal is to beat Google at its own game.

I've got a simpler explaination for what the author is seeing. His evidence is based on the fact that some pages being requested exist only in Google's cache. Well, spiders are supposed to do breadth-first searches so they don't hit the same site too often. Microsoft is probably going against data it collected a few weeks ago but hasn't put on its public servers yet. (Why not? Could be lots of things. Maybe they haven't put enough hardware on the front end to support the amount of data they have on the back end. Or maybe they're just slow.)

As much as I'd like to bash M$, there's nothing here that really looks suspicious to me.

Not quite by SamMichaels · 2004-11-11 08:36 · Score: 3, Insightful

Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own.
My garbage doesn't have a copyright statement, contain my patented technology, nor does it come with terms of service or licensing agreements.

Re:Don't block them! by NuclearDog · 2004-11-11 09:15 · Score: 2, Funny

Or just return a bunch of fake links:

"Madame X's House of Leather"
"Hot slutty teens!"
"Wet & Wild College Girls!"

Etc.

Microsoft would stop leeching REAL quick.

--
This statement is forty-five characters long.

Re:More lies from cowardly trolls by Asphalt · 2004-11-11 09:59 · Score: 2, Informative

They do profit from your data. However, being that it is publically available on an HTTP server, that's pretty much their right. That's like you handing me $5 for me to tell you which magazines you might like to buy.

And MSN crawling Google's site is really no different. As long as the Google data is on a public server, it is fair game to crawl.

what ridiculous logic... by the-build-chicken · 2004-11-11 10:01 · Score: 4, Funny

microsoft is looking at old pages, google uses a cache...ergo microsoft must be using google.

if we're going to use that kind of logic, I could just as easily come up with "afghanistan is in the middle east and supports terrorist, iraq is in the middle east...ergo, iraq must support terrorists", and use it to make a case for invading iraq...but you don't see......oh wait

google doesn't allow bots to crawl google.com... by lixlpixel · 2004-11-11 10:33 · Score: 2, Informative

see their http://www.google.com/robots.txt robots.txt

so if the msn bot does what they say it doesn't do what it's supposed to do.

Slashdot Mirror

Is Microsoft Crawling Google?

83 of 480 comments (clear)