Is Microsoft Crawling Google?

← Back to Stories (view on slashdot.org)

Is Microsoft Crawling Google?

Posted by CmdrTaco on Thursday November 11, 2004 @07:36AM from the put-on-your-foil-hat dept.

triplecoil writes "Jason Dowdell over at WebProNews has written a piece questioning a tactic Microsoft might be using to beef up its new search engine. He thinks they might be dipping into Google's results to supplement its own. Dowdell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

23 of 480 comments (clear)

Min score:

Reason:

Sort:

Don't concern yourself with this crap... by garcia · 2004-11-11 07:37 · Score: 4, Insightful

Has anyone out there seen similar behavior on their own sites? Please comment with your qualitative/objective data if so.

Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No. Do I care what they are doing? No, as long as they are obeying my robots.txt.

I have complained before about MSNbot ignoring changes to robots.txt while Google happily changed its habbits (I can't find the link sorry). My recent fighting with Googlebot has come to a head when I had to disallow them access to my gallery completely because they refused to honor anything except Disallow: /. I had to go so far as to point Googlebot at my robots.txt and tell it to remove all the previous links. It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

Do I care if MSNbot is crawling Google and then finding sites and links to search? No as it's none of OUR concern. What is OUR concern is our own robots.txt and how the spiders interact with our sites through that file. Let Google deal with Microsoft/MSNbot if that's what needs to be done but don't concern yourself with it otherwise.
1. Re:Don't concern yourself with this crap... by finkployd · 2004-11-11 07:45 · Score: 4, Insightful
  
  Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.
  
  No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion. I can write a search bot today that completely ignores it and there is nothing wrong with that (except perhaps ethically but even that is arguable) If you don't want people (or bots) viewing it then password protect it or take it off the public interweb.
2. Re:Don't concern yourself with this crap... by garcia · 2004-11-11 07:47 · Score: 2, Insightful
  
  Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.
  
  Crawling a gallery of images (and all image property links as well) all day for several days might be considered "DoSing" I consider it being rude.
  
  You're right, they don't have to obey the robots.txt but they should when they say they will.
3. Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 08:32 · Score: 2, Insightful
  
  Well anything on the internet that doesn't have normal web server access controls blocking access, is open slather IMO. That's what makes the internet so cool. Doesn't mean you can't still copyright your material so others can't use it, but I think for search engine purposes there is an implied agreement between YOU and THEM - and I think there should be.
  
  In a sense it's like tourism. The world is full of stuff like Historical buildings and the owners of those places have legal rights against theft/damage etc. But the tour companies can still take people around the streets and show them the places without having to necessarily pay a fee.
4. Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 08:43 · Score: 2, Insightful
  
  Since databases are currently copyrightable, I would argue that a website is a database. If Google insists, I would imagine that MSN, hitting Google's database...er, website, would amount to using a copyrighted database without its owner's permission, which in this case could amount to being a robots.txt file that punts known websites that link w/o attribution.
  
  I would imagine a metacrawler, which attributes its links back to Google, is probably OK, because it keeps Google's adstream intact when the user clicks on the link to a Google search result.
  
  But MSNSearch (or whatever it's called), taking Google search results as its own without attribution, well, that might be a copyright infringement...
  
  If you were an on-line bookstore and deep-linked to Amazon's reviews while portraying them as your own, well, you're gonna get a C&D from Amazon's lawyers awfully fast.
5. Re:Don't concern yourself with this crap... by mollymoo · 2004-11-11 11:41 · Score: 4, Insightful
  
  If don't want your site indexed or cached by google. Go here and follow the directions.
  
  I shouldn't need to go and fill out some form for every search engine to protect my rights. One accepted standard way to say "do not index this" should be sufficient. This is an automated system. There is an accepted automated method to stop crawlers indexing your site (robots.txt). If they (Google or anyone else) take your copyrighted content and reproduce it automatically when their automatic system could have automatically respected your explicitly stated and legally protected rights they are knowlingly making a flagrant copyright violation.
  
  --
  Chernobyl 'not a wildlife haven' - BBC News
6. Re:Don't concern yourself with this crap... by Anonymous Coward · 2004-11-11 13:35 · Score: 1, Insightful
  
  I don't think so.
  
  They cached your page with whatever copyright notices you put on it.
  
  What, are you going to tell me that when you put you zipped your website and put it in your p2p shared folder, and someone downloaded it, that they are commitying flagrant copyright violation?
  
  IF YOU DON'T WANT PEOPLE TO SEE IT, DON'T PUT IT ON THE INTERNET NUMBSKULL.
7. Re:Don't concern yourself with this crap... by big_gibbon · 2004-11-11 21:16 · Score: 2, Insightful
  
  It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.
  
  Google should follow the robots.txt - definitely. But there needs to be some way on confirming on your website that you actually want the pages removing - otherwise what's to stop your competitors "accidentally" entering your URL into the removal form? Meta elements would seem to be the natural choice.
  
  P
Difficult to do if Google doesn't want them to by Anonymous Coward · 2004-11-11 07:37 · Score: 5, Insightful

All Google has to do is run some unusual queries through MSN, check their logs, find the IP addresses and block them.
Does it violate Google's Terms of Service by winkydink · 2004-11-11 07:38 · Score: 4, Insightful

If so, they have legal remedies.
If not, it's called doing business and gaining an advantage any legitimate way that you can.
I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

--
"I'd rather be a lightning rod than a seismometer." -Ken Kesey
1. Re:Does it violate Google's Terms of Service by Lev13than · 2004-11-11 07:44 · Score: 3, Insightful
  
  Does it violate Google's Terms of Service? If so, they have legal remedies.
  If not, it's called doing business and gaining an advantage any legitimate way that you can.
  I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.
  
  If I copy your work and take credit or it, does it violate your terms of service? If so, you have legal remedies. If not, it's called doing business and gaining an advantage any legitimate way that I can.
  
  Furthermore, I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.
  
  --
  When you have nothing left to burn you must set yourself on fire
Why not? by Anonymous Coward · 2004-11-11 07:39 · Score: 1, Insightful

Doesn't that mean even more results?

I'd do the same thing if I could. This is all "speculation" anyway, but since it feeds the stereotype of the insidious Microsoft, it gets posted front page to this "tech news" site.
Re:More lies from garcia by calibanDNS · 2004-11-11 07:47 · Score: 2, Insightful

Actually, search engines profit from ad revenue displayed on search result pages (amoung other things). The search engine with the best results SHOULD attract the most users. Increasing the number of users can correlate to increasing profits from ads. Thus, search engine sites profit from having THEIR 'bots crawl YOUR site. On the flip side, we as web users, profit (non-monetarily) by having a better search engine.
Absurd by targo · 2004-11-11 07:50 · Score: 4, Insightful

The claims are so absurd I don't even know where to start.
1) His whole theory is based on the "fact" that the only way in the world to find his pages is to use site:www.sitename.com in Google, implying that Google has cached the results from an earlier crawl. Of course, there is no way that the Microsoft search couldn't have also cached it.
2) Then, he claims that Microsoft is probably screen-scraping Google's results (for all the millions of sites out there), and using these results to recrawl those sites? This doesn't even make any sense.
3) And last but not least, Microsoft is certainly basing its whole search architecture on the assumption that Google wouldn't ever notice MSN mirroring its whole index. Yeah right.

--
When men used to be men
Re:But will this mean Google can crawl back? by Anonymous Coward · 2004-11-11 07:59 · Score: 0, Insightful

Google would gain nothing from crawling Microsoft. All they'd be getting is their own material.

If Microsoft is indeed doing this however, they could become real competition a lot sooner than you'd think.
a company I worked for did this once... by Skuld-Chan · 2004-11-11 08:00 · Score: 2, Insightful

And got banned from using google. Seriously.
Terrible article by angio · 2004-11-11 08:02 · Score: 4, Insightful

The author suggests that microsoft must be scraping google b/c the only place _he_ could find the URLs they're requesting was google's cache.

Uh.

Microsoft has been developing their internal search engine for quite a while now. Part of developing a search engine is using it to crawl and creating a large corpus of test data. It's hugely likely that M$ has had a working crawler system for much, much longer than would be indicated by their public announcement. Quite a few people who helped develop Altavista at HP/Compaq/DEC research joined Microsoft Research about two years ago - the kind of people who could write a high-performance crawler in their sleep and wake up feeling refreshed.

That article seems like baseless, uninformed speculation, to put it not-so-politely.
This could be entirely natural... by theluckyleper · 2004-11-11 08:02 · Score: 4, Insightful

I'm certainly no Microsoft groupie, but this behavior may not be as sinister as it seems. Afterall, Google is on the internet, too. There are links found all over the internet to Google, with some specific search term embedded in the URL. If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?

--
Visit the Game Programming Wiki!
Interesting by Eric119 · 2004-11-11 08:02 · Score: 2, Insightful

Try entering a known Googlebomb into the MS search engine. "litigious bastards" shows up www.sco.com as the number one hit.
worthless by Anonymous Coward · 2004-11-11 08:09 · Score: 1, Insightful

This article is an example of why blogs are worthless ... He never thought of *asking* Microsoft, did he?
Bogus article by YU+Nicks+NE+Way · 2004-11-11 08:13 · Score: 2, Insightful

This whole article is based on the speculation of a web master who notices that a bot which allegedly isn't leaving behind a bot name is crawling his site. He then figures out that, oh look, there is a standard record in his server log.

And I'm supposed to take this clown's "friend" seriously? That's not a good start, anyway.

But then there's the real howler: the site can allegedly only be found through site: on Google. How does the friend know that? Has he done a complete crawl of the web to find all forward links to any image in his site -- even broken ones? MSNBot, like all bots, recognizes that many anchors are broken, and tries plausible corrections around the broken links. That's particularly useful with a deep link, where the deep link may have timed out but the shallow link still exists.
Highly unlikely by David+Leppik · 2004-11-11 08:28 · Score: 3, Insightful

Google keeps track of IP addresses and blocks which are doing an unusually high number of searches and disables requests from them.

How do I know? Because a friend of mine decided to find out how common all TLAs are (three-letter acronyms) by counting Google hits on each TLA. This was before the Google API, so he did it with good old fashioned HTTP/HTML. It didn't take long for Google to flag him as evil and block access from his IP block.

Sure, Microsoft could find some way around this-- using different enough IP addresses to conceal the source-- but that's more trouble than it's worse. Worse yet, it sets up a cat-and-mouse game and keeps M$ dependent on Google-- when their stated goal is to beat Google at its own game.

I've got a simpler explaination for what the author is seeing. His evidence is based on the fact that some pages being requested exist only in Google's cache. Well, spiders are supposed to do breadth-first searches so they don't hit the same site too often. Microsoft is probably going against data it collected a few weeks ago but hasn't put on its public servers yet. (Why not? Could be lots of things. Maybe they haven't put enough hardware on the front end to support the amount of data they have on the back end. Or maybe they're just slow.)

As much as I'd like to bash M$, there's nothing here that really looks suspicious to me.
Not quite by SamMichaels · 2004-11-11 08:36 · Score: 3, Insightful

Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own.
My garbage doesn't have a copyright statement, contain my patented technology, nor does it come with terms of service or licensing agreements.