Slashdot Mirror


Is Microsoft Crawling Google?

triplecoil writes "Jason Dowdell over at WebProNews has written a piece questioning a tactic Microsoft might be using to beef up its new search engine. He thinks they might be dipping into Google's results to supplement its own. Dowdell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

38 of 480 comments (clear)

  1. Don't concern yourself with this crap... by garcia · · Score: 4, Insightful

    Has anyone out there seen similar behavior on their own sites? Please comment with your qualitative/objective data if so.

    Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No. Do I care what they are doing? No, as long as they are obeying my robots.txt.

    I have complained before about MSNbot ignoring changes to robots.txt while Google happily changed its habbits (I can't find the link sorry). My recent fighting with Googlebot has come to a head when I had to disallow them access to my gallery completely because they refused to honor anything except Disallow: /. I had to go so far as to point Googlebot at my robots.txt and tell it to remove all the previous links. It was rather annoying dealing with support via email from Googlebot as they have apparently taken on the stance of "we don't care but you should put meta tags in all your files so that we don't index those pages." Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

    Do I care if MSNbot is crawling Google and then finding sites and links to search? No as it's none of OUR concern. What is OUR concern is our own robots.txt and how the spiders interact with our sites through that file. Let Google deal with Microsoft/MSNbot if that's what needs to be done but don't concern yourself with it otherwise.

    1. Re:Don't concern yourself with this crap... by finkployd · · Score: 4, Insightful

      Umm, you are crawling MY site for YOUR profit, you do as I say, not the other way around.

      No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion. I can write a search bot today that completely ignores it and there is nothing wrong with that (except perhaps ethically but even that is arguable) If you don't want people (or bots) viewing it then password protect it or take it off the public interweb.

    2. Re:Don't concern yourself with this crap... by mollymoo · · Score: 5, Interesting
      No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

      There's more to it than that. Google caches your pages and makes that cache of your copyright material available. Arguably if you have used your robots.txt file to tell it not to index (and therefore cache) your pages and it still does they are breaching copyright. OK, the Google cache is the world's largest breach of copyright anyway, but if you have told its spider not to index and it does regardless, that's a different ballgame.

      Putting it out there on the web does not give anyone the right to do with it as they please.

      --
      Chernobyl 'not a wildlife haven' - BBC News
    3. Re:Don't concern yourself with this crap... by liquidsin · · Score: 4, Interesting

      Hmmm...let's call "robots.txt" a "copyright control device" in that it states who may and may not have access to my copyrighted images directory. I'd bet a DMCA suit or two for circumventing your copyright control device would get them to pay attention...

      --
      do not read this line twice.
    4. Re:Don't concern yourself with this crap... by nofx_3 · · Score: 4, Funny

      Yes, but I invented the "Information Historic Old Country Road" its not fast, and there ain't much information, but its so durn quaint you gotta love it.

      -kaplanfx

      --
      Visualize Whirled Peas
    5. Re:Don't concern yourself with this crap... by ad0gg · · Score: 5, Informative
      If don't want your site indexed or cached by google. Go here and follow the directions.

      Remove yourself from google

      "Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code. "

      --

      Have you ever been to a turkish prison?

    6. Re:Don't concern yourself with this crap... by mollymoo · · Score: 4, Insightful
      If don't want your site indexed or cached by google. Go here and follow the directions.

      I shouldn't need to go and fill out some form for every search engine to protect my rights. One accepted standard way to say "do not index this" should be sufficient. This is an automated system. There is an accepted automated method to stop crawlers indexing your site (robots.txt). If they (Google or anyone else) take your copyrighted content and reproduce it automatically when their automatic system could have automatically respected your explicitly stated and legally protected rights they are knowlingly making a flagrant copyright violation.

      --
      Chernobyl 'not a wildlife haven' - BBC News
  2. Difficult to do if Google doesn't want them to by Anonymous Coward · · Score: 5, Insightful

    All Google has to do is run some unusual queries through MSN, check their logs, find the IP addresses and block them.

    1. Re:Difficult to do if Google doesn't want them to by carpe_noctem · · Score: 4, Funny

      Why stop there? Google should just ban all of Microsoft's netblocks to prevent their employees from gathering useful information from them...

      "Begun, this war of the corporations has!"

      --
      "Quoting famous computer scientists out of context is the root of all evil (or at least most of it) in programming." - K
    2. Re:Difficult to do if Google doesn't want them to by blamanj · · Score: 5, Interesting

      Yes, and don't think Google wouldn't notice. My company had a summer intern that once wrote a program that started sucking a lot of information out of Google. They blocked our entire site for about three days until everything got straightened out.

  3. Does it violate Google's Terms of Service by winkydink · · Score: 4, Insightful
    If so, they have legal remedies.

    If not, it's called doing business and gaining an advantage any legitimate way that you can.

    I think the interesting bit is in the conclusion. If MS is using this to establish a baseline, they can benchmark their spider against Google's over time.

    --

    "I'd rather be a lightning rod than a seismometer." -Ken Kesey

    1. Re:Does it violate Google's Terms of Service by TheRaven64 · · Score: 4, Interesting

      Do Google's terms of service have any legal standing? Click-through EULAs don't in many jurisdictions, and I don't remember ever even seeing Google's ToS, let alone agreeing to them.

      --
      I am TheRaven on Soylent News
    2. Re:Does it violate Google's Terms of Service by nick13245 · · Score: 5, Informative

      Yes it does.
      From Googles Privacy Center (http://www.google.com/terms_of_service.html):

      Personal Use Only

      The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.

  4. Yea, and by BrianGa · · Score: 5, Funny

    The new search engine's name will be Mooglesoft.

    1. Re:Yea, and by MooseByte · · Score: 4, Funny

      "The new search engine's name will be Mooglesoft."

      Which will subsequently be sued by SCOogle, the latest startup from The Canopy Group, after announcing they purchased the rights to the Internet in a complex transaction which is documented in a briefcase somewhere in Germany.

  5. But will this mean Google can crawl back? by biffnix · · Score: 5, Funny

    Couldn't Google just crawl Microsoft in return? Then they'd be stuck in an endless loop, and William Shatner can then swoop in, crack some skulls, and save the day.

    Or something like that.

    biffnix

    --
    Don't Die Wondering
  6. Microsoft stealing someone elses technology??? by Shant3030 · · Score: 4, Funny

    Nah, never happens....

    --
    100% Insightful
    1. Re:Microsoft stealing someone elses technology??? by netringer · · Score: 4, Interesting
      I fail to see how they are stealing any of Google's technology. Data maybe.
      Are are they stealing Google's innovations?

      Lo! Note how the review articles of the last few days mention the innovative NEW FEATURE of MSN search called, "Search Near Me" which stores the calculated lat/long of addresses on web pages and returns matches near you.

      Note how Google's long in beta Google Local (http://local.google.com) stores the calculated lat/long of addresses on pages and returns matches near you. Google Local works better.

      Another Microsoft innovation! Let's hope WE remember who had it first!
      --
      Ever dream you could fly? Get up from the Flight Sim. I Fly
  7. They been crawling like mad lately by mpost4 · · Score: 5, Interesting

    I can say that they been crawling like mad as of late, Google, Yahoo, and MSN. I say this because on my site I have had a lot of traffic from all three, and my site is not a popular, or even an important one but I seen a lot of traffic from them. Not just once a week or a few times a week but every day. There are big updates coming. I was not surprised to see the article about google doubling their index, I know something was coming from the way they are crawling unimportant/unpopular sites.

  8. Try this term on MSN search by bbzzdd · · Score: 5, Funny
    1. Re:Try this term on MSN search by JohnnyKlunk · · Score: 5, Funny

      OK. This is really freaky. Try

      more evil than god and you get FIREFOX as the first result (then google, of course)

    2. Re:Try this term on MSN search by finkployd · · Score: 4, Funny

      That they put google up there as the number one search result is not that surprising. What gets me is they have themselves at number four.

    3. Re:Try this term on MSN search by Red+Alastor · · Score: 4, Interesting
      Sure. Bill Gates is an atheist so he think that God is evil. Open Source too, specially that pesky browser eating his market share.

      Before you mod me down for that, I'd like to mention that this isn't Microsoft bashing since I am an atheist too and so are Linus and RMS.

      --
      Slashdot anagrams to "Sad Sloth"
    4. Re:Try this term on MSN search by mormop · · Score: 4, Funny

      It's not so much so much the result that scares me as the thought processes that led you to try it ;)

      --
      Hmmmmmm..... Deep fried and look like Squirrel.
  9. They wouldn't... by Wrathie · · Score: 4, Funny

    Such trouble. Just buy the damned company.

    1. Re:They wouldn't... by RobertB-DC · · Score: 4, Funny

      Such trouble. Just buy the damned company.

      Come on, be serious. Google doesn't plan to buy Microsoft until *after* they reach the one-year post-IPO mark, silly.

      --
      Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
  10. Shocked I tell you by finkployd · · Score: 5, Funny

    Well, that kind of business practice would be completely out of character for Microsoft.

    This is a non-story. A good Slashdot headline will be when they get caught actually NOT doing something like this.

    Microsoft Has Original Idea and Implements it By Themselves
    From the 70%-of-slashdot-editors-suffered-heart-attacks -reading-this-submission Dept.

  11. Google is Catholic? by TheAmazingBob · · Score: 5, Funny

    "Google happily changed its habbits..."

    Google is Catholic?

  12. Violates Google's TOS by Anonymous Coward · · Score: 5, Informative
    From Google's Terms of Service
    Personal Use Only

    The Google Services are made available for your personal, non-commercial use only. You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales. You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google. If you want to make commercial use of the Google Services, you must enter into an agreement with Google to do so in advance. Please contact us for more information.
  13. Absurd by targo · · Score: 4, Insightful

    The claims are so absurd I don't even know where to start.
    1) His whole theory is based on the "fact" that the only way in the world to find his pages is to use site:www.sitename.com in Google, implying that Google has cached the results from an earlier crawl. Of course, there is no way that the Microsoft search couldn't have also cached it.
    2) Then, he claims that Microsoft is probably screen-scraping Google's results (for all the millions of sites out there), and using these results to recrawl those sites? This doesn't even make any sense.
    3) And last but not least, Microsoft is certainly basing its whole search architecture on the assumption that Google wouldn't ever notice MSN mirroring its whole index. Yeah right.

  14. Spike the results, then sue by G4from128k · · Score: 4, Informative

    It would be easy for Google to insert a small fraction of non-sequiturs in the results, look at Microsoft's search results, and then sue for misuse. Even if MSFT uses random proxies to avoid detection, it cannot manually recheck all the hits to make sure they are correct (if they could, they had the resources to check all the sites, then they not need to crawl Google. A few made-up sites or inappropriate search hits would be enough to establish a pattern of abuse.

    --
    Two wrongs don't make a right, but three lefts do.
  15. They really only need to seed their crawler... by JustNiz · · Score: 5, Interesting

    You can't get to every page on the internet just by starting at one page and recursively following links, therefore the more places you from, the more likely you are to have 100% coverage.

    I could imagine that Microsoft just needs a few thousand URL's evenly-spread across the internet just to seed their crawler, which they can get from Google by using a list of most popular queries.

    Once their crawler has so many starting points it can do the rest itself.

  16. Terrible article by angio · · Score: 4, Insightful

    The author suggests that microsoft must be scraping google b/c the only place _he_ could find the URLs they're requesting was google's cache.

    Uh.

    Microsoft has been developing their internal search engine for quite a while now. Part of developing a search engine is using it to crawl and creating a large corpus of test data. It's hugely likely that M$ has had a working crawler system for much, much longer than would be indicated by their public announcement. Quite a few people who helped develop Altavista at HP/Compaq/DEC research joined Microsoft Research about two years ago - the kind of people who could write a high-performance crawler in their sleep and wake up feeling refreshed.

    That article seems like baseless, uninformed speculation, to put it not-so-politely.

  17. This could be entirely natural... by theluckyleper · · Score: 4, Insightful

    I'm certainly no Microsoft groupie, but this behavior may not be as sinister as it seems. Afterall, Google is on the internet, too. There are links found all over the internet to Google, with some specific search term embedded in the URL. If MSN's bot happened upon a link to a Google search page, is it somehow wrong for the MSN bot to follow that link, and spider as normal?

    --
    Visit the Game Programming Wiki!
  18. Hey Google, please don't make us... by potus98 · · Score: 4, Funny

    Hey Google, please don't make us read those wacky JPG/GIF letter scrambles with criss-cross lines and input the random characters into a field before submitting a search.

    "Hold on a sec while I Goog- Huh? Grrrr.... H... P... 7... O... wait no, 7... zero... ummm...

    --
    This one gang kept wanting me to join cause I'm pretty good with a bo staff.
  19. Re:You don't say! by cortana · · Score: 4, Funny

    Movie? I thought that thing was a documentary!

  20. Full Circle by Guppy06 · · Score: 5, Interesting
    "Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

    It's interesting to know that Bill Gates has been forced to go back to his roots...
    The best way to prepare [to be a programmer] is to write programs, and to study great programs that other people have written. In my case, I went to the garbage cans at the Computer Science Center and fished out listings of their operating system.
  21. what ridiculous logic... by the-build-chicken · · Score: 4, Funny


    microsoft is looking at old pages, google uses a cache...ergo microsoft must be using google.

    if we're going to use that kind of logic, I could just as easily come up with "afghanistan is in the middle east and supports terrorist, iraq is in the middle east...ergo, iraq must support terrorists", and use it to make a case for invading iraq...but you don't see......oh wait