Slashdot Mirror


Is Microsoft Crawling Google?

triplecoil writes "Jason Dowdell over at WebProNews has written a piece questioning a tactic Microsoft might be using to beef up its new search engine. He thinks they might be dipping into Google's results to supplement its own. Dowdell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

22 of 480 comments (clear)

  1. They been crawling like mad lately by mpost4 · · Score: 5, Interesting

    I can say that they been crawling like mad as of late, Google, Yahoo, and MSN. I say this because on my site I have had a lot of traffic from all three, and my site is not a popular, or even an important one but I seen a lot of traffic from them. Not just once a week or a few times a week but every day. There are big updates coming. I was not surprised to see the article about google doubling their index, I know something was coming from the way they are crawling unimportant/unpopular sites.

  2. Meta-search? by grasshoppa · · Score: 3, Interesting

    The question is why? If they are doing this, are they simply going to present the results as their own, or are they going to work some magic and find the most relevant search results from ALL the engines and use those.

    In the first case, it's a slimy business practice. In the second, it's fairly cunning ( and has been tried before ).

    In either case, I doubt google is in any real danger. They are to search engines what MS is to the desktop. And while MS has squandered that advantage in the desktop arena ( reader homework: 250 word essay as to why ), google is only improving on their work.

    --
    Mod me down with all of your hatred and your journey towards the dark side will be complete!
  3. Re:Don't concern yourself with this crap... by Anonymous Coward · · Score: 1, Interesting

    This is insightful? If your stuff is on the net, you should not expect it to remain private. So their bot is crawling your site. Get over it. If you don't want them crawling your stuff for profit, protect the directory or just ban them. Or just put meta tags in your pages like they said.

    The bot should be treated as no different from another anonymous human. If not the Googlebot, one of the other search engines is bound to find it.

  4. Re:Try this term on MSN search by Kalak451 · · Score: 2, Interesting

    Also note that the "SPONSORED SITES" part of the page goes away on that search.

  5. Probably Not.. by DelawareBoy · · Score: 2, Interesting

    My website is the #1 site listed with specific Criteria on Google. Consistently for the last 2 months. I try the same thing with MSN search and My site does not even show up at all.

    If they are searching Google, they haven't done it recently, or else they haven't gotten to my site yet.

  6. Re:Try this term on MSN search by hehman · · Score: 2, Interesting

    I think you meant this URL: more evil than microsoft

  7. They really only need to seed their crawler... by JustNiz · · Score: 5, Interesting

    You can't get to every page on the internet just by starting at one page and recursively following links, therefore the more places you from, the more likely you are to have 100% coverage.

    I could imagine that Microsoft just needs a few thousand URL's evenly-spread across the internet just to seed their crawler, which they can get from Google by using a list of most popular queries.

    Once their crawler has so many starting points it can do the rest itself.

  8. Re:Microsoft stealing someone elses technology??? by isometrick · · Score: 3, Interesting

    Google's "data" is collected, generated, and stored by their technology.

    I won't steal your oven, but I'll steal your food!

  9. Re:Does it violate Google's Terms of Service by TheRaven64 · · Score: 4, Interesting

    Do Google's terms of service have any legal standing? Click-through EULAs don't in many jurisdictions, and I don't remember ever even seeing Google's ToS, let alone agreeing to them.

    --
    I am TheRaven on Soylent News
  10. Re:Don't concern yourself with this crap... by mollymoo · · Score: 5, Interesting
    No offense dude, but you are the one who put the site out their publically. Now if they are DoSing you then you have a valid complaint but robots.txt is just there as a friendly suggestion.

    There's more to it than that. Google caches your pages and makes that cache of your copyright material available. Arguably if you have used your robots.txt file to tell it not to index (and therefore cache) your pages and it still does they are breaching copyright. OK, the Google cache is the world's largest breach of copyright anyway, but if you have told its spider not to index and it does regardless, that's a different ballgame.

    Putting it out there on the web does not give anyone the right to do with it as they please.

    --
    Chernobyl 'not a wildlife haven' - BBC News
  11. Re:Difficult to do if Google doesn't want them to by blamanj · · Score: 5, Interesting

    Yes, and don't think Google wouldn't notice. My company had a summer intern that once wrote a program that started sucking a lot of information out of Google. They blocked our entire site for about three days until everything got straightened out.

  12. Re:Microsoft stealing someone elses technology??? by netringer · · Score: 4, Interesting
    I fail to see how they are stealing any of Google's technology. Data maybe.
    Are are they stealing Google's innovations?

    Lo! Note how the review articles of the last few days mention the innovative NEW FEATURE of MSN search called, "Search Near Me" which stores the calculated lat/long of addresses on web pages and returns matches near you.

    Note how Google's long in beta Google Local (http://local.google.com) stores the calculated lat/long of addresses on pages and returns matches near you. Google Local works better.

    Another Microsoft innovation! Let's hope WE remember who had it first!
    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  13. Re:Don't concern yourself with this crap... by Eric+Giguere · · Score: 3, Interesting

    Sure, I see crawlers on my site all the time sometimes hitting the same URL over and over again. Do I understand their repetitive behavior? No.

    Google gives a partial answer to this on their GoogleBot page:

    In general, Googlebot should only download one copy of each file from your site during a given crawl. Occasionally the crawler is stopped and restarted, and it may recrawl pages that it has recently retrieved. These recrawls should happen infrequently.

    If they're playing around with new indexing algorithms then I would expect to see more of these multiple hits.

    Eric
    How to (gently) detect Internet Explorer
  14. Re:Don't concern yourself with this crap... by liquidsin · · Score: 4, Interesting

    Hmmm...let's call "robots.txt" a "copyright control device" in that it states who may and may not have access to my copyrighted images directory. I'd bet a DMCA suit or two for circumventing your copyright control device would get them to pay attention...

    --
    do not read this line twice.
  15. Full Circle by Guppy06 · · Score: 5, Interesting
    "Dowell likens it to leaving your garbage on the curb--anyone could conceivably go through it and take whatever is there for their own."

    It's interesting to know that Bill Gates has been forced to go back to his roots...
    The best way to prepare [to be a programmer] is to write programs, and to study great programs that other people have written. In my case, I went to the garbage cans at the Computer Science Center and fished out listings of their operating system.
  16. Arg I hate M$ by OverlordQ · · Score: 3, Interesting

    Yes this might sound like a rant, but somehow (partly my fault), the MSN Spider bot found one of my joke cgi scripts that translate pages to my own imaginary language. It's linked nowhere on my site, and maybe 3-4 places on the entire web. Said MSNBot began to pull PDF after PDF through the script, in addition to other large files, it also tried mailto: links. All in all said spider pulled about 1GB of data in a single day. My site's previous average was about maybe 300-400MB a Month. Let's just say that entire M$ IP Netblock was quickly filtered through iptables.

    --
    Your hair look like poop, Bob! - Wanker.
  17. Re:Try this term on MSN search by Red+Alastor · · Score: 4, Interesting
    Sure. Bill Gates is an atheist so he think that God is evil. Open Source too, specially that pesky browser eating his market share.

    Before you mod me down for that, I'd like to mention that this isn't Microsoft bashing since I am an atheist too and so are Linus and RMS.

    --
    Slashdot anagrams to "Sad Sloth"
  18. Re:Violates Google's TOS by Dhalka226 · · Score: 2, Interesting

    Ahhh. So, let's see. If you use google at work, you should be going to jail. Sounds fair.

    Can anybody take your comments seriously after you say something like "you should be going to jail?" I don't know when Google became a government agency that could send officers to your door for violating a TOS. No, at best it would be a civil issue. More likely, as you say, they have that clause as a justification if they choose to block usage.

    However, of all the companies out there, Google would be the one of the least anal ones I could think of. Almost certainly that clause exists for only the purpose of blocking people doing what MS is (rightly or wrongly) accused of: Crawling them to offer a competing service. And THAT is taking money directly out of their pockets--you can bet if it were true and could be proven, they would do more than start firewalling. They'd be sueing somebody's ass off.

    Frankly, I think that is a perfectly legitimate attempt to protect one's business. But hey, if you think it's moronic and crappy, that's your call.

  19. Re:Don't concern yourself with this crap... by CowboyBob500 · · Score: 2, Interesting

    As far as I see, MSNBot is behaving itself whilst Googlebot is hungriest - (much as I hate to stick up for Microsoft).

    Googlebot (Google) 74 945.51 KB 11 Nov 2004 - 03:02
    Netcraft Web Server Survey 13 0 10 Nov 2004 - 23:48
    Mirago 6 76.44 KB 02 Nov 2004 - 04:13
    MSNBot 6 76.44 KB 05 Nov 2004 - 05:58

    It's interesting that Mirago and MSNBot have taken exactly the same bandwidth in the same amount of visits. Are MS innov^H^H^H^H^H buying new technology again?

    Bob

  20. Re:Difficult to do if Google doesn't want them to by zentigger · · Score: 3, Interesting

    Better yet, Provide those addresses with the correct search results, but change all the links to the raunchiest porn (or pictures of little puppy dogs, if that better suits your sense of moral rectitude)

    --

    the above is my personal opinion and does not necessarily reflect that of the little voices in my head

  21. Re:Try this term on MSN search by KFury · · Score: 3, Interesting

    That they put google up there as the number one search result is not that surprising. What gets me is they have themselves at number four.

    Not anymore. They apparently hand-edited their own company out of the results about an hour ago.

  22. Re:Difficult to do if Google doesn't want them to by asavage · · Score: 3, Interesting
    If you go to whatismyip you get a website that displays your IP address. If you search msn and google for that site the search results show the IP address of the bot that indexed that site.

    For google I get: crawl-66-249-64-167.googlebot.com [66.249.64.167]

    for msn I get: fj1011.inktomisearch.com [66.196.91.16]

    and msn beta I get: 65.54.188.83 (can't find associated domain)

    So we can tell that at least this result wasn't stolen from Google.