Slashdot Mirror


Follow Up on Google Favoring Yahoo

After yesterday's story about google favoring Yahoo links, I got word from Sergey Brin from google. He says that the reason that the site tested showed so poorly is that a robots.txt file prevented Google's crawler from fully indexing the site. The robots.txt file has since disappeared, and the next index should show a change in the rankings.

96 comments

  1. Google indexing method by mikeraz · · Score: 1

    Is it just rumor that Google results are at least artially based on the number of links from other pages pointing to a page? robots.txt wouldn't (shouldn't?) have an effect on that.

    --

    There's more to it than this.

    1. Re:Google indexing method by gaudior · · Score: 1

      If I obey a Disallow, and never see a page on your site, I will never take into account the links that other pages have to it.
      --

  2. Re:Partial retraction from MedWebPlus by MikeTheYak · · Score: 1

    The answer is that Google is spidering Yahoo!. Google provides a free way for people to submit URLs to be spidered. Yahoo! is entitled to make use of this just like everybody else.

  3. Re:Partial retraction from MedWebPlus by Fishstick · · Score: 1

    >a partial retraction from MedWebPlus:

    Actually a 'partial retraction' from Eric Rumsey who does not appear to be affiliated with MedWeb, rather the University of Iowa, Hardin Library for the Health Sciences.

    >but they still question why Yahoo is on the rise)

    "...Google reportedly says that they are now crawling *all* of Yahoo! as part of their agreement...If Google is changing the way they're treating Yahoo! in rankings, they should say something about it. "

    Sounds like a 'save-face' whine if I ever heard one. Why should anyone expect Google to announce something like this? "We just made a deal with Yahoo! to be their search engine. We will be increasing our coverage of Yahoo! sites from around 20% to 100%". Duh! If you were Yahoo! and you were starting to use Google, wouldn't you want to make sure 100% of your own content will be indexed? Sheesh!

    --

    There is much cruelty in the universe, John.
    Yeah, we seem to have the tour map.

  4. amazing the ignorance by Anonymous Coward · · Score: 1

    Amazing how many people don't know how a robots.txt file works. Using a robots.txt, you can specify which parts of a site get indexed. That way you can make sure private, or 'not ready for prime time' pages aren't indexed by search engines. Take a look at the robots.txt that's up there now. http://www.yahoo.com/robots.txt User-agent: * Disallow: /gnn Disallow: /msn Disallow: /pacbell Disallow: /pb # Rover is a bad dog User-agent: Roverbot Disallow: / No spiders (aka site indexing) are allowed in the /gnn, /msn, /pacbell, or /pb subdirectories. It looks like the spider with the user-agent of Roverbot is disallowed entirely. Also, it's been a while but last time I checked, complying with the robots.txt file was completely voluntary. It should possible to write a spider that ignores the robots file completely and goes ahead and indexes the entire site. Of course, an admin sees an IP that's run through his entire site several times in the past week . . they'd probably get suspicious and block the IP.

    1. Re:amazing the ignorance by klanza · · Score: 1

      Of course complying with robots.txt is voluntary. It has to be. That's the nature of the web. What happens on the client (browser, spider, etc) side is completely independent of what happens on the server side. But people continue to think their server can "force" the browsers to do soemthing, or their browser can download a CGI, etc, etc, etc. Why don't people understand?

  5. Impressive by Fervent · · Score: 1
    I think that's pretty impressive: that a reader from within Google would react so quickly to a Slashdot post (and in an honest, favorable manner).

    As a side-note, I'm sincerely glad that Yahoo chose Google as its new search engine provider. For awhile I stopped visiting Yahoo in order to go to Yahoo directly, but now I can go back to Yahoo for all my searching/calender/weather/club needs.

    --

    - I don't care if they globalize against free speech. All my best free thoughts are done in my head.

  6. Re:what good is a robots.txt nowadays... by Pinball+Wizard · · Score: 2
    According to this Slashdot article, "the unauthorized alteration, damage or use of a computer system" is now a felony in Michigan.

    A robot not respecting robots.txt is certainly in the class of unauthorized use. So if Michigan's law catches on across the U.S., maybe there will be some real protection for web admins to protect sites or directories from being indexed. Slap the bot company with a felony! Maybe the law isn't so odious after all.

    --

    No, Thursday's out. How about never - is never good for you?

  7. Re:what good is a robots.txt nowadays... by SEWilco · · Score: 2
    bots from companies such as the above mentioned continue aggressive spidering.

    If their robots will not honor your robots.txt then you do not have to honor their robots nor give them useful information. You could detect them and feed them random responses -- either the types of responses which they do like or the types which they don't like. 43,000 links to metallica -- which when an expensive human looks at them will be found to be artwork made with glitter-covered glue...

  8. Re:Partial retraction from MedWebPlus by rgmoore · · Score: 2

    And this may be where the cause and effect of the Yahoo/Google agreement comes into play. Before there was an agreement between Yahoo and Google, Yahoo would have some reason not to want Google to be spidering their site. After all, you don't want your competitor to take advantage of your hard work. After the agreement, though, they would certainly want Google to spider their site, since they now want to show up as well as possible on Google. The result is that Google is taken off their spiders.txt (and we now know that Google is polite and obeys spiders.txt) and their ranking start shooting up.

    --

    There's no point in questioning authority if you aren't going to listen to the answers.

  9. Re:So what about yahoo? by Fishstick · · Score: 2

    No, it means that Yahoo!'s robots.txt doesn't block crawlers from 100% of their site like MedWeb was doing:

    http://www.yahoo.com/robots.txt

    User-agent: *
    Disallow: /gnn
    Disallow: /msn
    Disallow: /pacbell
    Disallow: /pb

    # Rover is a bad dog <http://www.roverbot.com>
    User-agent: Roverbot
    Disallow: /

    So they let just about anybody index most of their their site, except for the listed exceptions (except roverbot, he is a bad dog :-) ). Google apparently wasn't indexing their whole site for some other reason, now resulting from the new agreement, they are indexing 100%

    The presence of a robots.txt file doesn't block crawlers by default. The bots are supposed to look at the contents of robots.txt and follow the rules.

    --

    There is much cruelty in the universe, John.
    Yeah, we seem to have the tour map.

  10. Isn't it obvious? by WPL510 · · Score: 1

    Think about it... Yahoo gets, by far, more visitors than Google. Thus, it's no surprise that Yahoo results utilizing the Google database will rise. Correct me if I'm wrong, but doesn't Goggle base results on popularity? Therefore, if many more Yahoo users view sites in a certain category than Google, and Yahoo rankings are different, than Google will in its own way reflect Yahoo usage. So, Yahoo is affecting Google- not being propped up. Give it a rest. And if I'm wrong that's ok too.

  11. Re:So what about yahoo? by qwerty823 · · Score: 1

    I guess that means we should start adding a humans.txt file to our sites that tell people what they are allowed to read and remember.

  12. sample size by onShore_Jake · · Score: 1

    Its lame to use a small number of sites to base the conclusions on. LAME.

    1. Re:sample size by Fishstick · · Score: 1

      Well, his area of focus was medical directory sites, not overall search engine coverage.

      Take a look at his site and you'll see that they (University of Iowa, Hardin Library for the Health Sciences) are one of these directory sites and they saw they were at #15 at one point among their peers. When he noticed a shift in Yahoo! sites going up, while WebMed dropping from #1, he was reasonably suspicious. Seems he presented his information factually, although it would have been better if he had just contacted Google or Yahoo! first. Where everyone got riled up was here on /. where the ugly conclusions jumping occured. _That_ is what I think is lame.

      --

      There is much cruelty in the universe, John.
      Yeah, we seem to have the tour map.

    2. Re:sample size by edhall · · Score: 1
      ts lame to use a small number of sites to base the conclusions on.

      Wow; after 200 posts on the subject yesterday and today, this is the first post I've seen mentioning this simple fact. I'd hope that /.'ers would be a little more statistics-savvy (they usually are).

      -Ed
  13. Just like 'more evil than satan himself' by braindamage.org · · Score: 1

    This doesn't mean that Google has sold out,
    just that their ranking is somewhat imperfect.
    When you type in random phrases you often get
    randomish results. It used to be the case that
    'more evil than satan himself' returned microsoft
    as the #1 result; I don't think that MS paid for
    that. (or disney, who came up second or third)

    1. Re:Just like 'more evil than satan himself' by david+duncan+scott · · Score: 1
      Google looks for links using the terms as well as the content of the page itself. Every time somebody had a The Great Satan then the little counter associating "Satan" with "Microsoft" clicked up.

      When that story got around, of course, people put mentions of it on their sites, causing the same counters to spin even more.

      --

      This next song is very sad. Please clap along. -- Robin Zander

  14. Not exactly true by braindamage.org · · Score: 1

    robots.txt can disallow crawling, it does not disallow indexing. (I think that only Meta-noindex can do that) Google can "partially" index pages based on links even if it has never seen the pages. This is what you get when searching for 'medwebplus' on google.

  15. Re:what good is a robots.txt nowadays... by eudas · · Score: 1

    quoth the poster:

    "Maybe the law isn't so odious after all."

    live by the sword, die by the sword...

    eudas

    --
    Blessed is he who expects the worst, for he shall not be disappointed.
  16. Re:robots.txt by tinla · · Score: 2

    I know not everyone knows how Search Engines work, and mostly you don't need to know. Everyone who has a page on the web should read this though " A Standard for Robot Exclusion ". Its been a standard since 30 June 1994 and thats not bad for an Internet standard.

    I assure you that Google.com follows it to the letter. All the main SEs do.. if they didn't they might even be leaving themselves open to legal challenges. Read the old mailing lists at Webcrawler (search for "robots.txt" on google) and you'll see that people used to get quite wound up by rude SEs back in 94. A Web server's CPU time was worth something then.

    As for all the lone gunmen out there cooking up theories...read this. Google has ALREADY sold the top links for some keywords. They don't hide it, read the FAQ on their site and you'll find the address to write to to buy listings. Maybe you should read the Demographics. Your the market being sold. Seems fair to me.

    The actual search results (not the adverts) are genuine and not sold. Makes sense... consider the whole Google model (who links to you affects your ranking) and its clear Yahoo, Disney etc will all rank very highly. Lots of links into them because they are quality sites.

    I've done a lot of work with SEs over the years and Google is far more genuine than anyone else in the market, but they have to make ends meet.

    Take a look at this also. Can we spot the paid for listings yet?

    --
    0daymeme.com: Great stuff.
  17. Re:Why. by farsighed · · Score: 1

    Yahoo doesn't want you to ask, maybe?

  18. How Google indexed even the excluded parts by yerricde · · Score: 3

    If robots.txt was there, how did Google index the site at all (instead of just poorly)?

    • The presence of robots.txt doesn't automatically exclude everything, only the directories specified in the file.
    • Google can index even robots-excluded sites by looking at the 50 or so characters on either side of the page that links to the excluded pages. That's why Google sometimes gives URLs without any content.

    <O
    ( \
    XGNOME vs. KDE: the game!
    --
    Will I retire or break 10K?
  19. Re:Morons, all of *us*. :) by Xzzy · · Score: 1

    > you really should get your facts straight before
    > calling anyone else a moron

    I was refering to humanity as a whole, really. ;)

    Us humans have at least 6000 years of recorded history, which would offer plenty of "facts" dictating how we, as a whole, can be complete dimwits.

    A good number of the conflicts we've had in the past were based on a lack of communication. I was just borrowing this Google thing as a recent example, which to me, is the real issue. It's not the results given by a search engine that's bothersome; is the fact that people are displaying no inclination to communicate despite all the (good) reasons for attempting it.

  20. Re:what good is a robots.txt nowadays... by Alanzilla · · Score: 1

    I'll try anything once... twice if I like it... three times if it's blonde. :)

  21. Distributed search engines and robots.txt by yerricde · · Score: 2

    some of these poorly written programs check the robots.txt file every 5 minutes when they're in a spidering mood. Nice. You've got to wonder how much bandwidth is wasted due in part to moronic programming practice.

    Many spiders (e.g. Googlebot) are distributed among many colocated boxen so they can get better network performance. Each box needs its own copy of robots.txt so it can choose whether or not to index pages and follow links. Read your server logs again; are all the robots.txt hits from the same IP address, or are they from different machines?


    <O
    ( \
    XGNOME vs. KDE: the game!
    --
    Will I retire or break 10K?
    1. Re:Distributed search engines and robots.txt by substrate · · Score: 1
      They're both, I can understand the distributed effect somewhat, but some idiot-bots hang on hoping my robots.txt will just disappear. Others are distributed but seem to retry every day.

      There have been several occasions where they've slowed my cable modem to a crawl. My favourites are the one that don't even appear in a DNS lookup. It's so tempting to denial of service them at that point.

  22. I'm sorry by Cyno · · Score: 1

    I'd just like to say I'm sorry for ever thinking google might have been corrupted by Yahoo or any other corporate entity. Google, you're my favorite search engine! You kick ass! And I will never question your loyalty again... at least 'til the next slashdot article tells me otherwise.

  23. Re:robots.txt ? by Chasuk · · Score: 2

    Uhm, because it was a troll?

  24. Don't know what robots.txt is? by dale@redhat.com · · Score: 3

    If you don't know what robots.txt is, look at A Method for Web Robots Control Internet RFC...

    --

    -- A hundred thousand lemmings can't be wrong!
  25. Re:So one company favors another... by LordNimon · · Score: 1

    What's wrong with Latin-American fruit?
    --

    --
    And the men who hold high places must be the ones who start
    To mold a new reality... closer to the heart
  26. Re:what good is a robots.txt nowadays... by GlassUser · · Score: 1

    Would you consider a robots.txt to be an access control mechanism? Doesn't that mean that the aggressive spiders (or operators thereof) are violating the DMCA? Yeah, a stretch, more for laugh value.

  27. Morons, all of 'em. by Xzzy · · Score: 5

    Gee, I wonder how many problems in the world could be solved if people put out a little bit of effort into communicating with each other. Rather than asking Google what's up.. the guys in the story yesterday put MONTHS of effort into proving how they're getting shafted by Google's search engine. They make accusations.

    Google hears about it via Slashdot, and in less than 24 hours, the real reason is revealed.

    Kinda makes me wonder at humanity, when we're all so locked into our own little shells that we occupy ourselves trying to prove something that five minutes of talking could solve. Sort of like how most Americans never say hello to their neighbor, and can live next to them for years without ever exchanging niceties.

    1. Re:Morons, all of 'em. by rabidcow · · Score: 1

      That's assuming Google's people wouldn't just ignore them. Sometimes you need a megaphone to be heard.

    2. Re:Morons, all of 'em. by Vassily+Overveight · · Score: 2

      I've written to google a number of times and except for once, have always gotten an answer.

      --

      "If I have seen further than other men, it is by stepping on their glasses." - Michael Swaine

    3. Re:Morons, all of 'em. by _Bean_ · · Score: 2

      The guys in yesterday's story wern't the ones being "shafted". They said we've beening running this study of google and we happened to notice a trend that seems pretty fishy. The study was never done to prove that anyone was being shafted. Although your right that talking to google would have been the right thing to do you really should get your facts straight before calling anyone else a moron

    4. Re:Morons, all of 'em. by knowbody · · Score: 1

      it shows a lot about the author's fundamental assumptions - that the corps who run search engines are out to make money and are willing to sell their integrity of results for a buck - the author probably bases this opinion on what happened to altavista. altavista DID begin to sell their rankings (i beleive fravia's site proved this a few years ago) and that is probably one reason everyone uses google now. but the author let his cynical view determine his research angle. that is a very typical weakness, something scientists try to avoid but somtimes fail.

  28. Robot Exclusion Protocol by Anonymous Coward · · Score: 3

    The robot exclusion protocol (http://info.webcrawler .com/mak/projects/robots/norobots.html is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt.

    If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.

  29. Re:ok now say your sorry everyone by don_carnage · · Score: 1
    Hell no! [rats] That would be like [rats] the media [rats] admitting that subliminal [rats] advertising does not and has never worked![rats]

    Sorry for the offtopic post, but it just bothers me how easily the media can mislead the public.


    --
  30. Buggy SingingFish spider? by KlomDark · · Score: 1

    Anybody encountered a buggy bot coming from herring.singingfish.com? My web server log showed thousands of duplicate entries last week for the same file. I traced it through the log, and it was spidering along fine, and then got stuck on a file and re-requested it over 3500 times, each request 9.5k. Really annoying bandwidth sucker.

  31. Wait, I'm confused by um...+Lucas · · Score: 1

    Is all this saying that if you don't have a robots.txt file, your site will be indexed, but if you have one, regardless as to what it says, your site won't be? I thought you just used robots.txt files to direct the indexers.... I'd finally added one, just to prevent bothersome 404 messages, so does that mean I won't get crawled by many engines, even though it is set to allow all of them?

    I've noticed some hits from inktomi that just get the file and then go away... what's the deal?

    1. Re:Wait, I'm confused by AgentWebRanking+Free · · Score: 1

      I check with Agentwebranking Freeware ( http://www.aadsoft.com ) that your site was listed on the following search engines: Altavista: OK (only 2 pages) Google: OK (91 pages) Hotbot: OK (only 1 page) Lycos: OK (4 pages)

      --
      Freeware - Search engines ranker and analyzer - 5 stars Zdnet - http://www.aadsoft.com
    2. Re:Wait, I'm confused by um...+Lucas · · Score: 1

      Oh COOL! I must bookmark your site!

  32. Archiver programs by pjrc · · Score: 2
    Perhaps slightly off-topic, I find that the main bandwidth abuse of my site, which robots.txt is completely useless in preventing, is people running archiver programs like Teleport Pro, WebZIP, WebReaper, WebCopier, WebSymmetrix, Offline Explorer, and Wget. Some of them try to send a user agent string of "Mozilla" or "Mozilla/4.0", but looking through the log files it obviously not an interactive user who rapidly downloads every single file as fast as the connection will support.

    If any of you web admin gurus (I know you're reading) have any ideas of how to deal with these programs, I could really use some help. I'd like to detect them and feed them the files at a controlled bandwidth.

    I find these archiver programs usually (but not always) behave much worse than any robot... often times they completely saturate my bandwidth for many minutes. Not nice.

  33. Re:what good is a robots.txt nowadays... by PD · · Score: 1

    What's good for the goose is good for the gander.

  34. Re:what good is a robots.txt nowadays... by Ben+Jackson · · Score: 2

    You could build a trap for such crawlers in the form of randomly generated HTML documents which each reference a few more fake URLs which generate more random HTML documents... Disallow that tree in your robots.txt and let the robots who disregard it suffer.

    The best random document generator would be a Markov chainer which had been feed all of the top level category pages from Yahoo! to make sure you have lots of juicy keywords to index. :-)

  35. Can SEs search unreferenced pages? by skoda · · Score: 2

    This may be a trivial question, but I'd really like an answer:

    Can search engines find and index pages (html, php, etc.) that are not explicitly linked from the starting index.*htm* page in a given directory?

    Put another way, can a search engine find my directory /web/foo/bar and then index the page opus.html, even though neither the directory nor the file are referenced or mentioned in any of the "public" files?

    I ask because I was using non-referenced pages (can only be found by knowing the address) as part of a way to limit access to certain files to specific people.

    I hope someone can provide some insight into this issue.

    Thanks
    -----
    D. Fischer

    1. Re:Can SEs search unreferenced pages? by jareds · · Score: 1

      Can search engines find and index pages (html, php, etc.) that are not explicitly linked from the starting index.*htm* page in a given directory?

      Search engines only find pages that are linked to from some other page on the web. There is nothing in the HTTP protocol that allows them to get the full directory structure of the server or anything like that. Unless someone links to your page or submits the URL to a search engine, it won't be found.

      That said, I advise that you just use a robots.txt file just to be sure.

  36. Re:They've addressed half the problem by Fishstick · · Score: 2

    From the original article's author's "partial retraction":

    "...Google reportedly says that they are now crawling *all* of Yahoo! as part of their agreement..."

    http://www.lib.uiowa.edu/hardin/md/notes7a.html

    No real big mystery, Google wasn't indexing all of Yahoo's content before for some reason, now they are. If Yahoowent to all the trouble of pushing a pile of money at Google to be their search engine, why wouldn't they expect them to index all of their content?

    --

    There is much cruelty in the universe, John.
    Yeah, we seem to have the tour map.

  37. Re:Reasonable expectation of privacy? by skoda · · Score: 2

    "Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see." That's an interesting point, which I would have agreed with a few months back. Now that I have my own website, though, my attitude has changed. In my mind, I have leased a service by which I can make materials available to various people via a global computer network. That means that I have the right to restrict who sees what. The majority of my online info is freely available for the world to see. But there is information that is meant for a specific group of people. Thus, I've given the URL to only those people who should have access. Some of it password protected as well. Could certain unsavory types get to that info, despite my precautions? Probably, but I don't that think that merely putting it on an online computer automatically gives them that right. <Bad Analogy>I lease an apartment which is visible to the world, and anyone can access the foyer. But that does not implicitly confer the right for anyone to enter my apartment and go through my belongings. And just because anyone can get into the foyer doesn't mean that have the right to read my magazines that are there because they don't fit in the mailboxes. If they want access to that material and my belongings, they can call me or 'buzz' me and ask to be let in.</Bad Analogy> Put another way, eavesdropping is bad form even in the online world.
    -----
    D. Fischer

  38. Stop me if I am wrong but... by pres · · Score: 1

    Lets see.

    They index all of yahoo's sites.

    If you are searching from yahoo and see a yahoo link come up, chances are you like yahoo and would start by following that link.

    Repeat for the millions og searches that occure on yahoo every day and yahoo will move up in the ranking.

  39. Re:ok now say your sorry everyone by don_carnage · · Score: 1
    Thank you so much for posting that! I know that this is way off topic, but it's about time I saw what all the fuss is about.

    It's very clear that the whole 'rats' thing was blown out of proportion by the media. The text was an effect and was not intended to be a subliminal message.

    *sigh* I guess that's what you get when you're the republican candidate and the press is in the democrat's back pocket. [Not that I support either campaign at this moment. It's worse than Clinton and Dole. ugh]


    --
  40. The PBS Factor by skoda · · Score: 1

    When in doubt, trust that PBS is in effect: People are Basically Stupid.

    :)
    -----
    D. Fischer

  41. Re:Partial retraction from MedWebPlus by Zagadka · · Score: 1

    and we now know that Google is polite and obeys spiders.txt

    That's robots.txt.

  42. Re:ok now say your sorry everyone by dubl-u · · Score: 1
    Isn't this clear evidence that google is accepting money from other companies in return for higher rating and even inclusion in searches that don't match?
    Not really. If you click on the "cached" link on the Google results page, you can see another way this could happen. Google, wisely I think, uses the words in links leading to a page to index stuff. But it gives a way for people to cheat. So it could be that Epinions is the culpable party here, not Google.
  43. Robots.txt by zpengo · · Score: 2
    That's the old text floating about BBS about how to h4x0r robots, right?

    Seriously, though, I have a question: If robots.txt was there, how did Google index the site at all (instead of just poorly)?

    --


    Got Rhinos?
    1. Re:Robots.txt by Sean · · Score: 1

      Because robots.txt can specify not to index specific directories. Some robots.txt tell it not to index a site at all, some say not to index /tmp, etc.

      /Sean/

      --

    2. Re:robots.txt by aufait · · Score: 2

      The robots.txt file contains instructions to the webcrawlers on what pages should be indexed and which should not. Well-behaved robots follow the instructions. Apperently, the site complaining about its ranking was telling google not to index the entire site.

      --
      I feel like picking a fight with everyone who thinks they are right. - Rainmakers
    3. Re:robots.txt by gaudior · · Score: 1
      Search engines (and any webcrawling 'bots') don't index sites where they find a 'robots.txt' file. This is called the Robot Exclusion Principle

      This is not strictly true. The robots.txt file contains patterns which tell the robot what parts of the site they are allowed to index.
      --

    4. Re:robots.txt by baywulf · · Score: 3

      A robots.txt file is used to control web page indexing done by autonomous search engines. It states which search engines are allowed and what they may index. It is somewhat advisory in nature in that a rogue search engine may disregard that information and do what they please but they may suffer the wrath of the owners of that website or others if this is done too often.

    5. Re:robots.txt by don_carnage · · Score: 3
      The robots.txt file is used at the web-root to prevent search engines from indexing certain parts of your website -- not the whole site all-together.

      See this link for more information.

      --

    6. Re:robots.txt by pb · · Score: 1

      Good call; an empty robots.txt file does have no effect. I was going for a basic explanation, but apparently that was a little too basic.

      However, for more info, the other reply I got has a handy link in it! :)
      ---
      pb Reply or e-mail; don't vaguely moderate.

      --
      pb Reply or e-mail; don't vaguely moderate.
  44. robots.txt ? by zuffy · · Score: 1

    That's interesting, because recently I was receiving a hit on my server from a specific host, trying to access /robots.txt from me. I thought that it was most likely a search engine just indexing a site, but it occured rather frequently over the span of about two weeks.

    Does this have something to do with that?

    --
    {justin.filip | jfilip AT gmail DOT com} {http://jfilip.ca/}
    1. Re:robots.txt ? by don_carnage · · Score: 5

      What's interesting is that sometimes people look for the robots.txt file to find hidden directories on a server. Hmmm... /journal, /naked_school_girls, /personal_finances...

      --

    2. Re:robots.txt ? by twidfeki · · Score: 1

      I had the same thing happen to my web server over a couple week period.

  45. robots.txt by crgrace · · Score: 1
    Could someone please explain what a robots.txt file is and how it could affect a Google search?

    I'm sure I'm not the only slashdot reader who doesn't know.

  46. ok now say your sorry everyone by Emugamer · · Score: 3

    After reading all those great flames from yesterday I think this is a good time to apologize. simple mistake, no conspiracy. now show them that you are human and admit you were in error!

    ----------
    Geeks make mistakes to!

    1. Re:ok now say your sorry everyone by interiot · · Score: 2
      http://epinions.com/ appears in all of these searches:


      You won't find all the searched words on epinion's root page. Google's queries search for "all words", so I don't see how this link could have come up in their searches.

      Isn't this clear evidence that google is accepting money from other companies in return for higher rating and even inclusion in searches that don't match?
      --

    2. Re:ok now say your sorry everyone by Captain+Pillbug · · Score: 1

      Did someone mention rats?

    3. Re:ok now say your sorry everyone by interiot · · Score: 2
      The cached root page doesn't contain the works "gerber", "forks", "babies", or "ass".

      Are you suggesting that google combines all this information:

      and decides that http://epinions.com/ is a close enough match?

      Perhaps, but I'd think that many more general-purpose sites (eg. yahoo) would match in this way, and I haven't seen any sites show up nearly as much as epinions has.
      --

    4. Re:ok now say your sorry everyone by dubl-u · · Score: 1

      Actually, I'm suggesting that words in links are the problem. If you search for "Ass" on epinions, you'll see that there's a link from the movie "The Golden Ass" to this URL:

      http://www.epinions.com/book_mu-2053922

      Now if I were writing a link parser for a search engine, I might throw out that last part, since it's not obviously HTML or a directory. (Yes, that wouldn't be optimal, but it's a reasonable mistake.) And presumably you can find similar links for "gerber", "forks", and "babies", too.

      So perhaps Google's engine sees about a zillion links with all sorts of words pointing to Epinion's top level, and assumes that epinions top level is relevant to those. It's generally a pretty good assumption, even if it falls down in the case of Epinions.

      Of course, it's completely possible that the guys at Epinions walk a briefcase full of cash over to Google on the first of every month, and that's why Epinions ranks highly for people wanting to fork babies in their asses. But there are other possibilities, too, so maybe you should give Google the benefit of the doubt.

    5. Re:ok now say your sorry everyone by interiot · · Score: 2
      so maybe you should give Google the benefit of the doubt.

      One of America's dogmas is "question everything", so that's what I was doing.


      I guess your guesses sounds possible. And possibly testable, but I can't figure out atm.
      --

  47. So what about yahoo? by RJ11 · · Score: 1

    Does this mean that Yahoo completely ignores the robots.txt file? This certainly can't be a Good Thing(TM), can it?

    1. Re:So what about yahoo? by Pinball+Wizard · · Score: 2

      yes they do. And Rover is a bad dog.

      --

      No, Thursday's out. How about never - is never good for you?

    2. Re:So what about yahoo? by kaphka · · Score: 3

      Considering that Yahoo! is compiled by humans, not robots, it would be kind of insulting to expect them all to "parse" robots.txt.

      --

      MSK

    3. Re:So what about yahoo? by drivers · · Score: 2

      No. It means that yahoo doesn't have a robots.txt file. Think about it.

  48. robots.txt by pb · · Score: 2

    Search engines (and any webcrawling 'bots') don't index sites where they find a 'robots.txt' file. This is called the Robot Exclusion Principle.

    If you run a web site, check your error log for notes to that effect. (you'll get a random bot from, say, 'inktomi' or something, and they'll check for a robots.txt file, they don't find it, you get a message in your error log, and then your site gets crawled...)
    ---
    pb Reply or e-mail; don't vaguely moderate.

    --
    pb Reply or e-mail; don't vaguely moderate.
  49. Re:what good is a robots.txt nowadays... by substrate · · Score: 1
    I can feel your pain. There are a few robots that incessantly try to either index my web site or my ftp server. One of the worst offenders was mp3search.lycos.com but they removed my site (at the time) from their spider at my request.

    With dynamic ip you're subject to be indexed because some retard at some point in time submitted their site for indexing despite the fact that their ip lease might only be around for a few hours/days/weeks.

    I could handle it if they parsed robots.txt and read the "GO AWAY!" lines and didn't come back, but some of these poorly written programs check the robots.txt file every 5 minutes when they're in a spidering mood. Nice. You've got to wonder how much bandwidth is wasted due in part to moronic programming practice.

  50. Re:what good is a robots.txt nowadays... by True+Dork · · Score: 2

    I doubt it. You have to take the attitude that if you have something on an open webserver, people can see it. If you dont want a spider hitting your site, ban the subnet that it comes from. If the data is something you dont want the government or anyone else to see, dont place it in plain view.

  51. They've addressed half the problem by xant · · Score: 2
    And actually, they addressed the sillier half of the claim - that Yahoo was going out of its way to demote every conceivable directory on the web, including obscure medical directories nobody's ever heard of. And that Google agreed to help them do this.

    So fine, they didn't do that - now explain why Yahoo's rankings shot UP? I heard a few plausible and non-evil theories on how this happened, but I want to hear it from Yahoo.

    --
    It's rare that you're presented with a knob whose only two positions are Make History and Flee Your Glorious Destiny.
  52. Reasonable expectation of privacy? by FallLine · · Score: 4

    Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see. Just because some parties are a little more interested than others doesn't mean they're violating your privacy.

    As for searching beyond the request of robots.txt's and _really aggressively_ searching, that strikes me as being something of a different issue. It seems to me that robots.txt is more of a practical and protectionary issue, than it is one of privacy. It's more of a request not to bother you, than it is a request for privacy, at least in my opinion. Also, failure to adequately process and obey robots.txt can easily be the fault of programming error or ignorance, not necessarily a willful or particularly unreasonable act--one need not neccessarily take special measures to circumvent its intention.

    This is not to say that I can't sympathize with parties that get hammered by such spiders, but I don't believe the privacy argument per se holds any water. I see legitimate complaints on both sides of the issue. For instance, let's say you're a software company and you find a LINKED and self-proclaimed warez page, but the hosting site doesn't allow spidering. Is that still so criminal? Even if the desire is to simply catalogue and document all of it?

  53. Partial retraction from MedWebPlus by Frac · · Score: 5
    Here's a partial retraction from MedWebPlus: (they admit they know now why their rankings dropped, but they still question why Yahoo is on the rise)

    http://www.lib.uiowa.edu/hardin/md/ notes7a.html

  54. Explanation why robots.txt file affects ordering by bkosse · · Score: 4

    It's actually pretty simple, really. The reason the site in question would have plummeted is that as Google is updating its stats, it probably makes some allowances for screwups and inability to reach a given site. However, after a time, the fact that Google was not allowed to search the page must have some sort of impact, and probably an exponential one. "OK, not here, probably a screw up, but we can't verify the search terms will be there" happened at the beginning and eventually as it aged out of relevence, it became "Well, lots of people think this page is good, but it's just not there!" from Google's perspective.

    That makes sense.

    Now, we know Google weights other sites by the weight of the site that links them. As the original directory started sliding, anything it linked to starts sliding as well. Which means Yahoo! fills the void. Particularly in such a specialized example where your liklihood of getting a good match is based on a few key sites.

    --
    Ben Kosse

    --

    --
    Ben Kosse
    Remember Ed Curry!
  55. Re:what good is a robots.txt nowadays... by Alanzilla · · Score: 1

    What's good for the goose is good for the gander.

    Yes, but in this case, it's good for neither.

  56. DMCA Violation? by stinkydog · · Score: 1

    Could you not claim that the robot.txt file is a content protection system?

    The a spider programed to ignore robots.txt would be circumventing you system. Any web hosting company could clog the courts with DMCA suits in a couple of days.

    --
    âoeWho knew something as harmless as willful ignorance could end up having real consequences?â
  57. Pirvacy in public by Felinoid · · Score: 1

    Privacy in the public in effect dosn't exist.
    If I walk around with no cloaths in my home and someone looks in my window they are invading my privacy... if I walk outside with no cloaths then it's indecent exposure...
    Sence I don't wish to cause mass insaity and no one wishes to see my ugly butt there is no chance of eather...

    Basicly your outside...

    Unreasonable use of computing resorces?
    Maybe such a law needs to be put in place...
    Something like "Use of equipment that is far outside what the equipments known function or publicly accepted function or the owners publicly stated function"
    Spam would be outside this standard sence clecting e-mail addresses is "far outside" e-mailing itself isn't so it's accually the colecting of addresses that would be made illegal...
    (I like that.. ban "blind harvesting" of e-mail)

    Anyway there is a clear and obveous alternitive of sending out humans to look at web urls that are banned by bots..... gives jobs to a lot of kids... (Wanted: Kids to surf websites to look for illegal matereal)

    --
    I don't actually exist.
  58. Nobody mentioned Wpoison by toed · · Score: 2

    Robots can't find things not linked to.
    Good robots obey /robots.txt.
    Bad robots use /robots.txt to find juicy things.

    So...

    Create /youfucker, don't link to it anywhere,
    deny access to it explicitly in /robots.txt, and
    install Wpoison, freely available at
    http://www.e-scrub.com/wpoison/

    Fix your web server to take requests into /youfucker and feed them to Wpoison.

    Too bad for Mister Bad Robot.

  59. Re:what good is a robots.txt nowadays... by PD · · Score: 1

    Oh, I thought we were quoting witticisms in this thread.

    "Early to bed and early to rise makes a man healthy, wealthy, and wise."

  60. More specifically... by Derek+Pomery · · Score: 2

    The default (no robots.txt) is to crawl your site. If you have a robots.txt, it follows the rules therein.
    http://www.searchtools.com/robots /robots-txt.html
    List of rules - found with google. :>

    --
    -- perl -e'print pack"H*","6e656d6f406d38792e6f7267"' /. ate my old sig. Bastards.
  61. can't you exclude yahoo.com pages from results? by jlusk4 · · Score: 1
    Can't you just go to the "advanced search" page and exclude yahoo.com pages from the search results?

    Wouldn't this be an appropriate way for users to respond to "search result pollution" by Yahoo?

    John.

  62. The Implications being... by Christopher+B.+Brown · · Score: 3
    • ... The real findings that a research project would represent answers to the question: What search engines ignore the robots.txt file? so that any "inclusions" represent either:
      • Search engines that don't respider very often, thus providing obsolete data, or
      • Search engines that ignore requests not to spider, and that thus are bad Internet "citizens."
    • ... That this was a very successful "troll" for discussion on the part of both the research group as well as the operators of Slashdot.

      After all, if there was no "crime" to complain about, and any "damage" was done by themselves to themselves, this never merited one story let alone two.

      Since no lawyers were involved, it's not a case where "the lawyers won" (as is often seen in big, bloody trials); instead, it could be said that "the journalists won," as they got a bunch of blather out of no real story.

    --
    If you're not part of the solution, you're part of the precipitate.
  63. Re:what good is a robots.txt nowadays... by benedict · · Score: 1

    Maybe you can sue them under the DMCA. ;-)

    --

    --
    Ben "You have your mind on computers, it seems."
  64. Excellent Point - wrongly moderated by Maniacal · · Score: 1

    I agree with you whole heartedly. That was the first thing I thought about when I read the post. That guy was so excited about his discovery and "uncovering the TRUTH" that he failed to consider the fact that it might just be a mistake. One 5 minute e-mail would have saved him months of useless research.

    Your post was moderated to Troll unfairly. Mods should read first and consider content. Just because a poster seems emotional it doesn't mean he/she is necessarily ranting.

    Mike

    --
    MG
  65. Site Magically Dissapeared. by rwhite · · Score: 1

    As interesting as this is I added my site [just so you can verify it] neopaintball.com about 2 months ago. And although I didn't look very hard for the site to show up in keyword matches I knew it would show up on a simple search for neopaintball. It did consistantly for about 3 weeks.

    Then one day the link:url search stoped working [for any site] and I emailed google about it. They told me that they had fixed it and to try again. I tried again and poof my site no longer existed in their index.

    I thought this was VERY wierd but figured I would just submit the site again but nothing. Has not showed back up in their search results for 3 weeks.
    This works So does this but mine Does not.
    I know it sounds like griping but I realy don't care I just thought it was VERY odd.

  66. Commercially Altering Results by Calum+I+Mac+Leod · · Score: 1

    As interiot wrote in yesterday's /. thread, Google says "Unlike other search engines, Google is structured so no one can purchase a higher PageRank or commercially alter results.".

    But in Eric Rumsey's "Partial retraction", he writes "In refuting my article, Google reportedly says that they are now crawling *all* of Yahoo! as part of their agreement, which might have changed rankings."

    Can both be true? In my opinion, artificially spidering a domain as part of a "commercial agreement" is at variance with "no one can purchase a higher PageRank or commercially alter results.".

    That having been said, let's not all forget that amongst the mainstream search engines, Google has significant advantages over other ranking algos.

  67. what good is a robots.txt nowadays... by 2quam4 · · Score: 4

    Even with robots.txt utilizing:
    User-agent: *
    Disallow: /
    I continue to receive spidering from companies such as NetCurrents and Cyvelliance because it is easy to ignore robots.txt. Rude, yes -- but easy. It is also easy for me to deny access via Apache, but bots from companies such as the above mentioned continue aggressive spidering.

    It seems that standards (such as those for robots.txt) are useless, particularly for companies who spider the Net in search of copyright/trademark violations.

    Granted, some companies have an interest in policing their products, but when do they go too far? Wouldn't deliberate/aggressive spidering into areas of my site which I have instituted restrictions/blocking constitute some sort of invasion of privacy? If a government entity is doing the spidering, wouldn't a search warrant be required?