Slashdot Mirror


Google Index Doubles

geekfiend writes "Today Google updated their website to indicate over eight billion pages crawled, cached and indexed. They've also added an entry to their blog explaining that they still have tons of work to do."

9 of 324 comments (clear)

  1. More pages v.s more relevant pages by xiando · · Score: 5, Insightful

    Personally I find that the lack of relevant pages if the biggest problem with search engines, not the lack of pages with information. It seems I always find what I'm looking for eventually, what I need improved is the time I spend looking though spam-bomb pages before I find a page with the correct information.

    These spam-pages seem to be increasing; I mean those pages with just a buch of keywords or the output of some search system.

    1. Re:More pages v.s more relevant pages by Kithraya · · Score: 5, Insightful

      I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over.

    2. Re:More pages v.s more relevant pages by PsychoSlashDot · · Score: 5, Insightful

      What I've read on the Google help pages seems to indicate that they don't index punctuation or capitalization. When you search for something, your string is looked for within an existing index, and appropriate reference materials are shown. Including punctuation wouldn't result in any hits within their index, meaning no results.

      Now, obviously, it is theoretically possible to do just about anything. But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical.

      My point is that as I understand it, Google has coded a number of shortcut tricks which allow reasonable search times, and full-text string-exact searching would prevent them from using those shortcuts, resulting in search times they don't seem to think is reasonable.

      --
      "Oh no... he found the .sig setting."
  2. Makes you wonder... by manmanic · · Score: 5, Insightful

    Does this mean that I've been missing a huge amount of important information until now? I'd just assumed that Google covered the entire relevant web but now it seems to cover the whole same amount again. My Google alerts also seem to have started producing a lot more results which suggest that a lot of these new pages are rated quite highly. Who knows how much more quality content on the web we're just not seeing?

    1. Re:Makes you wonder... by jlar · · Score: 5, Interesting

      "Does this mean that I've been missing a huge amount of important information until now?"

      Maybe the steep increase is due to all the new file formats they are indexing now. That might be useful for some people (although I sometimes find it kind of annoying that a search returns MS-Word documents).

  3. Re:Google thieves my bandwidth by Anonymous Coward · · Score: 5, Informative

    Google respects the robots.txt file. Use it.

  4. Doubled? Wait a minute... by 't+is+DjiM · · Score: 5, Funny

    From 4 to 8 billion pages... I guess they just indexed the google cache...

    --
    --Use ant to make .war
  5. Re:Google thieves my bandwidth by jvj24601 · · Score: 5, Informative

    Well, if you know that Google is indexing your site and "stealing" your bandwidth, then you must have looked at the server logs, right? You'd see the name of the search bot is googlebot. Search for it, and you'll find that the first relevant link explains how to prevent googlebot from accessing your site.

    The logs would probably also show failed attempts to find the file /robots.txt. Similar info is gained from searching on that term as well.

  6. So, to sum up... by kahei · · Score: 5, Insightful


    I am feeding this troll because there are people who really _do_ think like that and I wish I could yell at them to their faces :)

    You put content in a place where it is publically accessible. You explicitly and proactively made that content available to everyone, including 'the average surfer' and googlebots. You took no steps to make it available only to the select few of whom you approve.

    Now you are all cross and bothered because average surfers / googlebots have read / copied your content, such as it is.

    The solution is to drown yourself in a bucket. I have a bucket.

    --
    Whence? Hence. Whither? Thither.