Slashdot Mirror


Google's Bigger Index

WebGangsta writes "Google Inc. today announced it expanded the breadth of its web index to more than 6 billion items. This innovation represents a milestone for Internet users, enabling quick and easy access to the world's largest collection of online information."

47 of 412 comments (clear)

  1. Here's hoping by r_glen · · Score: 5, Interesting

    ... this will lead to an increase in the integrity of PageRank(TM), and vintage Google will return in all her glory.

    1. Re:Here's hoping by Destoo · · Score: 5, Interesting

      So it's not just me..

      First, the reindex that happened a few months ago removed all cross-reference with accents.
      (where google would find the same number of links for both the word and the unaccentuated word... right now: soupcon: 9,750 - soupcon: 88,500)

      Then, when searching for anything regarding ras error messages, I get 30 links from spammer and then the real stuff.
      Example: 711 error yields multiple links for similar pages...
      "Your one stop resource for all things error 711 remote access connection
      management related. ... error 711 remote access connection management. ... "

      Vintage Google.. in Net years, that's 15-16 months ago, right?

      --
      Nouvelles de jeux et technologies en français. TC
    2. Re:Here's hoping by DeadSea · · Score: 3, Interesting
      Google does deal with spammers of the sort that you pointed out. It does take some prodding though. Last time that I found one of these, I submitted it to their problem report form on google.com. After a month nothing had been done. I then posted it in a slashdot comment that got modded up. A day later all the spammers were gone.

      Google search: 711 error

      Come on, Google. Stop reading slashdot and fix the problems.

  2. how many? by QuantumRiff · · Score: 4, Interesting

    How many of these 6 billion items are in the form of www.massivepopups.com/your_search_term.html

    --

    What are we going to do tonight Brain?
    1. Re:how many? by BuckaBooBob · · Score: 2, Interesting

      I would like to see a new element added to reduce ranking based on the number of pop-ups contained in pages indexed or linked to on sites :) That would really kill alot of the garbage sites that skew their rankings and in the same breath reduce the need to pop-up blockers :)

      --
      Who needs WiFi when we can have Packet Over Sheep! http://datacomm.org/PoS-InternetDraft.txt
  3. Heh by PaintyThePirate · · Score: 5, Interesting

    Anyone else find it funny that Google has around one item for every man woman and child on earth?

    1. Re:Heh by Eslyjah · · Score: 4, Interesting

      Well, we're a bit over 6 billion now. It's more like 6,348,951,839. Wait. Now it's 6,348,951,840. And now 6,348,951,841...

    2. Re:Heh by ktanmay · · Score: 2, Interesting

      With more than 50% of them not even aware of what google is.
      If a few hundred million people can generate more than 6 billion pages, just imagine what number all of humanity can produce?

  4. Google thumping its chest? by LostCluster · · Score: 4, Interesting

    What's going on here? This isn't like Google to put out a press release just because the index size just past a round number.

    Is Google setting up for its IPO and therefore becoming less like the Google we know and love?

    1. Re:Google thumping its chest? by Joel+Bruick · · Score: 2, Interesting

      It did the same thing over two years ago. Please, Google and stock market trolls, think before writing.

  5. The real question by Anonymous Coward · · Score: 3, Interesting

    Did they hit some sort of internal limit just above 4 billion? Were they using an unsigned int? Is that why all these extra items are in a "supplemental" index?

  6. It's only a matter of time.. by pacsman · · Score: 5, Interesting

    I'm waiting for them to come up with a sound search and an image search that look at the subject of the image rather than its file name. After that I'm not sure what's left. Maybe comparative searches for sounds and images, where you can upload a source to compare? Who knows! I hope these guys don't follow the normal path of spiralling into inconsequence after they go public.

  7. 4.28 billion web pages... by hanssprudel · · Score: 3, Interesting

    2^32 = 4.29 x 10^9

    Does it sound to anybody else like the rumours of Google hitting a deadend in the number of index position for the websearch are true? Especially given that it has been more than a year since they announced 4 billion.

    Apparently pagerank assigns an unsigned int to every page as id, and their index is so huge they cannot convert it to a 64 bit number. (You wonder why they didn't think of that 2-billion pages ago when a UTF8 like solution would still have been possible).

  8. What I want to know... by Bob+McCown · · Score: 5, Interesting

    ...is how to get rid of those pseudo-pages in Google. The ones with names like "thing_that_youre_searching_for.html", and all they are is either a page of dead links to crap on ebay, or a "Hey, we do great searches for your stuff".

    1. Re:What I want to know... by Anonymous Coward · · Score: 1, Interesting

      Just how do these sites know what I was searching for anyway? They don't have a cross-referenced page for every word in the dictionary, do they?

  9. is it just me? by trans_err · · Score: 5, Interesting

    Google has become so flooded with internet crap that it's quickly losing its status as a useful tool. Google needs some form of moderation to move out the superfulous blog entries and advertising fronts so it can someday become as useful as it always was.

  10. Faked URLs by Professr3 · · Score: 3, Interesting
    Surely a lot of these results are for search engines that prey on google. You can't run a lookup on anything these days without finding a link that goes straight to some other search page, filled with ads of course. Is this a problem, and is Google actually counting those pages in the 6 billion figure?

    </curious>

  11. Still nok by mirko · · Score: 5, Interesting
    • I own a forum on top of which I put a robots.txt file which is supposed to STOP any spider from visiting it.
      I however find my post while googling for words they also contain.
      How can one explicitely forbid Google from indexing a site ?
    • My wife developed 2 web sites which never got indexed even though we submitted these using Google's interface. As they might not be linked, I suppose Google just considers that if nobody mentions a site, then the site should not be registered as existing ? Do Google think it actually is the web ?

    Sorry, I'll keep using Altavista.
    --
    Trolling using another account since 2005.
    1. Re:Still nok by justMichael · · Score: 2, Interesting
      My wife developed 2 web sites which never got indexed even though we submitted these using Google's interface. As they might not be linked, I suppose Google just considers that if nobody mentions a site, then the site should not be registered as existing ? Do Google think it actually is the web ?

      Put it in your sig as a link, get a few high rated posts and google will visit.
  12. No Good... by Mork29 · · Score: 4, Interesting

    I don't want MORE things to search for, I want it to return more relavant searches. I know that the information I usually search for is out there, the problem is that there's so much chafe out there, that I can't find what I want. No matter what I search for, there are at least 2 or 3 responses related to porn. I understand that their are alot of variety of porn out there, but common... Search engines are getting even worse by throwing in search results that are hardly relevant, just because they got paid money by the company. I would even be willing to pay for a "google membership" if they eliminated the advertisers mixed in with search results and maybe gave me another special feature or 2. I'd want a search engine that returns just 1 or 2 good results over one that returns 5 good results mixed in with 200 bad ones.

  13. Sort out their indexing problems first by jolyonr · · Score: 5, Interesting

    I do hope they manage to sort out their recent indexing problems first. For many searches altavista is now showing far better relevent result searches than google - since their attempted cull of 'spam' sites last december which kind of backfired. They have improved things this year, but the quality of their search results is not as good as it was last year. Now, they need to figure out how to get rid of all the useless sites that are just shopping directories full of espotting URLs and similar and with no real content. Funnily enough, their anti-spamsite code seemed to actually promote these up the rankings on many search terms, while penalising many sites containing genuine content.

    Many people said that Google were using deliberate tactics to encourage small e-commerce websites to spend more on adwords, but I believe this wasn't deliberate - their index is so big that they simply can't tell what the results of their changes are going to do to the search orders for all the search options that people are going to use - and they simply didn't realise in advance the problems they were going to cause. And google have made efforts to minimise the damage since then, but they still need to do more.

    Jolyon

    --


    Please read my Canon EOS tech blog at http://www.everyothershot.com
  14. Re:Their search has apparently improved as well ! by trans_err · · Score: 2, Interesting

    Television antennas Information at Business.com Television antennas industry web links for business products, services, information and resources. ... Television antennas. FEATURED LISTINGS, ... www.business.com/directory/media_and_entertainment /television/ equipment_and_supplies/television_antennas/ - 28k - Cached - Similar pages --wow a flase advertising front... how USEFUL!

  15. Run out of indexing space? by rqqrtnb · · Score: 5, Interesting

    I heard that Google is using 4-byte ints for DOCids and they have been running out of indexing space since they are pretty close to 2^32 pages already. Is that true?

    1. Re:Run out of indexing space? by kindofblue · · Score: 3, Interesting
      Not likely. I would imagine that each item has a unique id, not just each web page, since their needs to be some way to identify what the target of a link is. Just because a link ends in pdf, or jpg, or gif, does not mean that it is of that type. The crawlers undoubtedly record the content-type of fetched resources.

      So I would guess that they already use more than 32 bits per item with everything in a single item ID space, or they use 32-bits plus some code indicating the ID-space, or more perhaps a variable length code depending on the item type, e.g. like UTF8. In any case, they should have exceeded 32-bits long ago.

  16. Good for Google...but: by master_p · · Score: 4, Interesting

    I am still waiting for a search engine that does topic matching instead of text matching. In other words, I would like the search engine to return a list of urls with relative topics instead of relative text. As it is right now, all search engines, including Google, return pages that contain text equal or relative to the input but they might be 98% unrelated. I still can't consider the Internet as a library of knowledge due to this fact.

    For example, if one searches for "TCP/IP tutorials", it would return many unrelated links like posts in newsgroups, college lectures, etc.

  17. Re:Marching In Step by WebGangsta · · Score: 4, Interesting
    My comment was left off the posting indicating that I noticed the change in "number of hamburgers served" message on the Google home page this morning, leading me to wonder what other changes we should be looking for today (and hence leading me to this news, albeit a press release - Search Engine Watch didn't have it mentioned on their home page at the time).

    And the press release doesn't say that they're indexing over 6B pages, so anyone who's saying that here is mistaken.

  18. How much space do they use for caching? by The+One+KEA · · Score: 4, Interesting

    With 6 billion pages indexed and cached, and maybe an average of 50K per page (which is probably pretty conservative - it's probably twice that in some cases), that's nearly 30TB, IICIC!!!

    The hard disk and RAID folks must LOVE Google....

    --
    SCREW THE ADS! http://adblock.mozdev.org/ Proud user of teh Fox of Fire - Registered Linux User #289618
    1. Re:How much space do they use for caching? by stratjakt · · Score: 2, Interesting

      They dont cache images and shockwave/java bloat though, just text. I'd say most pages are well under 10k. But who cares.

      --
      I don't need no instructions to know how to rock!!!!
    2. Re:How much space do they use for caching? by ediron2 · · Score: 4, Interesting
      With 6 billion pages indexed and cached, and maybe an average of 50K per page (which is probably pretty conservative - it's probably twice that in some cases), that's nearly 30TB, IICIC!!! The hard disk and RAID folks must LOVE Google....
      30tb... at a buck a gig, those $30,000 sure do look appetizing to all the hard drive and raid makers.

      Not!

      Hell, even doing 2x or 3x this amount for server-class drives still leaves us talking lame amounts. Just one Hitachi/Sun 9980 Fiber Channel drive costs several times more than this.

      Seriously, everything I've heard indicates that google's methods hinge on a lot of white boxes, each one covering a subset of the google data. Put another way, drivespace per server isn't the limiting factor. A distributed system with several hundred white box servers can't HELP but have tens of terabytes of storage, given drive capacities of tens and hundreds of gigs each.

      A client just bought a Hitachi 9980. As sweet as the Hitachi arrays are, I thought it was the most horrendous waste of cash I'd ever seen, considering this client's more modest needs. THOSE are the customers that raid/drive makers love... all it takes is one IT guy with hardware lust who has the trust of a Fortune-500 firm.

    3. Re:How much space do they use for caching? by RedWizzard · · Score: 2, Interesting
      30tb... at a buck a gig, those $30,000 sure do look appetizing to all the hard drive and raid makers.
      I've heard that it's all kept in RAM. 30TB of RAM is going to cost a lot more than $30,000. If it is also on disk would they use cheap IDE disks or a server class solution?
  19. Google pulled us out of "The Dark Ages" by leoaugust · · Score: 4, Interesting

    There is an interesting article in Wash Post Search For Tomorrow on Google, and possible AI in search.

    Some excerpts:

    We stumbled around in libraries. We lifted from the World Book Encyclopedia. We paged through the nearly microscopic listings in the heavy green volumes of the Readers' Guide to Periodical Literature. We latched onto hearsay and rumor and the thinly sourced mutterings of people alleged to be experts. We guessed. We conjectured. And then we gave up, consigning ourselves to ignorance.

    Only now in the bright light of the Google Era do we see how dim and gloomy was our pregooglian world. In the distant future, historians will have a common term for the period prior to the appearance of Google: the Dark Ages.

    There have been many fine Internet search engines over the years -- Yahoo!, AltaVista, Lycos, Infoseek, Ask Jeeves and so on -- but Google is the first to become a utility, a basic piece of societal infrastructure like the power grid, sewer lines and the Internet itself.


    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
  20. Re:Is /. pro Google? by selderrr · · Score: 2, Interesting

    It all depends on ho often they rotate their logs and how long they store their backups. I honestly don't believe they can keep logs longer than a few weeks. Any longer and they'd need 2nd serverfarm to store the archive. And no terrorists would go from a google query to a bomb in a few weeks. So I guess you're quite toptinfoiled indeed.

  21. Re:They said 6 billion items, not webpages. by Jugalator · · Score: 2, Interesting

    Yes, and while the press release says they doubled their image search index size, I'm more interested in how much their regular web index increased in size? I have a vague feeling it was around 4 billion before too? :-/

    --
    Beware: In C++, your friends can see your privates!
  22. Re:Most press-release like post ever by glinden · · Score: 2, Interesting
    • this is so obviously just a link to a press release
    It really is an uninformative press release. Surprising it made it to Slashdot.

    I would have liked to see some information about the underlying technology that allowed this bigger index, especially if it allowed the broader coverage without a reduction in search result quality.
  23. Image search: What's your experiences? by GQuon · · Score: 4, Interesting

    Both Google and Fast have image and picture search. They're all right. But I have had more luck with Lycos.

    What are your experiences?

    Of course, none of these services search in the image data itself. They search filenames, special features (like image size), and the content of the pages they are found in.
    What is the state of searching in images today? Facial recognition systems have existed for a while, but they are made for a specific purpose.

    How long before we can take a picture of that piece of your IKEA furniture and find the same model in pictures of celebrity houses, Babylon 5 sets and crime scenes? Or taking a picture of that familiar-looking person walking down the street, searching for her, and remembering that she was in that "reality" series two years ago.

    --
    Irene KHAAAAAAN!
  24. Re:Going Public & Pay Per Search by /dev/trash · · Score: 2, Interesting

    Like, I search for say "perl code" and they isntead present me with a page to login with my credit card number?

    I highly doubt that. I'd no longer use Google, and I bet a lot of others wouldn't either. Free is pretty addictive, even if they do have a lot of stuff indexed.

  25. But should be still be using Google? by mshiltonj · · Score: 1, Interesting

    Is Google becoming a task master for Big Brother?

    Et tu, Google?

  26. "miserable failure" top 5, update by Anonymous Coward · · Score: 1, Interesting

    Let's see if the 'new' index adds interesting stuff:

    1.- Michael Moore (still no. 1)
    2.- Dubya
    3.- Jimmy Carter
    4.- Sen. Hillary R. Clinton
    5.- Howard Dean

    PS: litigious bastards still clean, just pointing to litigiousbastards.com.

  27. Mailing lists by ajs · · Score: 4, Interesting

    The thing that is starting to bother me is not the search-spam (easily removed over time with increasingly smart ranking), but the mailing lists. If 20 sites around the net archive the same mailing list, then I'll get the first 20 hits in most techical searches from the same list. Google really needs some way to identify duplicate archives (which is hard given that they're all formatted differently) and treat them as one "site".

  28. Number One by Michael.Forman · · Score: 2, Interesting


    The upgrade has been quite good to me! Before the upgrade a search for my name would rank my website many pages down and then only secondary links not the root site. Now I rank number one! It looks like all my slashdot posting has finally paid off.

    Ahh. The small victories of the computer geek.

    Michael.

    --
    Linux : Mac :: VW : Mercedes
  29. Re:but... by Anonymous Coward · · Score: 1, Interesting

    I'd rather they start beating the spammers...

    Random advertising sites are working 24/7 to flood Google with crap :(

    Some sites are even using sneaky things to display special pages that only those with the Google spider's user-agent will ever see...

    Dastardly--I only caught them due to a broken PHP script seen in the cache...

  30. Already been done, sort of by first.last · · Score: 1, Interesting

    Kind of like this?

    --
    Wishing I was a millionaire since 1969.
  31. PNG! by pmsyyz · · Score: 4, Interesting

    ... Advanced features include search by image size, format (JPEG and/or GIF) ...

    They didn't mention PNG, the turbo-studly image format which Google Image Search does indeed support.

    It seems they used to have very few PNGs in their database, but now a search for +a filetype:png returns 700,000 results!

    --
    Phillip
  32. Google "search engine" by Anonymous Coward · · Score: 1, Interesting

    And click on "I'm feeling lucky"

  33. adsense is making sense by DrSkwid · · Score: 3, Interesting


    Google's adsense service https://www.google.com/adsense/overview

    is certainly a winner

    The ads presented are similar to the paid ads shown on a standard google search but using the keywords of the page displayed and also tailored to the country of the viewer via their ip address.

    In this way webmasters can maximize the global potential of their website.

    We have some very highly ranked pages (i.e. top 10) but for UK only content. Now our visitors who find us via search engines and discover we aren't quite what they want are presented with a relevant exit strategy and we get a commission!

    We're getting an average 1.7% click through rate which is translating into a nice tidy sum.

    go google! keep kicking MSN's dirty butt

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  34. Worried about reliability by xihr · · Score: 2, Interesting

    Especially with this announcement, I'm starting to get worried about the reliability of Google. More and more groups are taking advantage of quirks in Google's ranking system, as has been mentioned in previous Slashdot articles, to the point now where if you're searching for something even a little outside of the pop-culture mainstream (where you will be inundanted with valid hits) you will find tons and tons of automatically generated garbage hits on "providers" who boost their indexes by feeding links to each other. Google is a great service; I hope that in its desire to continue its ever-expanding dominance of the search engine market, they don't let themselves get too complacent and let their search engine technology become stale in the sense of it being so abused that for reliable results you need to look elsewhere.

  35. You hum it, we'll find the mp3 or midi.. by zcat_NZ · · Score: 2, Interesting

    Waikato University has a music recognition system that would be awesome on google - if you can hum a few notes, it'll match it with the original tune. Remember all those emusic tunes that ended up as 'elevator' music? A lot of them are free downloads and still available on the artist's websites, but if you hear a tune you like while you're waiting on hold how do you find it?

    Also, it would be cool if I could upload a text-overlayed, renamed thumbnail from usenet and google could find the matching full-size image for me.

    --
    455fe10422ca29c4933f95052b792ab2