Slashdot Mirror


Google URL Index Hits 1 Trillion

mytrip points out news that Google's index of unique URLs has reached a milestone: one trillion. Google's blog provides some more information, noting, "The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day."

37 of 249 comments (clear)

  1. Screenshot. by Shaitan+Apistos · · Score: 5, Funny

    Or it didn't happen.

    1. Re:Screenshot. by Anonymous Coward · · Score: 3, Funny

      Seriously, you just want to get into his/her pants...

    2. Re:Screenshot. by Shaitan+Apistos · · Score: 5, Funny

      That can be arranged.

  2. How long till.. by loconet · · Score: 5, Funny

    Once the index reaches a google (or rather a googol), the universe explodes.

    --
    [alk]
    1. Re:How long till.. by txoof · · Score: 4, Funny

      Is that the modern equivalent of the Mayan calendar running out of days?

      --
      This one's tricky. You have to use imaginary numbers, like eleventeen... --Hobbes
    2. Re:How long till.. by rho · · Score: 5, Insightful

      I'm more interested in when Google starts returning relevant results to my queries.

      I can't believe that I'm the only one that finds Google's quality of service somewhat below par. I guess they're better than randomly stabbing in the dark, and there certainly isn't any alternative that's obviously better, but Google sure isn't everything they think they are.

      I know--stop trying to compete with Wikipedia and cut out Experts-Exchange.com from your search results since their pages don't actually return the information you think they do.

      --
      Potato chips are a by-yourself food.
    3. Re:How long till.. by onedotzero · · Score: 5, Informative

      ... and cut out Experts-Exchange.com from your search results since their pages don't actually return the information you think they do.

      Perhaps you should try scrolling to the bottom of the page... :)

    4. Re:How long till.. by cdrudge · · Score: 4, Informative

      It took me a while to realize it, but if you scroll clear to the bottom of an expert exchange post, you'll find the comments unhidden and relevant.

    5. Re:How long till.. by Anonymous Coward · · Score: 5, Informative

      ...and cut out Experts-Exchange.com from your search results since their pages don't actually return the information you think they do.

      If you block cookies from experts-exchange.com you can actually see the answers on any e-e page - after you visit the first time, it normally sets a cookie to not show results next visit, which is how they get Google to index their pages anyway. With cookies from them blocked, you can then see the answers - you just have to scroll 7/8s of the way down the page past all the fake "Please sign up to see this result" boxes.
      (First AC post in years... tee hee. :)

    6. Re:How long till.. by blahplusplus · · Score: 4, Interesting

      "I'm more interested in when Google starts returning relevant results to my queries.

      I can't believe that I'm the only one that finds Google's quality of service somewhat below par."

      You're not the only one, but for the most part it is better then most other search engines out there. The real problem is spammers and paid advertising, I think spammers have really made search frustrating for a lot of companies. And ad companies pay other people to promote their sites for them (digg, slashdot, etc). I've noticed the increase in spam-vertised websites in search results for a lot of things.

      Personally I think the idea of sharding and search being more specific for what you're looking for is needed. I'd like to see a google with 'tags' and a delicious interface, things like educational institutions and universities get lumped into their own search engine space for instance, this would help narrow down what one is looking for, although it would take time and feedback to design something well for other areas. The fact is that search results get diluted as you put more and more stuff online (numbers and geometric scale).

      For fun, I've noticed stumble upon and del.ico.us are not bad alternatives when looking for new and interesting sites without having to use search

    7. Re:How long till.. by Eddi3 · · Score: 3, Informative

      Actually, If you go to the cached version of those pages, you can see all the answers. You can also just use the Googlebot's user agent via the User Agent Switcher.

    8. Re:How long till.. by hairyfeet · · Score: 3, Interesting

      That is why I switched to Yahoo in FF. Google search just seems to be getting crappier of late while Yahoo seems IMHO to be getting better. At least fo me it hits what I'm looking for on the first page a lot more than Google,and I love when i type something like Bioshock I can hit the more tab and get Bioshock demo,cheats,patches,reviews,etc. Whereas you hit the more tab in Google you just get more crap like Google groups and blogger. But of course it will probably be bought out by MSFT and turn into a giant turd like Live Search,so i'm just going to enjoy it while it lasts. And as always this is my 02c,YMMV

      --
      ACs don't waste your time replying, your posts are never seen by me.
  3. Wow, that's a lot of porn. by Anonymous Coward · · Score: 5, Funny

    Seriously, since the web is something like 42% porn. (Yes, that is the ultimate answer.) So that's on average, 60-70 pages of each person in the world naked.

    1. Re:Wow, that's a lot of porn. by sweet_petunias_full_ · · Score: 4, Interesting

      "the web is something like 42% porn"

      That probably stopped being the case after namespace speculators started buying up expired domains in large numbers just to put up a mildly useless index on *each* and *every* site to collect ad revenue or marketing statistics off of unwary visitors. I would also include typosquatters in that category, and maybe someone else can name a few other examples of utter namespace hogging uselessness.

      Whatever it is, you can rest assured that it's mostly repetitive trash... no need to stand in awe of it.

      --
      You can't send a takedown notice to an already printed newspaper.
  4. 1 trillion url's by jollyreaper · · Score: 5, Funny

    How many of those are automatically generated rank-spoofers, 80%?

    My favorite spoof pages were the ones that randomly substituted search terms into porno stories.

    "Yes!" she screamed as he thrust his SAMSUNG CD PLAYER deep into her. "I want you balls-deep in my CHEAP HARD DRIVES!" The smell of DISCOUNT SOFTWARE filled the room.

    --
    Kwisatz Haderach
    Sell the spice to CHOAM
    This Mahdi took Shaddam's Throne
  5. Re:Amazing by timmarhy · · Score: 5, Insightful

    i wish they would work on weeding out the crap. anything you google now is infested with cheesy search sites that list other websites and try plaster you with ads. they contribute zero to the web.

    --
    If you mod me down, I will become more powerful than you can imagine....
  6. And I rest in peace.. by consonant · · Score: 3, Funny

    ..knowing that the vast amounts of porn just keep getting vaster. And more searchable. Amen. *sheds a tear or two*

  7. Re:Odd by Anonymous Coward · · Score: 5, Funny

    So unless there is a screenshot showing the 1,000,000,000,000 site count, Google's index didn't reach that milestone? Even if it now shows 1,000,000,000,001?

    The 1,000,000,000,000th page had only one word on it:

    "woosh"

  8. Some numbers by Reality+Master+101 · · Score: 5, Interesting

    Counts of words:

    the: 18.3 billion pages
    a: 23.9B
    0: 12.7B
    1: 25.4B
    in: 17.1B
    I: 10.2B

    I know these numbers aren't exact, but you'd think one of them would be over 100B if Google is really indexing a trillion pages. What's on them? Anyone find any keywords that produce more?

    --
    Sometimes it's best to just let stupid people be stupid.
    1. Re:Some numbers by Shaitan+Apistos · · Score: 4, Funny

      My hobby:

      Getting the fewest possible google results above 0 with a quoted string.

      "interspecies gangbang": 6
      "hot topic meets disney world": 2
      "died in a blogging accident": 15,300
      "can boys make babies": 4
      "why does it hurt when I read": 1

    2. Re:Some numbers by miraboo · · Score: 3, Insightful

      My hobby:

      Getting the fewest possible google results above 0 with a quoted string.

      "interspecies gangbang": 6
      "hot topic meets disney world": 2
      "died in a blogging accident": 15,300
      "can boys make babies": 4
      "why does it hurt when I read": 1

      My Hobby

      Attributing my sources: http://xkcd.com/369/

    3. Re:Some numbers by Shaitan+Apistos · · Score: 5, Interesting
  9. What's going on with the founders' studies? by bogaboga · · Score: 4, Interesting

    This might be off-topic but I wonder what's going on with Sergey Brin and Larry Page's [PhD] education? Just wondering...did they give up?

  10. Re:Amazing by Freaky+Spook · · Score: 5, Informative

    I couldn't agree more.

    Many of the clients I support are constantly asking me "Is there a program that does this? or Can you find me a program to do this" etc etc.

    I used to be able to just use google to help me get started but these days the top level searches are all those bloody link farms peddling "free" software, even when typing in the word review you come up with link farms that offer no reviews.

  11. Re:No concern for the foreign readers? by kclittle · · Score: 4, Funny

    Google is headquartered in Mountain View, CA -- I know, 'cause I googled it. Now, California is rather inclined to think of itself as it own country (some would say, universe), but it is indeed part of the United States of America (again, I checked with Google). And in the US, "trillion" == 1E12 (again, Google).

    --
    Generally, bash is superior to python in those environments where python is not installed.
  12. No, it didn't. by aiken_d · · Score: 5, Informative

    They have identified that there are 1T pages out there, somewhere. They have indexed 40 billion pages. Read the entire Google post. It says it right there.

    Bad on Google for the misleading post. Bad on the submitter for not reading the misleading post. Bad on Slashdot for further descending into mindless repetition of mindless submissions of mindless PR announcements.

    --
    If I wanted a sig I would have filled in that stupid box.
  13. And most of them are webspam by Animats · · Score: 3, Insightful

    But how many of those trillion pages have unique, useful content? E-mail is over 95% spam, and the web is getting there.

    There were about 153 million registered domains at the beginning of the year. The ones from the spam-friendly registrars are mostly junk. Tim Bernars-Lee said in 2006 that web junk was becoming a major problem, and it's become worse since then.

    If you throw out all the anonymous but commercial domains (we call them "bottom-feeders"), as we do with SiteTruth, the Web looks a lot better. Search engines are getting stricter about this. You don't see that many "landing pages" in Google any more. Bad news for companies like Marchex, the publicly traded web spammer that cranks out all those junk "What you need, when you need it" sites.

    "The mass trials are going well. There will be fewer Russians, but better ones." - Greta Garbo in Ninotchka.

  14. Re:Amazing by arotenbe · · Score: 4, Informative

    Many of the clients I support are constantly asking me "Is there a program that does this? or Can you find me a program to do this" etc etc.

    I used to be able to just use google to help me get started but these days the top level searches are all those bloody link farms peddling "free" software

    Have you tried SourceForge? That's what it's there for, you know.

    --
    Tomato wedge sperm darts that are Republican.
  15. Try "Live" search by symbolset · · Score: 3, Interesting

    And you'll be back faster than a Google search result. Weeding out the crap?

    Just for a sample, try this one: getfirefox. If the first link on that search goes to a Mozilla mirror you will win one Internet. Try Linux. Hey, this is fun. Spoiler: the first link there is always "www.Microsoft.com/Windows : Special Offers from Windows Vista® w/ the Purchase of Select Laptops." The first time I tried this I was looking for Open Office and wound up misdirected to a members only site where you had to register to download a probably spyware infested Open Office and signing up for unlimited pharma spam. The scary part is that the text of the link misled me to believe I was headed for "OpenOffice.org". Try it and see. Let's find more horrifically inappropriate ad placements and query results, shall we? I'll bet you could come up with a really funny one.

    Note: Please don't go to any of the sites linked to those search results through live.com. Bad things might happen to your Windows box and there's nothing there of interest for your powerbook.

    Yeah, that's a good search result ad, don't you think? No wonder Google is becoming a verb.

    --
    Help stamp out iliturcy.
    1. Re:Try "Live" search by pagaboy · · Score: 5, Funny

      Turns out Live.com's market share for today has tripled due to Slashdot users clicking on the above links...

  16. Re:First Post by Vectronic · · Score: 3, Interesting

    -1 Redundant sure...

    But that's sort of along the lines I was/am thinking... take txoof's post alone (or mine, or whoever may reply) there are 3 separate URLS for each Slashdot comment

    The Header:
    http://search.slashdot.org/comments.pl?sid=626647&cid=24345519

    The User:
    http://slashdot.org/~txoof

    The Score:
    http://search.slashdot.org/article.pl?sid=08/07/26/0036245#

    How many Slashdot comments are there? It's probably in the high millions, (rhetorical, but I'm interested to know none-the-less) There's like an average of about 250 comments per article, about 25 articles a day, thats about 2 million a year, so 6 million links, then take into consideration stuff like Facebook, which bounces URLs (http://www.facebook.com/link=###/etc) or sites that generate a random identifier every few minutes, making those "unique", gets unexciting quite quickly, Although billions is still fairly high.

  17. Re:Amazing by cammoblammo · · Score: 3, Funny

    I imagine that certain sites, such as sites the size of Slashdot (in terms of dynamically generated pages), make a difference. After all, the index talks in pages, not domains. I bet there's also a lot of junk and redundancy in there, but still, it's quite an achievement to be able to deal with that much data.

    Surely you're not saying that Slashdot's full of junk and redundancy and redundancy?

    --

    Cogito, ergo sig.

  18. Re:First Post by repvik · · Score: 3, Funny

    Considering your comment is #24345983, I'd say about 24.3 million comments. Also, I believe there's about 1.5 million different users.

  19. google's search becoming steadily useless. by blind+biker · · Score: 3, Insightful

    I think google.com's search engine achieved its peak usefuleness about 5 years ago. Now, for the most part when I google for a certain electronic component I get some crappy webstore front (and by crappy I mean I can't actually order the component but must "contact by phone" first) or if I search for an electronic device, be it pro or just home electronics, I get those "Read reviews and compare prices"-sites. Which I hate with a passion. WTF google, you have the world's most talented programmers, can't you weed out this crap from your search? At least so it doesn't come up as top hits?

    --
    "The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
  20. Re:Amazing by Anonymous Coward · · Score: 3, Insightful

    i wish they would work on weeding out the crap.

    There are a *lot* of people at Google working on that problem. Please understand that it is really difficult to keep up with new attacks when your site is #1, because many people out there are aiming directly for it. No matter how many work on ranking and relevance inside the company, there will always be 10x-100x that number of people outside who are working on the shady side of SEO, spamming, etc. It's a never-ending battle, much like spam email. We're trying.

    anything you google now is infested with cheesy search sites that list other websites and try plaster you with ads. they contribute zero to the web.

    We're working on that both from the search side (ranking) and the ads side (not letting those sites run using Google ads).

    If you want to help, you have many options:

    1. Join Google. If you get in, and say you want to work improving search results or stopping spammy ads sites, you'll have no trouble joining an appropriate group.
    2. If you've got a better approach, start a company. If it is a better approach, sooner or later you'll get noticed, and probably bought by a search company. Good ideas are worth a lot in a bid industry.
    3. Read the available research in the area, do your own experiments, and contribute to the pool of knowledge. Could easily lead to #1 or #2.
    4. If a company you recognize engages in shady practices, tell them you'll take your business elsewhere unless they clean up. If you're a blogger, remind them of that fact. If the company you work for starts to do shady things, point out that you don't think its ok.
  21. Re:First Post by Anonymous Coward · · Score: 5, Funny

    Also, I believe there's about 1.5 million different users.

    yeah but if you take out Twitter and all his sock-puppets you'll just be left with 500K unique users...

  22. Dynamic pages pollute count by Coolhand2120 · · Score: 4, Informative

    There are so many dynamic pages on the net now that one web site, like slashdot as an earlier poster commented, can contain literally millions of pages. People use programs like modrewrite, isapirewrite and linkfreeze to manipulate spiders into crawling pages that are near identical. For more than one customer I've made meta, title and content randomization, serialization and or URL rewriting schemes to make damn sure spiders index every possible dynamic page, and it works. I have a single dynamic page that must have been indexed hundreds, maybe thousands of times with slightly different content, and they are all in the index.

    Google tries to detect a dynamic page by looking for ampersands and equal signs, as well as looking at the content of the page, it is really quite easy to fool.

    e.g.: http://somesite.com/itemlist.php?listmode=1&category=beds&orderby=7
    when 'rewritten' shows up as
    http://somesite.com/items/1/beds/7.html

    So 1 billion web pages could be, and I know a few thousand pages like this, just a few hundred thousand dynamic pages. Not that the pages don't have relevant information, some of the stuff can be redundant though. For instance, when the spider crawls across "Records per page = 10" > "Records per page = 20" > "Records per page = 30" etc.. or when lazy programmers don't use cookies and databases to store information but try and concatenate the URL with the user's selections. Thank god for that GET limit. People need to use POST!

    If someone knows how to stop this message board from creating links out of false URLs please, let me know.