Google URL Index Hits 1 Trillion
mytrip points out news that Google's index of unique URLs has reached a milestone: one trillion. Google's blog provides some more information, noting,
"The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day."
Or it didn't happen.
And about 600,000,000 of those are, "FrIST P0ST!"
This one's tricky. You have to use imaginary numbers, like eleventeen... --Hobbes
Once the index reaches a google (or rather a googol), the universe explodes.
[alk]
Wow! Five of those are mine! Mine I tells you!
I'm rich! I'm famous!
Don't you fuck with me, man! I am in Google! and I will rip you new ones!
As someone who is partially engineering/analytically minded (but not a great programmer) it amazes me how Google has manged to index so much data, yet at the same time, serve up results in a fraction of a second to so many people.
So unless there is a screenshot showing the 1,000,000,000,000 site count, Google's index didn't reach that milestone? Even if it now shows 1,000,000,000,001?
Seriously, since the web is something like 42% porn. (Yes, that is the ultimate answer.) So that's on average, 60-70 pages of each person in the world naked.
How many of those are automatically generated rank-spoofers, 80%?
My favorite spoof pages were the ones that randomly substituted search terms into porno stories.
"Yes!" she screamed as he thrust his SAMSUNG CD PLAYER deep into her. "I want you balls-deep in my CHEAP HARD DRIVES!" The smell of DISCOUNT SOFTWARE filled the room.
Kwisatz Haderach
Sell the spice to CHOAM
This Mahdi took Shaddam's Throne
Wow! Five of those are mine! Mine I tells you!
Only if Google has indexed your sites.
Trillion can mean 1E+12 or 1E+18 depending on which country you are in.
..knowing that the vast amounts of porn just keep getting vaster. And more searchable. Amen. *sheds a tear or two*
[Slashdot Comments We Liked]
End-user: "I want to download the internet. Will I need a bigger hard drive?"
Google: "Yes"
End-user: "I didn't even tell you how big my hard drive is!"
Counts of words:
the: 18.3 billion pages
a: 23.9B
0: 12.7B
1: 25.4B
in: 17.1B
I: 10.2B
I know these numbers aren't exact, but you'd think one of them would be over 100B if Google is really indexing a trillion pages. What's on them? Anyone find any keywords that produce more?
Sometimes it's best to just let stupid people be stupid.
This might be off-topic but I wonder what's going on with Sergey Brin and Larry Page's [PhD] education? Just wondering...did they give up?
to the increment of garbage in American landfill since 1998? ;)
http://www.365tomorrows.com/09/12/the-nine-billion-names-of-god/
the first couple hundred results for every search didn't consist of useless price comparison sites, a Target ad, youtube videos of dubious quality, link farms and experts exchange crap.
No wonder I cringe every time I need to start doing research on the web.
They have identified that there are 1T pages out there, somewhere. They have indexed 40 billion pages. Read the entire Google post. It says it right there.
Bad on Google for the misleading post. Bad on the submitter for not reading the misleading post. Bad on Slashdot for further descending into mindless repetition of mindless submissions of mindless PR announcements.
If I wanted a sig I would have filled in that stupid box.
I wonder how much bandwidth the daily/continuous Google index process takes.
I can tell you without reservation that more than half those "unique URLs" are dead pages. Over the course of my 12+ years, which pre-dates Google, 99% of all my "uniqie URLs" are dead and buried. I've done 1000s of pages, the bulk being online manuals. I know, I know. This is an oddity among readers here, but manuals are essential parts of software products, even those that are only around for a few years. It's not a product if it doesn't have a real, honest-to-goodness manual.
But how many of those trillion pages have unique, useful content? E-mail is over 95% spam, and the web is getting there.
There were about 153 million registered domains at the beginning of the year. The ones from the spam-friendly registrars are mostly junk. Tim Bernars-Lee said in 2006 that web junk was becoming a major problem, and it's become worse since then.
If you throw out all the anonymous but commercial domains (we call them "bottom-feeders"), as we do with SiteTruth, the Web looks a lot better. Search engines are getting stricter about this. You don't see that many "landing pages" in Google any more. Bad news for companies like Marchex, the publicly traded web spammer that cranks out all those junk "What you need, when you need it" sites.
"The mass trials are going well. There will be fewer Russians, but better ones." - Greta Garbo in Ninotchka.
A trillion URLs, and still no sign of clownpenis.fart in the index anywhere!
At this rate it really will be the last one to go.
NO CARRIER
And you'll be back faster than a Google search result. Weeding out the crap?
Just for a sample, try this one: getfirefox. If the first link on that search goes to a Mozilla mirror you will win one Internet. Try Linux. Hey, this is fun. Spoiler: the first link there is always "www.Microsoft.com/Windows : Special Offers from Windows Vista® w/ the Purchase of Select Laptops." The first time I tried this I was looking for Open Office and wound up misdirected to a members only site where you had to register to download a probably spyware infested Open Office and signing up for unlimited pharma spam. The scary part is that the text of the link misled me to believe I was headed for "OpenOffice.org". Try it and see. Let's find more horrifically inappropriate ad placements and query results, shall we? I'll bet you could come up with a really funny one.
Note: Please don't go to any of the sites linked to those search results through live.com. Bad things might happen to your Windows box and there's nothing there of interest for your powerbook.
Yeah, that's a good search result ad, don't you think? No wonder Google is becoming a verb.
Help stamp out iliturcy.
And not just those stupid "parked by GoDaddy" or "domain farms"? Like those ones where when you Google something, and you get a promising result, so you click on it, and it turns out to be nothing more than a page full of ads.
It's harder to do now. If you can find two words unquoted that result in one result it's a googlewhack like this one was before Google found "eltiguan parainaugurarme" on this page and made it the second result.
BTW, my first google search was "war" and it returned something equivalent to "Your search term is too common to return a meanignful result. Narrow your search." Today it returns about 974 million results. It was long ago...
Help stamp out iliturcy.
You'll be back faster than a Google search result.
Help stamp out iliturcy.
On my home Web server, I accidentally left a copy of the PHP manual in a browsable folder, which was linked to the homepage. So when Google indexed my homepage, guess what it also checked for? Every single page the homepage linked to! Including that manual... and damn the PHP manual has a LOT of pages.
So when I got back on the server and pulled up the logs (it was running strangely slow) I found Googlebot accessing page after page after page of the PHP manual. Thousands of pages. Lagging the server and Internet to hell.
Specifically which page was the trillionth?
Help stamp out iliturcy.
They grew up.
According to the googleblog post, which of the following is not true?
A. The initial google had 26 million pages.
B. Google has seen 1 trillion unique urls.
C. Google indexes every one of those trillion pages.
D. The web is infinite.
Now what isn't said here is that 999,999,998,000 of those pages are porn.
I think google.com's search engine achieved its peak usefuleness about 5 years ago. Now, for the most part when I google for a certain electronic component I get some crappy webstore front (and by crappy I mean I can't actually order the component but must "contact by phone" first) or if I search for an electronic device, be it pro or just home electronics, I get those "Read reviews and compare prices"-sites. Which I hate with a passion. WTF google, you have the world's most talented programmers, can't you weed out this crap from your search? At least so it doesn't come up as top hits?
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
I mean, really. 90% of it is junk.
Deleted
There are so many dynamic pages on the net now that one web site, like slashdot as an earlier poster commented, can contain literally millions of pages. People use programs like modrewrite, isapirewrite and linkfreeze to manipulate spiders into crawling pages that are near identical. For more than one customer I've made meta, title and content randomization, serialization and or URL rewriting schemes to make damn sure spiders index every possible dynamic page, and it works. I have a single dynamic page that must have been indexed hundreds, maybe thousands of times with slightly different content, and they are all in the index.
Google tries to detect a dynamic page by looking for ampersands and equal signs, as well as looking at the content of the page, it is really quite easy to fool.
e.g.: http://somesite.com/itemlist.php?listmode=1&category=beds&orderby=7
when 'rewritten' shows up as
http://somesite.com/items/1/beds/7.html
So 1 billion web pages could be, and I know a few thousand pages like this, just a few hundred thousand dynamic pages. Not that the pages don't have relevant information, some of the stuff can be redundant though. For instance, when the spider crawls across "Records per page = 10" > "Records per page = 20" > "Records per page = 30" etc.. or when lazy programmers don't use cookies and databases to store information but try and concatenate the URL with the user's selections. Thank god for that GET limit. People need to use POST!
If someone knows how to stop this message board from creating links out of false URLs please, let me know.
Google has proved, to become rich one has to do evil.
Adsense is the cause 90% of poor quality MFA web pages.
All three of his examples go directly to the most official site for Firefox, Linux and OpenOffice respectively. Nice try though.
That said, Google's results are still generally better.
"Why is McDonald's still counting? How insecure is this company? Forty million eighty jillion killion trillion....is anyone really impressed anymore? Oh eighty-nine billion sold! All right I'll have one."
-- Jerry Seinfeld
Call me when the index is bigger than the pr0n index on my PC
regardless of the amount of legitimate websites that are a part of that 1 trillion, Google always deserves positive recognition in my opinion. i really wish to see them grow even bigger than they are because of all the incredibly helpful things they have provided for humanity and mostly free of charge. i really hope to see more businesses especially, start using Google's online services a lot more. such as Google Docs, Google Calendar and Gmail. So many businesses in small towns are tricked in to wasting their money on Microsoft licensing because they simply aren't aware of any other solutions. Between the amount of money that is wasted on MS Exchange and MS Office, think of the money that could be saved. when you think of how big Google is, you still don't think about the fact that the majority of people have no idea what other services Google provides other than their search engine, or if they are lucky they may have seen Google Earth on one of their friend's computers. there are many uneducated people out there that don't know even 3/4ths of the services that Google provides. it sounds stupid, but unfortunately it's true and I can only imagine what some business owners would think if there was an absolutely free version of the software they use on a regular basis.
*plays the Apogee theme song music*
Sometimes it's nice to be called Dr. Brin or Dr. Page. Especially when dealing with somebody who doesn't know you or talking to a roomful of PhD's (such as their employees).
Now go to google and type "live search", "Microsoft" and "Microsoft Office" (without the quotes). If you can't explain why the ads and search results that Google puts on those pages are qualitatively better then you're not qualified to judge my comment. On a lark I went back to Live and did those searches too. If you search for "Microsoft" on Live today it shows three ads, each of which is likely to be more harm than help. This is why Microsoft is third in search and fading despite wasting billions on it. They just don't get it and they never will.
Help stamp out iliturcy.
google finally has a use for a bigint auto-increment primary key now!!!!
And yet, the phrase "hot teen ass" accounts for less than a million of that trillion....
As for the phrase "star trek ass", it's barely even worth mentioning...
Isn't 10E12 in the range of soft errors in computing ? The statistical change of a bit changing somewhere ... how do they deal with that ? And we're just talking about the amount of URLs (basically pointers), not the amount of data.
"Violence is the last refuge of the competent, and, generally, the first refuge of the incompetent" - Thing_1
nine hundred and ninety nine million nine hundred and ninety nine thousand nine hundred ninety nine!!!!!!!
The saturation level of the "net"???
Joe Investor
If your guestbook is empty, if the only person who checks your page is your mom, know that you are not alone: a friendly Google bot (think Wall-e) is visiting your page regularly, thoroughly parses through your lols and other pimples, indexes them and assigns shiny sunny ranks.
Kinda warms your heart, ah?
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Maybe the www will finally have multiple encoded links enabled to get google a kicking in Chinese and Arabic.
"My God, It's full of ADS!"
Hope I will still be alive when they reach Googol indexed pages
Now go round up the understander because if the team doesn't get what I'm saying here their mission will fail no matter how often you dismiss informed criticism.
Not that I care -- making fun of them is one of my favorite things. They should buy AOL and Yahoo. It would complete the triumvirate of Internet Suck and implode. That would be fun to watch.
Help stamp out iliturcy.
good for google
www.forcesign.com