WWW Surpasses One Billion Documents
Gary William Flake writes "A new study by Inktomi and NEC Research Institute show that there is at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market.
"
approximately 7 of them are useful.
Does narcissism count as a hobby? --Shawn Latimer
Longest domain name:. taxrepresentation. o m
http://www.tax.taxadvice.taxation.irs.taxservices
taxpayerhelp.internalrevenueservice.audit.taxes.c
gee. A tax site with a long, unintelligble, confusing domain name. Go figure.
"You want to kiss the sky? Better learn how to kneel." - U2
Sig:
Barbeque is a noun. Not a verb.
The net will not be what we demand, but what we make it. Build it well.
For all you know - the web has surpassed at least 1 webpage count. Big Fscking Deal!!!
<DrEvil>One... billion pages</DrEvil>
Sorry - couldn't resist. :=]
________________________
Corporate Jenga: You take a blockhead from the bottom and you put him on top...
Why is one of them Hamster Dance? Don't go there with an 18 month old child on your lap. For an adult, this is funny once. For a toddler, it is funny every time the computer is on.
The net will not be what we demand, but what we make it. Build it well.
dynamic content makes the technical quantity of distinct "pages" far greater than a billion.
Well, as any of us geeks know, this isn't really news. I'm sure we passed the billion mark a long, long time ago. Inktomi just wants the publicity, and some news service will probably pick this up, most likely CNN.
One thing of interest, though. If you look under the "Web server market share", Red Hat and mod_perl are apparently web servers now.
So were there three links to www.extraghost.com before they wrote the page, or after? And which one of the band members works at Inktomi? And will it be four after I post this comment?
Finding information on the web is going to increasingly be like trying to find hay in a needle stack. Already the current indexing engines can't keep up, and you have unscrupulous web authors putting bunches of keywords unrelated to their site in their meta tags to insure that they get mentioned in every single search. Some indexing engines already ignore meta tags for that reason. And how many times have you tried Altavista, Excite or Google only to find that the page you're trying to get to has expired or is 8 years old and hasn't been changed in 7?
This issue is going to have to be addressed, because the web is going to continue growing.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Really, this article says nothing. Unless it states (and it does not) *exactly* how they mean "unique" I'm not going to take this seriously. A more interesting statistic (and one I haven't seen updated in awhile) would be what the information conversion ratio is between the "RealWorld" and the web - ie: how much information that you can find in a library can you also find online in it's entirety. That is a more accurate measure of growth than raw page numbers.
49.5% Broken links to mp3s
49.5% pr0n pages with javascript popups
1% other
We humans should be so proud of ourselves.
:)
This sig is false.
They say they are the world's largest search engine and I get many hits spanning my pages from *.inktomisearch.com, but how do you search their site?
Is inktomi publicly searchable? If it is not, then my pages wouldn't be publicly searchable. So, what's the point of them making connections to my sites?
Is the following how you ban a site from your server?
/etc/httpd/conf/access.conf
#deny from domain
Inktomi and NEC Researcher: "Oh no!!! I can't remember if I counted our own web page. ARRRGGGHH!!! 1, 2, 3, 4, 5, ................."
IE
1,000,000,000 (US)
or
1,000,000,000,000 (UK)
There's a large difference.
Google is one of the best search engines available for most purposes, because it ignores meta tags, and scores pages higher based on links to the site from other high-scoring pages (this is a recursive definition but the recursion bottoms out).
The result of this is that it gives useful results even when very common words are used. Try searching for Linux on Google. The first ten results are
While a human being might be able to come up with a better list, a machine came up with that list, based solely on the structure of the web. (I wonder why linux.davecentral.com rates so high -- possibly because it's attached to a high-ranking site, davecentral.com).
ObAdvocacy: and Google runs on Linux.
Netcraft's measure is by number of servers, while this measure is by number of pages.
It's not suprising that they both agree, but it's certainly possible that larger sites might have a different server to the average site, causing a difference.
Well, my take from the site that what they're actually saying is "Look at our lovely indexing cluster. It can index 1 billion web thingies! Shouldn't you be buying an search engine product that powerfull?
Or, in other words, it's another example of meaningless statistics spewed in the name of marketing, vaguely covered-up as serious research.
References: Car MPG & top speed figures vs actual usage, Processor MHz as function of system throughput, quoted battery life as function of laptop utilisation, quaketest FPS compared to average internet multiplayer experience etc etc etc...
--
I'd rather have a bottle in front of me than a frontal lobotomy
Hair splitting alert ON.
The number of (different) pages on the web is actually infinite. Here is a sample infinite component.
(Actually it's finite because the maximal accepted length for a URL is finite. But it's way above the billions.)
Note that these are not dynamical pages. Dynamical pages (i.e. pages whose content changes for the same URL) don't count: they're cheating.
(The source used to generate this infinite number of pages is available under the GPL.)
The one billion documents were found to be a plot by The Cult of Arthur C Clarke to end the Universe - each page having a unique name of God on it.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I'm willing to help moderate on some subjects.
The net will not be what we demand, but what we make it. Build it well.
Grabbing just one page from each server is going to be faster than spidering the entire site. Therefore I'd expect netcraft to be ahead of all the search engines.
The Internet does not represent an infinite number of users (at least, not yet) but you're still more likely to get an infinite volume of monkey shit out of it while you try to dig up the works of Shakespere.
Or you could save time and go here.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Check out Ghost sites...
Your Working Boy,
Just another for-all-practical-purposes-meaningless statistic to nonetheless feel overwhelmed by, I suppose.
If there were a billion pages to look at, I don't know when I'd have the time to do anything else, being the info-junkie that I am. Fortunately, a sufficient quantity of these pages do not interest me.
Then, too, I wonder how many of these pages are de facto duplicates? ("Department of redundancy department, redundant division speaking
That also makes me wonder more about this statistic. Are there one billion ACTIVE pages, or merely one billion pages that have ever existed? If the former, how many pages have ever existed? That would be an interesting question
Well, by making this post I'm probably creating yet another page and adding to the noise and confusion. Consider it my chaotic deed for the day.
"Somebody exploded a letter-bomb today