WWW Surpasses One Billion Documents
Gary William Flake writes "A new study by Inktomi and NEC Research Institute show that there is at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market.
"
Porn and quotes
I can just see giant billboards now:
Apache: Millions and Millions Served
(For Free!)
when Push Comes to Shove
Not first post, but first with my threshold ;) Seriously, it's nice to see Apache sticking out again. Should do fairly well for marketing Linux.
follow me on Twitter: http://twitter.com/moeffju
I can't believe it took this long.
-- Hi! I'm a
approximately 7 of them are useful.
Does narcissism count as a hobby? --Shawn Latimer
Longest domain name:. taxrepresentation. o m
http://www.tax.taxadvice.taxation.irs.taxservices
taxpayerhelp.internalrevenueservice.audit.taxes.c
gee. A tax site with a long, unintelligble, confusing domain name. Go figure.
"You want to kiss the sky? Better learn how to kneel." - U2
Sig:
Barbeque is a noun. Not a verb.
Glorious gravy, the web has has breached 1,000,000,000 indexable pages. And just like radio, network television, cable and satellite before it, the new gag is:
A billion pages of information, and nothing's on.
Now, if real life exemplified the web, we'd know that 85% of the earth's population speaks English and, as can be expected, the IRS's domain name proves to be a lesson in redundancy and triplicate.
The net will not be what we demand, but what we make it. Build it well.
For all you know - the web has surpassed at least 1 webpage count. Big Fscking Deal!!!
Apache is still holding around 60%. It may take a while for the industry to change, but eventually the vast majority of software will become open source- at all level. The market will switch to a service model. Software will no longer be 'sold' per say. It will be provided. What will be sold will be ports, extensions, customizations, translations, support, codebase maintenace commitments, and commissions for new code. A lot of business and pr types are against the OSrce movement because they only understand the old model of software business. In the new order there will be lots of money to be made- however it will be harder however to concentrate it like microsoft did. And any company's main and in fact only asset will continue to be their coders.
<DrEvil>One... billion pages</DrEvil>
Sorry - couldn't resist. :=]
________________________
Corporate Jenga: You take a blockhead from the bottom and you put him on top...
Of course its about taxes, you got to hand it to the IRS, even their URLs are hard to read and understand. I wasn't able to open this link, can anybody else?
Its karma, Kramer.
Why is one of them Hamster Dance? Don't go there with an 18 month old child on your lap. For an adult, this is funny once. For a toddler, it is funny every time the computer is on.
The net will not be what we demand, but what we make it. Build it well.
dynamic content makes the technical quantity of distinct "pages" far greater than a billion.
is that big cluster of sun boxen. Now that is some serious admining and "play?". Just think of clustering them all together and playing a mean game of Quake. Enough proccessing power to run the super bot that is smart enough not to get detected. Im dreaming.
http://www.freebsd.org
Alright, there are 6 billion people on the planet. Many of these people work for companies or governments that have webpages that are probably hundreds of indexable pages. Some places auto-update things, producing hundreds of indexable pages. There are also millions of (pointless) personal sites, and some people manage more than one site. I'm shocked that we're only at one billion now.
-BlightX
Anybody following Netcraft's Web Server Survey already knew this. But it's still nice to get it confirmed from additional sources.
GNU/Linux. The Freshmaker.
The web has infinite amounts of indexable webpages, just look at dynamic webpages and CGI driven webpages. If you want proof go search for [A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z] at www.altavista.com. That's an example of 208827064576 indexable webpages (Each one different). Wee.
Well, as any of us geeks know, this isn't really news. I'm sure we passed the billion mark a long, long time ago. Inktomi just wants the publicity, and some news service will probably pick this up, most likely CNN.
One thing of interest, though. If you look under the "Web server market share", Red Hat and mod_perl are apparently web servers now.
Online gaming for motivated, sportsmanlike players: www.steelmaelstrom.org.
Online gaming for motivated, sportsmanlike players: www.steelmaelstrom.org.
Just looking at the top three:
Apache 60.33%
Microsoft-IIS 25.26%
Netscape-Enterprise 3.79%
Wow - Apache still kicks everyone else's butts, and not by a small margin! I think Apache is about the perfect case for OSS development - not just being a blip on the radar getting larger, but, covering almost the entire radar screen!
I'd love to see more stats out of Inktomi on this, but, it's still cool to see what little the did provide (261,472 links to MP3.com should say something about the digital music scene )
Davis Ray Sickmon, Jr - looking for something to read? Check out my three free novels at MidnightRyder.org
Stuff like that make me smile ;)
-Peace
Dave
Free as in "the Truth shall set you..."
So were there three links to www.extraghost.com before they wrote the page, or after? And which one of the band members works at Inktomi? And will it be four after I post this comment?
Also note that while these pages exist, there is also a lot of random crap out there that really just wastes space and time. As the number of pages increases, I'm sure that it will be harder and harder to find quality documents among the wasteland of stuff we don't need.
"You ever have that feeling where you're not sure if you're dreaming or awake?"
"You spoony bard!" -Tellah
These results seem a little strange to me, there is explaination or context for the results.
why did they list the number of links to rickymartin.com or cooking.com
why did they list the longest url as a nonworking URL that probably used to spam the search engines?
oh great, uh hey guys, today I have determined there are 1 billion webpages!
Finding information on the web is going to increasingly be like trying to find hay in a needle stack. Already the current indexing engines can't keep up, and you have unscrupulous web authors putting bunches of keywords unrelated to their site in their meta tags to insure that they get mentioned in every single search. Some indexing engines already ignore meta tags for that reason. And how many times have you tried Altavista, Excite or Google only to find that the page you're trying to get to has expired or is 8 years old and hasn't been changed in 7?
This issue is going to have to be addressed, because the web is going to continue growing.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
An almost infinite number of monkeys bagning away on a similar number of typewrites will eventually reqproduce the works of shakespeare.
The internet disproves this hypothesis.
But seriously - has anyone figured out how long it would take to requroduce certain random documents? - such as the works of shakespeare?
Really, this article says nothing. Unless it states (and it does not) *exactly* how they mean "unique" I'm not going to take this seriously. A more interesting statistic (and one I haven't seen updated in awhile) would be what the information conversion ratio is between the "RealWorld" and the web - ie: how much information that you can find in a library can you also find online in it's entirety. That is a more accurate measure of growth than raw page numbers.
49.5% Broken links to mp3s
49.5% pr0n pages with javascript popups
1% other
We humans should be so proud of ourselves.
:)
This sig is false.
How can 90% of Internet content be crud if over 50% of it is p0rn ;)
How can 90% of Internet content be crud if over 50% of it is p0rn ; )
...there are three types of lies:lies, damn lies, and statistics. Take from that what you will. BTW, 90% of everything is CRAP (or crud...or even s#|%)
...there are three types of lies:lies, damn lies, and statistics. Take from that what you will. BTW, 90% of everything is CRAP (or crud...or even s#|% depending on your frame of mind at any given time;^D)
They say they are the world's largest search engine and I get many hits spanning my pages from *.inktomisearch.com, but how do you search their site?
Is inktomi publicly searchable? If it is not, then my pages wouldn't be publicly searchable. So, what's the point of them making connections to my sites?
Is the following how you ban a site from your server?
/etc/httpd/conf/access.conf
#deny from domain
Billions and billions of pages lost in the cosmic consciousness...
This sig is false.
According to Netcraft (http://www.netcraft.com/survey/), "In the December 1999 survey we received responses from 9,560,866 sites". If each site has 1000 pages (not terribly unreasonable) we're at 9.5 billion, nearly 10 times more than this PR-plug. And this is only counting static pages; my guess is that auctions on eBay do not count. I wonder if they count Deja - how many pages do you think they have in all those news groups?
The Internet is large. Leave it at that.
Cheers,
Slak
Almost 4000 links point to rickymartin.com -
I'm just curious if that was supposed to be impressive or disturbing. Of course, a good lot of those one billion pages are made by teenagers so-
-Noiz,
Who thinks Ricky Martin looks too much like a clone to be a "hottie".
---------
---------
I'd kill for a Nobel Peace Prize.
Haha! I wonder how many have noticed?!?!?
Did I fall asleep for 20 years, or are Inktomi's claims about its search software a little inflated? They stop just short of claiming to read my mind and provide the doc I want as soon as I open the browser.
Someone please tell me if I'm missing some great coolness here. After all, I haven't used anything other than Google for months.
"You can't get something for nothing." - my grandfather, on the stock market and Reaganomics.
seems that I've been on websites that appeared to have more pages than that.
There are still a billion folks in the world who haven't even made a phone call.
-Peace
Dave
Free as in "the Truth shall set you..."
because 90% of online p0rn is crap too
erm
or so I am told :+)
--
-=DaveHowe=-
Marge: "Does anybody really need that much porn?"
Homer: "Mmmmm.... one million times.... aaaagggggghhhhh!!!!!!!!!!"
Are they trying to claim that all pages are in English, French or Dutch? What does this indicate as to the rest of their research? I would have thought that the number of pages in Russian (Cyrillic) or one of the eastern languages such as Korean or Japanese, would have been statistically significant enough for inclusion. Makes me wonder about the validity of any of their numbers.
Since when is RedHat a webserver and not a distribution? I'd like to know the method these guys used to get these stats, and why they listed Redhat as a server.
Inktomi and NEC Researcher: "Oh no!!! I can't remember if I counted our own web page. ARRRGGGHH!!! 1, 2, 3, 4, 5, ................."
now this is news!
http://www.ircnews.com/mirc.html
From the press release:
"By examining the entire Web and analyzing the billions of links between all of its documents, Inktomi can distill an index of the highest quality documents to provide users with
more relevant and intuitive results."
Isn't that the "technology" that google has patented?
-Rob Ewaschuk
Actually, this is an interesting idea. Replacing wach page on a web site with its own subdomain. I like it! All right DNS, let's see what you've got!
IE
1,000,000,000 (US)
or
1,000,000,000,000 (UK)
There's a large difference.
Google is one of the best search engines available for most purposes, because it ignores meta tags, and scores pages higher based on links to the site from other high-scoring pages (this is a recursive definition but the recursion bottoms out).
The result of this is that it gives useful results even when very common words are used. Try searching for Linux on Google. The first ten results are
While a human being might be able to come up with a better list, a machine came up with that list, based solely on the structure of the web. (I wonder why linux.davecentral.com rates so high -- possibly because it's attached to a high-ranking site, davecentral.com).
ObAdvocacy: and Google runs on Linux.
The Dilbert Zone is using the wrong symbol for Red Hat on the Ratbert Index.
...just to see how much free advertising The Dilbert Zone can get from it.
And I thought that cable television was a vast waste land...
all persons, living and dead, are purely coincidental. - Kurt Vonnegut
And of those 4 billion, probably 1 billion are on AOL and another billion are on yageohooties (Yahoo+Geocities)/angelfire/dragonfire/.../. This means that at a reasonable guess, a minimum of half of the pages on the net consist of a purple background (or image) with lime green text, broken html, and a couple dozen animated gifs reminiscent of a carnival (and no content beyond "Hi, my name is _____ I was born in _____, my drivers license, SSN, and major credit card are _____, _____, and _____.").
Geez....I say that there are far too many people on the net who just don't belong, and freedom of speech or no, some people shouldn't be allowed to make web sites.
Who am I?
Why am here?
Where is the chocolate?
What is your Slash Rating?
This man is correct! The correlation between making associations that do not exist, and poking another man's anus, is extraordinary!
Inktomi are an american company - one billion is a thousand million. That's the number of docs they have in their index. Inktomi was around a long long time before google, and their technology is a rather cool cluster based one. It currently runs on Solaris for their search. Part of the "Battle" in the search market is on the size of the index that people store. Inktomi are currently trying to leapfrog their competitors (Altavista et al.), which they have done nicely. Most people have at some time or another used Inktomi's seach indirectly through hotwired.com yahoo.com or one of the many other portals Inktomi power. As to "Other languages" - Inktomi are a multinational corporation providing services in japan (goo.com) and a lot of European and South American countries.
Well, my take from the site that what they're actually saying is "Look at our lovely indexing cluster. It can index 1 billion web thingies! Shouldn't you be buying an search engine product that powerfull?
Or, in other words, it's another example of meaningless statistics spewed in the name of marketing, vaguely covered-up as serious research.
References: Car MPG & top speed figures vs actual usage, Processor MHz as function of system throughput, quoted battery life as function of laptop utilisation, quaketest FPS compared to average internet multiplayer experience etc etc etc...
--
I'd rather have a bottle in front of me than a frontal lobotomy
Hair splitting alert ON.
The number of (different) pages on the web is actually infinite. Here is a sample infinite component.
(Actually it's finite because the maximal accepted length for a URL is finite. But it's way above the billions.)
Note that these are not dynamical pages. Dynamical pages (i.e. pages whose content changes for the same URL) don't count: they're cheating.
(The source used to generate this infinite number of pages is available under the GPL.)
The one billion documents were found to be a plot by The Cult of Arthur C Clarke to end the Universe - each page having a unique name of God on it.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Many search engines score pages higher that have the keywords in the hostname, so creating tons of subdomains to get every possible keyword into the hostname might actually get the page in top positions for several keywords.
. office2000.2000.windows95.windows98.mswi ndows.mswindows2000.sucks.org. ;)
Guess it's time someone anti-microsoft gets microsoft.ms.windows.windows2000.windowsnt.office
This message is provided under the terms outlined at http://www.bero.org/terms.html
In my opinion, 1 Natalie Portman site is one too many.
Vivent longtemps la résistance de Natalie Portman!!
I've seen numerous stats on the different types of servers out on the net serving web pages, and I've never seen one when Apache was even close to being taken over by another type of server, was this a surprise to anyone?
E.
Congratulations to the WWW and this accomplishment (Those of you who run or are members of porno sites: You don't deserve it). However, with the increasing number of homepages on the internet, who in the heck is gonna keep track of them all?? I am fully aware of the different directories and search engines, but they have such stringet rules for Link Submission, it discourages many newbies from starting a webpage. I am also fully aware of the necessity of the META tags required, but then there's also many other criteria that I don't think anyone is really aware of. I have repetedly e-mailed Yahoo! and Excite for or if they have a criteria list for homepage submission, only to wind up with a reply from an automated service, then never hear from them again!! Luckily, my news of my webpage gets around Via word of mouth, not on some search engine, but I'm going to change that.
.223 caliber!!)
But keeping track of all these billions of pages, will be difficult, and sooner or later, people are going to demand satisfaction! (Slap me with that glove again, and I'll give you satisfaction, in a
The Gray Wolf
My 80286 is like the Bible: I swear by it every night when I try to run something.
.
because 90% of online p0rn is crap too :+)
Do you mean fecofilia, or just low quality? *impertinent smirk*
--unDees
"I call a baby goat a 'goatse.'" -- my non-Internet-savvy 6-year-old stepdaughter
AND SECRET SAUCE HGALHGLAGHLGHAGLHAGLHAG
I'm willing to help moderate on some subjects.
The net will not be what we demand, but what we make it. Build it well.
hmm...
so that means that if each and every page on the WWW were worth $100, then it would equal bill gates' pocket.
that's nuts
Look out honey, 'cause I'm using technology; Ain't got time to make no apology
>rites "A new study by Inktomi and NEC Research >Institute show that there is at least one billion >unique
> indexable Web pages on the internet. The >details are pretty interesting; for example, >Apache dominates the server market. "
Is Apache paying Cmdrfucko to say shit like this? No need to mention Apache dominates the server market, we already know that, thank you..
The Internet does not represent an infinite number of users (at least, not yet) but you're still more likely to get an infinite volume of monkey shit out of it while you try to dig up the works of Shakespere.
Or you could save time and go here.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I guess there is something to be hopeful about.
:)
This sig is false.
Gopher is better!
Gopher doesn't clog up your screen with complex and confusing "images." Gopher cuts through all that meaningless drivel and gives you your information crystal clear.
WWW's search engine naming conventions make them difficult to remember? Why is this? There isn't one, that's why! What does "altavista" mean to you? What about "lycos"? Sounds like a cough serum, doesn't it? "Yahoo"? Wie doggie! Gopher is different because most of Gopher's search engine names rely on a standard instead of incomprehensible gibberish. This standard is, of course, the Archie comics from your childhood (or perhaps your second childhood). You've got all your ol' pals Veronica and Jughead ready to surf the Goph and give you some friendly advice on where to go next. Gopher is truly better than the WWW!
God bless Gopher, each and every one!
watch ms go down even more and apache go up, for the same reason the pentagon got rid of ms and went up with apache.
Actually, Redheads, Inc. ain't doin' too bad!
And how many of those one billion web pages are actively updated? I quit a job with (unnamed employer) a year and a half ago, and nobody has updated the Web server there since I left. The only reason it no longer has my name on it is because I changed the contact info...but the content is completely unchanged.
Finally, this is a press release. Press releases are written by the companies or commissioned by companies and distributed to news agencies which usually don't bother to do any but the most basic redaction. It's like free advertising. As I believe someone else pointed out, this isn't a "hey cool, a billion Web pages," it's "hey, our indexing software can index a billion pages, don't you want to buy it?". Always be wary of the source...usually, if you read something positive about a product or a technology, you can bet that somebody is getting paid for it. (Yes, this includes reviews...remember, we have to keep the advertisers happy.)
I would have liked to see what the oldest pages were. You know, NO updates since 1995, or whatever? Web senior citizens.
:-)
sk
...that the 1,000,000,000th document was porn? One chance in five? One in three?
So, knowing all of this, why did /. post this?
Makes me wonder........
Insert mind here.
And in other news...
President Clinton has proclaimed the One Billionth web page to belong to a young Bosnian orphan of indeterminate gender.
"Trademarks are the heraldry of the new feudalism."
no it didn't.
I work for one of google's competitors and we did *exactly* what they claim they are doing and got completely different results. They apparently are crawling sites like yahoo and dmoz and using positions there to effect their ranking...
google also now uses RealNames[tm] which desn't run on linux, and as an arm of the old bad intellectual property junta
don't believe the hype...
these comments do not represent those of my employer. in fact, I'd probably get busted if they found out I did this!
dynamic content makes the technical quantity of distinct "pages" far greater than a billion.
Surely you are correct. However, the operative term in that phrase is "indexable". I'm quite sure that neither Inktomi nor many other "spiders" such as AltaVista (to name one big one) can traverse links to dynamically generated pages. So even if the number of indexable pages is over one billion, that indeed leaves much content out of the big picture.
Quidquid latine dictum sit, altum viditur.
I only post comments when someone on the internet is wrong.
Okay, I'll admit I'd be happier if a *machine* had determined that yahoo and dmoz were worth crawling for ranking information, but let's face it, that's pretty hard to do.
The fact remains that google outdoes every other search engine on the net, and returns useful links for obscure queries that would be a lost cause on other search engines.
Ok, so they have a lartge database of links. These links point to pages that were online once, and may in fact be on line now.
So! The web has grown. But it's grown like an algee (SP?) across a fish pond. Some of it is usefull (for fish food) but most of it's a waste of space. An example, I have been looking all day for deck plans for the Vasa with out luck.
Plenty of mediteranina cruise liners with deck plans. It's the same for most searches (unless you are looking for porn, good strike rate there)...
1 billion pages, my local library is more offten a better source of information... Usenet is more flames and spam than usefull chatter. The Net is becoming a has been, the golden age has past us allready, sure video and audio streeming seem cool, but I already have a TV and stereo. What I want is a world class library, free and at my fingertips.
Tokyo Joe
You know what that means? Assuming, on average, one web page per individual, that means that 4/5th's to 2/3rds of the world's population does not yet have A WEB PAGE!!!! It also means that most or all of "The Internet" is controlled by only 1/5-1/6th of the world's population?!!!! Think about that before all of you web developers start patting yourselfs on the back. It's the socially responsible thing to do ;)
How do you know that you are doing "exactly" what Google does? Do you have the Google source? Maybe your search engine is just crap and stealing a few features from Google isn't enough to make it good.
Just another for-all-practical-purposes-meaningless statistic to nonetheless feel overwhelmed by, I suppose.
If there were a billion pages to look at, I don't know when I'd have the time to do anything else, being the info-junkie that I am. Fortunately, a sufficient quantity of these pages do not interest me.
Then, too, I wonder how many of these pages are de facto duplicates? ("Department of redundancy department, redundant division speaking
That also makes me wonder more about this statistic. Are there one billion ACTIVE pages, or merely one billion pages that have ever existed? If the former, how many pages have ever existed? That would be an interesting question
Well, by making this post I'm probably creating yet another page and adding to the noise and confusion. Consider it my chaotic deed for the day.
"Somebody exploded a letter-bomb today
cornz all the way !
moderate this and die biatch !!?
Heh. Is it any better for the 1e9 pages? ...
"Well, its one louder, i'n it?"