Slashdot Mirror


WWW Surpasses One Billion Documents

Gary William Flake writes "A new study by Inktomi and NEC Research Institute show that there is at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market. "

7 of 157 comments (clear)

  1. the best part by Capt+Dan · · Score: 4

    Longest domain name:
    http://www.tax.taxadvice.taxation.irs.taxservices. taxrepresentation.
    taxpayerhelp.internalrevenueservice.audit.taxes.co m


    gee. A tax site with a long, unintelligble, confusing domain name. Go figure.

    "You want to kiss the sky? Better learn how to kneel." - U2

    --
    Sig:
    Barbeque is a noun. Not a verb.
  2. And at least one of them already comments on that by dsplat · · Score: 4
    Yes, and the Jargon File already has a comment on that, originally from Theodore Sturgeon:

    Sturgeon's Law prov.

    "Ninety percent of everything is crap". Derived from a quote by science fiction author Theodore Sturgeon, who once said, "Sure, 90% of science fiction is crud. That's because 90% of everything is crud." Oddly, when Sturgeon's Law is cited, the final word is almost invariably changed to `crap'. Compare Hanlon's Razor, Ninety-Ninety Rule. Though this maxim originated in SF fandom, most hackers recognize it and are all too aware of its truth.


    --
    The net will not be what we demand, but what we make it. Build it well.
  3. Thaaat's great... by Greyfox · · Score: 4
    Now INDEX it.

    Finding information on the web is going to increasingly be like trying to find hay in a needle stack. Already the current indexing engines can't keep up, and you have unscrupulous web authors putting bunches of keywords unrelated to their site in their meta tags to insure that they get mentioned in every single search. Some indexing engines already ignore meta tags for that reason. And how many times have you tried Altavista, Excite or Google only to find that the page you're trying to get to has expired or is 8 years old and hasn't been changed in 7?

    This issue is going to have to be addressed, because the web is going to continue growing.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  4. unique? by Signal+11 · · Score: 4
    Well, yahoo has hundreds, nay thousands, nay hundreds of thousands of "uniquely indexable pages" in their database. It's a web of links. How does one define unique?

    Really, this article says nothing. Unless it states (and it does not) *exactly* how they mean "unique" I'm not going to take this seriously. A more interesting statistic (and one I haven't seen updated in awhile) would be what the information conversion ratio is between the "RealWorld" and the web - ie: how much information that you can find in a library can you also find online in it's entirety. That is a more accurate measure of growth than raw page numbers.

  5. 1 Billion useless pages. by Dast · · Score: 4

    49.5% Broken links to mp3s
    49.5% pr0n pages with javascript popups
    1% other

    We humans should be so proud of ourselves.

    :)

    --

    This sig is false.

  6. Use Google by JoeBuck · · Score: 4

    Google is one of the best search engines available for most purposes, because it ignores meta tags, and scores pages higher based on links to the site from other high-scoring pages (this is a recursive definition but the recursion bottoms out).

    The result of this is that it gives useful results even when very common words are used. Try searching for Linux on Google. The first ten results are

    • linux.org
    • linux.com
    • www.debian.org
    • www.linuxworld.com
    • linux.davecentral.com
    • www.varesearch.com (VA Linux)
    • linux.corel.com
    • www.li.org (Linux International)
    • lwn.net (Linux Weekly News)
    • www.linuxhq.com

    While a human being might be able to come up with a better list, a machine came up with that list, based solely on the structure of the web. (I wonder why linux.davecentral.com rates so high -- possibly because it's attached to a high-ranking site, davecentral.com).

    ObAdvocacy: and Google runs on Linux.

  7. Impressive Marketing statistics by henley · · Score: 4

    Well, my take from the site that what they're actually saying is "Look at our lovely indexing cluster. It can index 1 billion web thingies! Shouldn't you be buying an search engine product that powerfull?

    Or, in other words, it's another example of meaningless statistics spewed in the name of marketing, vaguely covered-up as serious research.

    References: Car MPG & top speed figures vs actual usage, Processor MHz as function of system throughput, quoted battery life as function of laptop utilisation, quaketest FPS compared to average internet multiplayer experience etc etc etc...

    --

    --
    I'd rather have a bottle in front of me than a frontal lobotomy