Slashdot Mirror


WWW Surpasses One Billion Documents

Gary William Flake writes "A new study by Inktomi and NEC Research Institute show that there is at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market. "

35 of 157 comments (clear)

  1. In related news... by Sick+Boy · · Score: 2

    approximately 7 of them are useful.

    --
    Does narcissism count as a hobby? --Shawn Latimer
  2. the best part by Capt+Dan · · Score: 4

    Longest domain name:
    http://www.tax.taxadvice.taxation.irs.taxservices. taxrepresentation.
    taxpayerhelp.internalrevenueservice.audit.taxes.co m


    gee. A tax site with a long, unintelligble, confusing domain name. Go figure.

    "You want to kiss the sky? Better learn how to kneel." - U2

    --
    Sig:
    Barbeque is a noun. Not a verb.
    1. Re:the best part by Nodatadj · · Score: 2

      I had
      in.2032.the.world.as.we.know.it.will.self-destru ct.com, whenever I was running "illegal"* servers off my university network.

      *There was nothing illegal about them, execpt that the university banned servers.

    2. Re:the best part by Nodatadj · · Score: 2

      The one I really want
      is
      i.should.co.co

      but I dunno how to register a hostname in columbia (or whereever CO is)

  3. And at least one of them already comments on that by dsplat · · Score: 4
    Yes, and the Jargon File already has a comment on that, originally from Theodore Sturgeon:

    Sturgeon's Law prov.

    "Ninety percent of everything is crap". Derived from a quote by science fiction author Theodore Sturgeon, who once said, "Sure, 90% of science fiction is crud. That's because 90% of everything is crud." Oddly, when Sturgeon's Law is cited, the final word is almost invariably changed to `crap'. Compare Hanlon's Razor, Ninety-Ninety Rule. Though this maxim originated in SF fandom, most hackers recognize it and are all too aware of its truth.


    --
    The net will not be what we demand, but what we make it. Build it well.
  4. Meaningless Statistics by (void*) · · Score: 3
    That's what I hate about such "statistics". No information or context is given. One is not told how this estimate of "one billion" is gotten. No details about how the research methodology was forthcoming. Instead one is only supposed to stare slack-jawed in amazing at the touted figure of one billion an be impressed. That anyone can impress oneself that THIS is an achievement is amazing.

    For all you know - the web has surpassed at least 1 webpage count. Big Fscking Deal!!!

    1. Re:Meaningless Statistics by dsplat · · Score: 2

      That's what I hate about such "statistics". No information or context is given. One is not told how this estimate of "one billion" is gotten.

      Remember, 53.4% of all statistics are invented on the spot. Of those, 63.1% are never checked against any reliable source. The rest are attributed to a survey done by Expensive Management Consultants. You can buy a copy of the report from them for only $2499, which includes the introductory price of a year's subscription to their weekly newletter containing the abstracts of other reports you can purchase, at a substantial 10% discount off the regular price that no one ever pays them anyway.

      --
      The net will not be what we demand, but what we make it. Build it well.
  5. Heh... by Anonymous+Commando · · Score: 2

    <DrEvil>One... billion pages</DrEvil>

    Sorry - couldn't resist. :=]
    ________________________

    --
    Corporate Jenga: You take a blockhead from the bottom and you put him on top...
  6. Why? by dsplat · · Score: 3

    Why is one of them Hamster Dance? Don't go there with an 18 month old child on your lap. For an adult, this is funny once. For a toddler, it is funny every time the computer is on.

    --
    The net will not be what we demand, but what we make it. Build it well.
    1. Re:Why? by dsplat · · Score: 2

      The link you provided doesn't respond well. I think they've been slashdotted. So I did a search at Google for Hamsterdeath and found this. Enjoy!

      --
      The net will not be what we demand, but what we make it. Build it well.
  7. technically inacurate statistics by TheCodeMaster · · Score: 3

    dynamic content makes the technical quantity of distinct "pages" far greater than a billion.

  8. Inktomi, publicity, and mod_perl by billh · · Score: 3

    Well, as any of us geeks know, this isn't really news. I'm sure we passed the billion mark a long, long time ago. Inktomi just wants the publicity, and some news service will probably pick this up, most likely CNN.

    One thing of interest, though. If you look under the "Web server market share", Red Hat and mod_perl are apparently web servers now.

    1. Re:Inktomi, publicity, and mod_perl by annarchy · · Score: 2

      some news service did pick this up...slashdot.

  9. Did they bump the count to extraghost.com? by foolishj · · Score: 2

    So were there three links to www.extraghost.com before they wrote the page, or after? And which one of the band members works at Inktomi? And will it be four after I post this comment?

  10. Thaaat's great... by Greyfox · · Score: 4
    Now INDEX it.

    Finding information on the web is going to increasingly be like trying to find hay in a needle stack. Already the current indexing engines can't keep up, and you have unscrupulous web authors putting bunches of keywords unrelated to their site in their meta tags to insure that they get mentioned in every single search. Some indexing engines already ignore meta tags for that reason. And how many times have you tried Altavista, Excite or Google only to find that the page you're trying to get to has expired or is 8 years old and hasn't been changed in 7?

    This issue is going to have to be addressed, because the web is going to continue growing.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  11. unique? by Signal+11 · · Score: 4
    Well, yahoo has hundreds, nay thousands, nay hundreds of thousands of "uniquely indexable pages" in their database. It's a web of links. How does one define unique?

    Really, this article says nothing. Unless it states (and it does not) *exactly* how they mean "unique" I'm not going to take this seriously. A more interesting statistic (and one I haven't seen updated in awhile) would be what the information conversion ratio is between the "RealWorld" and the web - ie: how much information that you can find in a library can you also find online in it's entirety. That is a more accurate measure of growth than raw page numbers.

  12. 1 Billion useless pages. by Dast · · Score: 4

    49.5% Broken links to mp3s
    49.5% pr0n pages with javascript popups
    1% other

    We humans should be so proud of ourselves.

    :)

    --

    This sig is false.

  13. A public or private search engine? by dattaway · · Score: 2

    They say they are the world's largest search engine and I get many hits spanning my pages from *.inktomisearch.com, but how do you search their site?

    Is inktomi publicly searchable? If it is not, then my pages wouldn't be publicly searchable. So, what's the point of them making connections to my sites?

    Is the following how you ban a site from your server?

    /etc/httpd/conf/access.conf
    #deny from domain

    1. Re:A public or private search engine? by JoeBuck · · Score: 2

      Inktomi sells their technology to other companies; they don't operate a search engine under their own name. HotBot is Inktomi-based; there are others as well but I don't know who.

    2. Re:A public or private search engine? by dattaway · · Score: 2

      Thanks for the good page with all the answers. It wasn't immediately obvious how to search their web site. You see, I get hundreds of entries for inktomi.com and others over my 56K dialup. Naturally, I'm curious to see what they are and do a www.them and didn't see anything useful until now.

    3. Re:A public or private search engine? by jfunk · · Score: 2

      I just checked out hotbot to see if any of my sites (which are constantly hammered by the Inktomi crawler, as I'm sure is the case with most sites) would come up.

      No hits.

      Google finds them, though.

      Something's definitely amiss regarding Inktomi.

    4. Re:A public or private search engine? by JoeBuck · · Score: 2

      Hotbot uses Inktomi technology. They don't use Inktomi's database (I don't know who does).

    5. Re:A public or private search engine? by jfunk · · Score: 2

      Hotbot uses Inktomi technology. They don't use Inktomi's database

      Ahh, I see now. They are crawling my sites but not letting anybody search the results unless they pay big bucks.

      Hmmm, looks like I'll be making a modification to my robots.txt files and possibly adding some new rules to my firewall.

      I should be allowed to find out what info about my sites they are trying to sell. If I can't, they won't be getting access.

  14. Recount by Jupiter2 · · Score: 2

    Inktomi and NEC Researcher: "Oh no!!! I can't remember if I counted our own web page. ARRRGGGHH!!! 1, 2, 3, 4, 5, ................."

  15. UK or US? by Nodatadj · · Score: 2

    IE
    1,000,000,000 (US)
    or
    1,000,000,000,000 (UK)

    There's a large difference.

  16. Use Google by JoeBuck · · Score: 4

    Google is one of the best search engines available for most purposes, because it ignores meta tags, and scores pages higher based on links to the site from other high-scoring pages (this is a recursive definition but the recursion bottoms out).

    The result of this is that it gives useful results even when very common words are used. Try searching for Linux on Google. The first ten results are

    • linux.org
    • linux.com
    • www.debian.org
    • www.linuxworld.com
    • linux.davecentral.com
    • www.varesearch.com (VA Linux)
    • linux.corel.com
    • www.li.org (Linux International)
    • lwn.net (Linux Weekly News)
    • www.linuxhq.com

    While a human being might be able to come up with a better list, a machine came up with that list, based solely on the structure of the web. (I wonder why linux.davecentral.com rates so high -- possibly because it's attached to a high-ranking site, davecentral.com).

    ObAdvocacy: and Google runs on Linux.

  17. Re:Apache is the largest by gorilla · · Score: 2
    I think this is a different measure.

    Netcraft's measure is by number of servers, while this measure is by number of pages.

    It's not suprising that they both agree, but it's certainly possible that larger sites might have a different server to the average site, causing a difference.

  18. Impressive Marketing statistics by henley · · Score: 4

    Well, my take from the site that what they're actually saying is "Look at our lovely indexing cluster. It can index 1 billion web thingies! Shouldn't you be buying an search engine product that powerfull?

    Or, in other words, it's another example of meaningless statistics spewed in the name of marketing, vaguely covered-up as serious research.

    References: Car MPG & top speed figures vs actual usage, Processor MHz as function of system throughput, quoted battery life as function of laptop utilisation, quaketest FPS compared to average internet multiplayer experience etc etc etc...

    --

    --
    I'd rather have a bottle in front of me than a frontal lobotomy
  19. Infinity by David+A.+Madore · · Score: 2

    Hair splitting alert ON.

    The number of (different) pages on the web is actually infinite. Here is a sample infinite component.

    (Actually it's finite because the maximal accepted length for a URL is finite. But it's way above the billions.)

    Note that these are not dynamical pages. Dynamical pages (i.e. pages whose content changes for the same URL) don't count: they're cheating.

    (The source used to generate this infinite number of pages is available under the GPL.)

  20. In related news... by jd · · Score: 2

    The one billion documents were found to be a plot by The Cult of Arthur C Clarke to end the Universe - each page having a unique name of God on it.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  21. My idea, you can't patent it by dsplat · · Score: 2
    Okay, it isn't new and it isn't my idea originally, but I'll put a new spin on it. Is there a new for a moderated index to the most useful stuff on the web? Hey andover.net, I'm talking to you too. An index to everything open source related would be great. After all, an index to the whole web is a huge project that never ends and eventually sucks up all your free time. But it may be useful to have moderators rate the links on two factors:

    1. General usefulness of the information on the page/site. Good stuff is good, no matter how you got there.
    2. Specific applicability of the index to the page. Getting to the wrong good stuff or seeing too many links for a particular idea doesn't help.


    I'm willing to help moderate on some subjects.
    --
    The net will not be what we demand, but what we make it. Build it well.
  22. Re:Apache is the largest by gorilla · · Score: 2

    Grabbing just one page from each server is going to be faster than spidering the entire site. Therefore I'd expect netcraft to be ahead of all the search engines.

  23. Large but Finite number of monkeys by Greyfox · · Score: 2
    I believe the original thought experiment calls for an infinite number of monkeys. It does not say anything about the infinite volume of monkey shit that would be produced over the course of the experiment.

    The Internet does not represent an infinite number of users (at least, not yet) but you're still more likely to get an infinite volume of monkey shit out of it while you try to dig up the works of Shakespere.

    Or you could save time and go here.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  24. Re:Web Antiques by otis+wildflower · · Score: 2

    Check out Ghost sites...

    Your Working Boy,

  25. One billion channels and nothing on ... by fable2112 · · Score: 2


    Just another for-all-practical-purposes-meaningless statistic to nonetheless feel overwhelmed by, I suppose.

    If there were a billion pages to look at, I don't know when I'd have the time to do anything else, being the info-junkie that I am. Fortunately, a sufficient quantity of these pages do not interest me. :)

    Then, too, I wonder how many of these pages are de facto duplicates? ("Department of redundancy department, redundant division speaking ...") For instance, I'm right in the middle of moving my pages off of geocities and onto drak.net. At the moment, the pages that I've put up on drak.net that were part of my old geocities page still exist on geocities because I'm not done moving everything yet, and can't shut down my old page until EVERYthing is transported. I went through a similar process when I moved TO geocities from my college web page two and a half years ago.

    That also makes me wonder more about this statistic. Are there one billion ACTIVE pages, or merely one billion pages that have ever existed? If the former, how many pages have ever existed? That would be an interesting question ....

    Well, by making this post I'm probably creating yet another page and adding to the noise and confusion. Consider it my chaotic deed for the day. :)

    --
    "Somebody exploded a letter-bomb today ... but it wasn't anybody I knew" -The Moody Blues, "Dear Diar