Slashdot Mirror


Indexing the Entire Web?

cah1 writes "BBC is carrying a story about another new search engine All The Web. The designers are planning to have the whole shooting match, all billion pages, indexed by the end of the year. " You can also read press from the company as well. I'm skeptical-they claim to be able to catch up within the first year, and keep up thereafter. But they claim to have 200 million already, so who knows?

15 of 98 comments (clear)

  1. Re:ad infinitum et ad nausium by Tom+Christiansen · · Score: 2
    Assuming it will be a rather large amount of data, who will index thier index? (and who will index that index... and that one... and that one......)
    It seems to me that this is not a new problem. Juvenal said, `Sed quis custodiet ipsos custodes?' :-)
  2. Re: Distributed Spidering by davie · · Score: 2

    I've only considered this as a strictly volunteer project, directed by a university and the top level hosts and database hosted there, with some corporate sponsorship thrown in for good measure.

    I don't know if this would work if commercialized, since a lot of the folks who have the knowledge, experience and compute power to participate would probably not feel too warm or fuzzy about helping to build the next Yahoo!, especially when the IPO made the company worth millions overnight. It would certainly be tough to maintain the same level of participation after going commercial, unless some hitherto unforeseen way of rewarding participation per contribution were discovered. Perhaps corporate sponsors could offer premiums to contributors based on sites spidered? Maybe something along the lines of frequent flyer miles?

    --
    slashdot broke my sig
  3. Let's do it! by foop · · Score: 2
    Hey, I've been thinking about this very same problem for quite some time and some fellow nerds and I have been thinking about how to do it. How about we start a mailing list to further discuss this as an open source initiative?

    I just created http://www.egroups.com/group/dizz-net/ as a an email discussion list. You can subscribe by sending email to dizz-net-subscribe@egroups.com. There are a lot of interesting issues, many already mentioned here:

    • quality is usually more important than quantity
    • a distributed app has the potential to be much more "fresh" than other search services
    • a network protocol needs to be designed carefully -- you don't want to be sending all the web haphazardly around the web every day. clients might be assigned to monitor nearby sites. there are some cool opportunities to use this system just to map the internet.
    • searching is a different beast from crawling. parallel searching -- like FAST and others -- requires major resources which an open source project couldn't manage.
    • full text vs topic searching: does a distributed system with clients fetch documents index every word or summarize? Topic searching is probably more appropriate for distributed searching, but full text is often more desirable.
    • interesting security issues come up, like how to keep clients from poluting the database.
    • etc...

    -david.

  4. Wow - Looks Great by Aaron+M.+Renn · · Score: 3

    I judge search engines by the most important criteria of all - how many references to me they have. Alltheweb now has vastly more than runner up Google, making them the biggest ever. I type in "Aaron M. Renn" and I got 1604 on AllTheWeb, ~500 on Google and only ~180 on AltaVista. Even if that number drops as I searched through the pages, it's still impressive. I did look through the plain "Aaron Renn" listings too, where they also crushed the competition (though it's a much smaller number of pages since I virtually always use my middle initial). Believe it or not, there is a page out there with another "Aaron Renn" on it. Pretty weird.

  5. It's fast, anyway. by rde · · Score: 2

    I've been using if for a few days now, and it seems impressive. It's certainly fast. Google is still my engine of choice (even though it's visited my page a ton of times, and still won't find it when I search for it).
    As for its coverage: it may be "the result of more than a decade of research into optimising search algorithms and architectures", frankly this sounds dubious.
    If it covers 30% of the web it'll be twice as good as existing engines, but I suppose thirdoftheweb.com isn't that catchy.

    1. Re:It's fast, anyway. by Helten+E · · Score: 2

      All the Web doesn't use the pattern matching chip, it's all done in software on 50 Dell servers running BSD. What's new is the 200M documents and the official announcement (it's only been up on trial until now).

      Eyvind Bernhardsen

  6. It seems that... by Gestahl · · Score: 2

    This would be a great application for a distributed computing application, lots of computers indexing the web, and after they finish that, they can revisit sites for broken, moved and changed content sites... First post?

    1. Re:It seems that... by spooky+ghost · · Score: 2

      On that topic: surely this could actually done by the web browser rather than a distributed client. If you have a page online you're bound to check it yourself to make sure it's OK. With an appropriate browser or plugin your page could then be indexed and submitted to a search engine. And then once you start surfing any page you visit could be automatically indexed. The only problem is the millions of submissions you'd get each day.

      --

      No matter what it looks like, there isn't a .sig here.
    2. Re:It seems that... by davie · · Score: 3

      Not to harp on one of my pet ideas or anything, but I think a distributed spidering project could be pulled off. The trick would be to delegate the work based on compute power and bandwidth, with the "low-end" clients doing the grunt work of spidering, then passing the raw data up to the bigger iron with more bandwidth where the relationships between sites could be ferreted out, keywords could be indexed and context established, etc. These sites could then pass the cooked data back to the top level servers (compressed, of course) for whatever final work needs to be done and then insertion into the database. The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.

      If I had the networking know-how I would put together a proposal and start taking flame-mail, er, suggestions. Since I don't, I hope someone who does and is as crazy as me will pick up on the idea.

      --
      slashdot broke my sig
  7. Re:What is the problem ? by Anonymous Coward · · Score: 2
    It's a problem of cost, bandwidth, and enough hardware. All of which can be solved relatively easy. The software to do the indexing is hardly any difficult to write - I have one I've written myself, and indexed a few million pages with. The reason I don't put up a search engine tomorrow, is that I certainly couldn't afford the hardware, and the fact that it's a lot of work to retrieve data from the index in a way that give good results.

    But another problem, is the amount of dynamically generated content. There simply ISN'T any way for a search engine to safely index everything on the web, because it can't know which CGI's just serve up a finite selections of pages from a database, and which randomly generate content, as long as no decent clues are given.

    The amount of dynamically generated content is growing dramatically, so this will be an increasing problem.

  8. Non-scientific analysis by Snotboble_ · · Score: 2

    I wondered about those 200M pages already indexed, and I dug into Altavista, which says it has ~140M pages indexed.

    I made two searches; one for the word 'Microsoft' and the other for 'Linux'.

    Altavista gave : 12,682,370 (M$) and 4,526,430 (LX).
    FAST gave : 4689227 (M$) and 2570827 (LX).

    So.. If FAST currently is ~40% bigger than Altavista, how come they return numbers that much lower? With such large numbers it can't be pure coincidence, In My Humble Opinion.

    -Snotboble

    --
    Q: How does a Unix guru have sex? A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep
    1. Re:Non-scientific analysis by jandrese · · Score: 2

      Probablly because alltheweb is indexing EVERY page it comes across, even those "Hello, I'm so and so and I love cats..." pages that most search engines thankfully ignore. It even had my webpage in there, which is a first for search engines.

      --

      I read the internet for the articles.
  9. It's the custom hardware, stupid.. by gaute · · Score: 2
    Sloppy journalists...

    Check out this
    http://www.fast.no/product/fastpmc.html

    gaute


    -- We plunge for the slipstream the realness to find

    --
    -- We plunge for the slipstream the realness to find
    The incredible String Band
  10. Re:They're running Apache/FreeBSD by Jordy · · Score: 2

    No one every said Linux was stable on every single machine in the world, it supports a whole lot of hardware which itself isn't all that stable itself. :)

    Linux Max Uptime: 845 days, 08:59m
    FreeBSD Max Uptime: 690 days, 23:48m

    Then again, there are about 1/10th the number of FreeBSD entrants... overall not a real big sampling group in general.

    Plus there's no information about hardware anyone is using and why the machine was rebooted (kernel ugprades, hardware upgrades or crash).

    Overall, it's sorta pointless other than a nice figure to say my oscar meyer is bigger than yours.

    --

    --
    The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
  11. Re:Wow - I'm famous! by Hard_Code · · Score: 2

    Just for fun I decided to search for myself on Alltheweb. To my surprise I found:

    1. The plan for an old CS group project from college, where my name was referenced!

    2. 2 broken links to ZDNet talkbacks of mine.

    3. A CNet page with a dorky little media player I wrote and released as freeware.

    4. Some random Italian site hosting Win95 software including my dorky media player with full description extracted!!

    Wow...my head is swelling...

    Hmm...it didn't find my page though...heh

    Aaron

    --

    It's 10 PM. Do you know if you're un-American?