Indexing the Entire Web?
cah1 writes "BBC is carrying a story about another new search engine All The Web. The designers are planning to have the whole shooting match, all billion pages, indexed by the end of the year. " You can also read press from the company as well. I'm skeptical-they claim to be able to catch up within the first year, and keep up thereafter. But they claim to have 200 million already, so who knows?
I've only considered this as a strictly volunteer project, directed by a university and the top level hosts and database hosted there, with some corporate sponsorship thrown in for good measure.
I don't know if this would work if commercialized, since a lot of the folks who have the knowledge, experience and compute power to participate would probably not feel too warm or fuzzy about helping to build the next Yahoo!, especially when the IPO made the company worth millions overnight. It would certainly be tough to maintain the same level of participation after going commercial, unless some hitherto unforeseen way of rewarding participation per contribution were discovered. Perhaps corporate sponsors could offer premiums to contributors based on sites spidered? Maybe something along the lines of frequent flyer miles?
slashdot broke my sig
I just created http://www.egroups.com/group/dizz-net/ as a an email discussion list. You can subscribe by sending email to dizz-net-subscribe@egroups.com. There are a lot of interesting issues, many already mentioned here:
-david.
I judge search engines by the most important criteria of all - how many references to me they have. Alltheweb now has vastly more than runner up Google, making them the biggest ever. I type in "Aaron M. Renn" and I got 1604 on AllTheWeb, ~500 on Google and only ~180 on AltaVista. Even if that number drops as I searched through the pages, it's still impressive. I did look through the plain "Aaron Renn" listings too, where they also crushed the competition (though it's a much smaller number of pages since I virtually always use my middle initial). Believe it or not, there is a page out there with another "Aaron Renn" on it. Pretty weird.
I've been using if for a few days now, and it seems impressive. It's certainly fast. Google is still my engine of choice (even though it's visited my page a ton of times, and still won't find it when I search for it).
As for its coverage: it may be "the result of more than a decade of research into optimising search algorithms and architectures", frankly this sounds dubious.
If it covers 30% of the web it'll be twice as good as existing engines, but I suppose thirdoftheweb.com isn't that catchy.
This would be a great application for a distributed computing application, lots of computers indexing the web, and after they finish that, they can revisit sites for broken, moved and changed content sites... First post?
But another problem, is the amount of dynamically generated content. There simply ISN'T any way for a search engine to safely index everything on the web, because it can't know which CGI's just serve up a finite selections of pages from a database, and which randomly generate content, as long as no decent clues are given.
The amount of dynamically generated content is growing dramatically, so this will be an increasing problem.
I wondered about those 200M pages already indexed, and I dug into Altavista, which says it has ~140M pages indexed.
I made two searches; one for the word 'Microsoft' and the other for 'Linux'.
Altavista gave : 12,682,370 (M$) and 4,526,430 (LX).
FAST gave : 4689227 (M$) and 2570827 (LX).
So.. If FAST currently is ~40% bigger than Altavista, how come they return numbers that much lower? With such large numbers it can't be pure coincidence, In My Humble Opinion.
-Snotboble
Q: How does a Unix guru have sex? A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep
Check out this
http://www.fast.no/product/fastpmc.html
gaute
-- We plunge for the slipstream the realness to find
-- We plunge for the slipstream the realness to find
The incredible String Band
No one every said Linux was stable on every single machine in the world, it supports a whole lot of hardware which itself isn't all that stable itself. :)
Linux Max Uptime: 845 days, 08:59m
FreeBSD Max Uptime: 690 days, 23:48m
Then again, there are about 1/10th the number of FreeBSD entrants... overall not a real big sampling group in general.
Plus there's no information about hardware anyone is using and why the machine was rebooted (kernel ugprades, hardware upgrades or crash).
Overall, it's sorta pointless other than a nice figure to say my oscar meyer is bigger than yours.
--
The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
Just for fun I decided to search for myself on Alltheweb. To my surprise I found:
1. The plan for an old CS group project from college, where my name was referenced!
2. 2 broken links to ZDNet talkbacks of mine.
3. A CNet page with a dorky little media player I wrote and released as freeware.
4. Some random Italian site hosting Win95 software including my dorky media player with full description extracted!!
Wow...my head is swelling...
Hmm...it didn't find my page though...heh
Aaron
It's 10 PM. Do you know if you're un-American?