Building a Bigger Search Engine
skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."
Also the grub engine crawls everything, including adult content and other questionable content. They have a setting to turn it off, but it does not block it. With the current questioning of international law relating to accessing illegal websites this could have major consequences for the average user.
So for the time being I have stopped using the grub client until some serious questions are answered. It's an interesting concept and if it was being used in more of an academic setting it could be interesting. However I believe that search engines like Google are doing pretty good themselves.
Go calculate something
LookSmart hopes to tap the altruistic nature of many Internet users.
That unfortunately seems like a naively optimistic hope. While the
vast majority of people may be altruistic, it only takes a few
unscrupulous individuals to completely undermine a fair result.
It's interesting that this idea is an extension to Google's model in
many ways. Essentially Google is able to index so much of the
interent by having 50,000+ servers. I don't think that's what makes
Google such a useful search tool, rather I think it's accuracy and
relevancy. If my search results started getting poluted with bogus
hits, I would stop using it almost immediately.
Unfortunately, by letting people run the client on their machine and
having it send the results back to the server, I think spoofed
results are inevitable. I don't think it will be possible to
safeguard the results either, it will be interesting to see how well
this project survives *when* people start spoofing results. It's
been a problem for SETI@home, and it's something that undermined some
peoples faith in the project as a whole. If the spoofed results are
more widespread and have a larger impact as they would in a system
like this, it may ultimately prove fatal to the project.
One factor that has been asbolutely critical to Google's success has
been their ability to remain resistant to spoofing attempts. It's
still a question mark how well grub will perform in that context.
Doug Tolton
"The destruction of a value which is, will not bring value to that which isn't." -John Galt
until someone figures out a way to compromize their local client's results and "escalate" their fave URLS.
It still sounds like a really cool idea though.
Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
So if I choose to run this client, how do I know that it won't accidentally index content that is only accessible from behind my firewall?
What's the difference between my machine indexing them and the university students recently being hauled into court for indexing open shares? Why would I not be held liable for contributory copyright infringement?
No thanks.
I prefer grid.org to grub.org. There the cycles are going to cancer or smallpox research. Currently over 2 million machines are participating.
Altruism has its place, but since I'm more likely to die of cancer than of not having the complete www indexed I think I'll be selfish and work towards a cure for something that may affect me.
But it still kind of irks me that people think that a computerized 'dumb' search result could compete with a human rating system that filters spam,porn,and other garbage results. Google should hire some REAL PEOPLE that can do some sort catagorized intelligent directory so we can have QUALITY at the beginning of a search result. Some sort of HUMUN RATING system is needed to sort. The software is not up to par.
Yea. If you help Grub, Grub gives your web site a preferencial listing. Building the biggest search engine, sure. Building good search results, not so sure.
You can always use the Google API for more than 2,000 searches per day if you pay licensing fees for it. That's just Google ensuring that it can remain a viable company. Little text-box advertisements just don't cut it in this day and age where blatant pop-ups and colorful banner ads don't even have much turn-around. That's not the point though.
The point is that I wouldn't look anytime soon for LookSmart to allow unlimited usage of this API. It's too large of a project for them to just let people use it. It's simple economics. They may not be investing the computing resources into this projects web spidering software, but it's still using TONS of resources to keep this data catalogued and readily accessible.
Of course, I am the first one to question this trend. Has anyone else considered the possibility that one day we'll wake up, and notice that google is charging for access to it's basic searching services?
I for one, would probably pay. I have become so dependent on it. What price? That's a good question...