Building a Bigger Search Engine

← Back to Stories (view on slashdot.org)

Building a Bigger Search Engine

Posted by ryuzaki0 on Saturday April 19, 2003 @02:15PM from the size-isn't-everything dept.

skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."

5 of 278 comments (clear)

Min score:

Reason:

Sort:

Re:Firewalls? by friedegg · 2003-04-19 14:40 · Score: 3, Informative

You can always put an entry in your robots.txt to block it.

Actually, the robots.txt issue is one they're still working on. Right now it doesn't check the file very often, which upsets some webmasters.

They're open to suggestions, so maybe you could suggest a list of blacklisted IP's/hostnames. I suggested they look into supporting gzip compressed web pages, and they said they'd look into it.

--
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
They realize they aren't the REAL GRUB by anagama · 2003-04-19 14:55 · Score: 5, Informative

From the readme in the linux version - no idea what the other readmes might say. However, it appears that they are sensitve to the fact that bootloader grub pre-existed their program. They are requesting catchy names. Here is an excerpt:

Notice
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.

--
What changed under Obama? Nothing Good
You can run both by friedegg · 2003-04-19 15:08 · Score: 3, Informative

Grub isn't a heavy cpu users. Right now, on my Athlon (~2400+), it's using between 0-2% of the CPU at any given time. Grub is mainly interested in your excess bandwidth.

--
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
Re:Will Grub take off or be smashed? by bcrowell · 2003-04-19 16:48 · Score: 4, Informative

Do you have any references? Please back up your claims.
here, and here
Actually I think the hole potentially gave the ability to run arbitrary code, which isn't the same as a root vulnerability.

--
Find free books.
Re:Grub does NOT look for robots.txt by Anonymous Coward · 2003-04-20 01:44 · Score: 3, Informative

Here it is on mine requesting it: 64.241.242.18 - - [18/Mar/2003:17:25:30 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" 64.241.242.18 - - [19/Mar/2003:19:41:05 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" 64.241.243.81 - - [30/Mar/2003:22:10:41 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" 64.241.243.81 - - [01/Apr/2003:23:11:21 -0700] "GET /robots.txt HTTP/1.1" 200 223 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" Notice those are LookSmart owned ip's and not just normal user crawlers. They seem to centrally crawl for robots.txt. They do know, however, that they need to crawl for robots.txt more often.