Building a Bigger Search Engine
skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."
You can always put an entry in your robots.txt to block it.
Actually, the robots.txt issue is one they're still working on. Right now it doesn't check the file very often, which upsets some webmasters.
They're open to suggestions, so maybe you could suggest a list of blacklisted IP's/hostnames. I suggested they look into supporting gzip compressed web pages, and they said they'd look into it.
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
From the readme in the linux version - no idea what the other readmes might say. However, it appears that they are sensitve to the fact that bootloader grub pre-existed their program. They are requesting catchy names. Here is an excerpt:
Notice
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.
What changed under Obama? Nothing Good
Grub isn't a heavy cpu users. Right now, on my Athlon (~2400+), it's using between 0-2% of the CPU at any given time. Grub is mainly interested in your excess bandwidth.
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
here, and here
Actually I think the hole potentially gave the ability to run arbitrary code, which isn't the same as a root vulnerability.
Find free books.
Here it is on mine requesting it:
/robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 223 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"
64.241.242.18 - - [18/Mar/2003:17:25:30 -0700] "GET
64.241.242.18 - - [19/Mar/2003:19:41:05 -0700] "GET
64.241.243.81 - - [30/Mar/2003:22:10:41 -0700] "GET
64.241.243.81 - - [01/Apr/2003:23:11:21 -0700] "GET
Notice those are LookSmart owned ip's and not just normal user crawlers. They seem to centrally crawl for robots.txt. They do know, however, that they need to crawl for robots.txt more often.