Slashdot Mirror


Building a Bigger Search Engine

skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."

10 of 278 comments (clear)

  1. Re:Firewalls? by friedegg · · Score: 3, Informative

    You can always put an entry in your robots.txt to block it.

    Actually, the robots.txt issue is one they're still working on. Right now it doesn't check the file very often, which upsets some webmasters.

    They're open to suggestions, so maybe you could suggest a list of blacklisted IP's/hostnames. I suggested they look into supporting gzip compressed web pages, and they said they'd look into it.

    --
    Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
  2. My Take on Grub by Anonymous Coward · · Score: 2, Informative

    Looksmart is only using Grub to save on their bandwidth. Essentially Grub just compresses web pages before sending them to Looksmart's indexer thus reducing the bandwidth they have to pay for by a factor of 5 or so. The same thing could be accomplished through a proxy which compresses web pages. Eventually, once the HTTP mime standard for requesting compressed web pages is better supported by web servers, Grub will not be necessary.

  3. They realize they aren't the REAL GRUB by anagama · · Score: 5, Informative

    From the readme in the linux version - no idea what the other readmes might say. However, it appears that they are sensitve to the fact that bootloader grub pre-existed their program. They are requesting catchy names. Here is an excerpt:

    Notice
    ======
    The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.

    --
    What changed under Obama? Nothing Good
  4. Re:Firewalls? by friedegg · · Score: 2, Informative

    Well, if you're getting into "What if"'s, she could could also email someone outside the company anything from inside the firewall. Or setup a file sharing client like Kazaa and share things on local and network drives.

    If you wanted to forbid the client from working, network admins could block port 3136 (I think it is), which would prohibit communication with the central server.

    My understanding is that grub does not just crawl away randomly, rather it's given a list of things to crawl by the central server. So, assuming it hasn't crawled your intranet before, and you don't give it a local site to crawl, it shouldn't normally find them. But, like I said, they're open to suggestions, so if you have some, offer them.

    --
    Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
  5. You can run both by friedegg · · Score: 3, Informative

    Grub isn't a heavy cpu users. Right now, on my Athlon (~2400+), it's using between 0-2% of the CPU at any given time. Grub is mainly interested in your excess bandwidth.

    --
    Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
  6. Re:Search engine software and lack of A . I . by Anonymous Coward · · Score: 1, Informative

    Google is very responsive to spam reports. Rather than simply remove spam sites tas they find them, they prefer to "teach" their software what's bad from example. This can take a bit of extra time, but it seems worth it to me. Google even has a link on their search results for feedback if you're unhappy. Try reporting bad searches some time.

  7. Re:Legalities? by Anonymous Coward · · Score: 1, Informative

    A. I don't believe it caches anything except crc's for the url's. It downloads it, calculates the CRC, sees if it's updated, and it's gone. And, B. It doesn't download images or other media files, so no kiddie porn, unless it's text.

  8. The open faucet, not the blown dam by SmartGamer · · Score: 2, Informative

    A DDoS is only effective because it's a whole bunch of messages all at once to one target- in the 100,000,000 range for a full-scale attack, to always cover all the positions.

    The database of "check-me"s is randomized rather evenly. Even if this takes off, I don't see how it could really do serious damage to any but the truly dinky servers: the hits will not come in all at once and flood the whole connection. While it very well could end up a constant stream, it's unlikely to be the massive stream that makes a DDoS.

    It does have the potential to slow servers across the world, but that's okay- it will slow home users' connections across the world by using 1/4 of them, too, so nobody will actually notice.

    --
    Warning: Poster of this comment is a nerd. Just like everybody else here.
  9. Re:Will Grub take off or be smashed? by bcrowell · · Score: 4, Informative
    Do you have any references? Please back up your claims.
    here, and here

    Actually I think the hole potentially gave the ability to run arbitrary code, which isn't the same as a root vulnerability.

  10. Re:Grub does NOT look for robots.txt by Anonymous Coward · · Score: 3, Informative

    Here it is on mine requesting it:

    64.241.242.18 - - [18/Mar/2003:17:25:30 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"
    64.241.242.18 - - [19/Mar/2003:19:41:05 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"
    64.241.243.81 - - [30/Mar/2003:22:10:41 -0700] "GET /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"
    64.241.243.81 - - [01/Apr/2003:23:11:21 -0700] "GET /robots.txt HTTP/1.1" 200 223 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"

    Notice those are LookSmart owned ip's and not just normal user crawlers. They seem to centrally crawl for robots.txt. They do know, however, that they need to crawl for robots.txt more often.