Slashdot Mirror


Peer-to-Peer Search Engine Wants You To Help Grub

FuzzyMan45 writes: "Check this out! Grub.org has finally finished writing their internet crawler. For those of you who don't know, grub is a distributed internet crawler that is indexing the internet and working towards an almost realtime index of all the pages combined with a search engine. Think about it, a search with no dead links and no out-of-date pages! Grub is the way to go." It sounds like a cool hybrid of client- and server-side information (a crawler works from your computer, with updated findings sent to a central repository) and all GPL'd. If grub outperforms Google, I'll be happy as a google-using clam -- but unless they have google's logic and caching, that is a very tall order. A better search engine is a pleasant dream though.

13 of 64 comments (clear)

  1. Free work, bad indexes and huge loads? by Klaruz · · Score: 5

    I have a few 'issues' with this.

    1) Lots of people running grub means that they don't have to spend money on bandwidth to index sites, they get nice indexed data for free. Sure it's nice for them, but it seems like a kind of lame thing to do, I wonder how this will set with people.

    2) What if I modify/run their GPL software to index my collection of web pages, and I return indexes that contain pages that aren't really there to taint their search engine with my 10,000 non-existant pr0n pages. Do they have a way around that problem?

    3) Userland's editthispage.com and weblogs.com stopped allowing certain bots to index them because they were being abusive. Google was accidently blocked, but they were being nice. But still, the load of Google crawling all those pages is huge. Grub says they can crawl every page on the internet every day. What kind of load will this cause? The faq says your client will only do part of the internet, so you won't have everybody hitting your site daily, but I'm wondering what kind of flaws exist in that logic. Could this result in huge loads on webservers like the slashdot effect or a DDOS?

    1. Re:Free work, bad indexes and huge loads? by Restil · · Score: 3

      Actually, we're not really providing an index for them, we're providing to them a list of pages that were updated. The clients don't do much except monitor pages to see when they get updated. Of course, the clients COULD just send all the data to the servers, it makes no difference on the server side, they'll have to archive all the data anyways. The advantage here is they get notified of all the pages that change and can poll them at that time, instead of having to poll their entire index constantly if only 0.01% of those pages change every time.

      Ultimately this comes down to the fairness and who owns the database itself. If its open and free to everyone, then this is a good cause, even if they have to generate revenue to support the site that serves the searches. In fact, if they distribute their servers properly, it won't even be necessary to have a large revenue stream as the load could be scattered more or less evenly over the entire volunteer pool.

      Unfortunately, due to their somewhat fishy intentions with regards to revenue, it might end up killing the project before it ever takes off.
      And it really seems like a good idea too.

      Oh well.

      -Restil

      --
      Play with my webcams and lights here
  2. What's the license on the database? by Paul+Crowley · · Score: 5

    The software may be open source, but what's the license on the content of the database? I don't want to put huge amounts of work into creating what will become someone else's proprietary content, a la CDDB...
    --

    1. Re:What's the license on the database? by baptiste · · Score: 5
      Read their Investor Page - they absolutely plan on charging the search engines to use the data AND to sell top result spots to the highest bidder. Open source or no open source - this is a joke - they won't get a sliver of my bandwidth.

      Here is the section outlining what they plan to do with all this free data 'volunteers' give them:

      The first revenue stream will come from selling URL status information to companies like Google and Altavista. This status information will enable existing crawlers to target the crawls for a particular day, based on the highly up-to-date information contained in our database. These status updates are similar in nature to the service provided by someone like NetMind, in which a change on a website triggers an action. Grub's database will be much vaster by comparison however, enabling it to provide services directly to wholesale search engines.

      Second, Grub will begin selling "wholesale searches" to other search engines and companies. Grub will make strategic alliances with other search engines much in the same way that Google has done with Yahoo and Inktomi has done with Hotbot. Grub will also provide one-shot search results for a large search query, delivering the data in a database format (like XML) instead of a web format.

      Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.

      Fourth, Grub will provide consulting services for companies wanting to set up their own Grub networks. Large corporate intranets could be quickly and efficiently indexed into a central database with the Grub client/server model. Consulting and coding for these proprietary installations is a common model in Open Source oriented businesses like Sendmail, MySQL and Apache.

      Guess they thought we were really that stupid!

      --

    2. Re:What's the license on the database? by GNU+Zealot · · Score: 3

      Why don't we create an open, wholesome Grub?

      We could use the existing Grub software, but modify it to report to a community-run free database. Modifying the software to report to different servers would be rather easy, as would reverse engineering and replicating their database and website. However the sticking point would be funding for the large amount of bandwidth a site like this would need.

  3. What about manipulation? by Florian · · Score: 3

    What if web masters manipulate the spider and return false search results (for luring people into pr0n, spam, propaganda...)?
    The grub concepts sounds good, but I doubt it will stand reality unless you create a complex "web of trust" system. (Which in turn would be too complex for grub to become popular.)

    --
    gopher://cramer.plaintext.cc http://cramer.plaintext.cc:70
  4. Some Clarifications on Grub. by maddboyy · · Score: 4

    Wow, I'm surprised to see Grub on Slashdot this morning. The first client beta was _just_ released last night!

    Anyways, I know the Grub guys and was there when Grub was just an idea being discussed over coffee. Although I can't speak with 100% authority, I feel that I can give some insight and perhaps some clarification to a few concerns/questions floating around. It appears that Kord and Iggy may have left a bit to be desired on the FAQ :)

    From my understanding, the initial desired audience is the ISP admin. As an ISP, you'd be able to have your grub client index and crawl sites that you host. In turn, those sites will be available on whatever search engine Grub is supplying data to. Those running an ISP or hosting websites know how often clients request that you make sure they get crawled and listed in a search engine; this is a pretty nice value-add for your ISP service then. In this case, it's a win-win scenario. Grub gets up to date information on sites and the ISP gets to provide a much requested service to its customers.

    Later, I believe the plan to encourage individuals on broadband connections is to provide rewards for a certain number of sites crawled and also prizes for top crawlers.

    There are some concerns about the licensing of the database. It's my understanding that Grub is taking a commercial-pay/non-commercial-free approach. That means for instance, if you started an opensource search engine like aspseek.org you could use the Grub data for free. But if you're Google or Inktomi, you'll have to pay for access.

    The data will not be free to everyone. There's just no way anybody can provide the overhead costs for that kind of service free to everyone. I think charging only for commercial use is the best option in this case. Also, keep in mind that the server will eventually be released as well. This means that individuals could run their own grub servers and stockpile their own data.

    As far as the few statements regarding the stock options payment, I'm pretty sure all of the in house full-time developers get paid real money. However, Kord is really determined to make sure that those people kicking in 5-10hrs a week in their spare time get to share in some of the success when Grub hits it big. Once again, that's a win-win situation. The contributors get to work on a promising, useful OS project and if the world comes knocking for this better mouse trap, the contributors also get a bit of cash for their troubles.

    I'd encourage those that have concerns or are curious about the project to go ahead and download the client now while it's in such early development. Take a look at the code. Email Kord and Iggy and tell them what you think. Even email them if you think Grub is a stupid idea, but tell them why. I don't think wanting to make a successful commercial P2P application is a bad idea in and of itself.

  5. Interesting business plan by Russ+Nelson · · Score: 3

    Interesting business plan. Pay people 100% in stock options -- and in a business where many stock options have proven to be worthless. Well, it might fly.
    -russ

    --
    Don't piss off The Angry Economist
  6. A Grub's Response by kordless · · Score: 5
    Ok. It's us, the Grub guys here.

    First off, we didn't expect to get Slashdotted so fast. We really weren't ready, but we do appreciate the attention that it has brought. I'd like to address some of the issues that you guys have brought up because they ARE important to us. After all, you are (hopefully)running the client and we do care about what you think.

    OK, about the money thing. A few of you are blasting us for trying to make a buck off this idea. Last time I looked, it takes money to pay for servers, bandwidth and programmers! We didn't start this thing with the intention of ripping people off - we did it because current search engine crawling technology is behind the times and we thought we could fix it. Don't fault us for having a revenue model and a desire to build a solid company that feeds us.

    A LOT of you guys could benefit from having your web pages continuously indexed. It would help your customers AND it possible could increase the quality of service you provide. Besides, we aren't proposing to charge you for this service if you help out by running the client - that would defeat the purpose of the whole project.

    Don't you already pay for bandwidth that gets used up by Google, or even Excite? What's the difference if we use it instead, and it possible works better for you in the end?

    About the bootloader thing. Sorry about that guys, I didn't realize there was an Open Source project named Grub until we had the cards printed, domain registered, incorporated and had the plaque on the door. I've fielded a few emails about Grub (the bootloader) and we try to get them pointed in the right direction. BTW, we don't have anything to do with grub.com either

    About the security problems. We have thought about this and do have a solution proposed (though not implemented). We are planning on scheduling the same URLs out to multiple clients, in much the same way that SETI@Home does. If we get bogus results back from a particular client, then we'll know fairly quickly that someone is pulling a fast one on us. There are a few other things we can do, but it will take time to implement them.

    About the database. We really don't know about licensing the thing. Any comments or suggestions are MORE than welcome. We would like to leave it open for anyone to use or query, but charge LARGE corporations (like Google) for accessing LARGE bits of it.

    Give us some ideas on what you would like to see us do and we'll listen.

    Kord
    kord@grub.org

  7. Why bother? by totalslacker · · Score: 4

    Let me get this straight: they want me to run their client on my machines, using up my cpu and network bandwidth so that they can resell that information to other search engines?

    I particularly like this piece from their "Investors" page:

    Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.

    So basically, the sites who are will to spend the most money will get their url's pushed up to the top of the list. Relevancy be damned.

    Someone please tell me why I should dedicate my resources to this?

    I think the smartest thing about the whole idea was putting the whole thing under the guise of an "open source", "peer to peer", "distributed", "let's make the world better" search engine. They might have managed to get some real interest if they had done a better job at hiding their financial motives.

  8. Neat idea - but I'm gonna pass... by baptiste · · Score: 5
    So it sounds like they want to provide the info they gather to other existing' search engines. Hey - now Grub crawling the internet and sending its data to Google to make Google even better - I'm all over that. Of course, if they send data to Excite, I'll stop running the client. I cannot believe how Excite (and all the affiliated search engines they have now purchased) pretty much requires payment to get added and if you use the free form 'the site will be reviewed and there is no assurance it will be added. Process may take 4 to 6 weeks.'

    Thank goodness for Google!

    But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)

    Which now that I read the site more is their business plan. Read their Investor Page I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)

    And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.

    I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale

    --

  9. They say they WILL provide a searchable index... by baptiste · · Score: 5
    In reading the FAQ over - they state that they will, at some point, have a front end to search the data:

    Q: How much longer will it be before grub.org has a searchable index?
    A: The first phase of the client and server project has just started. We expect that phase to take somewhere between 2-3 months to complete. At that time, we will begin deploying to the client to beta testers - at which time the database will begin to grow. A searchable index will become available sometime between now and then that will access the database directly. Update: We expect the database to come online sometime in Jan 2001.

    Looks like they are a little behind schedule on this one.

    And another telling tidbit from their FAQ:

    Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens.
    A: We like aliens too, but ours is noble cause if there ever was one - to have a decent index of the Internet free for any individual to use when they need it. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.

    So I guess my main concern is that a) they could pull a GraceNote and b) the whole selling top result spots to big companies that may have NOTHING to do with what I'm searching for.

    --

  10. Problems with Google by 6EQUJ5 · · Score: 3

    Preface: Google is by far the best search available for general, random stuff.

    That said, I get frustrated by some of its quirks.

    For one thing, it excludes common small words like "to" "that" "the". Those words can be important when you're searching for a specific quote, say an old song or a line from a movie that you once heard.

    Google doesn't seem to understand strict, logical use of parentheses, almost like it's really searching for the characters "(...)", or even the word OR, which contradicts my first complaint!

    Lastly, it's still not clear to me whether a search for "naked cheerleader" gives the same result as "naked cheerleaders". Hence, I tend to use OR and AND (+) a lot in my searches, which as I just said doesn't seem to work very well.

    --