Peer-to-Peer Search Engine Wants You To Help Grub

← Back to Stories (view on slashdot.org)

Peer-to-Peer Search Engine Wants You To Help Grub

Posted by ryuzaki0 on Saturday May 12, 2001 @11:49PM from the beats-seti@home-for-me dept.

FuzzyMan45 writes: "Check this out! Grub.org has finally finished writing their internet crawler. For those of you who don't know, grub is a distributed internet crawler that is indexing the internet and working towards an almost realtime index of all the pages combined with a search engine. Think about it, a search with no dead links and no out-of-date pages! Grub is the way to go." It sounds like a cool hybrid of client- and server-side information (a crawler works from your computer, with updated findings sent to a central repository) and all GPL'd. If grub outperforms Google, I'll be happy as a google-using clam -- but unless they have google's logic and caching, that is a very tall order. A better search engine is a pleasant dream though.

6 of 64 comments (clear)

Min score:

Reason:

Sort:

Free work, bad indexes and huge loads? by Klaruz · 2001-05-12 20:14 · Score: 5

I have a few 'issues' with this.

1) Lots of people running grub means that they don't have to spend money on bandwidth to index sites, they get nice indexed data for free. Sure it's nice for them, but it seems like a kind of lame thing to do, I wonder how this will set with people.

2) What if I modify/run their GPL software to index my collection of web pages, and I return indexes that contain pages that aren't really there to taint their search engine with my 10,000 non-existant pr0n pages. Do they have a way around that problem?

3) Userland's editthispage.com and weblogs.com stopped allowing certain bots to index them because they were being abusive. Google was accidently blocked, but they were being nice. But still, the load of Google crawling all those pages is huge. Grub says they can crawl every page on the internet every day. What kind of load will this cause? The faq says your client will only do part of the internet, so you won't have everybody hitting your site daily, but I'm wondering what kind of flaws exist in that logic. Could this result in huge loads on webservers like the slashdot effect or a DDOS?
What's the license on the database? by Paul+Crowley · 2001-05-12 20:08 · Score: 5

The software may be open source, but what's the license on the content of the database? I don't want to put huge amounts of work into creating what will become someone else's proprietary content, a la CDDB...
--

--
Xenu loves you!
1. Re:What's the license on the database? by baptiste · 2001-05-12 20:23 · Score: 5
  
  Read their Investor Page - they absolutely plan on charging the search engines to use the data AND to sell top result spots to the highest bidder. Open source or no open source - this is a joke - they won't get a sliver of my bandwidth.
  Here is the section outlining what they plan to do with all this free data 'volunteers' give them:
  The first revenue stream will come from selling URL status information to companies like Google and Altavista. This status information will enable existing crawlers to target the crawls for a particular day, based on the highly up-to-date information contained in our database. These status updates are similar in nature to the service provided by someone like NetMind, in which a change on a website triggers an action. Grub's database will be much vaster by comparison however, enabling it to provide services directly to wholesale search engines.
  Second, Grub will begin selling "wholesale searches" to other search engines and companies. Grub will make strategic alliances with other search engines much in the same way that Google has done with Yahoo and Inktomi has done with Hotbot. Grub will also provide one-shot search results for a large search query, delivering the data in a database format (like XML) instead of a web format.
  Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.
  Fourth, Grub will provide consulting services for companies wanting to set up their own Grub networks. Large corporate intranets could be quickly and efficiently indexed into a central database with the Grub client/server model. Consulting and coding for these proprietary installations is a common model in Open Source oriented businesses like Sendmail, MySQL and Apache.
  Guess they thought we were really that stupid!
  
  --
  
  --
  Top Most Bizarre/Disturbing Error Messages
A Grub's Response by kordless · 2001-05-13 04:42 · Score: 5

Ok. It's us, the Grub guys here.
First off, we didn't expect to get Slashdotted so fast. We really weren't ready, but we do appreciate the attention that it has brought. I'd like to address some of the issues that you guys have brought up because they ARE important to us. After all, you are (hopefully)running the client and we do care about what you think.
OK, about the money thing. A few of you are blasting us for trying to make a buck off this idea. Last time I looked, it takes money to pay for servers, bandwidth and programmers! We didn't start this thing with the intention of ripping people off - we did it because current search engine crawling technology is behind the times and we thought we could fix it. Don't fault us for having a revenue model and a desire to build a solid company that feeds us.
A LOT of you guys could benefit from having your web pages continuously indexed. It would help your customers AND it possible could increase the quality of service you provide. Besides, we aren't proposing to charge you for this service if you help out by running the client - that would defeat the purpose of the whole project.
Don't you already pay for bandwidth that gets used up by Google, or even Excite? What's the difference if we use it instead, and it possible works better for you in the end?
About the bootloader thing. Sorry about that guys, I didn't realize there was an Open Source project named Grub until we had the cards printed, domain registered, incorporated and had the plaque on the door. I've fielded a few emails about Grub (the bootloader) and we try to get them pointed in the right direction. BTW, we don't have anything to do with grub.com either
About the security problems. We have thought about this and do have a solution proposed (though not implemented). We are planning on scheduling the same URLs out to multiple clients, in much the same way that SETI@Home does. If we get bogus results back from a particular client, then we'll know fairly quickly that someone is pulling a fast one on us. There are a few other things we can do, but it will take time to implement them.
About the database. We really don't know about licensing the thing. Any comments or suggestions are MORE than welcome. We would like to leave it open for anyone to use or query, but charge LARGE corporations (like Google) for accessing LARGE bits of it.
Give us some ideas on what you would like to see us do and we'll listen.
Kord
kord@grub.org
Neat idea - but I'm gonna pass... by baptiste · 2001-05-12 20:16 · Score: 5

So it sounds like they want to provide the info they gather to other existing' search engines. Hey - now Grub crawling the internet and sending its data to Google to make Google even better - I'm all over that. Of course, if they send data to Excite, I'll stop running the client. I cannot believe how Excite (and all the affiliated search engines they have now purchased) pretty much requires payment to get added and if you use the free form 'the site will be reviewed and there is no assurance it will be added. Process may take 4 to 6 weeks.'
Thank goodness for Google!
But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)
Which now that I read the site more is their business plan. Read their Investor Page I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)
And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.
I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale

--

--
Top Most Bizarre/Disturbing Error Messages
They say they WILL provide a searchable index... by baptiste · 2001-05-12 20:30 · Score: 5

In reading the FAQ over - they state that they will, at some point, have a front end to search the data:
Q: How much longer will it be before grub.org has a searchable index? A: The first phase of the client and server project has just started. We expect that phase to take somewhere between 2-3 months to complete. At that time, we will begin deploying to the client to beta testers - at which time the database will begin to grow. A searchable index will become available sometime between now and then that will access the database directly. Update: We expect the database to come online sometime in Jan 2001.
Looks like they are a little behind schedule on this one.
And another telling tidbit from their FAQ:
Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens. A: We like aliens too, but ours is noble cause if there ever was one - to have a decent index of the Internet free for any individual to use when they need it. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.
So I guess my main concern is that a) they could pull a GraceNote and b) the whole selling top result spots to big companies that may have NOTHING to do with what I'm searching for.

--

--
Top Most Bizarre/Disturbing Error Messages