Peer-to-Peer Search Engine Wants You To Help Grub
FuzzyMan45 writes: "Check this out! Grub.org has finally finished writing their internet crawler. For those of you who don't know, grub is a distributed internet crawler that is indexing the internet and working towards an almost realtime index of all the pages combined with a search engine. Think about it, a search with no dead links and no out-of-date pages! Grub is the way to go." It sounds like a cool hybrid of client- and server-side information (a crawler works from your computer, with updated findings sent to a central repository) and all GPL'd. If grub outperforms Google, I'll be happy as a google-using clam -- but unless they have google's logic and caching, that is a very tall order. A better search engine is a pleasant dream though.
I have a few 'issues' with this.
1) Lots of people running grub means that they don't have to spend money on bandwidth to index sites, they get nice indexed data for free. Sure it's nice for them, but it seems like a kind of lame thing to do, I wonder how this will set with people.
2) What if I modify/run their GPL software to index my collection of web pages, and I return indexes that contain pages that aren't really there to taint their search engine with my 10,000 non-existant pr0n pages. Do they have a way around that problem?
3) Userland's editthispage.com and weblogs.com stopped allowing certain bots to index them because they were being abusive. Google was accidently blocked, but they were being nice. But still, the load of Google crawling all those pages is huge. Grub says they can crawl every page on the internet every day. What kind of load will this cause? The faq says your client will only do part of the internet, so you won't have everybody hitting your site daily, but I'm wondering what kind of flaws exist in that logic. Could this result in huge loads on webservers like the slashdot effect or a DDOS?
The software may be open source, but what's the license on the content of the database? I don't want to put huge amounts of work into creating what will become someone else's proprietary content, a la CDDB...
--
Xenu loves you!
Wow, I'm surprised to see Grub on Slashdot this morning. The first client beta was _just_ released last night!
Anyways, I know the Grub guys and was there when Grub was just an idea being discussed over coffee. Although I can't speak with 100% authority, I feel that I can give some insight and perhaps some clarification to a few concerns/questions floating around. It appears that Kord and Iggy may have left a bit to be desired on the FAQ :)
From my understanding, the initial desired audience is the ISP admin. As an ISP, you'd be able to have your grub client index and crawl sites that you host. In turn, those sites will be available on whatever search engine Grub is supplying data to. Those running an ISP or hosting websites know how often clients request that you make sure they get crawled and listed in a search engine; this is a pretty nice value-add for your ISP service then. In this case, it's a win-win scenario. Grub gets up to date information on sites and the ISP gets to provide a much requested service to its customers.
Later, I believe the plan to encourage individuals on broadband connections is to provide rewards for a certain number of sites crawled and also prizes for top crawlers.
There are some concerns about the licensing of the database. It's my understanding that Grub is taking a commercial-pay/non-commercial-free approach. That means for instance, if you started an opensource search engine like aspseek.org you could use the Grub data for free. But if you're Google or Inktomi, you'll have to pay for access.
The data will not be free to everyone. There's just no way anybody can provide the overhead costs for that kind of service free to everyone. I think charging only for commercial use is the best option in this case. Also, keep in mind that the server will eventually be released as well. This means that individuals could run their own grub servers and stockpile their own data.
As far as the few statements regarding the stock options payment, I'm pretty sure all of the in house full-time developers get paid real money. However, Kord is really determined to make sure that those people kicking in 5-10hrs a week in their spare time get to share in some of the success when Grub hits it big. Once again, that's a win-win situation. The contributors get to work on a promising, useful OS project and if the world comes knocking for this better mouse trap, the contributors also get a bit of cash for their troubles.
I'd encourage those that have concerns or are curious about the project to go ahead and download the client now while it's in such early development. Take a look at the code. Email Kord and Iggy and tell them what you think. Even email them if you think Grub is a stupid idea, but tell them why. I don't think wanting to make a successful commercial P2P application is a bad idea in and of itself.
First off, we didn't expect to get Slashdotted so fast. We really weren't ready, but we do appreciate the attention that it has brought. I'd like to address some of the issues that you guys have brought up because they ARE important to us. After all, you are (hopefully)running the client and we do care about what you think.
OK, about the money thing. A few of you are blasting us for trying to make a buck off this idea. Last time I looked, it takes money to pay for servers, bandwidth and programmers! We didn't start this thing with the intention of ripping people off - we did it because current search engine crawling technology is behind the times and we thought we could fix it. Don't fault us for having a revenue model and a desire to build a solid company that feeds us.
A LOT of you guys could benefit from having your web pages continuously indexed. It would help your customers AND it possible could increase the quality of service you provide. Besides, we aren't proposing to charge you for this service if you help out by running the client - that would defeat the purpose of the whole project.
Don't you already pay for bandwidth that gets used up by Google, or even Excite? What's the difference if we use it instead, and it possible works better for you in the end?
About the bootloader thing. Sorry about that guys, I didn't realize there was an Open Source project named Grub until we had the cards printed, domain registered, incorporated and had the plaque on the door. I've fielded a few emails about Grub (the bootloader) and we try to get them pointed in the right direction. BTW, we don't have anything to do with grub.com either
About the security problems. We have thought about this and do have a solution proposed (though not implemented). We are planning on scheduling the same URLs out to multiple clients, in much the same way that SETI@Home does. If we get bogus results back from a particular client, then we'll know fairly quickly that someone is pulling a fast one on us. There are a few other things we can do, but it will take time to implement them.
About the database. We really don't know about licensing the thing. Any comments or suggestions are MORE than welcome. We would like to leave it open for anyone to use or query, but charge LARGE corporations (like Google) for accessing LARGE bits of it.
Give us some ideas on what you would like to see us do and we'll listen.
Kord
kord@grub.org
Let me get this straight: they want me to run their client on my machines, using up my cpu and network bandwidth so that they can resell that information to other search engines?
I particularly like this piece from their "Investors" page:
Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.
So basically, the sites who are will to spend the most money will get their url's pushed up to the top of the list. Relevancy be damned.
Someone please tell me why I should dedicate my resources to this?
I think the smartest thing about the whole idea was putting the whole thing under the guise of an "open source", "peer to peer", "distributed", "let's make the world better" search engine. They might have managed to get some real interest if they had done a better job at hiding their financial motives.
Thank goodness for Google!
But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)
Which now that I read the site more is their business plan. Read their Investor Page I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)
And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.
I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale
--
Top Most Bizarre/Disturbing Error Messages
Q: How much longer will it be before grub.org has a searchable index?
A: The first phase of the client and server project has just started. We expect that phase to take somewhere between 2-3 months to complete. At that time, we will begin deploying to the client to beta testers - at which time the database will begin to grow. A searchable index will become available sometime between now and then that will access the database directly. Update: We expect the database to come online sometime in Jan 2001.
Looks like they are a little behind schedule on this one.
And another telling tidbit from their FAQ:
Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens.
A: We like aliens too, but ours is noble cause if there ever was one - to have a decent index of the Internet free for any individual to use when they need it. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.
So I guess my main concern is that a) they could pull a GraceNote and b) the whole selling top result spots to big companies that may have NOTHING to do with what I'm searching for.
--
Top Most Bizarre/Disturbing Error Messages