Peer-to-Peer Search Engine Wants You To Help Grub
FuzzyMan45 writes: "Check this out! Grub.org has finally finished writing their internet crawler. For those of you who don't know, grub is a distributed internet crawler that is indexing the internet and working towards an almost realtime index of all the pages combined with a search engine. Think about it, a search with no dead links and no out-of-date pages! Grub is the way to go." It sounds like a cool hybrid of client- and server-side information (a crawler works from your computer, with updated findings sent to a central repository) and all GPL'd. If grub outperforms Google, I'll be happy as a google-using clam -- but unless they have google's logic and caching, that is a very tall order. A better search engine is a pleasant dream though.
I have a few 'issues' with this.
1) Lots of people running grub means that they don't have to spend money on bandwidth to index sites, they get nice indexed data for free. Sure it's nice for them, but it seems like a kind of lame thing to do, I wonder how this will set with people.
2) What if I modify/run their GPL software to index my collection of web pages, and I return indexes that contain pages that aren't really there to taint their search engine with my 10,000 non-existant pr0n pages. Do they have a way around that problem?
3) Userland's editthispage.com and weblogs.com stopped allowing certain bots to index them because they were being abusive. Google was accidently blocked, but they were being nice. But still, the load of Google crawling all those pages is huge. Grub says they can crawl every page on the internet every day. What kind of load will this cause? The faq says your client will only do part of the internet, so you won't have everybody hitting your site daily, but I'm wondering what kind of flaws exist in that logic. Could this result in huge loads on webservers like the slashdot effect or a DDOS?
The software may be open source, but what's the license on the content of the database? I don't want to put huge amounts of work into creating what will become someone else's proprietary content, a la CDDB...
--
Xenu loves you!
First off, we didn't expect to get Slashdotted so fast. We really weren't ready, but we do appreciate the attention that it has brought. I'd like to address some of the issues that you guys have brought up because they ARE important to us. After all, you are (hopefully)running the client and we do care about what you think.
OK, about the money thing. A few of you are blasting us for trying to make a buck off this idea. Last time I looked, it takes money to pay for servers, bandwidth and programmers! We didn't start this thing with the intention of ripping people off - we did it because current search engine crawling technology is behind the times and we thought we could fix it. Don't fault us for having a revenue model and a desire to build a solid company that feeds us.
A LOT of you guys could benefit from having your web pages continuously indexed. It would help your customers AND it possible could increase the quality of service you provide. Besides, we aren't proposing to charge you for this service if you help out by running the client - that would defeat the purpose of the whole project.
Don't you already pay for bandwidth that gets used up by Google, or even Excite? What's the difference if we use it instead, and it possible works better for you in the end?
About the bootloader thing. Sorry about that guys, I didn't realize there was an Open Source project named Grub until we had the cards printed, domain registered, incorporated and had the plaque on the door. I've fielded a few emails about Grub (the bootloader) and we try to get them pointed in the right direction. BTW, we don't have anything to do with grub.com either
About the security problems. We have thought about this and do have a solution proposed (though not implemented). We are planning on scheduling the same URLs out to multiple clients, in much the same way that SETI@Home does. If we get bogus results back from a particular client, then we'll know fairly quickly that someone is pulling a fast one on us. There are a few other things we can do, but it will take time to implement them.
About the database. We really don't know about licensing the thing. Any comments or suggestions are MORE than welcome. We would like to leave it open for anyone to use or query, but charge LARGE corporations (like Google) for accessing LARGE bits of it.
Give us some ideas on what you would like to see us do and we'll listen.
Kord
kord@grub.org
Thank goodness for Google!
But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)
Which now that I read the site more is their business plan. Read their Investor Page I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)
And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.
I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale
--
Top Most Bizarre/Disturbing Error Messages
Q: How much longer will it be before grub.org has a searchable index?
A: The first phase of the client and server project has just started. We expect that phase to take somewhere between 2-3 months to complete. At that time, we will begin deploying to the client to beta testers - at which time the database will begin to grow. A searchable index will become available sometime between now and then that will access the database directly. Update: We expect the database to come online sometime in Jan 2001.
Looks like they are a little behind schedule on this one.
And another telling tidbit from their FAQ:
Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens.
A: We like aliens too, but ours is noble cause if there ever was one - to have a decent index of the Internet free for any individual to use when they need it. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.
So I guess my main concern is that a) they could pull a GraceNote and b) the whole selling top result spots to big companies that may have NOTHING to do with what I'm searching for.
--
Top Most Bizarre/Disturbing Error Messages