Building a Bigger Search Engine
skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."
Also the grub engine crawls everything, including adult content and other questionable content. They have a setting to turn it off, but it does not block it. With the current questioning of international law relating to accessing illegal websites this could have major consequences for the average user.
So for the time being I have stopped using the grub client until some serious questions are answered. It's an interesting concept and if it was being used in more of an academic setting it could be interesting. However I believe that search engines like Google are doing pretty good themselves.
Go calculate something
LookSmart hopes to tap the altruistic nature of many Internet users.
That unfortunately seems like a naively optimistic hope. While the
vast majority of people may be altruistic, it only takes a few
unscrupulous individuals to completely undermine a fair result.
It's interesting that this idea is an extension to Google's model in
many ways. Essentially Google is able to index so much of the
interent by having 50,000+ servers. I don't think that's what makes
Google such a useful search tool, rather I think it's accuracy and
relevancy. If my search results started getting poluted with bogus
hits, I would stop using it almost immediately.
Unfortunately, by letting people run the client on their machine and
having it send the results back to the server, I think spoofed
results are inevitable. I don't think it will be possible to
safeguard the results either, it will be interesting to see how well
this project survives *when* people start spoofing results. It's
been a problem for SETI@home, and it's something that undermined some
peoples faith in the project as a whole. If the spoofed results are
more widespread and have a larger impact as they would in a system
like this, it may ultimately prove fatal to the project.
One factor that has been asbolutely critical to Google's success has
been their ability to remain resistant to spoofing attempts. It's
still a question mark how well grub will perform in that context.
Doug Tolton
"The destruction of a value which is, will not bring value to that which isn't." -John Galt
I bet one of the big successes in Folding and distributed.net is that many people run the clients on work boxes, knowing that there's little actual overhead incurred to their work. How different that is for a URL sucker.
I wonder what broadband ISPs think of Grub.
Grub searches the web
... and I have a suggestion. Has anyone written a program called "E-Coli" yet? No? I can just imagine my mom ...
Sniffing out all the good porn
Not just bootloader
I love being a Slashdot subscriber - it gives me fifteen minutes to figure out a good joke before anyone has a chance to post!
Seriously though, shouldn't they change the name? "GRUB" is already a bootloader. They should change the name
"Agh! You have E-Coli on your computer!"
Cyde Weys Musings - Scrutinizing the inscrutable
until someone figures out a way to compromize their local client's results and "escalate" their fave URLS.
It still sounds like a really cool idea though.
Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
1. Tech-savvy people will install this.
2. Tech-savvy people tend to be loners.
3. Loners most often search for porn.
C1. Tech-savvy people search for porn.
4. Items searched for most often reach the top of the list.
5. Porn is searched for often by tech-savvy people.
C2. Porn will be easier to find with this new search engine.
Count me in!
Oh wait, you mean it's not related to GRUB, the Linux/etc boot loader. *slaps forehead* But I guess this solves everything - we can call Phoenix "Grub" too, and just treat it as the generic name to call everything we're having problems thinking up a name for...
You are not alone. This is not normal. None of this is normal.
So if I choose to run this client, how do I know that it won't accidentally index content that is only accessible from behind my firewall?
Couldn't google do this anyways with the google toolbar? Cause with the advanced features version it tracks every page you visit. If they offered some incentive to install the toolbar, google could just beat them at this game. I actually use the google toolbar already by choice (it makes my web searching more productive) everyday, all they have to do is get lots of people using it and wouldn't that work just as well or better?
...those pigeons can't be beat.
What's the difference between my machine indexing them and the university students recently being hauled into court for indexing open shares? Why would I not be held liable for contributory copyright infringement?
No thanks.
Here is what slashdotters were saying about grub almost 2 years ago.
Raisinettes are my raison d'etre
From the readme in the linux version - no idea what the other readmes might say. However, it appears that they are sensitve to the fact that bootloader grub pre-existed their program. They are requesting catchy names. Here is an excerpt:
Notice
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.
What changed under Obama? Nothing Good
I prefer grid.org to grub.org. There the cycles are going to cancer or smallpox research. Currently over 2 million machines are participating.
Altruism has its place, but since I'm more likely to die of cancer than of not having the complete www indexed I think I'll be selfish and work towards a cure for something that may affect me.
I expected some way to search... this looks more like a project to index the web rather than make the results available for public use via web interface. Did it strike anyone else odd that there was no web form on the home page with which to search?!
It seems like a good concept, but the availability of the information collected needs to be accessible without installing the client. I'm not game to install distributed computing apps without some freely available benefit. The "for the good of the world" motivation went out the window for me about a day after my first Seti At Home experience. (But now BitTorrent, there was appreciable benefit. I had RedHat 9 isos within 8 hours of their initial release!)
There is no need to use a SlashDot sig for SEO...
You have to be kidding or working for Microsoft, or both! Have you ever searched for Linux on MSN? Try it - here.
Notice the third result? "Learn about the Microsoft alternatives and how to move to them from open source products." I shit you not! I don't think Google would ever use this kind of dirty, underhanded trick. Great "hand-picking", mate.
We're only gonna die from our own arrogance, that's why we might as well take our time...
Grub isn't a heavy cpu users. Right now, on my Athlon (~2400+), it's using between 0-2% of the CPU at any given time. Grub is mainly interested in your excess bandwidth.
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
Isn't Looksmart/Sprinks a big pay-per-listing deal? The looksmart logo in the upper right corner was enough to make me just close that page right away without any second thought.
Morphing Software
But it still kind of irks me that people think that a computerized 'dumb' search result could compete with a human rating system that filters spam,porn,and other garbage results. Google should hire some REAL PEOPLE that can do some sort catagorized intelligent directory so we can have QUALITY at the beginning of a search result. Some sort of HUMUN RATING system is needed to sort. The software is not up to par.
(Oh, I can't remember. Have I MetaModerated Recently?)
sulli
RTFJ.
Yea. If you help Grub, Grub gives your web site a preferencial listing. Building the biggest search engine, sure. Building good search results, not so sure.
- makes me feel warm and fuzzy about my altruism
- can run in the background on a Unix box
- is open-source (so I don't have to run someone's closed-source app on my box and trust their
security through obscurity)
Well, #1 rules out Grub, #2 rules out Folding@Home, and #3 rules out both SETI@Home and Folding@Home.So what worthy causes are out there?
Find free books.
If this thing gets too popular without proper throttling, they could cause real havoc.
Copyright Violation:"theft, piracy"::Anti-Trust Violation:"thermonuclear price terrorism"<-Overly dramatic language.
Alright, I have 3 major problems with this...
1) How different is this than the princton kiddies system? I don't know about you, but I don't want a 95 billion dollar bill arriving in the mail...
2) What if you local (cache?) contains a few links to kiddie porn? Not your fault, right? Software does it's own thing, you cannot control, BUT what will the FBI think? The FBI Scottland Yard, RCMP are currently heavily investigating Kiddie Porn cases (good work IMHO), but what if your the unlucky sap who getts stuck with a few sketchy URLs? Or Worse Yet, what if this GRUB keeps a cache of the website like google does? Then what?
3) What about material that is legal locally, but illegial somewhere else... eg. Nazi stuff in Germany, Falun Gong in China, etc... The last thing I want is to be refused to be given a travel visa cuz my PC has an illegial cache...
Good idea in principle, but with sketchy content on the web, I don't think I will be the one keeping track of it all. If there is a way to filter out the questionable stuff then maybe, but since the purpose is to be as inclusive as possible, it seems incompatible.
_CMK
Bad spellers of the world untie!
You can always use the Google API for more than 2,000 searches per day if you pay licensing fees for it. That's just Google ensuring that it can remain a viable company. Little text-box advertisements just don't cut it in this day and age where blatant pop-ups and colorful banner ads don't even have much turn-around. That's not the point though.
The point is that I wouldn't look anytime soon for LookSmart to allow unlimited usage of this API. It's too large of a project for them to just let people use it. It's simple economics. They may not be investing the computing resources into this projects web spidering software, but it's still using TONS of resources to keep this data catalogued and readily accessible.
It is too easy to send currupted information into the database. They have *no choice* but to trust the clients. Sure they could run spot checks on the results, but they would be very partial and it would be easy enough to fake responses for those as well.
So the more popular it gets, the more incentive people will have to promote their sites by feeding it fake index information. If this magically got to be very popular, within weeks search results would become meaningelss and it would drop back into obscurity. The more likely result would be that it will never become popular in the first place.
Besides, who wants to donate his CPU and bandwidth resources for a commercial company, anyway?
Here it is on mine requesting it:
/robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 222 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)" /robots.txt HTTP/1.1" 200 223 "-" "Mozilla/4.0 (compatible; grub-client-1.07; Crawl your own stuff with http://grub.org)"
64.241.242.18 - - [18/Mar/2003:17:25:30 -0700] "GET
64.241.242.18 - - [19/Mar/2003:19:41:05 -0700] "GET
64.241.243.81 - - [30/Mar/2003:22:10:41 -0700] "GET
64.241.243.81 - - [01/Apr/2003:23:11:21 -0700] "GET
Notice those are LookSmart owned ip's and not just normal user crawlers. They seem to centrally crawl for robots.txt. They do know, however, that they need to crawl for robots.txt more often.