Building a Bigger Search Engine
skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."
Also the grub engine crawls everything, including adult content and other questionable content. They have a setting to turn it off, but it does not block it. With the current questioning of international law relating to accessing illegal websites this could have major consequences for the average user.
So for the time being I have stopped using the grub client until some serious questions are answered. It's an interesting concept and if it was being used in more of an academic setting it could be interesting. However I believe that search engines like Google are doing pretty good themselves.
Go calculate something
LookSmart hopes to tap the altruistic nature of many Internet users.
That unfortunately seems like a naively optimistic hope. While the
vast majority of people may be altruistic, it only takes a few
unscrupulous individuals to completely undermine a fair result.
It's interesting that this idea is an extension to Google's model in
many ways. Essentially Google is able to index so much of the
interent by having 50,000+ servers. I don't think that's what makes
Google such a useful search tool, rather I think it's accuracy and
relevancy. If my search results started getting poluted with bogus
hits, I would stop using it almost immediately.
Unfortunately, by letting people run the client on their machine and
having it send the results back to the server, I think spoofed
results are inevitable. I don't think it will be possible to
safeguard the results either, it will be interesting to see how well
this project survives *when* people start spoofing results. It's
been a problem for SETI@home, and it's something that undermined some
peoples faith in the project as a whole. If the spoofed results are
more widespread and have a larger impact as they would in a system
like this, it may ultimately prove fatal to the project.
One factor that has been asbolutely critical to Google's success has
been their ability to remain resistant to spoofing attempts. It's
still a question mark how well grub will perform in that context.
Doug Tolton
"The destruction of a value which is, will not bring value to that which isn't." -John Galt
I bet one of the big successes in Folding and distributed.net is that many people run the clients on work boxes, knowing that there's little actual overhead incurred to their work. How different that is for a URL sucker.
I wonder what broadband ISPs think of Grub.
Grub searches the web
... and I have a suggestion. Has anyone written a program called "E-Coli" yet? No? I can just imagine my mom ...
Sniffing out all the good porn
Not just bootloader
I love being a Slashdot subscriber - it gives me fifteen minutes to figure out a good joke before anyone has a chance to post!
Seriously though, shouldn't they change the name? "GRUB" is already a bootloader. They should change the name
"Agh! You have E-Coli on your computer!"
Cyde Weys Musings - Scrutinizing the inscrutable
What are sensible business plans for this type of endeavour?
Should we expect to see many commercial efforts focussed on providing similar "crawl" or "index" capabilities, but each honed to a specific niche market? A scientific crawler? A retail links database?
One could argue that similar efforts targeting music resources have resorted to less automated techniques, i.e. human-driven sharing.
Thoughts?
until someone figures out a way to compromize their local client's results and "escalate" their fave URLS.
It still sounds like a really cool idea though.
Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
Hmm searchengine eh? Why don't you call it grab ?
Robert
1. Tech-savvy people will install this.
2. Tech-savvy people tend to be loners.
3. Loners most often search for porn.
C1. Tech-savvy people search for porn.
4. Items searched for most often reach the top of the list.
5. Porn is searched for often by tech-savvy people.
C2. Porn will be easier to find with this new search engine.
Count me in!
This is going to challenge Google's search, which will entice them to cut loose some of those really cool google labs concepts. Froogle, Google News, and all of the other cool things that they are working on are great services and are going to be the focus of innovation over at Google.
Also, Looksmart needs to develop and release an API for this system. You can only use the google api for 2,000 searches per. day. If they allowed unlimited usage, it would get a lot of developer backing.
I don't keep a lid on my coffee so when I walk around I look busy -me
grub has been crawling my site for weeks if not months now. How is this news? Because someone at Wired wrote about it? Geesh.
I want to delete my account but Slashdot doesn't allow it.
Oh wait, you mean it's not related to GRUB, the Linux/etc boot loader. *slaps forehead* But I guess this solves everything - we can call Phoenix "Grub" too, and just treat it as the generic name to call everything we're having problems thinking up a name for...
You are not alone. This is not normal. None of this is normal.
So if I choose to run this client, how do I know that it won't accidentally index content that is only accessible from behind my firewall?
Couldn't google do this anyways with the google toolbar? Cause with the advanced features version it tracks every page you visit. If they offered some incentive to install the toolbar, google could just beat them at this game. I actually use the google toolbar already by choice (it makes my web searching more productive) everyday, all they have to do is get lots of people using it and wouldn't that work just as well or better?
Assuming they had enough people, they could always crawl twice to see if the submitted stuff matches.
Well, Google's been targeting straight-up, no-frills search for a while now, and manages to sell this very successfully to its advertisers.
Of course, once a context-specific search engine wins the majority share of its targeted market (as Google has done for the entire general market), then it can branch out and offer enhance "pay" services, or usage statistics.
Some markets are more cash-strong than others, for example the construction industry, the entertainment industry, or banking. The dot-com boom saw the failure of many efforts to bring internet-and-technology to the construction industry. In the banking industry many such efforts succeeded. In the entertainment industry, well, you tell me?
...rather a crawl with a distributed component.
They use the screensaver grub clients to check if a web page has been modified since the last time it was crawled (by the centralized crawl done by Looksmart). They probably use some smart MD5 checksum of the pages and send that with the urls to be crawled to the clients. If the checksum of what the grub client crawled doesn't match then the centralized crawl is instructed to re-fetch that url.
They go this route because the If-Modified-Since HTTP 1.1 request is not supported by many webservers (and even if it is, you can't really trust it). This is especially true for dynamically generated web pages. I.e., if If-Modified-Since would work reliably then it would be a simple operation to check if a previously crawled page has changed. Since that's not the case, they are outsourcing the expensive refetching of whole pages.
It will be interesting to see how this pans out. I think they could run into trouble with ISPs if this really takes off (because bandwidth consumption per user would increase and make flatrate deals less profitable for some ISPs).
If my search engine client ever became ubiquitous enough, I wonder how good a search index you could build, not by actively crawling, but passively harvesting all the efforts of your huge collective of clients. Sounds way too scary to want on my machine.
It's kind of funny and a bit ironic that search engines are generally used to search information from a central repository and Grub uses a distributed network to index pages. It's almost like having a distributed google cache (that's updated more frequently). Perhaps a better idea would be to invent a crawling daemon that runs on each server with a standard protocol that reports to a central server the relevence of search terms (hey it's DNS for search terms!!) - to bad it would be heavily abused (mostly by Buy Now, Free Money and Pron avenues I suppose).
Ok now tell me that it's already been done, 'cause I'm pretty sure it has (and probably by Microsoft for ad money).
Well it's an idea that might be more efficient and updatable than Grub anyway.
Who is this "Poster" guy and why does he own all of my comments?!?
...those pigeons can't be beat.
Looksmart is only using Grub to save on their bandwidth. Essentially Grub just compresses web pages before sending them to Looksmart's indexer thus reducing the bandwidth they have to pay for by a factor of 5 or so. The same thing could be accomplished through a proxy which compresses web pages. Eventually, once the HTTP mime standard for requesting compressed web pages is better supported by web servers, Grub will not be necessary.
I think Tim the tool-man Taylor is the man for this job. Nobody over builds engines, better than this lovable Tim Allen character.
More POWER! ugh ooough ooough!
What's the difference between my machine indexing them and the university students recently being hauled into court for indexing open shares? Why would I not be held liable for contributory copyright infringement?
No thanks.
From the readme in the linux version - no idea what the other readmes might say. However, it appears that they are sensitve to the fact that bootloader grub pre-existed their program. They are requesting catchy names. Here is an excerpt:
Notice
======
The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.
What changed under Obama? Nothing Good
It seems that google is actually crawling my site a lot more than grub is. Over the past 6 days:
$ grep -c Googlebot access_log
827
$ grep -c grub-client access_log
153
I prefer grid.org to grub.org. There the cycles are going to cancer or smallpox research. Currently over 2 million machines are participating.
Altruism has its place, but since I'm more likely to die of cancer than of not having the complete www indexed I think I'll be selfish and work towards a cure for something that may affect me.
i wonder if google has already seen this coming (i've seen that grub fellow in my logs a number of times and sort of wondered about it), and is going to use their own distributed search engine once they get the bugs hammered out...
*Another* bunch of spiders chewing up my bandwidth, ignoring my robots.txt files, and bringing my server(s) to their knees.
Joy of freaking joys.
Ed R.Zahurak
You know, oblivion keeps looking better every day.
I expected some way to search... this looks more like a project to index the web rather than make the results available for public use via web interface. Did it strike anyone else odd that there was no web form on the home page with which to search?!
It seems like a good concept, but the availability of the information collected needs to be accessible without installing the client. I'm not game to install distributed computing apps without some freely available benefit. The "for the good of the world" motivation went out the window for me about a day after my first Seti At Home experience. (But now BitTorrent, there was appreciable benefit. I had RedHat 9 isos within 8 hours of their initial release!)
There is no need to use a SlashDot sig for SEO...
You have to be kidding or working for Microsoft, or both! Have you ever searched for Linux on MSN? Try it - here.
Notice the third result? "Learn about the Microsoft alternatives and how to move to them from open source products." I shit you not! I don't think Google would ever use this kind of dirty, underhanded trick. Great "hand-picking", mate.
We're only gonna die from our own arrogance, that's why we might as well take our time...
Set Up Your Account Please register for your Grub account. We will NOT release your personal information to anyone, and your email address will not be displayed on the site. Your email address will be your Grub login.
* Email:
* Username:
* New Password:
Learn lisp today!
just another extension of the 1998 zeitgeist;
It's all about eyeballs.
baloney.
Show me the profits.
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
Grub isn't a heavy cpu users. Right now, on my Athlon (~2400+), it's using between 0-2% of the CPU at any given time. Grub is mainly interested in your excess bandwidth.
Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
of course not.
These guys buy a bunch of servers and let some DUMB designed software to try and find what your looking for. It's really stupid shit if you ask me.
Whats really needed is some sort of A.I. . Either real people actually giving you advice on finding something or programs that can think like humans and know what humans want.
Also needed is to catagorize FREE CONTENT from the commercial websites . The internet which was invented for government sharing information has turned into a FILTHY ,CRASS, Commercial overloaded sack of SHIT.
Anyone that can implement these features , I tip my hat to you. Google, yahoo, lycos all suck in my opinion. These companies should hire some of those soon to be out of work telemarketers but they wont because they think their software is so special but it's actually rediculously unprofessional,shoddy and cheap.
If anyone wants evidence then try and find content in these search engines about starting an internet business. I get SPAM site after spam. Nothing legitimate. It doesn't help that these overlyhyped so-called searchengines take under the table cash from businesses for placement on their searches.
Google is very responsive to spam reports. Rather than simply remove spam sites tas they find them, they prefer to "teach" their software what's bad from example. This can take a bit of extra time, but it seems worth it to me. Google even has a link on their search results for feedback if you're unhappy. Try reporting bad searches some time.
cough
Selling software wont make you money, selling a service will.
this is totally off topic... but has anyone seen MadPenguin.org's logo? I about fell out of my chair when I saw it. Seems they have been endorsed by Muhammed Saeed al-Sahaf LOL.
:)
Thought I would share
nox
An enormous amount of spiders that are hunting for an enormous amount of web flies - pages...
they're going to sneak in file sharing support with a kazaa plugin.
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
Isn't Looksmart/Sprinks a big pay-per-listing deal? The looksmart logo in the upper right corner was enough to make me just close that page right away without any second thought.
Morphing Software
anyways ( P ) Pronunciation Key (n-wz)
adv. Nonstandard
In any case.
But it still kind of irks me that people think that a computerized 'dumb' search result could compete with a human rating system that filters spam,porn,and other garbage results. Google should hire some REAL PEOPLE that can do some sort catagorized intelligent directory so we can have QUALITY at the beginning of a search result. Some sort of HUMUN RATING system is needed to sort. The software is not up to par.
Grub has had problems forever. I remember when they first announced it. It sounded cool, so I went to check it out. Turns out the actual crawling was done by.. wait for it.. wget. How lame is a web crawler that uses wget?
Then people started to realize that grub didn't have a good set of AI back at the mothership--lots of pages got crawled way too often, grub didn't obey robots.txt, etc. Many webmasters just started banning grub altogether.
Now we find out that LookSmart has bought grub and its three developers. LookSmart is the company that stabbed its customers in the back by starting to charge for every click from its directory instead of a one-time fee for inclusion.
These two groups deserve each other. Grub was supported by the community, but now that they've sold out to commercial interests, who wants to give up their bandwidth for free to LookSmart? The grub code was GPL--I wonder if grub will start to change the license to make the code closed source..
Not to mention:
Results 1-15 of about 609 containing "linux"
I seem to remember there being more than 609 websites with Linux information on them...
So, pray tell, where does that result belong? I agree, it shouldn't be number three, but where then? It's nowhere to be found in the first ten pages of Google. Am I to assume does not Google weight search results? No, just look at the Search King case. I don't think we can really rely on any search engine with an agenda, but we have no other choice.
A programmer is a machine for converting coffee into code.
Those pigeons can eat the grubs, solving two problems at once.
Just watch out for the part about killing two birds with one stone.
According to the Grub FAQ, it respects robots.txt although not the META tags. Although it takes a week or two for it to listen to the robots.txt, it does eventually...
The sheer volume of this project concerns me, however. The very fact that it got Slashdotted may cause it to be a bit heavier than expected!
It sounds like a good use of spare bandwidth, but if it's going to wind up a superscanner, it's going to send a hell of a lot of requests.
I tried it and deleted it as quickly: it's not very good at being a bottom feeder, it redlined my system resources immediately and slowed everything down. Duration between installation and uninstallation: twenty-nine seconds.
Warning: Poster of this comment is a nerd. Just like everybody else here.
...as the web gets larger and more cluttered.
I've already discovered this with comic books turned into movies. Finding synopses of the comic book X-Men is nigh impossible. Finding syopses of the movie s is much, much easier. Damn near every site online about X-Men, Spiderman, The Hulk, Batman, etc. deal with the movies, and sifting through the cruft is not easy. And that's just comic books. Other topics can be just as hard to find, and this doesn't even touch upon fake search results that only turn up porn or worse, a blank page (happens frequently).
Searching for MORE stuff isn't going to help. Searching better is the key. Google goes a long way towards this, but even it has the same problems of finding too much crud.
(Oh, I can't remember. Have I MetaModerated Recently?)
sulli
RTFJ.
Or not. What a difference maturity makes.
I saw another poster say you can stop the GRUB client from crawling porn, but what if you could pick the types of content you wanted to crawl for?
;) and then help the Engines refine the content in its indexes according to what I ACTUALLY SEARCHED FOR???
;), give the user (client) the ability to improve results for things that matter to them....
Let's say for example I use search engines but find them lacking or would like better results for the types of content I SEARCH FOR???
So one solution would just be to pick the types of content manually, or select keywords, etc, manually....
Another option might be to sniff my use of Google.com or Altavista.com (is that still up?
Since there is not any monetary incentive to run the client, and you won't find any Aliens (but maybe some freaks
Silly Rabbit: tricks are for kids.
Haha
Yea. If you help Grub, Grub gives your web site a preferencial listing. Building the biggest search engine, sure. Building good search results, not so sure.
Something that the i.e. squid cache, and is some kind of client of that kind of network will be more useful, at least for common users (the ones that don't have yet a proxy cache will gain a lot in internet navigation, and will not use extra bandwidth, it will use just what they already downloaded) and for the "search" engine will give another approach of ranked results, giving more results for the sites that are more accessed, not just the ones that are more linked.
It could have problems, of course. Sites not very visited will not be easy to found, making them even more difficult to find, but maybe this can be compensated with an optional crawler.
- makes me feel warm and fuzzy about my altruism
- can run in the background on a Unix box
- is open-source (so I don't have to run someone's closed-source app on my box and trust their
security through obscurity)
Well, #1 rules out Grub, #2 rules out Folding@Home, and #3 rules out both SETI@Home and Folding@Home.So what worthy causes are out there?
Find free books.
If this thing gets too popular without proper throttling, they could cause real havoc.
Copyright Violation:"theft, piracy"::Anti-Trust Violation:"thermonuclear price terrorism"<-Overly dramatic language.
Alright, I have 3 major problems with this...
1) How different is this than the princton kiddies system? I don't know about you, but I don't want a 95 billion dollar bill arriving in the mail...
2) What if you local (cache?) contains a few links to kiddie porn? Not your fault, right? Software does it's own thing, you cannot control, BUT what will the FBI think? The FBI Scottland Yard, RCMP are currently heavily investigating Kiddie Porn cases (good work IMHO), but what if your the unlucky sap who getts stuck with a few sketchy URLs? Or Worse Yet, what if this GRUB keeps a cache of the website like google does? Then what?
3) What about material that is legal locally, but illegial somewhere else... eg. Nazi stuff in Germany, Falun Gong in China, etc... The last thing I want is to be refused to be given a travel visa cuz my PC has an illegial cache...
Good idea in principle, but with sketchy content on the web, I don't think I will be the one keeping track of it all. If there is a way to filter out the questionable stuff then maybe, but since the purpose is to be as inclusive as possible, it seems incompatible.
_CMK
Bad spellers of the world untie!
Hey! Have you heard of Yahoo?
Yeah, but all the others actually run linux and can't stay up long enough to get indexed.
A) Google does have a human-created directory (might be the same as DMOZ)
B) I imagine that they manually have given pages on Yahoo and other web directories high weights.
You can always use the Google API for more than 2,000 searches per day if you pay licensing fees for it. That's just Google ensuring that it can remain a viable company. Little text-box advertisements just don't cut it in this day and age where blatant pop-ups and colorful banner ads don't even have much turn-around. That's not the point though.
The point is that I wouldn't look anytime soon for LookSmart to allow unlimited usage of this API. It's too large of a project for them to just let people use it. It's simple economics. They may not be investing the computing resources into this projects web spidering software, but it's still using TONS of resources to keep this data catalogued and readily accessible.
Or DMOZ (which Google actuall does use.)
A DDoS is only effective because it's a whole bunch of messages all at once to one target- in the 100,000,000 range for a full-scale attack, to always cover all the positions.
The database of "check-me"s is randomized rather evenly. Even if this takes off, I don't see how it could really do serious damage to any but the truly dinky servers: the hits will not come in all at once and flood the whole connection. While it very well could end up a constant stream, it's unlikely to be the massive stream that makes a DDoS.
It does have the potential to slow servers across the world, but that's okay- it will slow home users' connections across the world by using 1/4 of them, too, so nobody will actually notice.
Warning: Poster of this comment is a nerd. Just like everybody else here.
google's pigeons
Another quote I like is, "Windows operating systems do not provide X Windows. For X Windows connectivity, developers need a third-party X Windows server.". Of course Microsoft would never be anticompetitive by competing with third-party suppliers of implementations of an open standard, right?
It's not as bad as you make it out to be. They do point out (in fine print) that it is a "featured" site. They list the "featured" sites first, then the sponsored links, and then general web hits. And they mark each category. I guess that the only differencebetween featured and sponsored is in the price. All this was far from obvious to me when I saw the results at first (being used to Google), but I imagine that if you used them on a daily basis you would quickly become used to skipping down to the real results.
Another damn web spider adding to the collective noise of the internet.
Why don't these people try to work out some way of sharing information so I don't have to have my webserver poked at by every person and their brother's search engine?
Only on slashdot can a posting be rated "Score -1, Insightful".
It's a "featured site". Meaning it's a site from Microsoft, a Microsoft partner, or someone who paid some money to Microsoft for the privilege.
Nothing that other search sites don't do. They just mark their paid adverts a little more obviously.
I am NOT a man!
I am a free number!
My web site has gotten a few hits from the grub bots, none of which were for robots.txt.
Hello grub, welcome to my BANNED BOT LIST.
Okay, i found the source at sourceforge CVS. unfortunately, all the files checked in are >4 months old. If this is under the GPL, where the hell is the source for the binaries they are putting out?
Results 801 - 878 of about 58,500,000
In order to show you the most relevant results, we have omitted some entries very similar to the 878 already displayed.
609 pages with Linux info isn't so bad, when you consider Google only shows 878 "relevant pages". Not one link to MSN in those 878 pages.
Anyone care to look through the 58,499,222 omitted entries?
This I dispute sir. Targeted keywords on google, where my clickthrough ratio has averaged 1.3-1.5%, are a goldmine for my site and money very well-spent (averaging $500 a month on those ads, paying .05 in 97% of all cases.)
I've been a google advertiser since Feb. 02, consider their program extremely lucrative, and I guess they like me 'cause I got a picture frame from them last Christmas. It was a Coach picture frame....
I'm not disputing whether or not the advertising is effective in fulfilling its purpose of promoting the advertiser's site. I am simply stating that Google would not a very viable company if they relied on advertising alone to make their money.
I won't argue with you on how much Google makes off the ads, as I am willing to bet that about 80% or more of their funds comes from advertising, however, advertising has always proven as an ineffective means of remaining viable. You simply have to have other sources of income.
What a lame piece of shit code... Didn't work at all on any of the 5 machines I tried it on... Tried logging in to the forum, but unfortunately that was broken too. Basically, that's OSS in a nutshell.
"Your CPU came with a keyboard? What kind of ghetto deal is that?" -McSuede
Im sure you'r apostrophe's and ",quotes", have good grammars
We need a higher signal to noise ratio. I don't think crawling MORE will really help that. Even google has been fooled and sites can quickly master the rankings with little effort.
I'm not sure of a good way to improve the signal to noise ratio, but this certainly doesn't seem like a solution. I also question the ethics of releasing such software that, if it contains security holes, is a potential launch platform for debilitating internet attacks.
Yes, Google's algo only asked Microsoft to go to hell, of course, taking it down after the story was reported far and wide.
More than mere navel gazing.
It is too easy to send currupted information into the database. They have *no choice* but to trust the clients. Sure they could run spot checks on the results, but they would be very partial and it would be easy enough to fake responses for those as well.
So the more popular it gets, the more incentive people will have to promote their sites by feeding it fake index information. If this magically got to be very popular, within weeks search results would become meaningelss and it would drop back into obscurity. The more likely result would be that it will never become popular in the first place.
Besides, who wants to donate his CPU and bandwidth resources for a commercial company, anyway?
The internet has become, an ever growing tree of knowledge that will some lead to something even bigger.
nothing about grub here, but personally i really like this web site that have a few search engines on it: http://freddo.netfirms.com/. It also refers to Fravia's new website and his invaluable forum.
A good reference about search engines is also Search Engine Watch
have fun...
-- search the web
Normally, most search engine's spidering methods are designed to be pretty nice to servers - such as only requesting pages once every 30 seconds or so.
However, I've seen times when the methods of some of the search engine spiders were foiled by such simple things as having a large number of virtual hosts on a machine. Combine that with a number of front-end machines all connected to the same database server, and things can get really nasty.
In one particularly bad incident, several fairly big-name search engines were spidering us simultaneously, and only hitting each domain name relatively infrequently. However, with 500+ on several front-end servers, and several search engines, we were getting something like 50-100 requests per *second* from the search engines. When those hits were to pages generated from the database, our servers kept up, but performance was definitely degraded.
So, where am I going? I see the potential for small bugs, weak algorithms, idiotic end-users, or even malicious end-users causing the same sort of havoc. Even if it weren't meant as an actual DDOS, it could certainly end up that way. And it would be much, much harder to prevent than merely blocking (or rate-limitting) requests from one company's spiders.
Oh, you're not stuck, you're just unable to let go of the onion rings.
1. Design a search engine
2. Let everyone else fill it
3. Profit
The second step is finally found!!! YAY
it is only after a long journey that you know the strength of the horse.
I'm sure grub will indeed build a larger database than most other search engines, since grub (or grub-client, or whatever it's calling itself) has never, not even once bothered to look at a robots.txt file on any web site I've ever administered. This is what webmasters call a misbehaved robot, and it is not something to be looked at with respect.
--Mythos
Many people around here work for the Government.
As to whats wrong with Corporations and Big Business:
in one word: Enron
In more words that that:
http://www.corpwatch.org/
www.linkloader.com
The common point made by these "distributed" software authors is that there are "wasted" CPU cycles in your computer that you could donate to a project for free.
However, that is not true at all! CPU cycles are not wasted. When the CPU has nothing to do, it sleeps. At least in a modern operating system (i.e. about everything after Windows 95).
By "donating your wasted CPU cycles" you will actually increase the power consumption of your computer. This will be very noticable in a laptop, but when you watch the CPU temperature in your home system you will also see a noticable increase in temperature between an idle system and a system running a computationally intensive background task.
Probably the effect will be worse for things like keysearches, prime number searches, SETI etc than for this GRUB bot, because that probably also spends time waiting for the network (and thus returns the CPU to idle).
So before you "donate your wasted CPU cycles", please realize that this will actually cost you money.
I like the idea. But it shuld benefit the community as well, so it should crawl something like 80% community assigned pages, and 20% "my" pages. That would still benefit the user much more than he deserves.
Holy fuck Tony Blair, what the HELL are you doing?
Ensuring that American Dollars and Popular Opinion flow toward Britain. Not to mention military toys and training for British troops.
Brilliant of him to pick the winning side. Now he can reap the rewards for his people.
You're certainly right that every business should have other sources of income (I do worry about my own site's single source). But I think google's raking it in on the click-through ads.
.40 cents per click, on keywords that generate around 500,000 impressions a month.
Typically, where I advertise, there are eight or nine other people trying for the same keyword. I've got the green-shifted look despite paying the minimum because I'm allowed to include "free" in my description, but there's usually five people above me, meaning they're paying at least six cents; often as much as
That number really starts to add up when you think of all the web businesses, and all the keywords, and all the searches, and all the clicks, but I guess we won't have a better idea until google files with the SEC prior to their IPO...
One thought, however, is the way google text ads are now showing at places like Metafilter or a number of the PDA news sites. Google's out to score more impressions any way they can... must be worth something to them.
The Austrian version of MSN is even better. If you search for Linux, the first two results are WinXP ads on the Microsoft site. And, while you're at it, try searching for google or yahoo. This will produce a popup saying "Why look for a search engine when you've already found one?".
As it is not possible to really track who is an authentic Grub, the agent is subject to abuse. If I were a spammer (I am not) and wanted to do email address gathering, I would use the agent name for any spam crawlers I would run... I do not think I am giving away any dark secrets here.
Since the makers of Grub claim it is a robot.txt compliant spider and since I have seen the Grub agent not always follow the behavior a robot.txt compliant spider should follow, I can only conclude that my speculation above is already happening.
Imagine your horror when you think your site is the most well indexed on the planet, because it is being crawled at DoS frequencies, but it fact it is really being crawled by an army of spambots getting in as something you might actually want to have crawl your site. As you watch those logs, consider that those Grub accesses may be the spambot engine from hell and not the real Grub rummaging through your site.
So, even if you were of a mind to discriminate between the real thing and the pretenders, how do you do it? Search engines from fixed IP blocks are easy enough to authenticate and really allow a webmaster control over who indexes what. I cannot think of a way to do that with a distributed spider application without having a way to communicate with the "mother ship". If I have to make the investment in time to communicate with the "mother ship" each time a Grub client shows up, I might just as well crawl my own sites and disallow external Grub clients anyway.
Elitist dirt bag? Me? Not really, blocking and managing search engines is a fair approach, because if you let every budding search engine / indexing tool on the planet have at your site, you really might be facing a defacto DoS just from this kind of activity alone.
Well behaved search engines always are allowed and welcome to crawl our sites. Ill behaved engines are met with a different attitude and fate.
"We reserve the right to serve or refuse service to anyone for any reason"
In any case, a colloborative search engine API using distributed computing might still be a nice thing for not-for-profit purposes. One of the applications I wanted to use this API for was be a plagiarism search for teachers to quickly scan student papers to see if they were simply pulled of the net. This was bombed by the 1000 query limit of Google's api, as to do the search properly would require a few tens of queries for each paper. If you have to check tens of these papers the limit can be reached fairly soon.
For this purpose speed wouldn't be so much of an issue, so maybe a distributed cataloguing (sp) and search system might be something interesting?
When you download Kazaa, you authorize the corporation to utilize any unused processor or disk space -- this doesn't seem that much more dangerous than all those Kazaa users out there. As a non-Kazaa subscriber, I think I will also skip on grub -- I paid for my computing space and power thank you, and I don't plan on just giving it away to all of these corporations looking to further themselves.
**When craziness is bliss, 'tis folly to be sane**
The only way I'd run grub is on a low-bid DMZ host (like that old P133 I have laying around), with the adult content searching filters disabled. Then I'd let it do whatever it wanted to do as long as it wanted to do it and I'd forget about it. Who cares about the search results? Just use Google like before. They aren't going to make a good search engine anyway.
But if I ever got a subpoena which included information about my web browsing and online history, I could tell the judge that I could't honestly say if that particular bit of outbound traffic was me or that grub thing doing its searching. So as long as I was running it, I'd be free to look at "subversive" literature, pr0n, Arab websites, the Cato Institute's homepage, whatever I wanted. If I got on a list and they tried to PATRIOT ACT me, I'd use grub as my get out of (Ashcroft's mystery) jail free card. Hell, I'd throw grub and freenet on the same box and cover every base.
That's if I was paranoid. And wanted to surf Arab web sites or pr0n. Which I'm not. And I don't. :-)
-B
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
"Looksmart"?? Is this supposed to be a clever company name?
I keep thinking "Look smart as opposed to what? Being smart?" or "Look smart even if you're not??"
With a name like that, these clowns don't even LOOK smart!
make up your own mind: www.linuks.mine.nu/people/kord/
Windoze not found: (C)heer, (P)arty or (D)ance
It would be interested to just see a database that is connected to browsers, so that whenever I were to look at a page, the page data would be processed and sent to whatever search engine. Then, those sites that are updated frequently and get a lot of traffic would be more easily searched.
Just a thought.
Ok, so I'll just hack it a bit, and all my websites will FINALLY make it to #1 in search engines on ANY keyword! Doh, I need to subscribe to a few click-to-pay banner sites...
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
What? Are you part of the "Yahoo Publicity Spread FUD department strike team" or something?
Here's a hint for ya;
1) Go to Google
2) Click on the Fourth Link from the left in the bar. (The green one that says "Directory" on it.)
3) Enjoy!
Or, if you are particularly patient, just visit http://www.dmoz.org/ directly.
Built by humans, edited by humans, unpaid volunteers that know something and care about the directories they edit. You too can even volunteer to help!
Yahoo sucks PRECICELY BECAUSE they tried to pay people to get sites in their directory, found out they could not keep up, and then started making site owners pay to get in. Obviously, GRUB won't do what you want either, but what you are complaining about lacking already exists.
The idea is cool and I imagine it won't be long before an org. without links (unverified) to M$, will do the same thing. There's at least a couple of people on the grub forum who are figuring out some of the shadier sides of this code: potential spyware? security hole? And the licensing is vague (no links).
Note the tone of their pitch as well you are participating in a competitive group effort a kin to Seti@home and Distributed Net? I don't think so... caveat emptor.
Updating a search engine of general web material is an important objective, but there are diminishing marginal returns to immediacy. Google News is an example of a subset of web material -- news sites -- for which immediacy is a more important goal. It's no surprise that Google offers a very fast refresh there. A distributed system that would do that for the entire net is interesting, but not necessarily worthwhile.
i don't know about you.
i don't search for porn, it looks more like porn searches for me.
Practical Semantic Web Log
I've been noticing some hits from my website mentioning something called "grub," but never knew what it was.
For the webmasters out there, this is what the UserAgent string shows up as on my site:
Mozilla/4.0 (compatible; grub-client-1.2.1; Crawl your own stuff with http://grub.org)
(There are variations on the grub-client-1.2.1 version number, so if you for some reason decide to search, you may want to do grub-client-*.
________________________________________________
suwain_2
From the CNET article (linked from the grub website)
"LookSmart, which licenses editorial and commercial directory listings to Microsoft's MSN and other Web sites, paid $1.3 million in cash and stock for Grub, according to a recent filing with the Securities and Exchange Commission. LookSmart said it is testing the Grub system and plans to unveil the distributed computing project in early April."
Does this mean whoever runs the client will be helping Microsoft build a "good" search engine? It appears to me that will be the case. Also, "the client is open source"? Oh great - so we can do all the labor and look at the source of it, but the server which the corporations (Looksmart? M$?) owns and servers will not be open source.
Isn't this a dirty trick on the open source community?
Has anyone been able to get/look at the source code of the client yet?
There is no mention on their home page (www.grub.org)
on who these guys are and what their intention is. Just a
[Also there is a thread on their forum on the client trying to act as a server - the thread is inconclusive on whether any spyware is included.]
How often the dupes are getting! Pretty soon it will be once a week..
Is it slashdottable?
Why is it looking for robots.txt in a subdirectory?
i couldn't believe that, so i tried. While my german isn't that great, i worked it out. that is hilarious
Pfft - Sorry, what?
From taxes. Taxes on businesses, and taxes on people who work for businesses. No taxes means no money to pay these government whiners.
Do people actually think that there's some magical money tree, and government just gets it from there? If one dislikes business so much, BAN IT! Make it illegal. We'll go to 100% socialism. Let's see how THAT works out.
Uh huh, Grub is going to "run in the background" ?
No thanks!!. It just doesn't feel right. It is sort of like lending a firearm to an untrustworthy neighbor. What is in it for the lender other than potential problems?
Spyware "runs in the background" and slows up peoples machines. What really happens to one's machine performance with Grub? And, more importantly, where is my check?
Harpo Tunnel Syndrome--my wrist feels funny.
Your take sounds exactly like what Matt Wells, the programmer of the Gigablast search engine said on his rants and raves page.
....
"Rants & Raves
by Matt Wells
My Take on Looksmart's Grub
Apr 19, 2003
There's been some press about Grub, a program from Looksmart which you install on your machine to help Looksmart spider the web. Looksmart is only using Grub to save on their bandwidth. Essentially Grub just compresses web pages before sending them to Looksmart's indexer thus reducing the bandwidth they have to pay for by a factor of 5 or so. The same thing could be accomplished through a proxy which compresses web pages. Eventually, once the HTTP mime standard for requesting compressed web pages is better supported by web servers, Grub will not be necessary."
[Your suppossed take:] "Looksmart is only using Grub to save on their bandwidth. Essentially Grub just compresses web pages before sending them to Looksmart's indexer thus reducing the bandwidth they have to pay for by a factor of 5 or so. The same thing could be accomplished through a proxy which compresses web pages. Eventually, once the HTTP mime standard for requesting compressed web pages is better supported by web servers, Grub will not be necessary."
You have been caught plagiarizing. Dork, you hardly changed a word, nice copy and paste job.
- You can develop any application you want, but you must abide by the Google Web APIs terms of service. One condition is you cannot create a commercial service using Google Web APIs without first obtaining written consent from Google. Another is that you can only create one account for your personal use.
Do what it says - obtain written permission. That written permission will be in the form of a commercial contract/license.If this becomes popular, legal issues will crop up and it will be shut down and banned through mega-corporations' legal clout. I hope not, but I wouldn't be surprised at all. Today's net kinda sucks.
So Grub is commercial. Big deal. Any large-scale project like this furthers our knowledge of distributed computing and helps pave the way to other things, like on-demand mirroring of popular content.
> As to whats wrong with Corporations and Big Business: in one word: Enron
Wow, try using a little more thought next time. There is nothing wrong with business. What was wrong with Enron was the people running it (or, not running it). There are plenty of big businesses out there that are not corrupted like Enron. Stop beating your chest with stupid remarks that don't hold up.
Google has other income sources. Take a look at their Google devices 1U unit costs $28,000! It's really just a Dual PIV with 2 GB RAM running a modified Linux version and their software. There is also a limit of number of docs you can index.
For $1,000 you can have a kick ass open source search engine on the same hardware that you can actually customize and disk space is pretty much your only restriction. See ht://dig project @ http://htdig.org