Building a Bigger Search Engine

← Back to Stories (view on slashdot.org)

Building a Bigger Search Engine

Posted by ryuzaki0 on Saturday April 19, 2003 @02:15PM from the size-isn't-everything dept.

skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."

29 of 278 comments (clear)

Min score:

Reason:

Sort:

Will Grub take off or be smashed? by Blaine+Hilton · 2003-04-19 14:17 · Score: 4, Insightful

I started to use grub, but then questions started cropping up. First we are using this to further a commercial organization. This is not research such as SETI or Folding At Home; this is doing the dirty work of a large commercial search engine. There is not even any potential reward such as with distributed.net.
Also the grub engine crawls everything, including adult content and other questionable content. They have a setting to turn it off, but it does not block it. With the current questioning of international law relating to accessing illegal websites this could have major consequences for the average user.
So for the time being I have stopped using the grub client until some serious questions are answered. It's an interesting concept and if it was being used in more of an academic setting it could be interesting. However I believe that search engines like Google are doing pretty good themselves.
Go calculate something
1. Re:Will Grub take off or be smashed? by Threni · 2003-04-19 15:03 · Score: 1, Insightful
  
  "Also the grub engine crawls everything, including adult content and other questionable content."
  
  Adult content isn't questionable. You either look at it, or you don't. Don't tell me that stuff about children being harmed by looking of photographs of the naked body has got to you?
  
  Also, the legal problems exist mainly in your head. No user will be prosecuted for supplying an URL of a website to a third party who then makes it available to people using their search engine, as it simply isn't illegal.
  
  Unlike SETI, this thing isn't a complete and utter waste of time, although I agree with you about the folding thing.
  
  "So for the time being I have stopped using the grub client until some serious questions are answered."
  
  No serious questions have been posed at this time.
2. Re:Will Grub take off or be smashed? by bcrowell · 2003-04-19 15:59 · Score: 3, Insightful
  
  This is not research such as SETI or Folding At Home; this is doing the dirty work of a large commercial search engine.
  Actually, if I had a gun to my head, I'd choose to run Grub, because the client is open-source. I used to run SETI@home, but then the news came out that they'd been sitting on a potential root vulnerability for a long time. That really brought home to me the risks of running someone else's closed-source app on my box.
  
  --
  Find free books.
3. Re:Will Grub take off or be smashed? by kaden · 2003-04-19 16:00 · Score: 5, Insightful
  
  Um, I think you're missing the point. This client could download highly illegal files, and make it look like I'm knowingly downloading them. Say I run it, and it downloads anything from kiddy porn to some Al Qaida webpage from an FBI sting server. I would quite possibly be arrested and charged, and while I wouldn't be convicted, it's quite an ordeal, and there's an ugly social stigma to even being charged with Kiddy Porn or conspiring with a terrorist. So that's a serious question that's posted by running Grub.
4. Re:Will Grub take off or be smashed? by Moonwick · 2003-04-19 16:55 · Score: 2, Insightful
  
  Yeah, god forbid you help a commercial organization, especially when the results could stand to benefit you.
  
  God knows that Google, by virtue of being a commercial entity, has absolutely nothing to offer you.
  
  Anti-capitalist fucktard.
  
  --
  Only on slashdot can a posting be rated "Score -1, Insightful".
Great idea, but will it pan out? by dtolton · 2003-04-19 14:17 · Score: 5, Insightful

LookSmart hopes to tap the altruistic nature of many Internet users.

That unfortunately seems like a naively optimistic hope. While the
vast majority of people may be altruistic, it only takes a few
unscrupulous individuals to completely undermine a fair result.

It's interesting that this idea is an extension to Google's model in
many ways. Essentially Google is able to index so much of the
interent by having 50,000+ servers. I don't think that's what makes
Google such a useful search tool, rather I think it's accuracy and
relevancy. If my search results started getting poluted with bogus
hits, I would stop using it almost immediately.

Unfortunately, by letting people run the client on their machine and
having it send the results back to the server, I think spoofed
results are inevitable. I don't think it will be possible to
safeguard the results either, it will be interesting to see how well
this project survives *when* people start spoofing results. It's
been a problem for SETI@home, and it's something that undermined some
peoples faith in the project as a whole. If the spoofed results are
more widespread and have a larger impact as they would in a system
like this, it may ultimately prove fatal to the project.

One factor that has been asbolutely critical to Google's success has
been their ability to remain resistant to spoofing attempts. It's
still a question mark how well grub will perform in that context.

--

Doug Tolton

"The destruction of a value which is, will not bring value to that which isn't." -John Galt
Business Plan? by Anonymous Coward · 2003-04-19 14:22 · Score: 2, Insightful

What are sensible business plans for this type of endeavour?

Should we expect to see many commercial efforts focussed on providing similar "crawl" or "index" capabilities, but each honed to a specific niche market? A scientific crawler? A retail links database?

One could argue that similar efforts targeting music resources have resorted to less automated techniques, i.e. human-driven sharing.

Thoughts?
Hrmm, I wonder how long... by bergeron76 · 2003-04-19 14:22 · Score: 3, Insightful

until someone figures out a way to compromize their local client's results and "escalate" their fave URLS.

It still sounds like a really cool idea though.

--
Don't think that a small group of dedicated individuals can't change the world. It's the only thing that ever has.
1. Re:Hrmm, I wonder how long... by CaptainMunchies · 2003-04-19 14:38 · Score: 3, Insightful
  
  Grub's clients don'tcome up with a ranking for each website they crawl; rather, they check to see if this website has changed since the last time it was crawled. For any website that has changed, the client notifies the server. The search engine asks the server which sites in its index need to be updated, and the server gleefully replies.
  
  Clients artificially increasing their ranking isn't an issue, since the client has nothing to do with a site's ranking.
  
  --
  Spam removed for the Internet's pleasure ...
grub is already taken by stock · 2003-04-19 14:23 · Score: 2, Insightful

Grub is the GRand Unified Bootloader, a GNU project, so the name is already taken.
Hmm searchengine eh? Why don't you call it grab ?
Robert
Not news for us webmasters by Gothmolly · 2003-04-19 14:27 · Score: 1, Insightful

grub has been crawling my site for weeks if not months now. How is this news? Because someone at Wired wrote about it? Geesh.

--
I want to delete my account but Slashdot doesn't allow it.
1. Re:Not news for us webmasters by hswerdfe · 2003-04-19 19:11 · Score: 2, Insightful
  
  dude, get over yourself....
  
  I never heard tell of Grub.org before.
  
  I found it interesting....
  
  not every link on slashdot is going to directly relate to you....
  
  --
  --meh--
Firewalls? by adam_megacz · 2003-04-19 14:28 · Score: 5, Insightful

So if I choose to run this client, how do I know that it won't accidentally index content that is only accessible from behind my firewall?
What about the RIAA? by One+Louder · 2003-04-19 14:51 · Score: 3, Insightful

So...let's say my instance of Grub crawls over a repository of .mp3s and supplies that information to the combined index.
What's the difference between my machine indexing them and the university students recently being hauled into court for indexing open shares? Why would I not be held liable for contributory copyright infringement?
No thanks.
1. Re:What about the RIAA? by Anonymous Coward · 2003-04-19 14:54 · Score: 1, Insightful
  
  Because this would call into question the future of all search engines, and you'd see the big plays like Google, Yahoo, Overture, etc head into court with their own high priced lawyers. You think the RIAA wants a fight it doesn't think it can win?
A better use for my screensaver time by Call+Me+Black+Cloud · 2003-04-19 14:57 · Score: 5, Insightful

I prefer grid.org to grub.org. There the cycles are going to cancer or smallpox research. Currently over 2 million machines are participating.

Altruism has its place, but since I'm more likely to die of cancer than of not having the complete www indexed I think I'll be selfish and work towards a cure for something that may affect me.
Re:Hardly distributed crawling by myov · 2003-04-19 15:19 · Score: 2, Insightful

Not the greatest way of doing this. On one of the sites I maintain, the date shows up at the top of the page. The other content changes very infrequently in most cases (a few pages hit a news&events database but that's about it). But the new date would be enough to change the checksum (unless they're allowing for it somehow)

Grub hits us quite often. I've seen the same URL hit multiple times in one day by different hosts. It's ignoring the "revisit-after" meta tag (7 days), but then, so are most of the other search engines. While I haven't banned it, I am watching the amount of bandwidth it uses.

--
I use Macs to up my productivity, so up yours Microsoft!
Re:Search engine software and lack of A . I . by zymano · 2003-04-19 15:31 · Score: 3, Insightful

I didn't know that.
But it still kind of irks me that people think that a computerized 'dumb' search result could compete with a human rating system that filters spam,porn,and other garbage results. Google should hire some REAL PEOPLE that can do some sort catagorized intelligent directory so we can have QUALITY at the beginning of a search result. Some sort of HUMUN RATING system is needed to sort. The software is not up to par.
Lame.. by Anonymous Coward · 2003-04-19 15:33 · Score: 1, Insightful

Grub has had problems forever. I remember when they first announced it. It sounded cool, so I went to check it out. Turns out the actual crawling was done by.. wait for it.. wget. How lame is a web crawler that uses wget?

Then people started to realize that grub didn't have a good set of AI back at the mothership--lots of pages got crawled way too often, grub didn't obey robots.txt, etc. Many webmasters just started banning grub altogether.

Now we find out that LookSmart has bought grub and its three developers. LookSmart is the company that stabbed its customers in the back by starting to charge for every click from its directory instead of a one-time fee for inclusion.

These two groups deserve each other. Grub was supported by the community, but now that they've sold out to commercial interests, who wants to give up their bandwidth for free to LookSmart? The grub code was GPL--I wonder if grub will start to change the license to make the code closed source..
Web searching will only get harder... by Sancho · 2003-04-19 15:44 · Score: 2, Insightful

...as the web gets larger and more cluttered.

I've already discovered this with comic books turned into movies. Finding synopses of the comic book X-Men is nigh impossible. Finding syopses of the movie s is much, much easier. Damn near every site online about X-Men, Spiderman, The Hulk, Batman, etc. deal with the movies, and sifting through the cruft is not easy. And that's just comic books. Other topics can be just as hard to find, and this doesn't even touch upon fake search results that only turn up porn or worse, a blank page (happens frequently).

Searching for MORE stuff isn't going to help. Searching better is the key. Google goes a long way towards this, but even it has the same problems of finding too much crud.
Good Idea, Bad Implementation by oaf357 · 2003-04-19 15:52 · Score: 3, Insightful

Yea. If you help Grub, Grub gives your web site a preferencial listing. Building the biggest search engine, sure. Building good search results, not so sure.
1. Re:Good Idea, Bad Implementation by Anonymous Coward · 2003-04-19 15:55 · Score: 2, Insightful
  
  It doesn't give you a preference in listings, simply a preference in crawling. You offer some work to guarantee your site has fresh indexing. It's not much different than the search engines that sell frequent crawling for extra. A fresh non-relevant listing won't help you much more than an older listing.
Unlimited Use? Try Wishful Thinking. by NeoMoose · 2003-04-19 16:37 · Score: 3, Insightful

You can always use the Google API for more than 2,000 searches per day if you pay licensing fees for it. That's just Google ensuring that it can remain a viable company. Little text-box advertisements just don't cut it in this day and age where blatant pop-ups and colorful banner ads don't even have much turn-around. That's not the point though.

The point is that I wouldn't look anytime soon for LookSmart to allow unlimited usage of this API. It's too large of a project for them to just let people use it. It's simple economics. They may not be investing the computing resources into this projects web spidering software, but it's still using TONS of resources to keep this data catalogued and readily accessible.
Re:search.msn.com is the future by Anonymous Coward · 2003-04-19 16:50 · Score: 2, Insightful

It's not as bad as you make it out to be. They do point out (in fine print) that it is a "featured" site. They list the "featured" sites first, then the sponsored links, and then general web hits. And they mark each category. I guess that the only differencebetween featured and sponsored is in the price. All this was far from obvious to me when I saw the results at first (being used to Google), but I imagine that if you used them on a daily basis you would quickly become used to skipping down to the real results.
Read the fine print by anon*127.0.0.1 · 2003-04-19 16:52 · Score: 2, Insightful

It's a "featured site". Meaning it's a site from Microsoft, a Microsoft partner, or someone who paid some money to Microsoft for the privilege.

Nothing that other search sites don't do. They just mark their paid adverts a little more obviously.

--
I am NOT a man!
I am a free number!
Re:Altruistic? by eversunsoft · 2003-04-19 18:36 · Score: 4, Insightful

Well, because web searching, to this day in age, has been a free service. Supposing that the index is built as the result of donated searches, it would be ethically in very bad taste to act against this trend.
Of course, I am the first one to question this trend. Has anyone else considered the possibility that one day we'll wake up, and notice that google is charging for access to it's basic searching services?
I for one, would probably pay. I have become so dependent on it. What price? That's a good question...
CPU cycles are NOT wasted or "available" by pe1chl · 2003-04-19 20:52 · Score: 2, Insightful

The common point made by these "distributed" software authors is that there are "wasted" CPU cycles in your computer that you could donate to a project for free.
However, that is not true at all! CPU cycles are not wasted. When the CPU has nothing to do, it sleeps. At least in a modern operating system (i.e. about everything after Windows 95).

By "donating your wasted CPU cycles" you will actually increase the power consumption of your computer. This will be very noticable in a laptop, but when you watch the CPU temperature in your home system you will also see a noticable increase in temperature between an idle system and a system running a computationally intensive background task.

Probably the effect will be worse for things like keysearches, prime number searches, SETI etc than for this GRUB bot, because that probably also spends time waiting for the network (and thus returns the CPU to idle).

So before you "donate your wasted CPU cycles", please realize that this will actually cost you money.
Re:Haiku :-) by Anonymous Coward · 2003-04-19 23:34 · Score: 1, Insightful

I love being a Slashdot subscriber - it gives me fifteen minutes to figure out a good joke before anyone has a chance to post!

OK. 15 minutes are up, and we are STILL waiting for your "Good" joke.
Re:So THAT'S What It Is... by Anonymous Coward · 2003-04-20 04:42 · Score: 1, Insightful

Was it really so hard to go to the url in the user agent to see what it was?