Nutch: An Open Source Search Engine

← Back to Stories (view on slashdot.org)

Nutch: An Open Source Search Engine

Posted by ryuzaki0 on Wednesday August 13, 2003 @08:51AM from the but-will-it-matter dept.

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

9 of 291 comments (clear)

Min score:

Reason:

Sort:

Google? by devphaeton · 2003-08-13 08:54 · Score: 5, Informative

Last i heard google still doesn't accept bribes for page ranking.

inobtrusive adverts on the right hand column nonwithstanding.

--

do() || do_not(); // try();
Re:Hook it up to slashdot! by Anonymous Coward · 2003-08-13 08:57 · Score: 3, Informative

Just use google. Search for "SEARCH-STRING site:slashdot.org"
Re:Accuracy is relevance by binaryDigit · 2003-08-13 09:06 · Score: 3, Informative

A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.
Re:Lucene (index and search engine) by cpeterso · 2003-08-13 09:45 · Score: 4, Informative

Lucene and Nutch are related:

http://scriptingnews.userland.com/2003/08/13#When: 12:20:53PM

Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."

--
cpeterso
Irrational fear of money by KalvinB · 2003-08-13 10:02 · Score: 3, Informative

That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.

How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?

I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.

When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.

Free is a pretty dream but free don't pay the bills.

Ben

--
Work Safe Porn
Re:Hook it up to slashdot! by randyest · 2003-08-13 10:13 · Score: 3, Informative

167 posts and no mention of ht://dig? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.

While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.

--
everything in moderation
Re:Accuracy is relevance by randyest · 2003-08-13 10:42 · Score: 3, Informative

If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search :)

Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see

These terms only appear in links pointing to this page: to be or not to be

Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.

Lucky for us that ol' Bill is still in the top 10 at all, I'd say.

--
everything in moderation
Re:Hook it up to slashdot! by lvdrproject · 2003-08-13 12:34 · Score: 3, Informative

Interestingly enough, if i had read this story a few months ago, i would've said "Poppycock! Google should be good enough for anyone!". But lately i've been noticing that Google turns up a lot of garbage results. Like, if you search for something "generic" (like, no brand name or product name or anything like that), you're going to find a whole bunch of results that just lead to pop-up search sites.
For example, look at the results for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.
Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....
Shameless plug for SWISH++ by pauljlucas · 2003-08-13 15:34 · Score: 3, Informative

I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.
SWISH++ fills this niche nicely. It can index hundreds of thousands of documents very quickly, indexes not only HTML, but e-mail, news, man pages, LaTeX, RTF, and even the ID3 tags of MP3 files; can apply filters on-the-fly (convert PDF to text, then index that), can do incremental indexing, and can run as a multi-threaded search daemon.

--
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.