Nutch: An Open Source Search Engine

← Back to Stories (view on slashdot.org)

Nutch: An Open Source Search Engine

Posted by ryuzaki0 on Wednesday August 13, 2003 @08:51AM from the but-will-it-matter dept.

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

20 of 291 comments (clear)

Min score:

Reason:

Sort:

Google? by devphaeton · 2003-08-13 08:54 · Score: 5, Informative

Last i heard google still doesn't accept bribes for page ranking.

inobtrusive adverts on the right hand column nonwithstanding.

--

do() || do_not(); // try();
1. Re:Google? by fireboy1919 · 2003-08-13 09:15 · Score: 2, Informative
  
  Yeah, they been known to do that when people make server farms to attempt to influence the rankings of google. It is in their best interest to ensure that the pages that people actually want to see come up first, not the advertisers pages.
  
  That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.
  
  They're like the Swiss and Consumer Reports. Part of the reason they make money is neutrality, and they won't make as much if they're not.
  
  --
  Mod me down and I will become more powerful than you can possibly imagine!
2. Re:Google? by Anonymous Coward · 2003-08-13 09:48 · Score: 1, Informative
  
  Yes, but google does delist pages when threatened with lawsuits.
  
  Remember the Scientologists?
3. Re:Google? by RedWizzard · 2003-08-13 13:05 · Score: 2, Informative
  
  See this article on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.
  That article has already been dealt with on Slashdot (here). Using a bit of intelligence when searching will avoid the problems cited.
Accuracy is relevance by AtariAmarok · 2003-08-13 08:56 · Score: 2, Informative

To me, accuracy is the most important "Relevance".

The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.

A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.

--
Don't blame Durga. I voted for Centauri.
1. Re:Accuracy is relevance by binaryDigit · 2003-08-13 09:06 · Score: 3, Informative
  
  A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.
  
  This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.
2. Re:Accuracy is relevance by randyest · 2003-08-13 10:42 · Score: 3, Informative
  
  If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?
  
  Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search :)
  
  Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see
  
  These terms only appear in links pointing to this page: to be or not to be
  
  Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.
  
  Lucky for us that ol' Bill is still in the top 10 at all, I'd say.
  
  --
  everything in moderation
Re:Hook it up to slashdot! by Anonymous Coward · 2003-08-13 08:57 · Score: 3, Informative

Just use google. Search for "SEARCH-STRING site:slashdot.org"
Lucene (index and search engine) by Anonymous Coward · 2003-08-13 09:18 · Score: 1, Informative

Check out Lucene, the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.
1. Re:Lucene (index and search engine) by cpeterso · 2003-08-13 09:45 · Score: 4, Informative
  
  Lucene and Nutch are related:
  
  http://scriptingnews.userland.com/2003/08/13#When: 12:20:53PM
  
  Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."
  
  --
  cpeterso
Anyone ever heard of grub? by nadadogg · 2003-08-13 09:21 · Score: 2, Informative

Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.

--
i use linux and windows oh god how can i have an opinion
Re:Hardware? by AsparagusChallenge · 2003-08-13 09:32 · Score: 2, Informative

Don't worry too much. This is software, not a service. When available it may be implemented by someone and be the infrastructure of a company, which may then provide bugfixes and development to the original project. Or it may not. Who knows.
I wouldn't count on it by Wesley+Felter · 2003-08-13 09:36 · Score: 2, Informative

Nutch has four developers, one of whom is Doug Cutting who wrote several indexing engines. They count Alexa founder Brewster Kahle as a "friend" and are sponsored by Overture.
Re:Patents. by alwayslurking · 2003-08-13 09:50 · Score: 2, Informative

I still don't think you can describe google's setup as distributed. They have multiple data centers each running a very large cluster and containing a similar, but not identical, snapshot of the database, indices, etc. A truly distributed engine is likely to require an innovative step or three to emulate that with no centralised control, unknown hardware and bandwidth resources and the real possibility that some "clients" may be corrupted by their owners to distort results. I haven't got any arguments about the real value of this effort though. Google has done nothing to lost my trust and seems to be run with retaining people's trust as an active ambition. Closest they came to worrying me was crippling for China, but that was really a no-win situation, IMHO.
Re:not a good idea.... by curunir · 2003-08-13 10:01 · Score: 2, Informative

You've entirely missed the point of this project.

I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.

Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek server, we'd definitely be interested.

We don't have to worry about anyone "googlebombing" our search collections because, well, we create all the content that goes into those collections. We'd love it if the algorithm that determined rankings was open-source. That way, we could change it to suit our specific needs if we thought it would help return more relevant results. There are currently a number of undesirable phenomena that we live with or work around because the mechanics of the problem are burried within proprietary Ultraseek code.

Google is the best of the best in web search and I don't think anyone short of MS is interested in challenging them for that. But 'search engine' in this case means something entirely different.

--
"Don't blame me, I voted for Kodos!"
Irrational fear of money by KalvinB · 2003-08-13 10:02 · Score: 3, Informative

That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.

How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?

I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.

When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.

Free is a pretty dream but free don't pay the bills.

Ben

--
Work Safe Porn
Re:Hook it up to slashdot! by randyest · 2003-08-13 10:13 · Score: 3, Informative

167 posts and no mention of ht://dig? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.

While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.

--
everything in moderation
Re:Hook it up to slashdot! by lvdrproject · 2003-08-13 12:34 · Score: 3, Informative

Interestingly enough, if i had read this story a few months ago, i would've said "Poppycock! Google should be good enough for anyone!". But lately i've been noticing that Google turns up a lot of garbage results. Like, if you search for something "generic" (like, no brand name or product name or anything like that), you're going to find a whole bunch of results that just lead to pop-up search sites.
For example, look at the results for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.
Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....
Re:Hook it up to slashdot! by msgregory@earthlink. · 2003-08-13 12:45 · Score: 2, Informative

I've noticed that searching for Eric S. Raymond's home page brings up his actual home page third or fourth in the listing. I don't know if that means Google is on it's way to going downhill or what. The first listing it brings up doesn't appear to have anything to do with ESR. I don't even think his name appears anywhere on the page.
Shameless plug for SWISH++ by pauljlucas · 2003-08-13 15:34 · Score: 3, Informative

I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.
SWISH++ fills this niche nicely. It can index hundreds of thousands of documents very quickly, indexes not only HTML, but e-mail, news, man pages, LaTeX, RTF, and even the ID3 tags of MP3 files; can apply filters on-the-fly (convert PDF to text, then index that), can do incremental indexing, and can run as a multi-threaded search daemon.

--
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.