Nutch: An Open Source Search Engine
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
Last i heard google still doesn't accept bribes for page ranking.
inobtrusive adverts on the right hand column nonwithstanding.
do() || do_not();
To me, accuracy is the most important "Relevance".
The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.
A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.
Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.
Don't blame Durga. I voted for Centauri.
Just use google. Search for "SEARCH-STRING site:slashdot.org"
Check out Lucene, the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.
Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.
i use linux and windows oh god how can i have an opinion
Don't worry too much. This is software, not a service. When available it may be implemented by someone and be the infrastructure of a company, which may then provide bugfixes and development to the original project. Or it may not. Who knows.
Nutch has four developers, one of whom is Doug Cutting who wrote several indexing engines. They count Alexa founder Brewster Kahle as a "friend" and are sponsored by Overture.
I still don't think you can describe google's setup as distributed. They have multiple data centers each running a very large cluster and containing a similar, but not identical, snapshot of the database, indices, etc. A truly distributed engine is likely to require an innovative step or three to emulate that with no centralised control, unknown hardware and bandwidth resources and the real possibility that some "clients" may be corrupted by their owners to distort results. I haven't got any arguments about the real value of this effort though. Google has done nothing to lost my trust and seems to be run with retaining people's trust as an active ambition. Closest they came to worrying me was crippling for China, but that was really a no-win situation, IMHO.
You've entirely missed the point of this project.
I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.
Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek server, we'd definitely be interested.
We don't have to worry about anyone "googlebombing" our search collections because, well, we create all the content that goes into those collections. We'd love it if the algorithm that determined rankings was open-source. That way, we could change it to suit our specific needs if we thought it would help return more relevant results. There are currently a number of undesirable phenomena that we live with or work around because the mechanics of the problem are burried within proprietary Ultraseek code.
Google is the best of the best in web search and I don't think anyone short of MS is interested in challenging them for that. But 'search engine' in this case means something entirely different.
"Don't blame me, I voted for Kodos!"
That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.
How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?
I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.
When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.
Free is a pretty dream but free don't pay the bills.
Ben
Work Safe Porn
167 posts and no mention of ht://dig? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.
While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.
everything in moderation
For example, look at the results for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.
Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....
I've noticed that searching for Eric S. Raymond's home page brings up his actual home page third or fourth in the listing. I don't know if that means Google is on it's way to going downhill or what. The first listing it brings up doesn't appear to have anything to do with ESR. I don't even think his name appears anywhere on the page.
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.