Nutch: An Open Source Search Engine
Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch.
In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?
Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.
I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.
google is already ideal... the weight of search results is not sold, just text ads.
people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...
if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.
MARIJUANA, SHROOMS, X: ONLINE?! - E
One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.
With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.
I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?
"Be proud to be a fighter" - Martial Arts Adage
Ooh, what's this?
Overture Research has donated hardware and helped to fund development.
So, even an "open source," "unbiased" search engine is funded by a commercial search organization.
let's see where is the funding coming from. Project is funded by overture which is to be bought by Yahoo. More info is here. Hmm.. So i guess Yahoo needs a revival...
bin
look siG is kool
I think having an open source search engine that people can modify and deploy would be an excellent thing, and here is why. Currently, google has the complete power to highlight or censor anything on the web. So far, they have used this power wisely, but that's no guarantee that it'll always be so. If they go public, you may find this power being used to increase the shareholders' wealth, rather than in the highest standards of fairness as it is today.
With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.
I made a PHP/MySQL library that prevents SQL injection & makes coding easier!
See this article on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.
I have found there are just two ways to go.
It all comes down to livin' fast or dyin' slow. -REK, Jr.
Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?
:
The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?
O'Reilly put it best
Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.
I was looking over the site and a number of things concerned me.
Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.
My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
Just browsing the US patent office gave me a couple of possible Patent nasties;
6,463,428 or 6,278,992. (And about 10 others I glanced at...)
Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]
I hope these comments are useful to somebody at least.
[1] http://www.xlnt-software.com/xml_dl.html
'I am become Shiva, destroyer of worlds'