Slashdot Mirror


Nutch: An Open Source Search Engine

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

14 of 291 comments (clear)

  1. Patents. by Christopher+Thomas · · Score: 5, Interesting

    I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

    1. Re:Patents. by Feztaa · · Score: 4, Insightful

      I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

      Hmmm, I just realized something... with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders. Which do you think is the better vantage point?

    2. Re:Patents. by AstroDrabb · · Score: 5, Insightful

      Does it matter? There are no innovations. ALL knowledge is based on prior knowlegde. Look in any field of study and you will soon learn that advancement is not possible without prior knowledge. What we know about computer science today is thanks to the knowledge gained by those before us. It is this way in EVERY field, Astronomy, Medical Science, Mathmatics, etc. Humankind does not grow by leaps and bounds, we grow by incremental improvements. I have not heard of ONE discovery/innovation in which the discovery/innovator was not educated in prior knowledge. Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?

      --
      If Tyranny and Oppression come to this land,
      it will be in the guise of fighting a foreign enemy. -James Madison
  2. Google? by devphaeton · · Score: 5, Informative

    Last i heard google still doesn't accept bribes for page ranking.

    inobtrusive adverts on the right hand column nonwithstanding.

    --


    do() || do_not(); // try();
  3. Biased listings by Champaign · · Score: 4, Insightful
    I think many commercial search engines have learned that biasing themselves to sites who have paid them is a good way to errode consumer confidence, and damage their readership/userbase. Just as newspapers have to at least provide the image of objectivity, the same demands are on search engines.

    I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.

  4. Seems pretty pointless by cryptochrome · · Score: 4, Insightful

    Free and open code is good and all... but the one real cost of a search engine is RUNNING it. It requires a far from trivial amount bandwidth and hardware, and somebody has to pay for all of it. Unless someone comes up with a novel P2P solution (and many are trying) it just won't happen.

    What they should be doing is pressuring the existing search engine companies for some integrity.

    --

    ---If you can't trust a nerd, who can you trust?

  5. Can this work? by jmkaza · · Score: 4, Insightful

    I think the idea is good in principle, but could it actually succeed? Google gets hit with millions of request each day. They've got hardware that can support thousands of slashdottings a day and a fat pipe to feed all of that info out. That takes alot of money. Financing an open source project is difficult enough, but financing an open source service such as that would seem next to impossible. Ideas?

    The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.

  6. Search engine game is NOT over by AtariAmarok · · Score: 4, Insightful

    "Google has WON the search engine war, probably forever. Find some other mountain to climb, guys."

    At one time, Oldsmobile won the auto company wars. Where are they now?

    IBM ruled the PC roost. Hmmmm....

    Command-line OS's were king. But now???

    Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?

    The point is, it is not over.

    --
    Don't blame Durga. I voted for Centauri.
  7. A Tough Challenge by Cloudmark · · Score: 5, Interesting

    One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.

    With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.

    I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?

    --
    "Be proud to be a fighter" - Martial Arts Adage
  8. Re:just don't get it by cduffy · · Score: 4, Insightful

    Think about cryptosystems: The whole point about the really good ones is that you can know the algorithm, but still not break it. Granted, pulling that off for a search engine is prone to be much, much harder -- but I *do* believe it's well within the realm of possibility. Ambitious in the extreme? Certainly... but there's something to be said for high-risk-high-reward projects.

  9. Re:The purpose of a search engine by AVryhof · · Score: 4, Funny
  10. Re:Slimey adverts? by Anonymous Coward · · Score: 5, Insightful

    This project is the SOFTWARE to run a search engine. Not a corporation that needs to generate income to justify the resources required to run the search engine.

    Anyone could take this source code and with enough money, challenge Google.com as the top search engine.

    I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.

    I haven't seen this on the homepage but it doesn't list what content it can index. I hope it can at least index PDF's and popular Office documents.. Maybe even Media files? And what XML indexed fields? Or external metadata?

  11. Re:Lucene (index and search engine) by cpeterso · · Score: 4, Informative


    Lucene and Nutch are related:

    http://scriptingnews.userland.com/2003/08/13#When: 12:20:53PM

    Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."

  12. Search Engine Monoculture by peachawat · · Score: 5, Interesting

    Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?

    The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?

    O'Reilly put it best :

    Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.