Slashdot Mirror


Nutch: An Open Source Search Engine

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

27 of 291 comments (clear)

  1. Hook it up to slashdot! by FortKnox · · Score: 1, Insightful

    The slashdot search page could definately use this kinda technology!

    --
    Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
    1. Re:Hook it up to slashdot! by Steven+Blanchley · · Score: 2, Insightful

      No, many comments don't end up getting indexed by Google, and recent discussions aren't indexed at all. I've tried that method in the past with little success.

  2. Slimey adverts? by Acidic_Diarrhea · · Score: 3, Insightful
    Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money? It's clear that running a search engine has costs associated with it. To offset these costs, it seems like advertising is the only way to go. Now I can see that some search engines handle this in a more "slimey" way than others (I am happy with Google) but this project seems to want to avoid advertising at all costs. Where does the money come from then?

    Also of note is that companies can still influence search engines in slimey ways - Google can be manipulated to make a page rank higher, although Google keeps an eye on this activity and works around it.

    --
    I hate liberals. If you are a liberal, do not reply.
    1. Re:Slimey adverts? by Anonymous Coward · · Score: 5, Insightful

      This project is the SOFTWARE to run a search engine. Not a corporation that needs to generate income to justify the resources required to run the search engine.

      Anyone could take this source code and with enough money, challenge Google.com as the top search engine.

      I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.

      I haven't seen this on the homepage but it doesn't list what content it can index. I hope it can at least index PDF's and popular Office documents.. Maybe even Media files? And what XML indexed fields? Or external metadata?

    2. Re:Slimey adverts? by Blue+Lozenge · · Score: 2, Insightful
      Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money?

      Uhh... how about having advertising that does not affect search results. You see... ads on google are relevant to your search criteria, yet are separate from the results.

  3. Seems like /. by darkstar949 · · Score: 1, Insightful

    This seems to me like the /. moderation system, with the pages being ranked based upon how the user feels about the site.
    However, I could see some disadvantages to the system depending upon how it is set up, because one person could keep dinging a site to get its score to drop down.

  4. Biased listings by Champaign · · Score: 4, Insightful
    I think many commercial search engines have learned that biasing themselves to sites who have paid them is a good way to errode consumer confidence, and damage their readership/userbase. Just as newspapers have to at least provide the image of objectivity, the same demands are on search engines.

    I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.

  5. just don't get it by Astrorunner · · Score: 3, Insightful

    I think that you absolutely have to have a closed source algorithm for ranking pages, because otherwise you'll get people who will simply tune their pages to be high on the list. I can see how making the majority of the search engine open source would be beneficial, but the algorithm itself? Its like saying "Here's the keys to my car" and thinking that, because everyone has access to the keys, no one's going to drive away with it. Sure, everyone has the opportunity to make your search engine better, but never underestimate the tenacity of a web-wanna-be-millionaire.

    1. Re:just don't get it by cduffy · · Score: 4, Insightful

      Think about cryptosystems: The whole point about the really good ones is that you can know the algorithm, but still not break it. Granted, pulling that off for a search engine is prone to be much, much harder -- but I *do* believe it's well within the realm of possibility. Ambitious in the extreme? Certainly... but there's something to be said for high-risk-high-reward projects.

  6. If it's like every other SourceForge project... by realmolo · · Score: 2, Insightful

    Here's what I expect to see on the webpage in a few months: "Currently Nutch is in the alpha stage- it doesn't index any web pages, doesn't return any results, and has no user interface. Programmer's needed!" Google has WON the search engine war, probably forever. Find some other mountain to climb, guys.

    1. Re:If it's like every other SourceForge project... by AchmedHabib · · Score: 2, Insightful

      Google has WON
      You mean just like Altavista had? :)

  7. Seems pretty pointless by cryptochrome · · Score: 4, Insightful

    Free and open code is good and all... but the one real cost of a search engine is RUNNING it. It requires a far from trivial amount bandwidth and hardware, and somebody has to pay for all of it. Unless someone comes up with a novel P2P solution (and many are trying) it just won't happen.

    What they should be doing is pressuring the existing search engine companies for some integrity.

    --

    ---If you can't trust a nerd, who can you trust?

  8. Can this work? by jmkaza · · Score: 4, Insightful

    I think the idea is good in principle, but could it actually succeed? Google gets hit with millions of request each day. They've got hardware that can support thousands of slashdottings a day and a fat pipe to feed all of that info out. That takes alot of money. Financing an open source project is difficult enough, but financing an open source service such as that would seem next to impossible. Ideas?

    The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.

    1. Re:Can this work? by casio282 · · Score: 2, Insightful

      I think it's a fabulous idea, the kind of idea that make me slap my head and say "why didn't I think of this?" You're right -- the biggest obstacle to producing a truly free (as in speech, natch) search engine solution is not in producing the software (patent minefield notwithstanding), but in the "physical" costs of hardware and bandwidth.

      I think to way to overcome this obstacle is to develop a distributed system...run a nutch node on your server, host a few GBs of index data. There could be master nodes that are able to route requests to the right nodes for a given set of keywords. It sounds far-fetched, and I can't work out the network topography off the top of my head, but I bet it's doable. Of course, you'd have to build in redundancy into the system to make sure it's not exploited, and a power outage (or a machine that's not up 24-7) somewhere doesn't cause failures. You'd also want to encrypt the locally stored data to further protect against exploits, and to perhaps (IANAL) indemnify the node-owner to some degress from whatever problems s/he might face "hosting" this material, kinda like Freenet.

      It's interesting. I hope they think about this sort of approach.

      --

      :wq
  9. Search engine game is NOT over by AtariAmarok · · Score: 4, Insightful

    "Google has WON the search engine war, probably forever. Find some other mountain to climb, guys."

    At one time, Oldsmobile won the auto company wars. Where are they now?

    IBM ruled the PC roost. Hmmmm....

    Command-line OS's were king. But now???

    Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?

    The point is, it is not over.

    --
    Don't blame Durga. I voted for Centauri.
  10. Nutch will never get out of alpha stage by xannik · · Score: 2, Insightful

    I fail to see the point of such an endeavor. Without advertising Nutch can not possibly hope to become a serious contender with search engines such as google or overture. Advertising provides the money that enables search engines to have lots of bandwith to send those results quickly back to users, lots of computing power to quickly process each search, even the ability to hire people to research into new areas for better search results. Even if the search engine is selling its resources to other portals like google does with yahoo advertising would still be involved in the process. Yahoo would still need to be advertising on their site to bring in revenue to pay for the service. I think google's method is perfectly fine with small text based ads that are discrete. Why do we need to fix this?

    --

    Go Illini!!!
  11. Are they thinking too big? by xanderwilson · · Score: 3, Insightful

    I think they're setting themselves up for something that will get too big and too expensive before it can get finished, and they'll have to figure out a way to (gasp) get some funding beyond donations.

    I don't see a solution in one great open-source, independent search engine, but many individual specialized search engines, each mastering their own niche area of specialty stands a chance to compete, especially if run by people who focus on their areas of expertise. Alternative news search engines, music search engines, literary search engines, etc. each run by people who know what to filter in and out.

    If Nutch.org could create the technology that would allow each of these search engines to exist autonomously, it could also be the hub/portal/start-page/blahblahblah that links all these engines and databases together.

    Alex.

  12. Re:Google? by delcielo · · Score: 3, Insightful

    I have to agree. And I don't see my allegiance to Google as a sell-out. I see it as a reward for good work.

    --
    Hot Damn! It's the Soggy Bottom Boys!
  13. Re:Patents. by Feztaa · · Score: 4, Insightful

    I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

    Hmmm, I just realized something... with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders. Which do you think is the better vantage point?

  14. Re:Patents. by SpaceCadetTrav · · Score: 1, Insightful

    Depends... are you the one standing on the top or the bottom?

  15. Bias: Inevitable by handy_vandal · · Score: 2, Insightful


    "In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine."

    Bias is inevitable -- we're talking about ranking, which necessarily means bias.

    The question is: what bias do you want? What bias suits your purposes?

    My ideal search engine would offer a variety of biases from which to pick.

    --
    -kgj
  16. Re:Patents. by X · · Score: 3, Insightful

    In practice you may be right, but the intent of patents is the reverse. The key thing to think about is that without patents there is an incentive to keep ideas secret. So, you end up standing *beside* people until the idea comes out. If something gets patented, it is public knowledge, and you can stand on the person's shoulders so long as you pay them a "small" fee. Even without their consent you can do research that takes advantage of the knowledge in the patent.

    Of course, in practice patents are a mess. ;-)

    --
    sigs are a waste of space
  17. Nutch - Not Understanding The Capitalist Hegemony by EqualSlash · · Score: 2, Insightful


    Nutch - Not Understanding The Capitalist Hegemony (I am just making it up ;)

    Without a sound revenue model they can't operate for more than a month. Google has indexed billions of pages and to operate at that level they have to spend a lot of money (Google recently leased an entire campus from SGI). To meet the Infrastructure costs alone you need some form of commercial revenue stream.

  18. Re:Patents. by AstroDrabb · · Score: 5, Insightful

    Does it matter? There are no innovations. ALL knowledge is based on prior knowlegde. Look in any field of study and you will soon learn that advancement is not possible without prior knowledge. What we know about computer science today is thanks to the knowledge gained by those before us. It is this way in EVERY field, Astronomy, Medical Science, Mathmatics, etc. Humankind does not grow by leaps and bounds, we grow by incremental improvements. I have not heard of ONE discovery/innovation in which the discovery/innovator was not educated in prior knowledge. Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?

    --
    If Tyranny and Oppression come to this land,
    it will be in the guise of fighting a foreign enemy. -James Madison
  19. Some commentary... by Colm+Buckley · · Score: 3, Insightful

    I have a few comments on this development:

    • The article as posted contains some pretty snide commentary, apparently designed to intimate that all current search engines deliberately weight their results in favour of their advertisers. This is demonstrably not the case; in fact, with Google providing a strong, well-publicised counterexample, to do so would be suicide for any search engine with pretentions to market leadership.
    • The principal difficulty with an open-source search engine algorithm is that it would definitively be open to abuse. Once the ranking algorithm was known, it would be fairly trivial to develop ways to subvert it. One of the reasons why this hasn't happened to Google is because the details of the ranking algorithm are closed. There is a largish industry devoted to figuring out how to influence Google (which is why Google keep tweaking their algorithm). A search engine using an open algorithm would very quickly become unusable as this industry figured out how to play the system.
    • The funding from Overture is very suspicious, to be honest. Overture, assuming the Yahoo! takeover is given the all-clear, will soon be part of one of the largest commercial search engines, and with a history of business practices which are, shall we say, perhaps less than totally congruent with the open-source ideals.
    • Running a large, successful search engine requires vast, dedicated resources. I don't know the exact scale of the Yahoo!, Google or MSN search operations, but I'll warrant that they're surprising to anyone who's expecting to run a search engine from a couple of thousand distributed nodes.

    An open search engine application is a nice idea, but unfortunately it's one of those applications which are essentially useless without an enormous ASP architecture behind it. An earlier poster indicated that it might be useful for searching and indexing intranets and the like, analogously to the Google Search Appliance. This is indeed a valid potential application, but then, HT://Dig exists already. Is this dramatically better?

  20. Comments and suggestions... by rice_burners_suck · · Score: 2, Insightful
    Suppose you have just finished developing a free software search engine. And suppose it has the best algorithms in the world and the ratings are weighted based on some sort of moderation system.

    This is exactly like the problem the mice had one day. They couldn't come out of their mouse hole because there was a dangerous cat prowling around. One day, as food was getting scarce and everyone was afraid to leave the hole, the mice called a meeting to discuss the problem. One excited young mouse came up with the most wonderful idea: Let's put a bell around the cat's neck, so that when the cat is nearby, the mice would have advance warning and could escape! All the mice got excited at this proposal, until a very old, very wise mouse came over and asked, "And who will tie the bell around the cat's neck?"

    What I'm trying to say is: If the search engine is free software and companies don't pay to increase their ranking... who will pay for the bandwidth to host the engine? I can tell you this much:

    • Individuals will not pay a fee to perform a search unless this search engine gives them some incredibly compelling reasons to do so. Open moderation will not likely fulfill that requirement.
    • Companies will not pay to increase their ranking because that is the definition of this project. They will not pay to search for the same reason that individuals won't pay.
    • The government probably won't pay because there are plenty of "free" (cost) search engine around. That is, unless someone can give them an incredibly compelling reason to do so.
    • Universities probably won't pay for the same reasons as everyone else.

    Proposed solution? Make it a distributed search engine, like SETI@home, or the DNS.

    This is much easier said than done because:

    1. RAID-like distributed storage technology would have to be developed, so that the indexing database could be distributed among all computers worldwide that donate bandwidth and storage. This would have to guarantee statistically that all the data will be available at any point in time even if people turn off their computers for extended periods of time. However, this technology could make reliable clustered storage a reality, and the resulting free software implementation could be licensed for corporate use for an exhorbitant price, which would go to the EFF, FSF and other organizations that develop free software and/or support the development thereof.
    2. An efficient P2P-like protocol, along with a network topology of some sort (like the DNS system has) would have to be developed to support the searching; It would have to be damn fast and, like before, very resiliant to computers being shut off, chunks of data becoming lost at any moment, etc. Furthermore, changes would need to propogate at blazing speeds so that new items on the Internet could be found shortly after appearing.
    3. Bandwidth and disk quota would need to be managed at each participating host, so that limits set by the user are not exceeded.
    Governments, companies, universities and individuals would likely support an effort like this by donating some bandwidth and storage, rather than money.

    In the spirit of worldwide computing on the Internet, I hope this makes some amount of sense.

  21. Re:Google? by Anonymous Coward · · Score: 1, Insightful

    When you're the gateway to information, you're in an extremely powerful position. People will be prepared to pay a lot to get access to that power.

    Left as a virtual monopoly on net searching, it will only be a matter of time before Google caves into the pressure of 'pay for placement'. That is why we need to maintain competition in the 'net search' industry to keep them honest.