Slashdot Mirror


Nutch: An Open Source Search Engine

Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it.

291 comments

  1. Hook it up to slashdot! by FortKnox · · Score: 1, Insightful

    The slashdot search page could definately use this kinda technology!

    --
    Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
    1. Re:Hook it up to slashdot! by Gherald · · Score: 1

      The slashdot database is pretty huge.. I wonder if the servers could handle this kind of indexing?

    2. Re:Hook it up to slashdot! by Anonymous Coward · · Score: 3, Informative

      Just use google. Search for "SEARCH-STRING site:slashdot.org"

    3. Re:Hook it up to slashdot! by SpaceCadetTrav · · Score: 1

      Piece of cake, but you'd probably have to make an investment in to some proprietary software, which is forbidden here.

    4. Re:Hook it up to slashdot! by Steven+Blanchley · · Score: 2, Insightful

      No, many comments don't end up getting indexed by Google, and recent discussions aren't indexed at all. I've tried that method in the past with little success.

    5. Re:Hook it up to slashdot! by randyest · · Score: 3, Informative

      167 posts and no mention of ht://dig? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.

      While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks.

      --
      everything in moderation
    6. Re:Hook it up to slashdot! by lvdrproject · · Score: 3, Informative
      Interestingly enough, if i had read this story a few months ago, i would've said "Poppycock! Google should be good enough for anyone!". But lately i've been noticing that Google turns up a lot of garbage results. Like, if you search for something "generic" (like, no brand name or product name or anything like that), you're going to find a whole bunch of results that just lead to pop-up search sites.

      For example, look at the results for the search 'convert wmv mpeg'. The first three results lead to the same exact search site. (Whether they have pop-ups or not, i can't tell, because i block them.) The fourth result is another search site. And then the last three are the same as the first three.

      Of course, this obviously works with stuff you'd expect it to, like 'mp3s' and 'warez' and 'porn', but it works with legitimate stuff too. I wonder if there'll be anything to combat this trend, whether it be implemented by Google or by someone else....

    7. Re:Hook it up to slashdot! by msgregory@earthlink. · · Score: 2, Informative

      I've noticed that searching for Eric S. Raymond's home page brings up his actual home page third or fourth in the listing. I don't know if that means Google is on it's way to going downhill or what. The first listing it brings up doesn't appear to have anything to do with ESR. I don't even think his name appears anywhere on the page.

    8. Re:Hook it up to slashdot! by WoTG · · Score: 1

      Yeah, I'd agree that Google does produce quite a few garbage results, but let's keep things in perspective. Before Google, I'd be used to 3 PAGES of garbage results on AltaVista and other engines.

      You've got to remember that Google is just code. It has a lot of tricks to produce good results, but there are bound to be little things that slip through. As for your example, it appears to be a bit of luck. The site that shows up at #1 lists all its recent queries as links, a neat little feature. One of those recent queries happens to exactly match your Google query of 'convert wmv mpeg'. So Google guesses that it's a specific page about some object called 'convert wmv mpeg' and it gets a good ranking, in this case the top ranking.

      It happens on my personal homepage too. For some reason (page design, link text, page header, whatever) this miniscule grep examples page on my personal web database comes up as number 3 for "grep examples" (quotes included). I get a few hits that way...

  2. Biutch by Anonymous Coward · · Score: 0

    open source pimp engine

  3. Patents. by Christopher+Thomas · · Score: 5, Interesting

    I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

    1. Re:Patents. by socrates32 · · Score: 2, Interesting

      "most of the good search and indexing schemes have already been patented" Not at all... just the easy ones.
      If this is to be cheap to run, it will probably have to be distributed, and thus a very different architecture than most of what we've seen up to now.

      --

      -- "Quidquid latine dictum sit, altum sonatur."
      - Whatever is said in Latin sounds profound.
    2. Re:Patents. by Anonymous Coward · · Score: 0

      Dude. Read about google's network. Seriously.

    3. Re:Patents. by Feztaa · · Score: 4, Insightful

      I hope the authours of this project do their homework. My impression is that most of the good search and indexing schemes have already been patented, which will make it difficult to release such a project without stepping on someone's toes.

      Hmmm, I just realized something... with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders. Which do you think is the better vantage point?

    4. Re:Patents. by alwayslurking · · Score: 1

      I think the parent probably meant distributed in a Folding-at-home or SETI sense. Google use a massive cluster, but it's all on-site and owned by them, AFAIK.

    5. Re:Patents. by SpaceCadetTrav · · Score: 1, Insightful

      Depends... are you the one standing on the top or the bottom?

    6. Re:Patents. by Anonymous Coward · · Score: 0

      I understood what the post you refer to was talking about. Google's network isn't onsite, it is distributed to a host of data centers much like Akamai's network. The only real difference between what the initial poster wants and google is centralized control. This is not a technical issue, but rather a political issue. I understand the sentiment of "don't control my information" but on the other hand, the way google runs things so far hasn't given any objective reason to worry. Quite the contrary.

    7. Re:Patents. by mblase · · Score: 1

      with patents, you end up stepping on people's toes. Without patents, you get to stand on their shoulders.

      On the other hand, if you're the one didn't get the patent, you stand the risk of being crushed when too many people show up for a free piggyback ride.

    8. Re:Patents. by alwayslurking · · Score: 2, Informative

      I still don't think you can describe google's setup as distributed. They have multiple data centers each running a very large cluster and containing a similar, but not identical, snapshot of the database, indices, etc. A truly distributed engine is likely to require an innovative step or three to emulate that with no centralised control, unknown hardware and bandwidth resources and the real possibility that some "clients" may be corrupted by their owners to distort results. I haven't got any arguments about the real value of this effort though. Google has done nothing to lost my trust and seems to be run with retaining people's trust as an active ambition. Closest they came to worrying me was crippling for China, but that was really a no-win situation, IMHO.

    9. Re:Patents. by X · · Score: 3, Insightful

      In practice you may be right, but the intent of patents is the reverse. The key thing to think about is that without patents there is an incentive to keep ideas secret. So, you end up standing *beside* people until the idea comes out. If something gets patented, it is public knowledge, and you can stand on the person's shoulders so long as you pay them a "small" fee. Even without their consent you can do research that takes advantage of the knowledge in the patent.

      Of course, in practice patents are a mess. ;-)

      --
      sigs are a waste of space
    10. Re:Patents. by Anonymous Coward · · Score: 0

      Well, when Google was small and relatively unknown it was good. Now it is big, almost a household name. It's also successful and there is even talk of an IPO! Of COURSE it is bad now. Don't you know that anyone who makes money from technology is BAD?

      There is a definite need for a group of kids to get together on SourceForge and release something as 0.001.02.01b which promises to be BETTAR than Google some day. And of course there will be the obligatory whining about big bad Google owning all the patents and the usual ineffectual communists declaring that searches want to be free.

      In the end it will come to nothing. In the meantime be at peace and use Google.

    11. Re:Patents. by Anonymous Coward · · Score: 0

      Of course when there's no chance to exploit a patent there are far fewer shoulders because nobody spends a couple of billion forming a company, hiring a few hundred workers and developing the technology - everyone has to wait until some hick notices interesting patterns in the cow feed as he's trying to retrieve his gum, and then we get a new search algorithm... ...but nobody on Slashdot likes to think about that. Reality, who needs it!

      The preferred vantage point of Feztaa is obviously "head up ass".

    12. Re:Patents. by Anonymous Coward · · Score: 0

      No. You get to watch other people work with your ideas. If you're lucky, you might get to work with some of theirs as well. Exchanging ideas does not encumber you whatsoever.

    13. Re:Patents. by shepd · · Score: 1

      If I have seen further, it is by standing upon the shoulders of giants.
      --- Isaac Newton

      Just thought it would be worth mentioning. Then again, the laws of gravity have been against me lately...

      --
      If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
    14. Re:Patents. by _Sharp'r_ · · Score: 1

      Their search and indexing schemes for this don't matter. What really matters is that all my sites rank at the top of all the results.

      If they can accomplish that, I'll be happy with the search results!

      (Remove tongue from cheek)
      Sadly, this will probably be the attitude of most webmasters, leading open source search to be either totally bulletproof, or shamelessly exploited. Somehow I suspect it'll be the latter.

      --
      The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
    15. Re:Patents. by AstroDrabb · · Score: 5, Insightful

      Does it matter? There are no innovations. ALL knowledge is based on prior knowlegde. Look in any field of study and you will soon learn that advancement is not possible without prior knowledge. What we know about computer science today is thanks to the knowledge gained by those before us. It is this way in EVERY field, Astronomy, Medical Science, Mathmatics, etc. Humankind does not grow by leaps and bounds, we grow by incremental improvements. I have not heard of ONE discovery/innovation in which the discovery/innovator was not educated in prior knowledge. Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?

      --
      If Tyranny and Oppression come to this land,
      it will be in the guise of fighting a foreign enemy. -James Madison
    16. Re:Patents. by Anonymous Coward · · Score: 0

      Nonsense.

      If there were no innovations, then current knowledge would be the same as prior knowledge. And that would be true all the way back to, oh, let's say One Million B.C. Since current knowledge is clearly not the same as knowledge circa One Million B.C., innovations have in fact occurred.

      Patents cover some innovations. They do not cover the prior knowledge. "Prior art" does not mean you have no grounds for a patent; it merely limits what your patent covers. Building upon previous work is expected, and accomodated by the system.

    17. Re:Patents. by Anonymous Coward · · Score: 0

      Researched takes money.
      Don't be cheap.
      You can't get everything for free.
      Stop being so cheap!!!!

    18. Re:Patents. by salesgeek · · Score: 1

      Without patents, you get to stand on their shoulders.

      The person being stood on has a bad view.

      --
      -- $G
    19. Re:Patents. by PMuse · · Score: 1

      Feztaa wrote: Without patents, you get to stand on their shoulders.

      AstroDrabb wrote: There are no innovations. ALL knowledge is based on prior knowlegde. Look in any field of study and you will soon learn that advancement is not possible without prior knowledge. ... Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?

      There is innovation. The guy may be an innovator who sees A and sees B and says, if I used A with B, I could do C. For instance, take Guttenberg's invention of the printing press. Many parts of it were know: alphabets, paper, ink, carving, screws, etc. We can argue about whether Guttenberg's idea was inventive enough, but that's just a matter of degrees.

      Now, we all agree that the world is best off if Guttenberg tells the world about his idea and how to do it, right? How do we get him to do that? After all, he could just hide his presses in a monestary somewhere and only show the public the books.

      Answer: we give him a patent in exchange for publically revealing how to build and operate a printing press. Guttenberg gains a few years of monopoly and we gain his idea.

      We can haggle over details like how many years to give him or how strong a monopoly to grant. That, again, is a matter of degree. The point is this: the theory behind granting patents is that with patents, we get to stand on the shoulders of those who came before us. Without patents, there is less incentive for people who come up with good ideas to explain to the rest of us how to do them.

      --
      "We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
    20. Re:Patents. by Anonymous Coward · · Score: 0

      Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?


      This sounds to me like saying, "Do we really want to let gravity force all of the water out of our oceans and into the atomosphere?". Get a clue, dude: "greed" is what creates innovation, not the government. It isn't always "greed" for money: sometimes its for fame, sometimes its for personal satisfaction. Its still "greed" in the sense that its somebody going out of their way and taking a chance to do the extraordinary--and expecting a reward in return.

    21. Re:Patents. by Feztaa · · Score: 1

      The guy may be an innovator who sees A and sees B and says, if I used A with B, I could do C.

      And where is this guy if A and B are patented and he's not allowed to use them? He's shit out of luck, is where he is.

      the theory behind granting patents is that with patents, we get to stand on the shoulders of those who came before us.

      The theory, yes. Theory. That's not at all what's happening in practice, though. Software patents are the perfect example here.

    22. Re:Patents. by 5KVGhost · · Score: 1

      There are no innovations. ALL knowledge is based on prior knowlegde

      All humans are genetically related to all other humans, but that doesn't mean that you're my mother. The real world isn't a game of Civilization where new discoveries magically appear in your technology tree at predefined moments.

      Possessing certain knowledge does nothing for you if you don't know how to apply that knowledge to do something useful. Knowledge just sits there - it takes people to innovate, and in order for people to do so they have to have the skills and the incentive to invest their time, money, and energy.

      Now the question we need to ask ourselves, and especially the government is do we really want the advancement of our society to be hindered by monetary interests of the greedy?

      The question you may want to ask yourself is why people and businesses would spend huge amounts of money to research, develop, and test new ideas and inventions to advance society when a lazier company or individual could simply steal the fruits of their labor and present them as their own without any consequences. Monetary interests are the reason for progress, not a hinderance.

    23. Re:Patents. by PMuse · · Score: 1

      PMuse: The guy may be an innovator who sees A and sees B and says, if I used A with B, I could do C.

      Feztaa: And where is this guy if A and B are patented and he's not allowed to use them? He's shit out of luck, is where he is.


      Yes, for 20 years. If Mr. C-Seer doesn't tell us about C, it could be considerably longer before the rest of us figure C out.

      PMuse: the theory behind granting patents is that with patents, we get to stand on the shoulders of those who came before us.

      Feztaa: The theory, yes. Theory. That's not at all what's happening in practice, though. Software patents are the perfect example here.


      How are software patents a perfect example of the practice not matching the theory? (I can think of a couple of ways, but I'm curious about which you have in mind.)

      More importantly, if we don't have a problem with the theory, then our problem is that the practice deviates from the theory too far -- so what can we do to make the practice conform to the theory?

      --
      "We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
    24. Re:Patents. by Feztaa · · Score: 1

      How are software patents a perfect example of the practice not matching the theory?

      Well, how long does it take for a piece of software to become obsolete? MS would probably say 3 years or less (Win95 to 98 was 3 years, Win2k was 2 years, etc).

      So, in the case of the rapidly evolving software world, what's the point of a patent that lasts 20+ years? If some operating system was competing with another operating system that used patented algorithms, the competing OS would be 20 or more years behind the patented OS, assuming they were honoring the patents. Just to give you some perspective on that -- Unix is 33 years old, GNU is 19, Linux is ~12, and Windows is ~10 (counting from Win3.1 in 1993).

      Basically, if you haven't been able to exploit your software patent in the first two years to make money, you don't really deserve the patent anymore, since it's not helping you any, and it's holding others back.

      It's the same deal with copyrights. The point of copyrights is to give authors some compensation for their trouble, encouraging them to contribute more works to society, for the greater benefit of mankind. But if works are locked up in copyright for life + 95 years (or whatever ridiculous amount of time it is), then the copyright is essentially permanent, and the work is stolen from humankind. Remember, copyright is the exception, not the rule.

      The point I'm trying to argue here is that patents and copyrights are basically good, but their terms are just farcically long in today's fast-paced world.

      The other problem with patents, of course, is the apparent corruption (or just plain ignorance) of the patent office granting painfully obvious patents, which brings rise to situations where you've got a small company who patented an obvious idea with no products to show for it suing real companies doing real innovation -- this hurts society as a whole, because some pissant leach of a corporation is stealing money from good companies while simultaneously NOT contributing anything worthwhile to the list of humankind's accomplishments. :)

    25. Re:Patents. by Feztaa · · Score: 1

      Oh, and another thing I wanted to add -- if a company has a large patent portfolio, then their initial innovation that got them the patents can stagnate, turning them into some royalty collecting agency that has no further incentive to innovate until their patents expire -- but remember, business men are slimy, repulsive creatures who focus on short-term profits at the expense of long-term viability. If it looks like slashing the R&D budget and supplementing the income with patent royalties is the best way to maximize shareholder profits, that's what they will do.

    26. Re:Patents. by PMuse · · Score: 1

      The point I'm trying to argue here is that patents and copyrights are basically good, but their terms are just farcically long in today's fast-paced world.

      Yep. That's a good point. I'd disagree on exactly how long is long enough, but that's just a matter of degrees.

      The other problem with patents, of course, is the apparent corruption (or just plain ignorance) of the patent office granting painfully obvious patents,

      Mostly the latter, I think. The PTO is overly-reliant on old patents to know what has already been done. Software didn't have any old patents when the 1990s began, and the PTO had no clue what was either old or obvious. And look at the results... Not that things have improved much today.

      But, take screws for example. There's an old art. You can't get a patent on "a screw" today. You probably can't get a patent on "a stainless steel screw with a slotted hexhead cap and a 3:1 thread pitch today." If you want a patent on a screw today, it's got to be very, very specific, like "a stainless steel screw with a slotted hexhead cap with an outer diameter at least double the shaft diameter having a flare around the cap at least four times the shaft diameter, with a thread pitch of 3:1, and with a maximum thread diameter less than ten percent greater than the shaft diameter". (Not that I have any idea why someone would want such a screw.)

      By contrast, the PTO is still handing out patents for things like "a computing device with a means for sending signals to another dissimilar computing device to which it is not physically connected". Ridiculous -- that could be anything.

      ...which brings rise to situations where you've got a small company who patented an obvious idea with no products to show for it suing real companies doing real innovation -- this hurts society as a whole, because some pissant leach of a corporation is stealing money from good companies while simultaneously NOT contributing anything worthwhile to the list of humankind's accomplishments.

      Oh are you preaching to the choir there!

      --
      "We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
  4. The purpose of a search engine by Stalemate · · Score: 1
    After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?


    I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.
    1. Re:The purpose of a search engine by yamcha666 · · Score: 2, Funny
      I'm pretty sure a search engine is supposed to be for whatever purpose the people making it want it to be.

      And I'm sure many Slashdotters would love a search engine dedicated to find pr0n and anti-Microsoft propaganda. Right?

    2. Re:The purpose of a search engine by AVryhof · · Score: 4, Funny
  5. Exploits by Greenisus · · Score: 0, Troll

    My biggest concern is that the developers will simply be in a scramble patching up exploits, instead of actually making their technology better.

  6. Google? by devphaeton · · Score: 5, Informative

    Last i heard google still doesn't accept bribes for page ranking.

    inobtrusive adverts on the right hand column nonwithstanding.

    --


    do() || do_not(); // try();
    1. Re:Google? by billstr78 · · Score: 1

      Yeah, and the last I heard: this was the only search engine anyone used.

    2. Re:Google? by delcielo · · Score: 3, Insightful

      I have to agree. And I don't see my allegiance to Google as a sell-out. I see it as a reward for good work.

      --
      Hot Damn! It's the Soggy Bottom Boys!
    3. Re:Google? by capedgirardeau · · Score: 1

      Maybe they dont take money, I can't really say for sure, but they do adjust the rankings for some pages.

      They have been called on it before as I recall and refused to reveal what their criteria was for when they would manually adjust a page's rank.

      Read their support pages, no where do they say they do not manually adjust the page ranks.

      But they are still the best thing in town.

      --
      Wax on, wax off baby!
    4. Re:Google? by fireboy1919 · · Score: 2, Informative

      Yeah, they been known to do that when people make server farms to attempt to influence the rankings of google. It is in their best interest to ensure that the pages that people actually want to see come up first, not the advertisers pages.

      That's why people use google. If they stacked the deck supporting places people don't care about - advertisers pages, for instance, then we'd all jump ship and use another search engine.

      They're like the Swiss and Consumer Reports. Part of the reason they make money is neutrality, and they won't make as much if they're not.

      --
      Mod me down and I will become more powerful than you can possibly imagine!
    5. Re:Google? by The+Clockwork+Troll · · Score: 1
      Last i heard google still doesn't accept bribes for page ranking.
      They aren't stupid, give them some credit.

      If they accepted bribes for anything, it would be for concrete information about their ranking algorithm.

      --

      There are no karma whores, only moderation johns
    6. Re:Google? by g1zmo · · Score: 3, Interesting

      See this article on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.

      --
      I have found there are just two ways to go.
      It all comes down to livin' fast or dyin' slow.
      -REK, Jr.
    7. Re:Google? by Anonymous Coward · · Score: 1, Informative

      Yes, but google does delist pages when threatened with lawsuits.

      Remember the Scientologists?

    8. Re:Google? by Anonymous Coward · · Score: 0

      And nowhere do they say that they do not print out the HTML of the top search results, cover them in peanut butter and jelly and slide around on their bare asses in the resulting gooey, sensual mess.

      Do you communist idiots have to invent conspiracies everywhere?

    9. Re:Google? by ebyrob · · Score: 1

      *they* refused to reveal what their criteria was for when they would manually adjust a page's rank.

      Ya, because lod knows, a computer could never make a mistake without human intervention.

      Gimme a break, bouyant slimeballs will find a way to float on any popular search engine no matter how it is implemented. The only prayer an OSS search engine has of staying "pristine" would be making sure not many people ever use it.

    10. Re:Google? by RedWizzard · · Score: 2, Informative
      See this article on slate for some interesting ideas on why Google's page-ranking system is being undermined due to the evolution of ecommerce and price-comparing portals.
      That article has already been dealt with on Slashdot (here). Using a bit of intelligence when searching will avoid the problems cited.
    11. Re:Google? by shaitand · · Score: 1

      false, for $200 you can get a higher rank, the boss has already paid it.

    12. Re:Google? by Anonymous Coward · · Score: 0

      Do you McCarthyites have to respond to everything with homo-erotic fantasies?

    13. Re:Google? by Anonymous Coward · · Score: 1, Insightful

      When you're the gateway to information, you're in an extremely powerful position. People will be prepared to pay a lot to get access to that power.

      Left as a virtual monopoly on net searching, it will only be a matter of time before Google caves into the pressure of 'pay for placement'. That is why we need to maintain competition in the 'net search' industry to keep them honest.

    14. Re:Google? by blibbleblobble · · Score: 1

      "Last i heard google still doesn't accept bribes for page ranking."

      Essentially, you can achieve the same effect by registering a thousand domains, and linking them all to each other.

  7. Lost Cause by Anonymous Coward · · Score: 0

    Didn't google start out that way and then realize that is very expensive to maintain a search engine? Google also clearly differeniates between its ads and its results.

    I guess the more the merrier but I wouldn't bank on this thing becoming more than a curiosity.

  8. Slimey adverts? by Acidic_Diarrhea · · Score: 3, Insightful
    Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money? It's clear that running a search engine has costs associated with it. To offset these costs, it seems like advertising is the only way to go. Now I can see that some search engines handle this in a more "slimey" way than others (I am happy with Google) but this project seems to want to avoid advertising at all costs. Where does the money come from then?

    Also of note is that companies can still influence search engines in slimey ways - Google can be manipulated to make a page rank higher, although Google keeps an eye on this activity and works around it.

    --
    I hate liberals. If you are a liberal, do not reply.
    1. Re:Slimey adverts? by darkstar949 · · Score: 0

      I agree, a good search engine is going to face problems in paying for bandwidth, hosting, ect; and unless you have paid advertising you would be hard pressed to keep it running for long. The only posiblities left would be to have a company pay to host it (which may have problems), or to have user donations keep it running, simlar to PBS.

    2. Re:Slimey adverts? by Anonymous Coward · · Score: 0

      Why not give it a distributed architecture, then people can donate some of thier server's time to contributing.

    3. Re:Slimey adverts? by M.C.+Hampster · · Score: 2, Funny

      To offset these costs, it seems like advertising is the only way to go. Now I can see that some search engines handle this in a more "slimey" way than others (I am happy with Google) but this project seems to want to avoid advertising at all costs. Where does the money come from then?

      You speak blasphemy! How dare you speak of such practical issues as money when talking about free software!

      --
      Forget the whales - save the babies.
    4. Re:Slimey adverts? by Anonymous Coward · · Score: 5, Insightful

      This project is the SOFTWARE to run a search engine. Not a corporation that needs to generate income to justify the resources required to run the search engine.

      Anyone could take this source code and with enough money, challenge Google.com as the top search engine.

      I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.

      I haven't seen this on the homepage but it doesn't list what content it can index. I hope it can at least index PDF's and popular Office documents.. Maybe even Media files? And what XML indexed fields? Or external metadata?

    5. Re:Slimey adverts? by Blue+Lozenge · · Score: 2, Insightful
      Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money?

      Uhh... how about having advertising that does not affect search results. You see... ads on google are relevant to your search criteria, yet are separate from the results.

    6. Re:Slimey adverts? by Steven+Blanchley · · Score: 1
      You see... ads on google are relevant to your search criteria, yet are separate from the results.
      More and more the first part of that is becoming untrue. I searched for "hiccups" about an hour ago and got an ad for eBay. I'm starting to tune out the ads completely, since they're no longer relevant.
    7. Re:Slimey adverts? by mblase · · Score: 1

      Yes, having advertising affecting search results is not good for the end user but (and I'm just bringing this up as a discussion topic), in what other ways can a search engine make money.

      Indeed. Open source is great when you're talking about just software. A web-based search engine, however, involves a LOT of hardware and bandwidth as well, all of which cost mucho bucks.

      The only other option I can see is for the search software to be open and for miscellaneous companies to take it (for free) and build their own public search engines, which will necessarily supported by advertising--nobody less than a millionaire will be able to run a popular public search engine for donations only, at least not for very long.

      So where's the advantage? So what if the GPL states they must publish the algorithm they use to bias advertiser's results--the results are still biased.

    8. Re:Slimey adverts? by lostinchicago · · Score: 1

      if you actually looked at the project you would know that its all comming from donations.

    9. Re:Slimey adverts? by Amit+J.+Patel · · Score: 1

      Weird. I'm not seeing any ads on Google when I search for hiccups. It's possible that Google already got to it -- its ad system looks out for ads that users aren't interested in (i.e., aren't getting clicks) and turns them off.

      - Amit
    10. Re:Slimey adverts? by Oscar_Wilde · · Score: 1

      They could sell search appliances to companies, run a distributed network of computers (so you're not paying for al the machines).

      I'd be surprised if this became a huge and popular search engine but it if it works well and can index content inside popular file formats (pdf, doc etc then I'd be happy to use it in large intranet applications)

    11. Re:Slimey adverts? by Flarelocke · · Score: 1

      This is just search engine software. First of all, no one needs to make money from this, they can just use it for their own websites or intranets. Second, a company could only spider the domains of companies that pay for it, or only index the, say, PDF's of domains that pay for it. Subscription based services could contract with the company to write a specific spider that indexes the pages behind the subscription (or pay-per-view documents).

      Or, Yahoo could use it. Yahoo's still a portal, and the popularity of their site brings in money. Hiring a full-time developer to work on the software is not out of the question if they think they can surpass google one day.

    12. Re:Slimey adverts? by FroMan · · Score: 1

      Hmmm, to tell you the truth, I don't think advertisements are slimey all the time.

      For instance, yesterday I was interested in finding how much it might take to buy some apple tree and sugar maples. So, the advertisements were valid answers to the questions I popped into google.

      I think google's method of highlighting advertisements in are perfectly valid (and non intrusive). Its atleast much nicer than seeing banner ads on every page.

      I personally don't see what folk's here on slashdot have to complain about advertising. Unless you are going to pay to have pages indexed for searching, where do you think the tangible costs such as bandwidth, machines, and electricity to run those machines come from? There is no such thing as a free lunch, just sometimes you are lucky enough to have someone else foot the bill.

      --
      Norris/Palin 2012
      Fact: We deserve leaders who can kick your ass and field dress your carcass.
  9. Seems like /. by darkstar949 · · Score: 1, Insightful

    This seems to me like the /. moderation system, with the pages being ranked based upon how the user feels about the site.
    However, I could see some disadvantages to the system depending upon how it is set up, because one person could keep dinging a site to get its score to drop down.

    1. Re:Seems like /. by qtp · · Score: 1

      You mean like the moderators are going to do to the rest of your posts?

      --
      Read, L
  10. Biased listings by Champaign · · Score: 4, Insightful
    I think many commercial search engines have learned that biasing themselves to sites who have paid them is a good way to errode consumer confidence, and damage their readership/userbase. Just as newspapers have to at least provide the image of objectivity, the same demands are on search engines.

    I'm quite comfortable with how Google does this (present commercial links clearly marked to the side), and am not convinced a non-commercial (open source) alternative is needed.

    1. Re:Biased listings by zangdesign · · Score: 1

      I think many commercial search engines have learned that biasing themselves to sites who have paid them is a good way to errode consumer confidence

      Where is your evidence to support this? Of five major search engines, only Google doesn't directly insert advertisements into your search results stream. Yahoo, MSN and others do and it doesn't seem to have affected their business.

      --
      To celebrate the occasion of my 1000th post, I will post no more forever on Slashdot. Goodbye.
    2. Re:Biased listings by shaitand · · Score: 1

      Those who use MSN use it because it's the default on windows systems, which is 90+% of the systems out there. Most of the others get traffic because ISP set those engines as the default in browsers when their config apps are run. Yahoo used to be the #1 search engine, they got more and more ad's, people started to stray, each engine in turn that people turned to became the bearers of reams of ads and finally google came along and (then) had none. It still has less.

  11. It's not "Businuess" by The+Bungi · · Score: 1
    Businuess 2.0 article on it.

    It's "Business". Hope that helps.

  12. just don't get it by Astrorunner · · Score: 3, Insightful

    I think that you absolutely have to have a closed source algorithm for ranking pages, because otherwise you'll get people who will simply tune their pages to be high on the list. I can see how making the majority of the search engine open source would be beneficial, but the algorithm itself? Its like saying "Here's the keys to my car" and thinking that, because everyone has access to the keys, no one's going to drive away with it. Sure, everyone has the opportunity to make your search engine better, but never underestimate the tenacity of a web-wanna-be-millionaire.

    1. Re:just don't get it by cduffy · · Score: 4, Insightful

      Think about cryptosystems: The whole point about the really good ones is that you can know the algorithm, but still not break it. Granted, pulling that off for a search engine is prone to be much, much harder -- but I *do* believe it's well within the realm of possibility. Ambitious in the extreme? Certainly... but there's something to be said for high-risk-high-reward projects.

    2. Re:just don't get it by Anonymous Coward · · Score: 0

      This sounds an awful lot as the closed-source security-by-obscurity-i-am-an-industry-official type of speak..

      Same arguments applies.

    3. Re:just don't get it by DerekLyons · · Score: 1
      Think about cryptosystems: The whole point about the really good ones is that you can know the algorithm, but still not break it. Granted, pulling that off for a search engine is prone to be much, much harder -- but I *do* believe it's well within the realm of possibility.

      Within the realm of the possible? Even better than that, it's being done every day. There are whole websites out there dedicated to doing just that, and not without sucess.

      Read the search engine watch forums for ongoing discussion of just these issues.
    4. Re:just don't get it by Eraser_ · · Score: 1

      The problem with that argument is that cryptosystems rely on secrets in other places. A private key to decode with, a password to run through, something unknown, and completely variable. With search engines, theres no salt, nothing to be random. This page look more like "redhat linux apache package upgrade" than that one? Yes or No.

    5. Re:just don't get it by shaitand · · Score: 1

      Yeah, and the same people are the ones who will pay to have these work arounds and holes fixed. Their competition finds a way to get ahead of the game by studying the source... studying their page and the source it's only a matter of time until they figure out how the competition is beating their ranking, they fix hole and get to be number 1 with their exploit... competition is pissed and does the same... there will always be a way to get ahead, but slowly they'll become less obvious and less effective. Only those who are most on top of their game will be able to take advantage of tricks in the system and most online stores won't... if I'm wrong and they are, it's just that many more contributors and that much less likely any "tricks" will give too much of an edge.

    6. Re:just don't get it by cduffy · · Score: 1

      But the question isn't just if it looks like "redhat linux apache package upgrade" -- the question is if it looks like useful information about "redhat linux apache package upgrade". If the usefulness algorithm can be tuned to the point where data has to genuinely look useful (appears to be not random words but actual (semi?) grammatical sentences, consistant in word usage with other pages on similar topics which are frequently linked to, coming up clean for hostile javascript, perhaps passing a heuristic or two tuned to detect the more naiive methods of cheating the above, whatever -- then it's good enough that the easiest way to "cheat" it is to put up the information the user's really looking for, and when you get to that point, you've won, no?

      Of course, what might be really fun (from an "evil" sort of perspective) is to try to see if you can come up with a genetic algorithm or such for generating pages that rank high within thte system... but then, that can be used in reverse: it's pretty damned hard to, even given a full copy of the current net status, reverse-engineer a (sufficiently deep) neural network to decide on what it's making its decisions; a search engine using a neural network for page weighting (just general "usefulness" score, not necessarily with knowledge of the keywords in question) can be quite hard to cheat -- particularly if it's being taught as an ongoing process using input from the pages of effective cheaters.

      My general point stands, though: Humans can identify "useful" pages. Why not try to teach a search engine (through AI techniques, fancy heuristics, a combination of the same, whatever) to do just that same thing? That goal achieved, the only way to "cheat" the system is to create a genuinely useful page -- and at that point you haven't cheated at all.

    7. Re:just don't get it by blibbleblobble · · Score: 1

      "The whole point about the really good ones is that you can know the algorithm, but still not break it."

      Web of trust? "Your implicitly trusted website x reveals _these_ results within 5 spidering hops from the initial website"

  13. I see two problems by Anonymous Coward · · Score: 1, Interesting

    Two problems:

    1. Bandwidth. Having to search through so much data is going to take so much bandwidth, how could you pay for it?
    2. Patents. Google has lots of patents in this area, I imagine other search engines do as well. This is one area where I think software patents are deserved, since Googles' alorthims are actualy innovative. I don;t they will be willing to let you use thier patents in a GPLed app.
    1. Re:I see two problems by Anonymous Coward · · Score: 0

      2. Patents. Google has lots of patents in this area, I imagine other search engines do as well. This is one area where I think software patents are deserved, since Googles' algorithms are actually innovative.

      Man, gotta love the hero worship of Google on this site. First of all, in terms of publicly available patents, Google has very few in the search field. Many of the few they have a highly questionable (there is a fair bit of prior art). Most of the seminal work was at AltaVista.

      Most experts believe Google has built up a stockpile of provisional patents, taking advantage of a variety of loopholes in the patent process to obtain a patent but keep it secret for years (yup, this really helps with the scientific process).

      Eventually these patents will have to become public, at which point you'll all be screaming about how Google has manipulated the patent process, to patent trivial variants on existing ideas.

  14. If it's like every other SourceForge project... by realmolo · · Score: 2, Insightful

    Here's what I expect to see on the webpage in a few months: "Currently Nutch is in the alpha stage- it doesn't index any web pages, doesn't return any results, and has no user interface. Programmer's needed!" Google has WON the search engine war, probably forever. Find some other mountain to climb, guys.

    1. Re:If it's like every other SourceForge project... by Anonymous Coward · · Score: 0

      So since MS has won the browser and OS war, we should abandon all open source projects relating to those as well, right? Mod parent down -1 dumbass.

    2. Re:If it's like every other SourceForge project... by PhoenixFlare · · Score: 1

      You do seem to have a point, though...After seeing the list of intended features on their webpage:

      * fetch several billion pages per month
      * maintain an index of these pages
      * search that index up to 1000 times per second
      * provide very high quality search results
      * operate at minimal cost


      My first thought was...Gee, is that all? Good luck, guys, but this is like playing catch-up to someone who's already crossed the finish line. Compare Google to IBM or Oldsmobile all you want, but the facts remain.

      And as someone says above, having the founder of Alexa as a "friend" and a (partially) paid-advertising driven search company among their sponsors...I wouldn't get your hopes up quite yet.

    3. Re:If it's like every other SourceForge project... by AchmedHabib · · Score: 2, Insightful

      Google has WON
      You mean just like Altavista had? :)

    4. Re:If it's like every other SourceForge project... by lostinchicago · · Score: 1

      google works way better than any other search engin out there.

      if im not right in this matter ide like to know a better one (so i can use it)

    5. Re:If it's like every other SourceForge project... by DynamiteNeon · · Score: 1

      And at one time, ICQ was the best IM client. Then, AOL bought them.

      Don't try to predict the future.

  15. Accuracy is relevance by AtariAmarok · · Score: 2, Informative

    To me, accuracy is the most important "Relevance".

    The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.

    A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

    Since it seems like Google will never fix this problem, I'm looking forward to something with all of Google's great features, plus accuracy.

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:Accuracy is relevance by binaryDigit · · Score: 3, Informative

      A search on "to be or not to be" produces an error (non-matching results) in three of the first ten results: a 30% search failure rate. It used to be worse, when most of the links were bad.

      This is a bit of a misrepesentation. Google will toss the words 'to' 'be' and 'or'. So you effectively end up searching on 'not'. It does this to eliminate words that show up to frequently and make the searches faster (and the overloading of the word 'or'). If you really want that text, then either quote the whole thing, or place a '+' in front of those words, which will give you exactly what you're looking for. So there is no problem with it's acurracy when you understand the proper way to ask it for something.

    2. Re:Accuracy is relevance by antibryce · · Score: 1

      Thank you for pointing that out. It seems most people when pointing out problems with google are really just highlighting their lack of understanding of how it works.

      Imagine if I complained that Linux needed lots more work because when I'm at the command line I get an error from typing "move my email inbox to the floppy disk."

    3. Re:Accuracy is relevance by WTFmonkey · · Score: 1

      Well, if you slap some double-quotes around it (which I'm assuming is what was intended), you get accurate, but maybe not what you were [probably] looking for.

      The first link is about Barium Enemas, I shit you not.
      The second is about BeOS, and the third is some randomass link at funbrain.com.
      In the fourth we finally get some Shakepeare.
      Point is, these are all links that "capitalized" on the "to be or not to be" cliche and so are accurate results. Although, probably not what you were looking for. Next time try "Hamlet," "Shakespeare," or like that. If

      If all you know is the "to be or not to be" part, and can't remember who said it, or where they said it, hitting it on the fourth link is pretty damn good for a search that blind.

    4. Re:Accuracy is relevance by bersl2 · · Score: 1

      Why is a non-zero failure rate such an abominable thing? At some times, maybe finding something you weren't expecting is a positive. Perhaps a search engine with a "fooling around" mode using a more heuristic search method (which still excludes keyword floods)?

    5. Re:Accuracy is relevance by AtariAmarok · · Score: 1

      " Why is a non-zero failure rate such an abominable thing? At some times, maybe finding something you weren't expecting is a positive."

      If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

      " Why is a non-zero failure rate such an abominable thing? "

      Come to think of it, I have to ask. Which development team has Steve Ballmer assigned you to?

      --
      Don't blame Durga. I voted for Centauri.
    6. Re:Accuracy is relevance by ColdGrits · · Score: 1

      Funny, I just tried a search on "To be or not to be" and of the first 10 results, all 10 were related to the phrase "To be or not to be".

      You did put "" round the phrase, didn't you?

      --
      People should not be afraid of their governments - Governments should be afraid of their people.
    7. Re:Accuracy is relevance by bersl2 · · Score: 1

      If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

      Coincidentally, I was just eating a freezer-pop.

    8. Re:Accuracy is relevance by ekidder · · Score: 1

      The reason you got the results you did is because Google also checks the links which point to the page it displays. For instance, if you look at the cache of the Amazon result, you get:

      These terms only appear in links pointing to this page: to be or not to be

      That is how Google has always claimed to work (well, sorta). Google has always used more than just the text on the page to judge how well a page matches the search terms. Just because you were expecting it to work differently doesn't make it inaccurate :)

    9. Re:Accuracy is relevance by Anonymous Coward · · Score: 0

      The problem with Google is that there are errors in it: you ask for something and sometimes you get something else.

      I agree whole-heartedly.

      What we need is a fucking mind-reading search engine.

      Well maybe not fucking per se, unless it's a 5'10" red-head with huge mother-fucking knockers and an ass so tight you could bounce quarters off it and a...but I digress.

      Anyways, not only should the search engine be able to read your mind, it should return results based on what you really meant, not what you actually typed. I mean, what good is it if I type in "women who have sex for money to support their drug habit" and it returns a list of sites that contain documentary/interview style news articles on the tragedies of abused women hooked on crack, when all I'm really looking for is a list of names, phone numbers, and prices of "women who have sex for money to support their drug habit"?

    10. Re:Accuracy is relevance by randyest · · Score: 3, Informative

      If you reach into the freezer without really looking, thinking that you are grabbing a freezer-pop, and get an 8 month old leg of lamb instead, are you going to shrug and eat the lamb anyway?

      Of course not. I'd put it back and try more carefully to get what I want. I, what's the word I'm looking for, . . . wait for it . . . refine my search :)

      Regarding your comments above about google inaccuracy: I searched for +"to be or not to be" and consider the first page of 10 hits to definitely be 100% "correct". In fact, all of the 104,00 results that I checked (about 50, hehe) are 100% correct in that the sites on the list, or the sites linking to the sites on the list, contain the phrase "to be or not to be". Check the '2bee or nottoobee' link in google's cache and where you normally see the search term highlight colors, you'll see

      These terms only appear in links pointing to this page: to be or not to be

      Just because you wanted "Shakespeare" doesn't mean that "Shakespeare" is any more correct as an "answer" to "to be or not to be". If it were more popular (on the web), I'm confident that it would be higher on the list. That is, whether we like it or not, on the current www there are exactly 3 things more relevant to that famous phrase than Shakespeare, and they are, in order: barium enemas, beOS, and a kids' grammar game starring a bee. Or, more acurately and revealingly: an article about barium enemas titled "To BE or Not to BE?", an article about BeOS titled "TO Be OR NOT TO be?", and a kids' grammar game starring a bee called "2Bee or Nottoobee" which is linked to by sites containing the phrase "to be or not to be" in or near those links.

      Lucky for us that ol' Bill is still in the top 10 at all, I'd say.

      --
      everything in moderation
    11. Re:Accuracy is relevance by Anonymous Coward · · Score: 0

      That may be accurate, but certainly not relevant.

    12. Re:Accuracy is relevance by Pastis · · Score: 1

      it's up to us to influence these results.
      I guess if all slashdotters search for your query, and then click on the same 'real' Shakespeare reference, then the clicked reference will go up in the ranking.
      E.g. click on this result (once if recovers from the hard time it is having).

    13. Re:Accuracy is relevance by randyest · · Score: 1

      You might guess that, but I'm afraid you'd be wrong. As I understand it, google pagerank is in no way affected by how many people click a link from search results (though that might be an interesting addtition to the pagerank system, especially if the user could tag links as 'good' or 'bad', but this seems pretty open to abuse by bot clicks). Rather, pagerank is simply this:

      The rank of a document is given by the rank of those documents which link to it.

      So, if you want to make Bill show up higher in the results from searches for "to be or not to be", clicking links wont help. Creating links will. And getting higher pageranks for your pages which contain the links helps more. Which means you need lots and lots of links (and not just from one site, preferably from many, highly-rated sites) and/or links from highly-ranked pages, to have any effect.

      Or I suppose we could just bribe the pigeons and save time. But what do the pigeons want? :)

      --
      everything in moderation
    14. Re:Accuracy is relevance by Pastis · · Score: 1

      OK I can link that page from many weblogs, e.g. just adding it as my signature.

      But how do you define 'highly-rated sites' outside a search result context? Is it just the number of pages searched? Or the number of pages that end up high in the ranks for any kind of search?

    15. Re:Accuracy is relevance by randyest · · Score: 1

      I don't, google does, and their exact algorithm is a tightly-guarded secret. You can easily see the google pagerank of any webpage if you install the google toolbar (I use the beta 2.0, since it also has a nice popup stopper, form-filler, and other handy features), but changing a pagerank is a much harder task. In general, a higher pagerank means the page more sites (each weighted by its pagerank, of course) linking to it.

      I see the potential confusion here, since it can appear to be a bit of a vicious loop (though it really isn't) -- you have to get highly-ranked sites to link to you in order to increase your pagerank. Or, less-effectively, get lots and lots of not-so-well-ranked pages to link to you. Either way, you have to get people to link to you (or cheat and make fake sites whose purpose is only to link to your other sites, but google catches on to that eventually). The way you do that is by having good information. Which is why the system works so well.

      --
      everything in moderation
  16. Who? by Chess_the_cat · · Score: 1
    After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?

    The only search engine I ever use is Google and it seems to find relevant data just fine. And the ads are small, discrete, and actually useful. What's the problem?

    --
    Support the First Amendment. Read at -1
    1. Re:Who? by Anonymous Coward · · Score: 0

      After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising?

      Why? Doesn't similar things happen in the yellow pages? Is it insidious that some companies pay for larger listings or for colour listing that attract a reader's attention quicker than the standard small black and yellow listing?

  17. schweet by Anonymous Coward · · Score: 0

    i'm scho exschited to usche thisch on my blog schite. it'sch scho exscheschible, that even a noobie can hammer out a schuper-schweet nutch-hack in a couple of hoursch.

    i proposche a new schite to catalog thesche hacksch called nutchhack.com.

  18. Seems pretty pointless by cryptochrome · · Score: 4, Insightful

    Free and open code is good and all... but the one real cost of a search engine is RUNNING it. It requires a far from trivial amount bandwidth and hardware, and somebody has to pay for all of it. Unless someone comes up with a novel P2P solution (and many are trying) it just won't happen.

    What they should be doing is pressuring the existing search engine companies for some integrity.

    --

    ---If you can't trust a nerd, who can you trust?

    1. Re:Seems pretty pointless by jawtheshark · · Score: 2, Interesting

      Yes, that would hold true if you want to index the WWW. But what about indexing an intranet? Now businesses are paying Google for indexing servers (not that I think it is bad), but an Opensource searchengine could save costs for medium sized businesses. Just toss in another Quad Xeon with a few Gigs of RAM and it will do fine for a normal intranet.

      --
      Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
  19. Forget It. by Boss,+Pointy+Haired · · Score: 1
    In the commercial Internet, the mechanism by which you find commercial sites must be

    paid for

    by the sites which you find, otherwise basic economics breaks down and it will not work (abuse etc.).

    Thousands of companies provide $product - free search engines simply direct all users to one supplier of $product. That's not right.

    Searching for a supplier of $product is not like searching for information - it is not something that can be done outside of payment by the supplier of $product.
    1. Re:Forget It. by mblase · · Score: 1

      Thousands of companies provide $product - free search engines simply direct all users to one supplier of $product. That's not right.

      Neither is that sentence. One subject per verb, please?

  20. Nutch? by burgburgburg · · Score: 1
    Acronym, non-obvious pun, obscure reference?

    The FAQ doesn't explain the name.

    1. Re:Nutch? by qwerty823 · · Score: 3, Funny

      who knows... but as soon as they get it working, they can use it to search for a better name!

    2. Re:Nutch? by smithmc · · Score: 1

      Acronym, non-obvious pun, obscure reference?

      It's Jason Mewes' new gig, now that Kevin Smith has killed off Jay & Silent Bob. "Snootch to the Nutch! BONGGGG!"

      --
      Downmodding is the refuge of the weak. Don't downmod, make a better argument!
  21. that business2.0 article.. by joeldg · · Score: 1

    it reads more like some strange marketing propaganda than anything.

    That project has no releases, has nothing in cvs and very scant details on what it even "is" ..

    There are many many projects out there with so much more info available, why is this one that has not released anything getting so much attention?

    1. Re:that business2.0 article.. by Anonymous Coward · · Score: 0

      What strange marketing propaganda?

      CVS has a plethora of information. 1,216 commits, 285 adds

      And the board of directors has Tim O'Reilly. Thats worth attention

    2. Re:that business2.0 article.. by joeldg · · Score: 1

      well..

      then I stand corrected.

      sf cvs must have hiccuped when I went to look.

  22. not a good idea.... by edrugtrader · · Score: 3, Interesting

    google is already ideal... the weight of search results is not sold, just text ads.

    people are already 'googlebombing' to try and get better rankings by signing up tons of domains and cross linking them all with the keyword that they want to be #1...

    if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised... instead of paying for weight to the search engines you would be paying to web developers to make the search engine think you were #1. and as a web developer i feel that.... oh... wait, proceed.

    --
    MARIJUANA, SHROOMS, X: ONLINE?! - E
    1. Re:not a good idea.... by Anonymous Coward · · Score: 0

      if the algorithm that determined how #1 is determined was public, then the best possible strategy to cheat the system could be demised...

      Security trough obscurity doesn't work.

    2. Re:not a good idea.... by curunir · · Score: 2, Informative

      You've entirely missed the point of this project.

      I highly doubt that Nutch is going to offer an alternative to Google in the area of web search. What they seem to be doing is offering an alternative in the area of Enterprise search.

      Currently, the company that I work for pays Verity (used to be Inktomi, before that Infoseek) tens of thousands of dollars a year for the use of their software. We use their software to make our own site searchable. If Nutch offered us a free alternative to our Ultraseek server, we'd definitely be interested.

      We don't have to worry about anyone "googlebombing" our search collections because, well, we create all the content that goes into those collections. We'd love it if the algorithm that determined rankings was open-source. That way, we could change it to suit our specific needs if we thought it would help return more relevant results. There are currently a number of undesirable phenomena that we live with or work around because the mechanics of the problem are burried within proprietary Ultraseek code.

      Google is the best of the best in web search and I don't think anyone short of MS is interested in challenging them for that. But 'search engine' in this case means something entirely different.

      --
      "Don't blame me, I voted for Kodos!"
    3. Re:not a good idea.... by Molt · · Score: 1

      You may want to have a look at Xapian. It's an Open Source document indexing system, which in my experience scales very nicely. The Omega system, which is built on Xapian, is a nice not-so-little search indexing system. I've used it on multi-million (admittedly reasonably short) document indexing projects in the past and it's coped admirably.

      The search refinement system it supports may seem a little strange, but to me it seems powerful and you don't have to use it if you just want 'simple search'.

      --
      404 Not Found: No such file or resource as '.sig'
    4. Re:not a good idea.... by randyest · · Score: 1

      What they seem to be doing is offering an alternative in the area of Enterprise search.

      Oh, you mean like what ht://dig has been doing since 1995?

      --
      everything in moderation
    5. Re:not a good idea.... by shaitand · · Score: 1

      google DOES sell page rank, they have a disclaimer that it's subject to their review but money talks and bullshit walks.

      Hiding the algorithm does no good, if it were open then it's the same people who are willing to pay to be #1 who will make sure no obvious flaw fly... why would they do that? simple, they have competitors. This is another example of how open source will rather than destroying jobs for programmers, will create them. The companies in question would hire a coder or two to both find ways to cheat for them, and fix ways their competitors use... of course their competitors will do the same and in the end ways to cheat will be pretty obscure and likely not very effective.

      What could stop this from happening the is the resources required to run a search engine... of course things like kernel.org tend to make the odds of that being an impossible to overcome obsticle smaller. The open source world has massive amounts of bandwidth donated to it, and in some pretty hefty chunks at that.

  23. looking forward to it by Anonymous Coward · · Score: 1, Interesting

    take a look at the developers and contributors. these guys are all top notch. doug cutting, one of the developers there is the developer for lucene, one of the best libraries out there for developing application search engines in any language. not to mention overture, internet archive, and mitch kapor.. looks like an all-star team. can't wait to play the software.

  24. Bandwidth Costs by NDPTAL85 · · Score: 1

    Who's going to pay for them if its a non-profit open source project? Bandwidth doesn't grow on trees you know.

    And slimy adverts? Google has slimy adverts? I thought they only had relevant adverts? Oh well I guess we need another dot.com that will go bust in 6 months or so.

    --
    Mac OS X and Windows XP working side by side to fight back the night.
    1. Re:Bandwidth Costs by Anonymous Coward · · Score: 0

      Google has slimy adverts? I thought they only had relevant adverts?

      They do indeed. And Google goes to great pains to ensure that they stay relevant.

      Google's "Ad Words" campaigns require a given click-through-rate. If that isn't achieved then Google slows, then stops, the rate at which it displays your ads.

      Bottom line - if it's not relevant, then people don't click, and the ads don't get shown. If it is relevant then people click and the ads continue to appear.

      Google are, strangely, in the position they're in mostly because they seem to know what they're doing.

  25. Can this work? by jmkaza · · Score: 4, Insightful

    I think the idea is good in principle, but could it actually succeed? Google gets hit with millions of request each day. They've got hardware that can support thousands of slashdottings a day and a fat pipe to feed all of that info out. That takes alot of money. Financing an open source project is difficult enough, but financing an open source service such as that would seem next to impossible. Ideas?

    The other major problem would be that, with the ranking criteria being available for all to see, it would be relatively simple to manipulate page rankings.

    1. Re:Can this work? by casio282 · · Score: 2, Insightful

      I think it's a fabulous idea, the kind of idea that make me slap my head and say "why didn't I think of this?" You're right -- the biggest obstacle to producing a truly free (as in speech, natch) search engine solution is not in producing the software (patent minefield notwithstanding), but in the "physical" costs of hardware and bandwidth.

      I think to way to overcome this obstacle is to develop a distributed system...run a nutch node on your server, host a few GBs of index data. There could be master nodes that are able to route requests to the right nodes for a given set of keywords. It sounds far-fetched, and I can't work out the network topography off the top of my head, but I bet it's doable. Of course, you'd have to build in redundancy into the system to make sure it's not exploited, and a power outage (or a machine that's not up 24-7) somewhere doesn't cause failures. You'd also want to encrypt the locally stored data to further protect against exploits, and to perhaps (IANAL) indemnify the node-owner to some degress from whatever problems s/he might face "hosting" this material, kinda like Freenet.

      It's interesting. I hope they think about this sort of approach.

      --

      :wq
    2. Re:Can this work? by Wesley+Felter · · Score: 1

      Nutch is not a service; it's just the software. Running it is up to you.

  26. Search engine game is NOT over by AtariAmarok · · Score: 4, Insightful

    "Google has WON the search engine war, probably forever. Find some other mountain to climb, guys."

    At one time, Oldsmobile won the auto company wars. Where are they now?

    IBM ruled the PC roost. Hmmmm....

    Command-line OS's were king. But now???

    Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?

    The point is, it is not over.

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:Search engine game is NOT over by RedWizzard · · Score: 1
      The point is, it is not over.
      Unless Google gets worse I think it is. I don't think it is possible to produce the degree of improvement (to the quality of results) necessary to take significant market share from them. Google is good enough. I may be wrong but I don't see a revolutionary improvement to searching happening again.
    2. Re:Search engine game is NOT over by shaitand · · Score: 0, Troll

      The only thing I see anytime soon will be a search engine that doesn't take bribes and has some way to ensure they never will. Google does, I don't know what slashdotters are on, but it took my boss 2hrs on the web to find out how to buy a higher rank for $200 "subject to their review of the site". Maybe if I get bored I'll pull it up myself later and link it... don't count on it though, I'm a lazy SOB AND have a new Sword of Truth book to finish.

    3. Re:Search engine game is NOT over by alexo · · Score: 1

      > Altavista and infoseek and Lycos were search engine kings at one time. Whither this trio?

      I agree about infoseek and Lycos but AV, while no longer "king", is still a very useful secondary search engine that nicely complements Google when you run into Google's limitations:
      - It has a better support of boolean searches (including wildcards).
      - It does not limit searches to 10 words.
      - It supports searching for audio or video files.

      There are other features which can be of interest to some people:
      - Incremental refinement of searches using "Prisma".
      - Shortcuts.

      By the way, they also recently added a toolbar (similar to Google's).

      So, while not endangering Google's #1 position, AltaVista (and AllTheWeb too) could still be a useful addition to your search toolbox.

  27. Not making nutch sense by Anonymous Coward · · Score: 0
    At this point Nutch is coded entirely in Java, however persistent data is written in language-independent formats so that, if needed, modules may be re-written in other languages (e.g., C++) as the project progresses.

    One of those coffee out the nose moments for me.

    1. Re:Not making nutch sense by AtariAmarok · · Score: 2, Funny

      Don't worry. It is just a stepping stone to full project maturity reached when it is fully coded in Borland Turbo Pascal.

      --
      Don't blame Durga. I voted for Centauri.
  28. A Tough Challenge by Cloudmark · · Score: 5, Interesting

    One of the biggest issues with running a search-engine, open-source or otherwise, is that you can't eliminate bias in the results. No matter what scheme you put in place to handle rankings, someone will find a way to take advantage of it. It's a fact of any major system - there's always a way to twist it. Part of the challenge that Google and similar sites face is that they have to work constantly to protect themselves from systems designed to take advantage of their algorithm. While a completely unbiased search service would be nice, I think it would require the impossible. It would require that no one out here took advantage of it to further their own interests, be they political, commercial, or otherwise. That's fairly unlikely.

    With most of the major engines today including Google, they make an effort to prevent horribly unbalanced results (recent controversy over blogs outweighing professional sites in the rankings due to linking and other factors). Some even admit (again, Google does) to manually messing with the rankings a little. If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to. That's in their own public docs. It's also discussed in Wired.

    I honestly don't know if open-source could do a better job. The algorithm might be better (likely, given the manpower), but would it really be that much fairer?

    --
    "Be proud to be a fighter" - Martial Arts Adage
    1. Re:A Tough Challenge by rmohr02 · · Score: 1

      In fact, since the algorithm would be completely open, it would probably be easier to subvert. I'm sure Google has enough trouble working against people who guess at their algorithms, so you could imagine the trouble when people know the algorithm. Then again, many of the people who attempt to subvert search engines are probably fans of open source, and, as you said, there might be more manpower to work against them. Merely comparing open and closed search engines, it's a hard sell either way, but in this case, Google wins because everyone knows Google.

    2. Re:A Tough Challenge by PMuse · · Score: 1

      there's an obvious need for an unbiased search engine

      Umm, all search engines are biased. That is, each must choose a way to present results. Not to mention a way to acquire data and a way to compare criteria to the data. Trying to "eliminate" bias is futile. What searchers need is to know what the bias of a search engine is. Then they can decide whether that engine will serve for their task. Then they can know what "the results" mean.

      A program that calculates "averages" might return median, mode, midrange, mean, etc. All are "accurate" in a sense and all are useful for different purposes. Similarly, users of "search engines" must have disclosure of the program's method before they can make use of its results.

      --
      "We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
    3. Re:A Tough Challenge by PMuse · · Score: 1

      After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising

      True enough -- a search engine that gives you results based on how much entrants paid for placement is good only for finding companies who paid a lot for placement.

      Of course, sometimes that's what you're looking for -- ever notice that large, full-service businesses often have large, full-color ads in the print yellow pages, while use of a cheap basic listing correlates well to smaller company size? You can use the bias of a search engine to your advantage if you know what that bias is.

      --
      "We reject as false the choice between our safety and our ideals." --The American President (20.1.2009)
    4. Re:A Tough Challenge by Moeses · · Score: 1

      The answer to the biased rankings problem is so obvious. Use an algorithm to generate the result set and then random sort!

    5. Re:A Tough Challenge by SKPhoton · · Score: 1

      Yep, they definitely put effort into it. Search for 'search engine' on Google and you'll see someone other than Google on top interestingly enough.

    6. Re:A Tough Challenge by ihummel · · Score: 1

      If the search engine were open-sourced, then more people, probably many more, would know how to manipulate the rankings to the fullest.

    7. Re:A Tough Challenge by randyest · · Score: 1

      And more people, probably many more, would know the algorithm well enough to study the manipulations and contribute modifications to nullify them.

      --
      everything in moderation
    8. Re:A Tough Challenge by shaitand · · Score: 1

      whether they are fans of open source or not, the people who subvert engine algorithms would be the same people who make it harder to subvert... those people aren't lone islands trying to improve their own rankings, they will jump on the chance to take their competitors down and that means fixing what tricks their competitors use... which of course their competitors will do back. Some will slip through the cracks, or manage to stay on top for various lengths using these tricks, but ultimately the tricks will become more obscure and less effective. Open source has proven this works, they've done it with security holes... the code is out there for any to see, but the best hackers can do (with VERY rare exception) is follow what bugs have already been found and fixed and trust in idiots not to patch.. course there are plenty of those idiots.

      The difference here? Nobody needs to patch.

  29. Dupe! by Anonymous Coward · · Score: 0

    It must be indexing this page again.

  30. Business 2.0 is paid access only by prostoalex · · Score: 1

    To read the second page of this article use subscriber code 079751240X.

    Go to "Magazine subscribers: Enter here", then "Sign in using the account number on your subscription label" and enter the account number above.

    Courtesy of TechDirt.com

  31. Nutch will never get out of alpha stage by xannik · · Score: 2, Insightful

    I fail to see the point of such an endeavor. Without advertising Nutch can not possibly hope to become a serious contender with search engines such as google or overture. Advertising provides the money that enables search engines to have lots of bandwith to send those results quickly back to users, lots of computing power to quickly process each search, even the ability to hire people to research into new areas for better search results. Even if the search engine is selling its resources to other portals like google does with yahoo advertising would still be involved in the process. Yahoo would still need to be advertising on their site to bring in revenue to pay for the service. I think google's method is perfectly fine with small text based ads that are discrete. Why do we need to fix this?

    --

    Go Illini!!!
    1. Re:Nutch will never get out of alpha stage by Anonymous Coward · · Score: 0

      Lots of folks need a search engine, and not necessarily for the entire internet, but only the small bit that they've fenced off.

      Google is already selling search engine appliances. Several sites are "powered by Google".

      So, a search engine that performs well for The Internet(tm), may work just as well for something smaller.

      There are a lot of bits and pieces available to craft your own, but this sounds like a whole package that can, ideally, be easily dropped into place on an existing infrastructure.

    2. Re:Nutch will never get out of alpha stage by babyrat · · Score: 1

      who said there would be no advertising?

      Who said they are even making a search site?

      The website says they are making seach engine software. You could take this software and create a search site. This could be a public site, or a private one behind a corporate firewall.

      you could fund the public site with advertising. or not..it's up to you. You could use it as a searh engine for your own public internet site, without indexing the rest of the world.

      Microsoft Windows and the various flavours of Unix are all pretty good operating systems - what do we need Linux or FreeBSD for?

    3. Re:Nutch will never get out of alpha stage by xannik · · Score: 1

      The original AC who submitted the article had said this.

      "Anonymous Coward writes "Someone forwarded me this site working to create an open source search engine called Nutch. In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine. After all, isn't a search engine supposed to be for finding relevant data, not as an indirect and sometimes slimy method of advertising? Nutch is clearly in their intial stages, but it would certainly get my vote." You can find the project on SF.net, and also read the Business 2.0 article on it."

      I was responding to this persons statements about "slimy advertising" and "a need for an unbiased search engine." However, since you bring it up, I do think that an open source search solution for smaller and more specialized applications would be something to consider. But an open source replacement to google, as the original poster seems to be suggesting, does not appear to me to be possible.

      --

      Go Illini!!!
  32. Are they thinking too big? by xanderwilson · · Score: 3, Insightful

    I think they're setting themselves up for something that will get too big and too expensive before it can get finished, and they'll have to figure out a way to (gasp) get some funding beyond donations.

    I don't see a solution in one great open-source, independent search engine, but many individual specialized search engines, each mastering their own niche area of specialty stands a chance to compete, especially if run by people who focus on their areas of expertise. Alternative news search engines, music search engines, literary search engines, etc. each run by people who know what to filter in and out.

    If Nutch.org could create the technology that would allow each of these search engines to exist autonomously, it could also be the hub/portal/start-page/blahblahblah that links all these engines and databases together.

    Alex.

    1. Re:Are they thinking too big? by ktorn · · Score: 1

      You hit the bull there mate.
      Many small sized directories, tied together by a common API.

      Hopefully you won't have to wait too long for that to happen. ;)

  33. Feasible? by ae0nflx · · Score: 1

    No offense to the open source community, but I'm not sure about how feasible an unbiased search engine is. The open source community does not like any bias towards commerical interests and has no problem pointing it out, but by the same token, they do enjoy plugging their own programs, which is completely understandable and normal for any community, however that does not make it unbiased or 'publicly biased'. The merit of the site is very subjective, in my opinion. I am in favor of a project such as this, but I just want to see it for what it is.

    Most open source internet based projects (*cough* Slashdot *cough*) have tended to be rather biased towards themselves. It would be very difficult to remove all subjectivity from a project of this nature. How can the ratings be controlled? If it is done entirely by the 'public bias' what's to stop bots from altering the 'public bias'? Just a few questions that still need to be answered.

  34. A suggestion that Google adopted by afflatus_com · · Score: 1
    I wrote to Google some time back with an algorithm suggestion that was adopted by them. It is certainly welcome to an open source search engine. It is a minor improvement, but every bit helps.

    For citations of most websites, some of the citing people will link to http://www.someplace.com, and some will link to http://someplace.com.

    Therefore, include a comparison of the pages returned by each query, and if they are the same page returned, then summate the reverse citations to calculate their total rank.

    --

    -----
    Cast a Cold Eye
    On Life, on Death
    Horseman, pass by
    --W.B. Yeats' gravestone
  35. Distributed Open Search Network by Massacrifice · · Score: 2, Interesting

    It'd be nice if they could make distributed. Kinda like P2P search engines, but for the web. That way, the main searching server farm wouldn't be tied to any company in particular. That would give Google a run for their money, and would keep Microsoft at bay for another while.

    Being open an open search network, some peer servers could specialize in searching what they're hosting, making it possible to index otherwise dynamically generated content. These specialized hosts would act as "search plugins" for some otherwise hard-to-define content.

    An authentication method (a la Freenet) would be needed, though. Some form of authority to prevent rogue peers from injecting too much crap in the results.

    Overall, a good idea. If they make it, I'll run it.

    --
    -- Home is where you eat your heart out.
    1. Re:Distributed Open Search Network by Anonymous Coward · · Score: 0

      Mod parent up!

      The heavy dependence on a resource directed by a small group is never a good thing. We become more dependent on Google then we are on DNS now. Nothing has to really happen to distort the rules of a game - the mere potential of influence takes it to a different path.

  36. HTDig by Anonymous Coward · · Score: 0

    HTDig is written in C, configurable, and flexable.

    Nutch.. written in java. No Thank you, I rather not have my machine become to a crawl.

    Fact is, why cant these "developers" working on Nutch work on HTDig to add the features they want?

    HTDig is really nice.. Search Engine, help index tool... Really nice... You can even configure the ranking system to fit your needs.

    1. Re:HTDig by Anonymous Coward · · Score: 0

      > HTDig is written in C

      C++. But modern C++ features that have portability problem (such as namespace, STL etc...) is not used.

  37. may not fly by wo1verin3 · · Score: 1

    >> In the age of weighted rankings on search
    >> engines for profits, there's an obvious need
    >> for an unbiased search engine.

    If you tell everyone, what your page rankings are based on... that doesn't make it hard for companies to modify their page to fit what the search engine is looking for to increase rankings or hits.

    There are some companies that do this for Google as complicated as it may be

  38. Hardware and Bandwidth by metalhed77 · · Score: 1

    according to http://www.nutch.org/docs/credits.html the Internet Archive is hosting nutch, and Overture has given them hardware. Sounds pretty sweet. Probably not the 20,000 strong linux cluster google has going though.

    --
    Photos.
  39. Mnogosearch : a viable Free Software alternative by Anonymous Coward · · Score: 0

    http://www.mnogosearch.org/

    Mnogosearch is a viable web search engine software. It supports caching (a la google), cluster of db, supports easily external parser... Maybe that project should enhance and helps this excellent Free Software.

    adulau

  40. The answer is "Nutch" by Gudlyf · · Score: 3, Funny
    *Blows open envelope*

    The answer: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"

    --
    Trolls lurk everywhere. Mod them down.
    1. Re:The answer is "Nutch" by Anonymous Coward · · Score: 0

      Then what is the question?

    2. Re:The answer is "Nutch" by Gudlyf · · Score: 1

      Ugh, that went bad. Feel free to mod previous (and ths) post down, since I guess posting it again corrected was being a "whore". Sigh.

      --
      Trolls lurk everywhere. Mod them down.
  41. Funding....an issue.... or not? by ReyTFox · · Score: 1

    One thing that'll help Nutch financially is that they can use their technology for more than a single page running on their own servers(and taking on the huge loads that implies). They can use open-source business models instead, offering licenses with tech support, custom versions, etc.

    I was always kind of worried that we might end up with an internet controlled by Google, anyway. But we'll have to wait and see if it actually works or not, anyway. I sure hope so.

  42. That's the problem by AtariAmarok · · Score: 1

    "Google will toss the words 'to' 'be' and 'or'."

    That is the problem. The reason I put such words in phrases is because I want an exact match.

    " It does this to eliminate words that show up to frequently and make the searches faster"

    I would hope that Google solves this by getting faster servers, instead of producing bad results. Besides, if I did not want the results to include all the words in the phrase, I would not have included them in the phrase in the first place.

    " If you really want that text, then either quote the whole thing, or place a '+' in front of those words"

    I did quote the whole thing, and got 70% accuracy. By putting plusses in front of the words, I still got 70% accuracy.

    "So there is no problem with it's acurracy when you understand the proper way to ask it for something."

    Quotes around the phrase do not work. Plus in front of all the words fails too. What is the secret of "the proper way"? more importantly, why won't it do the most intuitive thing: try to match the phrase as it is typed?

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:That's the problem by Anonymous Coward · · Score: 0
      Quotes around the phrase do not work. Plus in front of all the words fails too. What is the secret of "the proper way"? more importantly, why won't it do the most intuitive thing: try to match the phrase as it is typed?


      I don't know what you're doing wrong, but when I type the phrase in quotes I get page after page of good hits. Same with putting +'s in front of each word (although the results were different because it was no longer looking for that particular string of words together.)

  43. Well by CausticWindow · · Score: 1

    It's not the technology that prevents thousands of google clones to pop up. It's the simple fact that to initially succeed, you need either a lot of cash or heavy backers.

    It't not like Google's pagerank is so unique that it's impossible to do better any other way. It's just that 1) you have to do better or equal, 2) people have to know about you.

    Point 2 equals lot of cash.

    --
    How small a thought it takes to fill a whole life
  44. I'm no genius but... by MRsackler · · Score: 0

    the way I understand it is that in order to operate a search engine that sorts through millions upon millions of listings thousands of times every minute, someone is going to need a whole lot of bandwidth. Not to mention the cpu resources that such a task would require. CPU cycles and Bandwidth cost money, and no matter how altruistic the person's intentions, they've got to earn that money. That is where advertising comes in. If I'm not mistaken, google is pretty cool about not having slimy advertising. However, I'm not sure if they pocket any of the money recieved from those advertisements, or if they simply use it to cover the costs of operating the search engine.

  45. That's great but.... by Anonymous Coward · · Score: 0

    This is great until it starts working and it is really good and someone offers a lot of money for it and it is sold.

  46. SIR by Anonymous Coward · · Score: 0

    While your first post (reply) was quite amusing, you lost a few points:
    1) You should have used "nutchs" instead of "nuts"
    2) You are a "FP Mastur", not a "FP Mastar"
    3) ???
    4) PROFIT!!1

  47. Unbiased Searching is Absurd or Useless by smack.addict · · Score: 1
    An unbiased search engine is completely useless. In short, an unbiased search engine would either list results randomly or according to useless biases such as alphabetical listings.

    Any useful search engine will have an algorithm for ranking page relevance. Because search engine placement is so important to business, there will always be people out there who attempt to optimize (and in some cases, abuse) their pages to boost search engine ranking.

    The most useful search engine is the one whose biases match your own biases.

  48. Umm... by Anonymous Coward · · Score: 0

    Could someone set up a mirror? I think they got slashdotted.

  49. Hardware? by shredwheat · · Score: 1

    But what about the hardware and bandwidth? I read about the kind of horsepower running behind the offices at Google and find it hard to believe a competitive offering can be made.

    Perhaps what is needed is a peer-to-peer style distributed search engine for the web?

    1. Re:Hardware? by AsparagusChallenge · · Score: 2, Informative

      Don't worry too much. This is software, not a service. When available it may be implemented by someone and be the infrastructure of a company, which may then provide bugfixes and development to the original project. Or it may not. Who knows.

  50. What??? by jawtheshark · · Score: 1
    Command-line OS's were king. But now???

    What??? And nobody sent me the memo.... (Posting from Lynx from a *BSD shell)

    --
    Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
    1. Re:What??? by Anonymous Coward · · Score: 0

      You are using *BSD huh? Where can I download that?
      Do you seriously not even know what OS you are using?

  51. Actually, it's "Bidness", white boy! by figlet · · Score: 1

    Actually, it's "Bidness".
    Yor momma! :-)

  52. The answer is "Nutch"... by Gudlyf · · Score: 1
    *Opens sealed envelope*

    The question is: "What did Sean Connery say when he saw the reviews for 'League of Extraordinary Gentlemen?"

    --
    Trolls lurk everywhere. Mod them down.
    1. Re:The answer is "Nutch"... by Anonymous Coward · · Score: 0

      How about: You posted this 6 minutes ago, dolt.

    2. Re:The answer is "Nutch"... by Anonymous Coward · · Score: 0

      Yes. And, he got a few "+1 Funny" on the last one, and posted this one with his +1 bonus. This is what we would refer to in the industry as a [fingerquotes]whore.[/fingerquotes]

    3. Re:The answer is "Nutch"... by Gudlyf · · Score: 1

      Just correcting my mistake. Fuckwad.

      --
      Trolls lurk everywhere. Mod them down.
  53. Unbiased is good enough for me by AtariAmarok · · Score: 1

    " An unbiased search engine is completely useless."

    Unbiased is fine for me. When I search, I am just looking for matches. That is all. I don't care so much about ranking decisions as long as the search produces accurate results. (that is, words or phrases found in the resulting documents).

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:Unbiased is good enough for me by smack.addict · · Score: 1
      OK, so what is a match?

      Because there are multiple valid definitions, the minute you select one you bias the engine unless you choose a lowest common denominator. In which case, you will end up with 99% useless "matches" and, buried in that list, 1% of useful matches.

  54. What about the hardware? by foo+fighter · · Score: 1

    How are they going to afford the massive hardware and bandwidth costs associated with running a tier 1 search engine?

    --
    obviously no deficiencies vs. no obvious deficiencies
  55. Lucene (index and search engine) by Anonymous Coward · · Score: 1, Informative

    Check out Lucene, the indexing and search engine used by Nutch. From what I've heard, Nutch is mainly the spider/crawler used to gather documents.

    1. Re:Lucene (index and search engine) by cpeterso · · Score: 4, Informative


      Lucene and Nutch are related:

      http://scriptingnews.userland.com/2003/08/13#When: 12:20:53PM

      Paul Nakada, via email: "It appears that the coding muscle for Nutch is Doug Cutting, the author of Lucene, an Apache Project open source search engine. We use it here at salesforce and have a huge amount of respect for Doug's coding."

  56. This kind of thing costs money, though by venom600 · · Score: 1

    The project may start out as an un-biased ranking system. But, if it gets very popular, the cost of running and maintaining a search engine that gets much traffic at all will require some sort of funding. (case in point: Google)

    Maybe if the thing was intended for use only by educational institutions, then some education grants could be used to support the infrastructure required to run a popular search engine? Or maybe it could be a subscription-based service? I dunno...couple of thoughts on how to pay for it anyway.

    Bottom line: somebody's gotta pay for it and (usually) the easiest way to pay for it is through advertising.....which will (unforunately) probably lead to money-biased rankings.

  57. Anyone ever heard of grub? by nadadogg · · Score: 2, Informative

    Grub is another open-source search engine, I have the client running right now, its nice and distributed, I think this kind of idea is great.

    --
    i use linux and windows oh god how can i have an opinion
    1. Re:Anyone ever heard of grub? by Anonymous Coward · · Score: 0

      If you're talking about www.grub.org, that's hardly open source. There's no source available at all.

  58. Let's check out the credits page... by baggachipz · · Score: 3, Interesting

    Ooh, what's this?

    Overture Research has donated hardware and helped to fund development.

    So, even an "open source," "unbiased" search engine is funded by a commercial search organization.

    1. Re:Let's check out the credits page... by WWWWolf · · Score: 1

      "funding" doesn't necessarily mean "controlling". I mean, there are people still out there who actually have ideals and treat donations as donations without letting the donors influence their judgement.

      Or something. I'm a trusting kind of person. =)

  59. They were not accurate. by AtariAmarok · · Score: 1

    Three of these top 10 links were not accurate results. I searched on the phrase "to be or not to be", not variations or mispellings. Phrases that capitalize on it, but do not match it, are close (but not accurate matches).

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:They were not accurate. by WTFmonkey · · Score: 1

      The first to were exact matches. Look at the source. That phrase appears no less than four times. In a shakespeare play, it only appears once. By that criteria, the first two were actually more accurate. You're right about that "bee" thing, that was just weird. But the first Shakespeare link is just as inaccurate, because the only place it contains that phrase is in the URL--and even there, without spaces. BUT, you would still consider that a positive search result, if you were blindly looking for "the guy who wrote it."

  60. funding by bindaaas · · Score: 3, Interesting

    let's see where is the funding coming from. Project is funded by overture which is to be bought by Yahoo. More info is here. Hmm.. So i guess Yahoo needs a revival...

    --
    bin
    look siG is kool
  61. SNL Celebrity Jeopardy Quote by burgburgburg · · Score: 1
    [Connery brogue]
    Hey Trebek, tell your mother I had a good time last night.

    You suck, Trebek. I hate you and your ass.
    [/Connery brogue]

  62. Distributed by verloren · · Score: 1

    I'll take the 'A's, hands up who wants to work on the 'B's...

    Cheers, Paul

  63. Details. by AtariAmarok · · Score: 1

    I use www.google.com (not www.goohle.com or gogle!).

    The third result is a site with bee cartoons. It contains "2Bee", etc. Close, but not a match. (The word referring to that insect was not in my search request).

    Link 9 goes to a book at Amazon called "Or Not To Be". That partial phrase appears throughout the link. However, the entire phrase that I asked for does not appear.

    Link 10 is to the papermsce site. It contains no funny but false variations on the phrase, nor any fragments lerger than "to be" found here and there in the text.

    --
    Don't blame Durga. I voted for Centauri.
  64. Need your help by Anonymous Coward · · Score: 0

    Hello. CmdrTaco here posting as AC because I lost the password to log in with. It's silly really, but I haven't needed to log in for such a long time, that I just can remember what the danged thing was! Anyways, I am in need of some assitance. Slashdot has a special backdoor for me only. If I get an AC post modded up to +5 insightful or interesting, and it comes from my subnet, then I my password will be reset and mailed to me. So if you could see your way to just modding me up it would be a HUGE favor! Honest injun! Thanks. And please keep reading Slashdot.

  65. How amazingly misguided! by Anonymous Coward · · Score: 0

    So you have an algorithm and software - so what? The hard part of any search engine is paying for the bandwidth and employees and hardware. Software does not make a web search engine - at least not unless the "web" is a very small bit of what's out there.

    Google uses software, but google's software on its own would be useless without the massive amount of funding that keeps the lights on and the pipes open.

  66. I wouldn't count on it by Wesley+Felter · · Score: 2, Informative

    Nutch has four developers, one of whom is Doug Cutting who wrote several indexing engines. They count Alexa founder Brewster Kahle as a "friend" and are sponsored by Overture.

  67. Can I contribute to the source code? by mblase · · Score: 1

    for (i=0; i<intMaxSearchResults; i++) {
    if (searchResultURL.host="www.myfavoritedomain.com")
    intSearchRanking = 1;
    else
    intSearchRanking = 1000;
    }

  68. Distributing the Power by FsG · · Score: 3, Interesting

    I think having an open source search engine that people can modify and deploy would be an excellent thing, and here is why. Currently, google has the complete power to highlight or censor anything on the web. So far, they have used this power wisely, but that's no guarantee that it'll always be so. If they go public, you may find this power being used to increase the shareholders' wealth, rather than in the highest standards of fairness as it is today.

    With that in mind, how would this project help? It would allow webmasters to quickly & easily modify it for their needs, and deploy their own niche engines; in other words, Google would be supplemented by 10,000 niche search engines, each focusing on a specific field (microsoft propaganda, for instance). This would create a balance of power, ensuring that no single search engine accumulates an insane amount of control over the web as a whole.

    --
    I made a PHP/MySQL library that prevents SQL injection & makes coding easier!
    1. Re:Distributing the Power by Anonymous Coward · · Score: 0

      So far, they have used this power wisely, but that's no guarantee that it'll always be so.

      While I agree with your post, I'd have to nitpick here and disagree. I've seen Google cave in to the Chinese government and the Church of Scientology. And those are just the ones we know about, it's not like Google's going to publish a list or issue press releases about who ever has threatened them lately.

      Keeping the other engines honest is a great goal, keeping information free is a greater one.

      Cheers

      Hiro ... "is it a virus, a drug, or a religion?"
      Juanita shrugs. "What's the difference?"

    2. Re:Distributing the Power by Amit+J.+Patel · · Score: 1

      If there were 10,000 niche search engines, how would you figure out which search engine is the right one? Either you'd have a metasearch engine that searched them all or you'd have a serach engine that told you which search engine to go to. Either way, there'd be some place where people started the search process, and that site would upset the balance of power.

      -- Amit
    3. Re:Distributing the Power by nacturation · · Score: 1

      If they go public, you may find this power being used to increase the shareholders' wealth, rather than in the highest standards of fairness as it is today.

      If this happens, don't you think the SEC would bitch slap their asses back to last year? It's basically insider trading, though I'm not a securities lawyer. Increasing shareholders wealth means a long-term profitibility strategy. As soon as Google drops the ball, you can bet other search engines will pick up the slack.

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
    4. Re:Distributing the Power by daybyter · · Score: 1

      I don't think you can call those 10000 engine 'niche' engine. One way too reduce costs would be to distribute the processes of one engine on 10000 clients (a search engine as a grid). All those nodes would check some parts of the next and keep a distributed index of the entire web. I would call this horizontal distribution. This concept could be used to reduce costs (I think Google does this already with it's many cheap PCs). Another concept to make money would be a vertical scaling, where you basically put another search engine on top of a search engine. That could be such a niche search engine that uses the index of the first engine to search for specific things, like Jobs, houses to sell or anything else, that could be used to make money. Such a nice engine could then help to finance the first engine by sharing some of the made money (they could buy some sort of user license from the first engine).

  69. You just don't understand how Google does this by melted · · Score: 1

    Commercial results are biased up even though they're not marked as paid. Try a search for anything whatsoever (except open source) and you'll get your first 3 pages filled with online stores.

  70. Here is the Google cache for it ;) by evil-osm · · Score: 2, Funny
    --


    E.

    Never rub another man's rhubarb - The Joker
  71. Nutch is the answer by Anonymous Coward · · Score: 0

    *Sealed envelope is opened*

    "When Sean Connery saw the reviews for `League of Extraordinary Gentlemen`, what did he say?" is the question.

    "Lost not are all who wander"

  72. Bias: Inevitable by handy_vandal · · Score: 2, Insightful


    "In the age of weighted rankings on search engines for profits, there's an obvious need for an unbiased search engine."

    Bias is inevitable -- we're talking about ranking, which necessarily means bias.

    The question is: what bias do you want? What bias suits your purposes?

    My ideal search engine would offer a variety of biases from which to pick.

    --
    -kgj
    1. Re:Bias: Inevitable by bmcent1 · · Score: 1
      Moderators, mod parent up!

      I like this idea! I'm tired of getting all shopping websites in my search results when I'm actually after research, software, or product documentation or complaints or fixes (anything except e-commerce)...

      Maybe most folks on the web are searching for the purpose of buying online, but in spirit of the old-school Internet (remember when commercial entities weren't even allowed on? Then later, you could sell your goods, but you HAD to contribute back and it was a faux pas to just sell your products with no value add to the Internet community?!) ...

      I want a selectable bias so searches for "Amazon" turn up websites about the river, and a search for "apple" links to sites about the fruit (baking, growing, folklore...)

      --

      "Hey Albert, Good luck exploring the infinite abyss."

  73. Mister grammar nazi by Anonymous Coward · · Score: 0

    "Neither is that sentence. One subject per verb, please?"

    To boldly say that it is TWO sentences, dolt!

  74. Beowulf Cluster of Server Farms by Superfreaker · · Score: 1

    I can't see this OS project getting too much traction. One quickly realizes when setting out to build a search engine, that it takes a ton of computing power in the means of pipe, drive space, and database space. I found out the hard way.

    It may be fun for some small intranet stuff though....

  75. You are right: 40% error rate. by AtariAmarok · · Score: 1

    You are right: one more of these was a bad result. That's 40% error.

    --
    Don't blame Durga. I voted for Centauri.
    1. Re:You are right: 40% error rate. by WTFmonkey · · Score: 1

      I just told you that the first two were matches, by your own criteria. They contain more exact instances of "to be or not to be" than Hamlet does, making them more accurate results. That drops us to 20% error.

  76. god forbid someone makes a profit by Anonymous Coward · · Score: 0

    open source is for communists

  77. The funny thing is by Omkar · · Score: 1

    My /. page became the #1 result for "Omkar" before I posted a single journal article. Google is great, but as this illustrates, it's certainly not infalliable.

  78. Tomcat? by SpamJunkie · · Score: 1

    While my own preference would be to use python as the spider just as google does I have no doubt that Java is up to the task itself, especially with actually skilled developers.

    However I question the decision to use Tomcat. My limited experience with Tomcat showed it to be a resource hog that doesn't scale well at all. I couldn't imagine the Tomcat I played with surviving traffic anywhere near the amount that google gets regularly.

    Has anyone used Tomcat in a high-load situation? How much RAM did you need? I wasn't convinced that the 512MB I had would be enough and ended up dropping Tomcat entirely. That was a year ago. Did I have a bad version? Is this normal for Java (doesn't seem to be to me)?

  79. Search Engine Monoculture by peachawat · · Score: 5, Interesting

    Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?

    The point is, are you really comfortable to have one, and only one, effective search engine? No matter how well it searches?

    O'Reilly put it best :

    Actually, Nutch has no ambitions to dethrone Google. It's just trying to provide an open source reference implementation of search to help keep Google and other search engines honest, by letting people compare the results of an engine whose algorithms and methodologies are transparent and accessible. It also aims to give a platform for people outside of the search heavyweights to research new search algorithms.

    1. Re:Search Engine Monoculture by Amit+J.+Patel · · Score: 1

      We complain about the Microsoft monoculture because what OS other people use affects me too -- availability of software, interoperability of protocols, sharing of documents. There's a cost to being in the minority. I can use more than one OS, but it's a lot of work.

      With search engines, it doesn't really matter much what other people use. I can still use any search engine I choose without risking a penalty for choosing something different. I'm also able to easily use more than one.

      I think the big difference is that with search engines, I feel like I have a choice, and I'm not being pressured to use one particular search engine.

      -- Amit
    2. Re:Search Engine Monoculture by Xerithane · · Score: 1

      Why is it that when it comes to OS, everyone is bitching and screaming how bad monoculture created by Microsoft Windows is, but otherwise feeling warm and fuzzy and swear to god Google is and always be the only search engine they use?

      Simple. Google works, and works reliably. Even works great in several languages (English, Spanish, Japanese, that's all I've tested.)

      --
      Dacels Jewelers can't be trusted.
    3. Re:Search Engine Monoculture by be-fan · · Score: 1

      People cheerleader Google because they're not forced to use it, but its so good they do anway.

      People bitch about Microsoft because they're forced to use it, even though its so bad that they don't want to.

      --
      A deep unwavering belief is a sure sign you're missing something...
  80. Try Teoma by Anonymous Coward · · Score: 0

    In my experience, the Teoma search site's sponsor links (paid for linking) are easyr to differentiate from search results (in a different part of the page). What's more, they are almost always directly related to what I am really looking for, and sometimes exactly what I looking for...

  81. Hey if you wanna design a search engine by Anonymous Coward · · Score: 0

    I will give you an idea to start, it's something I've been thinking on lately.

    When people browse the internet, a plugin will fetch the pages (like a cache), parse its contents, and send to another computer. This computer is like a "mini-server", holding a couple hundred of clients.

    Searching is like a p2p search, where you send your query to this "mini-server", which we shall call Hub for now on, and it will lookup the words on its index and process the results.

    A Baesyan filter determines what Hubs you connect to, clustering you with people with similar interests.

    What's the benefit? Well, new pages are added automatically, you don't need a crawler and bazillions of bandwidth to keep an index, which is *the* biggest problem for any search engine. Disk space is decentralized, so storage isn't an issue.

    And you can make all sort of connections since it's a browser plugin, for example, what other pages people visit when they look for certain words, what are the entry and exit links of a page, time of the day (what's the most popular Linux news site at the morning?).

    Still brainstorming a lot, but hey, IAAAC.

  82. Irrational fear of money by KalvinB · · Score: 3, Informative

    That's nice that they want to open source the engine but that's the least of a search engine. They're going to need multiple high end servers to process the searches and plenty of bandwidth to get the results to the users.

    How do they plan to pay for that? Apparently advertising is out. And we just had another monephobe complaining about lack of funds for his accounting software who expected people to donate because he couldn't figure out that maybe, just maybe he should find a way to sell his product in some form while also keeping one form free. I can get RedHat for free OR pay money to get a hard copy with some bonus stuff. Net result is that RedHat makes money and everyone is happy. Those who refuse to pay don't have to and those who are willing to pay have a reason to. Most people are not going to just give you money out of the goodness of their heart and accept nothing in return if they don't have to. Why do you think PBS gives you gifts with your donations?

    I'd be more impressed with such undertakings if the owners weren't convinced the bandwidth fairy was real and that money will fall from the sky like mana.

    When someone comes along who recognizes that the bandwidth fairy doesn't exist and that money needs to be aquired through marketing to get any real amount then I'll think twice before laughing it off.

    Free is a pretty dream but free don't pay the bills.

    Ben

  83. Nutch - Not Understanding The Capitalist Hegemony by EqualSlash · · Score: 2, Insightful


    Nutch - Not Understanding The Capitalist Hegemony (I am just making it up ;)

    Without a sound revenue model they can't operate for more than a month. Google has indexed billions of pages and to operate at that level they have to spend a lot of money (Google recently leased an entire campus from SGI). To meet the Infrastructure costs alone you need some form of commercial revenue stream.

  84. Java? by slimak · · Score: 2, Interesting

    It seems like there would be a better choice than Java for the language when speed/efficiency is a must. Isn't the added overhead of the JVM going to decrease performance significanly?

    Portability should be a mute point since the pages can be generated on the server, which could easily run an OS specific binary.

    1. Re:Java? by dirtydiaper · · Score: 1

      The JVM experences that some have had on the net may seem slow and a memory eater. But, Java has now been proved as a very powerful tool in todays internet market.. The Nutch website does boast that there bot searches 10,000 indexs per second which is impressive.. I think we all may be in for a surprise.. Lets hope for the best!

  85. Suicide by Anonymous Coward · · Score: 0

    If you search for suicide methods, they will bend the engine to make sure you get reasons why you shouldn't commit suicide before you get the how-to.

    Yes but if you search for: suicide +"how to", the first hit comes up is: HOW TO KILL YOURSELF

  86. Much ado about Nutching by innerFire · · Score: 1

    I see no code.

  87. Categorical searches by phorm · · Score: 1

    Since google hasn't really done anything to warrant not using it (and we really shouldn't be so paranoid as to assume they will) I could see a project like this becoming useful in terms of specialized searches.

    How about a network of Linux or developer sites? Yes, there is google linux, but I have at times found it lacking (especially when I get a slew of German/Japanese sites and it doesn't always give me the language filter option).

    How about sites that index restaurants,etc? Perhaps they would benefit more from a searchable index without a visible initial ranking (so customers don't bitch). Live in Eastern LA and want to grab Greek? use blahblahfoodsearch.com and look up "+italian +greek +take-out"

    Eventually, specialized sites could cater to a niche, rather than taking on something the scope of google (with its no-doubt massive servers) straight away

  88. "written entirely in Java" by Anonymous Coward · · Score: 0

    ROTFL, they can't be serious! A few yahoos fresh from college talking about scalability and billions of pages and then they use the hype language with the worst performance track record of any non-scripting language?!

    Man, what a sad joke this is.

    Why don't they use Visual Basic? Or maybe perl? You know, for performance reasons. Muhahaha.

  89. Hey genius... by Anonymous Coward · · Score: 0

    Search: "to be or not to be" shakespeare

    10 good hits... amazing, huh.

  90. Obviously vapor by Anonymous Coward · · Score: 0

    I don't see jack shit on their site, anyone with a little HTML knowledge could produce what they have thus far.

  91. Advertising != Manipulating the rankings by alexhmit01 · · Score: 2

    On Google.com, it is VERY clear what are paid ads and what are "real" results. With MSN, for example, they list Featured Site (you pay MSN), followed by Overture (you pay per click), following by the Looksmart Directory Listings (used to just pay for submission, for the past year, Looksmart charges $0.15/click for those results).

    After the "paid" listings come the Inktomi listings. Those crawler based listings include PFI (pay for inclusion, you pay for daily spidering, but no "boost" in rankings) and the Partner Connect program, where you get free traffic for a week, then negotiate a PPC price for traffic.

    If you search MSN, you would get the impression that Featured is editorial (which is kind of is), Sponsered is paid, and Directory/Page results are "real" search results, where the Directory/Page are often actually paid results.

    The paid traffic from Inktomi involves an XML feed of terms and results, and your "fake" entries are treated as real entries with a boost for being a paid player.

    In addition, for various adult terms, MSN tells you to use a third party adult "search engine," which ISN'T a search engine. It is a big player in the adult space that pays MSN for all the traffic and lists their sites, but does it in a "search engine" look and feel.

    That is manipulating the rankings. If Google were to say, charge for the XML entry (either PFI or PPC) into Froogle.com, and then shot the Froogle results interspersed with Google results, that would be manipulating the rankings for money.

    That is the manipulation angle.

    Now, are paid results any better/worse from objective results if those are manipulated by SEO professionals (so you pay an SEO to get "free traffic" instead of paying the SE for the traffic)?

    It's certainly more manipulated.

    With a free engine, you could tweak the rankings and sell ads on the side, but not have manipulated editorial.

    It's about maintaining a wall between advertisements and editorial, and the only engine that appears to have that wall is Google, and even Google pushes the boundaries.

    Alex

  92. Obvious! by nacturation · · Score: 1

    Simple: every other domain name was taken. And czxvb.com just didn't quite have the same ring to it.

    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  93. monoculture bad, diversity good by m0nk3ym1nd · · Score: 1

    M$FT wants to get into search techs. good enough reason right there to advocate an open-source alternative.

  94. Depend on the country developing it. by Anonymous Coward · · Score: 0

    Patents about partical parts of software do not pertain in all countrys around the world. Ie there are many places where a developer could be based develop and say stuff the Patents basicly can be by passed.

  95. Re:Here's "A Tough Challenge" for you, hombre... by Anonymous Coward · · Score: 0

    a close relative of "felch"? : To eat the semen you just ejaculated out of your partner's anus.

  96. Search algorithms by speck · · Score: 1

    Without wanting to beat the "security through obscurity" dead horse any further, I don't see what's to stop the Nurk developers from doing exactly what Google does: changing the algorithm every month or so, so that pages which are "nurkbombed" in one release don't count as high in the next one. Admittedly, nurkbombers might have a slight time advantage over googlebombers (since they'd have access to the source), but the principle that the algorithm evolves based on attempts to exploit it is independent of whether the algorithm is open or closed-source.

  97. Great for large company's intranets by FattMattP · · Score: 1

    This would be great for large company intranets. The company I work for (60,000+ employees) has probably more than 1000 web servers spread out all over the world yet we have no way to search the content of all of them. Something like this would be great.

    --
    Prevent email address forgery. Publish SPF records for y
  98. Large scale and DB by webhat · · Score: 3, Interesting

    I was looking over the site and a number of things concerned me.

    Firstly the choice of Java, personally I have no gripe about this. And reading that a choice was made to use language-independent formats is a good idea. My main concern is for the larger scaling and distribution over multiple machines.
    At present I make the educated guess that a project on this scale, in Java, would still be best run on a `hardware base as uniform as possible', like UltraSparc 450's with a fibre back-plain.

    My second concern is that there is so much choice of indexing and searching technique that there are sure to be some problem due to Patent restrictions.
    Just browsing the US patent office gave me a couple of possible Patent nasties;
    6,463,428 or 6,278,992. (And about 10 others I glanced at...)

    Lastly DB, in the short time I've been looking at the code it seems to me that a choice was made to implement a DB build for the problem. Although this could be a good thing, it is usually better to reuse existing products. I found SleepyCat (DB4) to match the requirements. And if the choice is final read this. [1]

    I hope these comments are useful to somebody at least.

    [1] http://www.xlnt-software.com/xml_dl.html

    --
    'I am become Shiva, destroyer of worlds'
  99. Some commentary... by Colm+Buckley · · Score: 3, Insightful

    I have a few comments on this development:

    • The article as posted contains some pretty snide commentary, apparently designed to intimate that all current search engines deliberately weight their results in favour of their advertisers. This is demonstrably not the case; in fact, with Google providing a strong, well-publicised counterexample, to do so would be suicide for any search engine with pretentions to market leadership.
    • The principal difficulty with an open-source search engine algorithm is that it would definitively be open to abuse. Once the ranking algorithm was known, it would be fairly trivial to develop ways to subvert it. One of the reasons why this hasn't happened to Google is because the details of the ranking algorithm are closed. There is a largish industry devoted to figuring out how to influence Google (which is why Google keep tweaking their algorithm). A search engine using an open algorithm would very quickly become unusable as this industry figured out how to play the system.
    • The funding from Overture is very suspicious, to be honest. Overture, assuming the Yahoo! takeover is given the all-clear, will soon be part of one of the largest commercial search engines, and with a history of business practices which are, shall we say, perhaps less than totally congruent with the open-source ideals.
    • Running a large, successful search engine requires vast, dedicated resources. I don't know the exact scale of the Yahoo!, Google or MSN search operations, but I'll warrant that they're surprising to anyone who's expecting to run a search engine from a couple of thousand distributed nodes.

    An open search engine application is a nice idea, but unfortunately it's one of those applications which are essentially useless without an enormous ASP architecture behind it. An earlier poster indicated that it might be useful for searching and indexing intranets and the like, analogously to the Google Search Appliance. This is indeed a valid potential application, but then, HT://Dig exists already. Is this dramatically better?

  100. Cost cutting vs ?? by joshsnow · · Score: 1

    Hmmm, well "businesses" use one of the recent variants of MS operating systems (NT,2000,XP), the latter two of which ship with an "indexing server" specifically for indexing intranet sites.

    Yet said businesses are still willing to pay money for a google indexing server. Why?

    My point is maybe a "free"(beer- to save money) indexing server isn't what businesses require.

    1. Re:Cost cutting vs ?? by shaitand · · Score: 1

      Businesses that pay for google's indexing server or need a good indexing server DON'T run ms operating systems on their server ;) They have too heavy a load for that... windows doesn't scale up, hate to break it to you, those kind of businesses are running some *nix variant or other stable operating system. However, even if they were, being willing to pay for google servers simply means they are more interested in how well it performs than how much it costs... open source solutions are inevitably superior in the end. So why free alone may not be enough, free, fast, and stable may just be the trick.

      Of course money may be the deciding factor after all, search engines each huge amounts of resources (the bigger the intranet, the more it will use) that cost significant chunks of change... having one that is efficient (unlike the MS solution for anything I've encountered to date (even notepad is bloated... though it is stable, it stands alone among the crowd in that respect) could save them a decent amount of cash. True they probably have 100mbit links throughout most of the network (faster links are still expensive enough they won't be tossed all over) but there are other things that will be running on that and it will be crowded enough... this also might have to index across wan/vpn links.

  101. Comments and suggestions... by rice_burners_suck · · Score: 2, Insightful
    Suppose you have just finished developing a free software search engine. And suppose it has the best algorithms in the world and the ratings are weighted based on some sort of moderation system.

    This is exactly like the problem the mice had one day. They couldn't come out of their mouse hole because there was a dangerous cat prowling around. One day, as food was getting scarce and everyone was afraid to leave the hole, the mice called a meeting to discuss the problem. One excited young mouse came up with the most wonderful idea: Let's put a bell around the cat's neck, so that when the cat is nearby, the mice would have advance warning and could escape! All the mice got excited at this proposal, until a very old, very wise mouse came over and asked, "And who will tie the bell around the cat's neck?"

    What I'm trying to say is: If the search engine is free software and companies don't pay to increase their ranking... who will pay for the bandwidth to host the engine? I can tell you this much:

    • Individuals will not pay a fee to perform a search unless this search engine gives them some incredibly compelling reasons to do so. Open moderation will not likely fulfill that requirement.
    • Companies will not pay to increase their ranking because that is the definition of this project. They will not pay to search for the same reason that individuals won't pay.
    • The government probably won't pay because there are plenty of "free" (cost) search engine around. That is, unless someone can give them an incredibly compelling reason to do so.
    • Universities probably won't pay for the same reasons as everyone else.

    Proposed solution? Make it a distributed search engine, like SETI@home, or the DNS.

    This is much easier said than done because:

    1. RAID-like distributed storage technology would have to be developed, so that the indexing database could be distributed among all computers worldwide that donate bandwidth and storage. This would have to guarantee statistically that all the data will be available at any point in time even if people turn off their computers for extended periods of time. However, this technology could make reliable clustered storage a reality, and the resulting free software implementation could be licensed for corporate use for an exhorbitant price, which would go to the EFF, FSF and other organizations that develop free software and/or support the development thereof.
    2. An efficient P2P-like protocol, along with a network topology of some sort (like the DNS system has) would have to be developed to support the searching; It would have to be damn fast and, like before, very resiliant to computers being shut off, chunks of data becoming lost at any moment, etc. Furthermore, changes would need to propogate at blazing speeds so that new items on the Internet could be found shortly after appearing.
    3. Bandwidth and disk quota would need to be managed at each participating host, so that limits set by the user are not exceeded.
    Governments, companies, universities and individuals would likely support an effort like this by donating some bandwidth and storage, rather than money.

    In the spirit of worldwide computing on the Internet, I hope this makes some amount of sense.

  102. Failure by PickyH3D · · Score: 1
    At this point Nutch is coded entirely in Java, however persistent data is written in language-independent formats so that, if needed, modules may be re-written in other languages (e.g., C++) as the project progresses.
    Wow. It's already not going to beat Google.
  103. MOD THIS UP! by shaitand · · Score: 1

    This is a good idea, perhaps the good people of bittorrent and p2p networks could lead some insight into how to get this working.

    As for ranking manipulation, open source takes care of that... the one thing more important to a business than their own rank, is keeping the other guys down.

  104. Off-topic: new Google feature, calculator by harmonica · · Score: 1

    I just came across that new Google feature, the calculator.

    I'd only wish that there was better documentation, 'radius of earth' isn't exactly something you stumble upon by accident.

  105. Search based on the p2p model? by Anonymous Coward · · Score: 0

    Maybe with a touch of distributed net?

    Could be the next killer ap...

    1. Re:Search based on the p2p model? by j.leidner · · Score: 1

      I totally agree, what we need is a distributed search service that operates on a distributed index. If you have 10000 nutch-style instance running, they could all be biased in their own ways. There really should be an all-encompassing index of the Web, but in a scenario where the index is spread across millions of machines redundantly so that no set DDOS attacks could ever destroy it. I'd be quite happy to give up 10G of my local disk space to contribute to society (= for indexing). Many people do this already for FreeNet (but that serves a different purpose of course). Very possibly The Next Big Thing(R) indeed... Jochen Leidner

  106. ask slashdot ... 10yrs... nutch by flibbert · · Score: 1

    google is so 5 seconds ago... we`ll all have our own nutch engines and have all the data we care about on our local storage every 2 microseconds or so.

    nutch will be just one oldest/basic utils on Linux2013.

  107. Shameless plug for SWISH++ by pauljlucas · · Score: 3, Informative
    I see this project as a competitor to shrink wrapped search engines. IE google appliance or maybe even Folio based products. Typically corporations have many documents that need to be indexed and searchable to their needs.
    SWISH++ fills this niche nicely. It can index hundreds of thousands of documents very quickly, indexes not only HTML, but e-mail, news, man pages, LaTeX, RTF, and even the ID3 tags of MP3 files; can apply filters on-the-fly (convert PDF to text, then index that), can do incremental indexing, and can run as a multi-threaded search daemon.
    --
    If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
    1. Re:Shameless plug for SWISH++ by bkw · · Score: 1

      Some folks (me included) think swish-e is even more impressive and easier to setup and maintain.

    2. Re:Shameless plug for SWISH++ by pauljlucas · · Score: 1

      SWISH-E is many times slower which is why I wrote SWISH++ in the first place. As for "easier to set up," I don't know what's so hard about changing a few things in a config file and typing "make". As for "easier to maintain," I don't see what's so hard about running a cron job.

      --
      If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
    3. Re:Shameless plug for SWISH++ by davebaker824 · · Score: 1

      I still haven't been able to find any front-ends for SWISH++ though -- anything written for Perl?

    4. Re:Shameless plug for SWISH++ by pauljlucas · · Score: 1

      IMHO, a front-end is either useless or trivially written by those who really want them. If you want one so bad, write it yourself.

      --
      If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
  108. Search engines are not always internet portals by sbszine · · Score: 1

    Something a lot of folks are missing here is that search engines are used in applications, intranets, individual sites etc as well as Google type whole-internet portals.

    When you click on 'Find Files' in Windows, or look for a song in your chosen P2P app, or look something up on your O'Reilly CD Bookshelf, or search /. for an old article, that's a using a search engine just as much as Google is.

    If you're interested in something for your own project, lucene is a great application-centric search engine. It's just a bunch of Java classes that you call from your application. Or you can use a website-centric engine such as htdig if you're dealing with an intranet or website rather than an app. They're both GPLed I think.

    --

    Vino, gyno, and techno -Bruce Sterling

  109. DMOZ by led_belly · · Score: 0

    I thought that's what dmoz was supposed to be??

  110. Worst. FAQ. Ever by CGP314 · · Score: 1

    Will Nutch ever be as good as other search engines?

    ...

    How can I stop Nutch from crawling my site?

    ...

    When will Nutch search images, pdf files, etc.?

    ...

  111. So? by Anonymous Coward · · Score: 0

    Search for "linux" and you'll get your first 3 pages filled with pages about Linux. What's your point?

  112. "written in Java" ?!? -- trashcan. Next ? by kinsoa · · Score: 1
    A programmer who chose Java to implement serious projects is probably not a serious programmer.

    I don't want a search engine thaht use 80% of my CPU/RAM.

    1. Re:"written in Java" ?!? -- trashcan. Next ? by forkboy · · Score: 2, Interesting

      That's all they're teaching the kids in college these days. Seriously. At the school I go to (i'm not a CS major) you have to take C/C++ as an elective. The core CS curriculum is all Java. I don't think they even teach assembly there. Good schools are probably different of course, but who can afford good school anymore?

      I met someone the other day who had an an associates in Computer Science from a community college and had never used anything but an AS/400 and a Mac. (Not even Windows! Seriously!) I think people saw the dollar signs from 4-8 years ago and went to school for something they really are only marginally interested in just because they thought they could make a few more bucks. It's a damn shame too, because these shitty little tech schools (or community colleges hawking their tech "degrees") are doing a disservice to these people by making them that 1) they'll get a good education, 2) they'll have a job waiting for them when they graduate, and 3) they'll be respected by their peers in the IT biz as professionals.

      And then there's my niece who got a nursing degree in 18 months, had 4 interviews her first week out, and now makes more than your average 4-year college graduate. Go figure.

      But anyway, yeah, I agree with you. Don't /. while high, it makes one ramble.

      --
      This message brought to you by the Council of People Who Are Sick of Seeing More People.
    2. Re:"written in Java" ?!? -- trashcan. Next ? by dubstop · · Score: 1

      Uses 80% of your CPU/RAM, as in Google only uses 20% of your CPU/RAM because it's written in C/C++? How does the language that a search engine is written in, if you are accessing it from a browser, affect in any way the CPU/RAM usage on your computer? Granted, it would be possible to run this locally, but my take on it is that it's being aimed at search-farms that are (or will be) the equivalent of Google, or Yahoo, or whatever.

      I consider myself to be a serious programmer, and I use Java. I started in the industry using C, progressed to C++, and have used professionally (amongst other languages) x86 assembler and 68k assembler. I made the transition to Java when I moved onto developing enterprising applications for one of the largest merchant banks in the world. For scalable enterprise solutions, Java makes a lot of sense.

      Also, I think that by any reasonable definition of the word, Mitch Kapor and Tim O'Reilly could be considered 'serious' programmers.

  113. xapian by Froggie · · Score: 1
    Xapian?

    It's the storage library that keeps an index of pages. You need a display front-end and a webcrawler to go with it (there's some code around). It's GPL and it has some clever features.

  114. Risk of Mixing paid and free search by anti.gladio · · Score: 1

    The Success of an Open Source engine would be of crucial importance, especially after the latest development in the market: Yahoo! bought Overture and Google started working in the field of paid searches. The risk of the current trend is to produce a undistinguishible mixture of paid and free search that does not work anymore as a reliable catalogue or classification for a growing messy Web. The point is strengthened by the Web-directories decline (Yahoo! loosing importance and the small success of the Open Directory Project, exploited but not popular among users). If a new open source search engine will flourish, it is important to defend its integrity from its born. This can be achieved perhaps with copyleft protection.

  115. Nutch Johnson? by PimpNinjaWannaBee · · Score: 0

    Then it probably be good for finding babes in red swimsuits.

  116. Namazu by Anonymous Coward · · Score: 0

    http://www.namazu.org/

    Namazu is a open source full-text search engine. It has various document filters (HTML, Mail/News, PDF, MS Word, Excel, Powerpoint, man page, TeX, DVI, PostScript etc...) and mharc web-based mail archiving system adopts Namazu as its search engine.

  117. could this work as a cluster, like SETI? by grikdog · · Score: 1

    What about implementing nutch as a distributed effort, with spiders patiently covering a thousandth of the web per install, and reporting to a consortium of active servers? You could cover the web in hours, not days, the way SETI covers its search space. Also, if each node agrees to mirror its logically closest neighbor's small dataset, the odds are a search launched from Tanzania could see results posted by the colloquium from anywhere on the 'net -- Thousand Oaks, Cedar Rapids, Mexico City, Oslo -- in seconds, regardless of who is on the web or when.

    --
    ``Tension, apprehension & dissension have begun!'' - Duffy Wyg&, in Alfred Bester's _The Demolished Man_
  118. Unbiased? by 42forty-two42 · · Score: 1
  119. uses Java? by Anonymous Coward · · Score: 0

    bleah. Something to ignore.

  120. LONG LIVE by Anonymous Coward · · Score: 0

    NUTCH! great idea- keep google on its toes!!!