Domain: nutch.org
Stories and comments across the archive that link to nutch.org.
Comments · 30
-
Woah
Wikia Search is open source, it's based off of Grub (which we have already talked about before). Here's the source code to the grub Windows client, and there's a dev site too. The current scoring algorithm is over here. If you want to talk with Jimbo and the developers, hop on to mailing list and let's talk.
Anyway, it looks like there's the opportunity here to *improve* this search engine -- programmers, I know you are reading, and at least check out the code. There's been talk about running some competitions for improving the search results (the scoring algorithms), how many of us would like to form a team? Maybe I'll do one. Who's with me?
(Btw, these guys need help. I just found all of this after the recent news articles.) Screw my mod points. -
Re:Possible Solution
That's a bit cynical, don't you think?
If they really wanted to make the most money possible, they would have sold these logs (non-anonymized) to the scores of direct marketers that I'm sure would love to have this data. Instead, they packaged it up and tried to make it available to academic researchers. These researchers honestly just want to make better search engines that run faster and return better results. Furthermore, when academics come up with a great new idea, it gets published so that anyone can read it.
Every once in a while, someone suggests an open source search engine. Check out Nutch if you want to see work in this area. However, if open source search solutions are going to be any good at all, they'll have to rely on the decades of public, published information retrieval research that's already out there.
We are entering a time when companies are capable of totally outpacing academia because they have query log data, so they know exactly what users actually do. There is no way that an academic can get this kind of data unless a company releases it. Researchers at AOL, in good faith, tried to release data so researchers could have a chance at success. Ultimately, of course, that's good for AOL since they're not in the top three search engines out there. Public research can only help raise AOL's standing by helping to level the playing field. But, it's good for you too, because you can build your open source solution based on this research too.
Yes, the release was botched, and yes, the long term user identifiers were a mistake. But don't make AOL out to be some evil company that was only out to destroy your privacy. They made a mistake!
-
Re:A prediction"The university can just go to another provider if they don't like Google's attitude -- that's why it's different with open source software. With closed source it would be a lock-in."
In fact thats just what Oregon State University did when googles prices were too high. They replaced thier Google box with Nutch Search Engine and saved around $100,000 a year. Fortunatly Google apparently does not have any (or enough) bad blood about this to prevent them from taking the initiative to promote open source.
-
Wow
Another google story?
Cool as it is, it just aint that cool.
Mod me down if you want, call me biased but there is tons of other "news for nerds" besides some corporation who is after your dollar.
For some cool search news, Nutch .07 just came out - http://nutch.org/ - i'm loading it up on mozdex through next week :) -
Nutch
There is already a fairly scalable complete FOSS search-engine called Nutch which can (in theory) scale from an 'in website' search engine to a full-blown google-style search site.
I wonder if Yahoo are offering as much source access and simmilar licencing terms to this? (It appears from the articles that the APIs are purely for interaction with the Yahoo site). -
Search engine technology should be decentralized!!
Quite a while back I worked on a "personalization engine" for a software company that ran on top of Autonomy. It would display different content depending on what was inside someone's "profile". The internal code name for the technology was "Orwell". I am not even kidding, and I didn't find it the least bit funny. In fact, I quit because of it.
The bottom line is that the Web, which is made up of a lot of different technologies (html, http) is too centralized in its nature. Interacting with the "web" means that first and foremost, you have to start trusting a single website whom you may or not have a connection to. Even if it's your employer, there is no way *not* to have any kind of "big brother" as you interact with the web.
There are nascent efforts to provide an open source search engine (for one example see: Nutch, but I also think this needs to be combined with a much more decentralized transport than HTTP, ideally one that is fundamentally "peer to peer" *and* authenticated in its nature. That way, each time you interact with other organizations, there will be a face to it and you can get a sense of how people are interacting with your personal digital profile. -
Re:[tt]:Encartathe Google UI is just elegant.
google suck too.
This is the future...someday!!
-
Re:rephrase
F*** g00gle. Roll on
http://www.nutch.org An open source search engine is the ONLY way we can keep propaganda out of the internet (one of the last free mediums) -
Re:How will this work?
Now, how is this going to work? First off, when I do a search on google there are dozens if not hundreds of PC's involved in various aspects of the search. I get my results in under a second. My computer - although fairly decent itself - is only a mid-tower. There is no way I can support even one PC to assist in searching.
Having played with and done some work on the open source Nutch search engine I know from experience that you can return search results from ~10,000,000 pages in much less than 1 second on a mid-range desktop. It's all done with indexes in much the same way as relational DB have been doing it for years.
-
GFS question
Can you answer me some GFS questions:
Does the SRPMs run with Kernel 2.6?
Does anybody got the server/client running already? Can you tell me the Distribution/Kernel?
Can I have one subdir on my workstation which is the total amount of all harddiscs of my GFS Machines together? (Or if mirroring is used
only 50% of the harddisc storage).
Does GFS need a master server?
The reason why I ask for is, I want a distributed filesystem to build a set-up for nutch. I am in the testing process actually with OpenAFS on SUSE 9.0 and also would like to test other distributed filesystems. -
Re:Naughty behaviour
I'm looking for a clean, fast, non-buggy alternative to the google giant. Preferably open source.
Any suggestions?
The only big one I know of right now is Nutch. It is an open source search engine that is in the later stages of development, but hasn't produced a large, usable site yet.
nutch.org
Since it will be open source, you will be able to read the ranking algorithms and change/abuse them as you see fit.
This one http://search.mnogo.ru/ is also available. -
Re:Naughty behaviour
Nutch aims to create opensource search engine, though they don't have anything yet.
-
Re:how is this different?This guy don't stop spamming, today he was ask not to post spam in Nutch developer list.
This is just a Nutch and taken from http://www.nutch.org/release/nightly/he claims to be the creator or something.
Please report abuse to Nutch list at nutch-developers@lists.sourceforge.net -
Re:As a webmaster
Yes, Nutch is designed to be a good bot and follow the normal rules, but just like any open source project, it could potentially be used badly by someone.
More information can be found on the Nutch Webmaster Information Page.
-
Nutch
There is one open source search engine that seems to be up-and-coming. Nutch is now powering Mozdex, and it looks fairly impressive so far.
Now, instead of the previous free-will donations, you can support the project through purchasing very cheap sponsered listings that appear to the right of the results (similar to Google)
-
Re:Open Source Search Engine?
-
Re:Open Source Search Engine?
-
what timing for this /. article!
just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.
Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.
-
What about an open source search engine instead?There seem to be too many social networking sites these days. How many can one person possibly belong to? What would be cool is an open source search engine, although I don't know if that project is still active. One thing to consider is that open source works well for "products" like GNU/Linux but does not work as well for services like a social networking site. Even a service like our beloved Slashdot may use open source software but it is a commercially-operated ad-sponsored business.
---------
Create your wireless web site -
Re:I agree, good Sir
Here's to the next search engine king (whomever it turns out to be).
nutch.org perhaps? -
Re:Oy.I was just thinking the same thing. Google is a search engine company, why not just concentrate on improving what they do best.
Maybe it's time to start looking elsewhere. I like the idea of Nutch the effort to implement an open-source web search engine.
-
Re:Why isn't "someone" Tim Bray
-
Re:Before anybody gets too worked up...
Actually, the directory is simply a reformatted version of the Open Directory, so MS would have to fight yet another OSS project to get control of Google's directory. Nutch looks like it could be the open source candidate, and AllTheWeb is the closed source candidate for a new search engine (it can search for audio or video). AllTheWeb doesn't have a Usenet search, but they do have a file search (think of putting ANYTHING in the filetype: parameter). Maybe Deja.com could buy Google Groups back (if GG were bought by Deja.com, the domain would be available - deja.com points to groups.google.com)...
-
Re:Before anybody gets too worked up...
There's always Nutch... (an open-source Google-like engine)
-
Re:April Fools year round with Slashdot
As the parent post says, isn't the
/. summary a bit premature?
But still, it would be wise to be cautious, who knows what the next offer to Google might be? An offer they can't refuse?
Anyway, its nice to know we do have some alternatives: the open source search engine Nutch is in its larval stage. Lets hope they get it up and running before some company with plans for WORLD DOMINATION takes over Google! -
Mighty google?
Why are people so obsessed with google? They defend it like it's a member of their family. Google this, google that. Having one search engine that everyone relies on is not a good situation. I'm hoping nutch is making progress..
-
Re:Corporate entity
um...don't know if this has been mentioned... nutch- open source web search engine
-
Re:That's nice and all but the code isn't the prob
Yeah, bandwidth and hardware are rather limiting in building an large search service. There is Nutch, a project to start an open source search engine.
Until that gets off the ground, if you're woried about Google, you can use different searches as well. Someone like Hotbot lets you chose the engine from the standard search page.
Really, with all the different engines out there, it's not like you have to use Google, it's just been the best for relevant results for a while.
-
Nutch?
Has anyone seen nutch? It looks pretty interesting. "Nutch provides a transparent alternative to commercial web search engines. Only open source search results can be fully trusted to be without bias. (Or at least their bias is public.)"
Take a look here: here -
Re:And I'm just sure...but what (using your analogy) the cat actually gets the mouse? what if it becomes law that google can't index links that breach the dmca? they'd have to check not only each new site they are going to index to make sure it doesn't breach dmca, but all the sites they are currently indexing to make sure they haven't fallen out of compliance.
and that really would be a horrible state of affairs for the internet. maybe projects like nutch are the only way internet searching is going to advance in the future. or would even this service fall under the same restrictive laws?