Building a Search Engine Using Open Technology?
cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."
Yes, Nutch is designed to be a good bot and follow the normal rules, but just like any open source project, it could potentially be used badly by someone.
More information can be found on the Nutch Webmaster Information Page.
Actually, there are a lot of open source porn-search projects. For instance, gnaughty, and the Porn Toolkit.
While browsing the Mozdex site, I learned they are using Nutch, an open source search enigine. So I started browsing the Nutch site. On their site I found out that they are sponsored by Overture Research ... The name seemed familiar. Clicking on the link I arrived at http://labs.yahoo.com.
Apparantly Yahoo is rather interested in this project. Browsing the Yahoo Labs site I found this page(which is also the third hit when googling for nutch): "Welcome to the Yahoo! Research Labs implementation of the Nutch open source search engine (www.nutch.org). This search engine is intended as a demonstration platform for a number of search related technologies that we are working on and is specifically not intended to provide a full and comprehensive search experience for the average user. If you do a search here, please do not be surprised or offended if your favorite site is not in the result set for your query.
With this in mind, please feel free to test drive the technology. Happy Nutch-ing.
A very quick test shows that the 50 million pages counting index of mozdex is indeed still far to small to really find something. The ranking system will also need some tweaking, but this is also clearly stated on the nutch site: "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines.".
Although it is currently not possible to do any real comparison due to the big difference in the number of indexed pages, it sure is nice to see both the Nutch project and the Mozdex project. I hope that both of these project will receive enough funding (and hardware) to continue, and maybe we'll see another /. post when they hit the 5 billion page count and we will be able to do a massive comparison ... and all change from googling to nutching or mozdexing!
One to watch
Although the website mentions "open source" a lot it only suppplies a link to a sourceforge page which does not seem to supply anything downloadable.
ALthough Mozdex appears to be of good will, notice that the GPL does not force them to distribute changes to GPLed code as long as they're the only ones using the code. THe GPL would only be effective if they would try to distribute changed binaries, but they do not distribute anything other than HTML web content. This could become a major headache with the GPL.