Slashdot Mirror


Building a Search Engine Using Open Technology?

cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."

3 of 42 comments (clear)

  1. Open source search engine? by Chester+K · · Score: 4, Funny

    An open source search engine is a great idea! I'll know exactly how to exploit the ranking algorithms to position my pages as #1!

    --

    NO CARRIER
  2. Re:how is this different? by k4_pacific · · Score: 4, Interesting

    Yes google already runs on OSS even though the search software itself is proprietary. If you wanted to truly put the search engine in the hands of the people, consider this idea. You could use P2P technology to distribute the search index across millions of systems worldwide. If someone wants to use the search engine, they must download the client software and donate, say, 100 MB to the project. Of course, you would have to have the system set up so that it has massive redundency to handle cases where individual nodes are offline. Also, the logistics of distributing the search across so many systems would need to be worked out. Furthermore, there is the possibility that users may attempt to tweak the client handling their node to increase the score for various pages or decrease the score for others. These issues would have to be worked out, but it could be feasible. Frankly, I'm too lazy to implement it, but you are welcome to credit me for the idea when its all done.

    --
    Unknown host pong.
  3. Mozdex using Nutch sponsored by Overture which is by christophe.vg · · Score: 5, Informative

    While browsing the Mozdex site, I learned they are using Nutch, an open source search enigine. So I started browsing the Nutch site. On their site I found out that they are sponsored by Overture Research ... The name seemed familiar. Clicking on the link I arrived at http://labs.yahoo.com.

    Apparantly Yahoo is rather interested in this project. Browsing the Yahoo Labs site I found this page(which is also the third hit when googling for nutch): "Welcome to the Yahoo! Research Labs implementation of the Nutch open source search engine (www.nutch.org). This search engine is intended as a demonstration platform for a number of search related technologies that we are working on and is specifically not intended to provide a full and comprehensive search experience for the average user. If you do a search here, please do not be surprised or offended if your favorite site is not in the result set for your query.
    With this in mind, please feel free to test drive the technology. Happy Nutch-ing.

    A very quick test shows that the 50 million pages counting index of mozdex is indeed still far to small to really find something. The ranking system will also need some tweaking, but this is also clearly stated on the nutch site: "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines.".

    Although it is currently not possible to do any real comparison due to the big difference in the number of indexed pages, it sure is nice to see both the Nutch project and the Mozdex project. I hope that both of these project will receive enough funding (and hardware) to continue, and maybe we'll see another /. post when they hit the 5 billion page count and we will be able to do a massive comparison ... and all change from googling to nutching or mozdexing!

    One to watch