Building a Search Engine Using Open Technology?
cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."
Yes google already runs on OSS even though the search software itself is proprietary. If you wanted to truly put the search engine in the hands of the people, consider this idea. You could use P2P technology to distribute the search index across millions of systems worldwide. If someone wants to use the search engine, they must download the client software and donate, say, 100 MB to the project. Of course, you would have to have the system set up so that it has massive redundency to handle cases where individual nodes are offline. Also, the logistics of distributing the search across so many systems would need to be worked out. Furthermore, there is the possibility that users may attempt to tweak the client handling their node to increase the score for various pages or decrease the score for others. These issues would have to be worked out, but it could be feasible. Frankly, I'm too lazy to implement it, but you are welcome to credit me for the idea when its all done.
Unknown host pong.
Yes, and a related topic is indexing files that are in some specialized format.
I run a search site that only indexes a few hundred other sites and around 170,000 files (today). What the files contain doesn't matter here. What's significant is that the data, while being (usually) plain ascii text, is not in any human language. If you saw it and didn't know the subject area, you wouldn't be able to make sense of it. It's very useful to a few thousand users, and of no interest whatsoever to anyone else.
One thing that could be feasible with an open-source search project is to discuss ways in which specialized search engine like mine can be incorporated. The data that I index can be related to several other kinds of online data that are in turn indexed by others. But my code doesn't make the connection, and neither do the search engines for the related types of data.
This strikes me as a significant problem that the big guys can't much work on (yet). And, like "orphan" drugs, they probably won't ever find it worthwhile to work on most kinds of data that only exist in a few thousand files.
But if we could define a way to interface search engines so that they can recognize each other and refer queries to each other, then these specialized data formats could be usefully searched and indexed.
Sounds worthwhile to me. I wonder if I could find someone to pay me a salary while I worked on it?
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
What I have found REALLY interesting about MozDex is the "explain" button which I assume provides some insights into why MozDex decided to rank that web URL as whatever ... but the information as currently presented isn't understandable and/or explained.
For instance, I was interested where a Google Compute web page came up and was actually quite surprised that a MozDex Search shows it as #1. So I click on the explain button and I get a page with a buncha numbers ... but nowhere on this page (or anywhere on the MozDex site) can I find an explanation for what they heck they mean.
Since your claim-to-fame is open source/search, I think adding information on the internal algorithms would help you out. Keep up the good work - interesting stuff! ;-)
alek
P.S. Minor typo in the Corporate Info link from your FAQ
Hulk SMASH Celiac Disease
You need a name that is as easy to pronounce as google. As friendly sounding would be good as well.
You're "competing" on a number of different areas with google, including the name ofcourse.
The first thing that came to my mind when I read the name was: "Typical for geeks who are good at the technical side of things, but are bad at marketing and the human interface/psychology side".
- -- Truth addict for life.