Building a Search Engine Using Open Technology?
cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."
Support our index, sponsor mozAds keyword Advertising as low as 1/cent click
Is it different only because it runs on open source software? Hell google does that successfully already.
The thing I look for is a polite bot. Does it follow robots.txt fully? Does is hammer the server? Does it page modification headers?
An open source search engine is a great idea! I'll know exactly how to exploit the ranking algorithms to position my pages as #1!
NO CARRIER
Yes, and a related topic is indexing files that are in some specialized format.
I run a search site that only indexes a few hundred other sites and around 170,000 files (today). What the files contain doesn't matter here. What's significant is that the data, while being (usually) plain ascii text, is not in any human language. If you saw it and didn't know the subject area, you wouldn't be able to make sense of it. It's very useful to a few thousand users, and of no interest whatsoever to anyone else.
One thing that could be feasible with an open-source search project is to discuss ways in which specialized search engine like mine can be incorporated. The data that I index can be related to several other kinds of online data that are in turn indexed by others. But my code doesn't make the connection, and neither do the search engines for the related types of data.
This strikes me as a significant problem that the big guys can't much work on (yet). And, like "orphan" drugs, they probably won't ever find it worthwhile to work on most kinds of data that only exist in a few thousand files.
But if we could define a way to interface search engines so that they can recognize each other and refer queries to each other, then these specialized data formats could be usefully searched and indexed.
Sounds worthwhile to me. I wonder if I could find someone to pay me a salary while I worked on it?
Those who do study history are doomed to stand helplessly by while everyone else repeats it.
What would you look for out of this project?
The only thing that matters is results. Is the answer that I need in the first three or four results? If you can do that, you win. If you can't, don't bother.
I'm skeptical about how realistic it is to develop an open source search engine. Wikipedia, although cool, has large gaps in content, and only a few months ago was begging for donations to survive. I'm betting that a Google sized operation would be even more resource intensive.
Three Squirrels
What I have found REALLY interesting about MozDex is the "explain" button which I assume provides some insights into why MozDex decided to rank that web URL as whatever ... but the information as currently presented isn't understandable and/or explained.
For instance, I was interested where a Google Compute web page came up and was actually quite surprised that a MozDex Search shows it as #1. So I click on the explain button and I get a page with a buncha numbers ... but nowhere on this page (or anywhere on the MozDex site) can I find an explanation for what they heck they mean.
Since your claim-to-fame is open source/search, I think adding information on the internal algorithms would help you out. Keep up the good work - interesting stuff! ;-)
alek
P.S. Minor typo in the Corporate Info link from your FAQ
Hulk SMASH Celiac Disease
You could do that by (a) putting in more keywords; (b) letting the search engine suggest topics/extra search keywords for a given search; some search engines try to do this already. As to how, latent semantic indexing looks good (it's a matrix technique used to find relationships between bits of data, such as the ones you discuss)
pb Reply or e-mail; don't vaguely moderate.
While browsing the Mozdex site, I learned they are using Nutch, an open source search enigine. So I started browsing the Nutch site. On their site I found out that they are sponsored by Overture Research ... The name seemed familiar. Clicking on the link I arrived at http://labs.yahoo.com.
Apparantly Yahoo is rather interested in this project. Browsing the Yahoo Labs site I found this page(which is also the third hit when googling for nutch): "Welcome to the Yahoo! Research Labs implementation of the Nutch open source search engine (www.nutch.org). This search engine is intended as a demonstration platform for a number of search related technologies that we are working on and is specifically not intended to provide a full and comprehensive search experience for the average user. If you do a search here, please do not be surprised or offended if your favorite site is not in the result set for your query.
With this in mind, please feel free to test drive the technology. Happy Nutch-ing.
A very quick test shows that the 50 million pages counting index of mozdex is indeed still far to small to really find something. The ranking system will also need some tweaking, but this is also clearly stated on the nutch site: "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines.".
Although it is currently not possible to do any real comparison due to the big difference in the number of indexed pages, it sure is nice to see both the Nutch project and the Mozdex project. I hope that both of these project will receive enough funding (and hardware) to continue, and maybe we'll see another /. post when they hit the 5 billion page count and we will be able to do a massive comparison ... and all change from googling to nutching or mozdexing!
One to watch
I'd love to be able to filter out all sites that are trying to sell something.
Searching on Google for things like reviews of mp3 players has become a nightmare these days. Any useful sites are drowned out in a noise of pricerunner/dealtime/kelkoo/shopping.yahoo/etc and other sites that are simply affiliate sites for Amazon etc.
An OSS search engine that actually indexes the entire web and is used by many people is at least a couple of orders of magnitude harder than the Mozilla project.
Writing the search code itself is not too hard (you still need a PhD in data structures and algorithms, but those can be found), the real hard part is the amount of bandwidth and CPU power that is required.
You need a name that is as easy to pronounce as google. As friendly sounding would be good as well.
You're "competing" on a number of different areas with google, including the name ofcourse.
The first thing that came to my mind when I read the name was: "Typical for geeks who are good at the technical side of things, but are bad at marketing and the human interface/psychology side".
- -- Truth addict for life.
cat database | grep query
Completely Open Source!
how long until