Building a Search Engine Using Open Technology?

← Back to Stories (view on slashdot.org)

Building a Search Engine Using Open Technology?

Posted by Cliff on Wednesday May 12, 2004 @02:47PM from the free-indexing dept.

cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."

9 of 42 comments (clear)

Min score:

Reason:

Sort:

how is this different? by shaitand · 2004-05-12 14:47 · Score: 2, Insightful

Support our index, sponsor mozAds keyword Advertising as low as 1/cent click

Is it different only because it runs on open source software? Hell google does that successfully already.
As a webmaster by Anonymous Coward · 2004-05-12 14:50 · Score: 2, Insightful

The thing I look for is a polite bot. Does it follow robots.txt fully? Does is hammer the server? Does it page modification headers?
results of course by rueger · 2004-05-12 15:38 · Score: 3, Insightful

What would you look for out of this project?

The only thing that matters is results. Is the answer that I need in the first three or four results? If you can do that, you win. If you can't, don't bother.

I'm skeptical about how realistic it is to develop an open source search engine. Wikipedia, although cool, has large gaps in content, and only a few months ago was begging for donations to survive. I'm betting that a Google sized operation would be even more resource intensive.

--
Three Squirrels
1. Re:results of course by MikeCapone · 2004-05-12 17:36 · Score: 2, Insightful
  
  I'm skeptical about how realistic it is to develop an open source search engine. Wikipedia, although cool, has large gaps in content, and only a few months ago was begging for donations to survive.
  
  Well, Wikipedia did get almost $30K in donations that time and is still getting lots of donations from what I gather, and could easily get a lot more whenever it wanted because lots of people LOVE that project, so that part is successful.
  
  As for the larg gaps in content, it is being worked on everyday. That's the nature of a work in progress, and I'm sure that compared to Wikipedia many other "real" encylopedias have "large gaps in content", shorter articles because of space/price restrictions and, especially, lots of terribly outdated content. And the good part is, if you see something bad, you can actually fix it!
  
  Open source/stuff often looks unfinished compared to proprietary stuff because, well, it is! If you could've seen the "betas" of proprietary encyclopedias and software you'd probably see most of the same failings.
  
  --
  Treehugger? Treehugger... Treehugger!
Re:Open source search engine? by addaon · 2004-05-12 17:51 · Score: 3, Insightful

No, it's a game where those who focus on the value of their content lose to those who focus on the marketing of their content.

--

I've had this sig for three days.
Re:Subject/Topic based filters by pb · 2004-05-12 18:04 · Score: 2, Insightful

You could do that by (a) putting in more keywords; (b) letting the search engine suggest topics/extra search keywords for a given search; some search engines try to do this already. As to how, latent semantic indexing looks good (it's a matrix technique used to find relationships between bits of data, such as the ones you discuss)

--
pb Reply or e-mail; don't vaguely moderate.
Filter commercial sites by iangoldby · 2004-05-12 20:23 · Score: 3, Insightful

I'd love to be able to filter out all sites that are trying to sell something.

Searching on Google for things like reviews of mp3 players has become a nightmare these days. Any useful sites are drowned out in a noise of pricerunner/dealtime/kelkoo/shopping.yahoo/etc and other sites that are simply affiliate sites for Amazon etc.
In over your heads? by Alomex · 2004-05-13 00:11 · Score: 2, Insightful

An OSS search engine that actually indexes the entire web and is used by many people is at least a couple of orders of magnitude harder than the Mozilla project.

Writing the search code itself is not too hard (you still need a PhD in data structures and algorithms, but those can be found), the real hard part is the amount of bandwidth and CPU power that is required.
Re:Filter commercial sites -MOD UP IF YOU WILL by Anonymous Coward · 2004-05-13 10:05 · Score: 1, Insightful

Try this:

MP3 player review -buy -dealtime -pricegrabber -kelkoo -shopping.com -amazon.com -nextag.com -bizrate.com -moreover.com -celeb -porn -free -coupon -pimprig.com

I have it set to a hotkey for all the -'s and boom relatively valid search results.

A gift from a marketroid to you techies.