Building a Search Engine Using Open Technology?

← Back to Stories (view on slashdot.org)

Building a Search Engine Using Open Technology?

Posted by Cliff on Wednesday May 12, 2004 @02:47PM from the free-indexing dept.

cybrthng asks: "Mozdex.com is my attempt at building a search engine capable of indexing the entire web. Our goal is to provide a completely transparent system utilizing open technologies such as Nutch, Lucene and other systems to provide a search facility that is more scientific and 'protocol' vs the current propriety and almost 'faith based' search engine results and methods of getting listed. What do you look for out of a search engine? What would you look for out of this project? Should large commercial entities be the only way we find information and resources on the net? BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today. We are seeking input on starting a developer & input community as well as getting concepts and ideas out and about, so we value your ideas and what you hope to see out of this project."

42 comments

Min score:

Reason:

Sort:

how is this different? by shaitand · 2004-05-12 14:47 · Score: 2, Insightful

Support our index, sponsor mozAds keyword Advertising as low as 1/cent click

Is it different only because it runs on open source software? Hell google does that successfully already.
1. Re:how is this different? by k4_pacific · 2004-05-12 15:04 · Score: 4, Interesting
  
  Yes google already runs on OSS even though the search software itself is proprietary. If you wanted to truly put the search engine in the hands of the people, consider this idea. You could use P2P technology to distribute the search index across millions of systems worldwide. If someone wants to use the search engine, they must download the client software and donate, say, 100 MB to the project. Of course, you would have to have the system set up so that it has massive redundency to handle cases where individual nodes are offline. Also, the logistics of distributing the search across so many systems would need to be worked out. Furthermore, there is the possibility that users may attempt to tweak the client handling their node to increase the score for various pages or decrease the score for others. These issues would have to be worked out, but it could be feasible. Frankly, I'm too lazy to implement it, but you are welcome to credit me for the idea when its all done.
  
  --
  Unknown host pong.
2. Re:how is this different? by Anonymous Coward · 2004-05-12 15:11 · Score: 0
  
  a better question is, "how is this different from a href="http://developers.slashdot.org/article.pl?si d=04/05/12/149243&mode=nested&tid=126&tid=156&tid= 95"> this article from earlier today?"
3. Re:how is this different? by Tune · 2004-05-12 22:56 · Score: 2, Informative
  
  Although the website mentions "open source" a lot it only suppplies a link to a sourceforge page which does not seem to supply anything downloadable.
  
  ALthough Mozdex appears to be of good will, notice that the GPL does not force them to distribute changes to GPLed code as long as they're the only ones using the code. THe GPL would only be effective if they would try to distribute changed binaries, but they do not distribute anything other than HTML web content. This could become a major headache with the GPL.
4. Re:how is this different? by Anonymous Coward · 2004-05-13 04:02 · Score: 1, Funny
  
  Step 1: Create P2P search engine technology.
  Step 2:
  Also, the logistics of distributing the search across so many systems would need to be worked out.
  Step 3:
  Furthermore, there is the possibility that users may attempt to tweak the client handling their node to increase the score for various pages or decrease the score for others. These issues would have to be worked out, but it could be feasible. Frankly, I'm too lazy to implement it, but you are welcome to credit me for the idea when its all done.
  Step 4: Profit!!
5. Re:how is this different? by Anonymous Coward · 2004-05-13 17:45 · Score: 0
  
  This guy don't stop spamming, today he was ask not to post spam in Nutch developer list.
  
  This is just a Nutch and taken from http://www.nutch.org/release/nightly/he claims to be the creator or something.
  
  Please report abuse to Nutch list at nutch-developers@lists.sourceforge.net
As a webmaster by Anonymous Coward · 2004-05-12 14:50 · Score: 2, Insightful

The thing I look for is a polite bot. Does it follow robots.txt fully? Does is hammer the server? Does it page modification headers?
1. Re:As a webmaster by idiotfromia · 2004-05-12 14:57 · Score: 2, Informative
  
  Yes, Nutch is designed to be a good bot and follow the normal rules, but just like any open source project, it could potentially be used badly by someone.
  
  More information can be found on the Nutch Webmaster Information Page.
Open source search engine? by Chester+K · 2004-05-12 15:00 · Score: 4, Funny

An open source search engine is a great idea! I'll know exactly how to exploit the ranking algorithms to position my pages as #1!

--

NO CARRIER
1. Re:Open source search engine? by dtfinch · 2004-05-12 16:58 · Score: 1
  
  >
  
  Yeah, but so will everyone else. And it's a zero sum game.
2. Re:Open source search engine? by addaon · 2004-05-12 17:51 · Score: 3, Insightful
  
  No, it's a game where those who focus on the value of their content lose to those who focus on the marketing of their content.
  
  --
  
  I've had this sig for three days.
3. Re:Open source search engine? by kunudo · 2004-05-13 07:35 · Score: 1
  
  But then, assuming there is such a thing as a perfect alorithm, it would be achieved some day, by looking at what exploits people were using, getting patches submitted etc. Then they could go beyond beta. That's my guess, anyway...
-1 by Anonymous Coward · 2004-05-12 15:00 · Score: 0

buy an ad ;)
Subject/Topic based filters by jfdawes · 2004-05-12 15:04 · Score: 1

How about you work out some way to do this old saw:

Searching for "Jaguar" the fighter bomber as opposed to "Jaguar" the comic book character.

But then, I'd want you get into natural language processing to determine what the real "topic" was that I meant. Of course I'm assuming a free form field. I'd like to just be able to put in "Jaguar the bomber" or "aeronautical: jaguar" or "plane jaguar" or even "plain jaguar" and have it do a Googlesque "Did you mean 'plane Jaguar'".

Hmm, and a fun API so you could build a search....
query = new Query();
query.setTerm("jaguar");
query.setTopic Hint("plain");
or
query.setCanonicalTopic("aeron autical");
1. Re:Subject/Topic based filters by jfdawes · 2004-05-12 15:13 · Score: 1
  
  Oh, and the search engine needs to have some understanding of the pages it's looking at so it can distinguish between pages that are about jaguar planes (or the comic book character) as opposed to pages that just mention them but might actually be about a related topic.
2. Re:Subject/Topic based filters by jc42 · 2004-05-12 15:20 · Score: 3, Interesting
  
  Yes, and a related topic is indexing files that are in some specialized format.
  
  I run a search site that only indexes a few hundred other sites and around 170,000 files (today). What the files contain doesn't matter here. What's significant is that the data, while being (usually) plain ascii text, is not in any human language. If you saw it and didn't know the subject area, you wouldn't be able to make sense of it. It's very useful to a few thousand users, and of no interest whatsoever to anyone else.
  
  One thing that could be feasible with an open-source search project is to discuss ways in which specialized search engine like mine can be incorporated. The data that I index can be related to several other kinds of online data that are in turn indexed by others. But my code doesn't make the connection, and neither do the search engines for the related types of data.
  
  This strikes me as a significant problem that the big guys can't much work on (yet). And, like "orphan" drugs, they probably won't ever find it worthwhile to work on most kinds of data that only exist in a few thousand files.
  
  But if we could define a way to interface search engines so that they can recognize each other and refer queries to each other, then these specialized data formats could be usefully searched and indexed.
  
  Sounds worthwhile to me. I wonder if I could find someone to pay me a salary while I worked on it?
  
  --
  Those who do study history are doomed to stand helplessly by while everyone else repeats it.
3. Re:Subject/Topic based filters by trg83 · 2004-05-12 17:38 · Score: 1
  
  >...is not in any human language... So, it's like legal documents and stuff?
4. Re:Subject/Topic based filters by pb · 2004-05-12 18:04 · Score: 2, Insightful
  
  You could do that by (a) putting in more keywords; (b) letting the search engine suggest topics/extra search keywords for a given search; some search engines try to do this already. As to how, latent semantic indexing looks good (it's a matrix technique used to find relationships between bits of data, such as the ones you discuss)
  
  --
  pb Reply or e-mail; don't vaguely moderate.
5. Re:Subject/Topic based filters by Anonymous Coward · 2004-05-13 02:53 · Score: 0
  
  plain ascii text, is not in any human language. If you saw it and didn't know the subject area, you wouldn't be able to make sense of it. It's very useful to a few thousand users, and of no interest whatsoever to anyone else.
  So is it ASCII pr0n?
6. Re:Subject/Topic based filters by bkocik · 2004-05-13 07:15 · Score: 1
  
  Our search engine does something like this. Go to http://search.aol.com and search for "eagles", for example.
  
  --
  -BK
  Chemical Blog
7. Re:Subject/Topic based filters by jc42 · 2004-05-15 10:07 · Score: 1
  
  Heh, no. But that could be used as a similar example. A search engine that is good at legal searches should be a lot pickier about how the language is used. I'd imagine that lawyers would really like something that is a lot more targetted and precise than google.
  
  In general, I'd think that to solve this problem, you wouldn't want to look too closely at specific examples, other than to become convinced that each specialty really is going to need its own parser and syntax analyzer.
  
  A good traditional example is the phrase "dead beef", which is of course an encoding of four bytes in hexadecimal. In the data that my search bot analyzes, this would in fact be a legal and meaningful phrase, but it wouldn't be hexadecimal, and it wouldn't have anything to do with cattle (dead or alive). The encoding that I'm indexing in fact uses the letters A-G, and case is significant. Grouping of letters into word-like strings is meaningful, but has nothing to do with "words" in any spoken language.
  
  There's another well-known alphabetic encoding that is becoming very significant that uses only the letters ACGT (and sometimes U). There are people working on software that indexes such text. In this case, strings like CAT and TAG are very meaningful (though I couldn't tell you off the top of my head which amino acids they encode).
  
  It would be interesting if a search-engine project could find a way to incorporate text in such "languages" so that they could be combined in meaningful ways with searches of text in human languages. Or in computer languages, for that matter.
  
  --
  Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Ummm.... by Anonymous Coward · 2004-05-12 15:15 · Score: 0

BTW, our beta index currently has about 50 million pages and we hope it shows what can be done using Open Source systems available today.

Uh, crawling & getting 50 million pages isn't the hard part. Searching them is.

Good luck. You'll need it.
Re:I look for... by Anonymous Coward · 2004-05-12 15:21 · Score: 1, Informative

Actually, there are a lot of open source porn-search projects. For instance, gnaughty, and the Porn Toolkit.
There may be parallels in encryption... by stienman · 2004-05-12 15:37 · Score: 1

It's something to really consider, because I can see how an open algorithm would be beneficial, but it's very easy to see how it can be spammed into uselessness.

I think of the Dow and other financial indices and believe that the proprietary model may be the only successful way to provide useful, reliable information.

Then I look at encryption, and I see how the algorithms, being public, can be vetted without compromising the security of the communication through a proprietary, secret key.

I suspect that a succesful search engine, at this juncture in time, cannot have an open 'key' due to processing power limitations. Given infinite processing power and a reasonable language-understanding algorithm it could be possible to make the spammers work so hard at getting into the listings that they would actually have to provide useful information about a given subject just to show up in the listings.

However, a search engine cannot make enough money off of ads to even approach that power. We first have to create better natural language algorithms, because the web will grow as fast or faster than processing power.

At this time a good compromise would be to follow the encryption route. You use publicly vetted methods of rating pages and indexing pages, but you keep the key, the actual calculation that relates those ratings to a particular search, a secret and modify it to keep up with spammers who will brute force the key into the public.

Again, I disagree with the theory that it is possible to design
1) A fully public algorithm for page ranking and relating them to search queries
2) Keep spammers and keyword hijackers from ruining the index
3) Without infinite computer power and/or a very advanced natural language indexing system.

Don't let this dissuade you from trying. Search is not a solved problem, and there are many opportunities for small (and large) successes before we even come close to a complete solution. You may well find another important piece of the puzzle.

-Adam
results of course by rueger · 2004-05-12 15:38 · Score: 3, Insightful

What would you look for out of this project?

The only thing that matters is results. Is the answer that I need in the first three or four results? If you can do that, you win. If you can't, don't bother.

I'm skeptical about how realistic it is to develop an open source search engine. Wikipedia, although cool, has large gaps in content, and only a few months ago was begging for donations to survive. I'm betting that a Google sized operation would be even more resource intensive.

--
Three Squirrels
1. Re:results of course by MikeCapone · 2004-05-12 17:36 · Score: 2, Insightful
  
  I'm skeptical about how realistic it is to develop an open source search engine. Wikipedia, although cool, has large gaps in content, and only a few months ago was begging for donations to survive.
  
  Well, Wikipedia did get almost $30K in donations that time and is still getting lots of donations from what I gather, and could easily get a lot more whenever it wanted because lots of people LOVE that project, so that part is successful.
  
  As for the larg gaps in content, it is being worked on everyday. That's the nature of a work in progress, and I'm sure that compared to Wikipedia many other "real" encylopedias have "large gaps in content", shorter articles because of space/price restrictions and, especially, lots of terribly outdated content. And the good part is, if you see something bad, you can actually fix it!
  
  Open source/stuff often looks unfinished compared to proprietary stuff because, well, it is! If you could've seen the "betas" of proprietary encyclopedias and software you'd probably see most of the same failings.
  
  --
  Treehugger? Treehugger... Treehugger!
What I like to see the focus by zipoh · 2004-05-12 15:55 · Score: 1

of such a search engine would be something akin to the philosophy of openness that is common to GNU etc, and free as in beer of course. It's OK to have open rankings, if the point of this is to index the more non-commercial side of stuff. And don't bother to cover the same base as google. What's the point?
More explaining on "explain" would be useful by xmas2003 · 2004-05-12 16:31 · Score: 3, Interesting

First, I've futzed around with MozDex for a little while, so congrats on having Slashdot "find" you and getting the word out.
What I have found REALLY interesting about MozDex is the "explain" button which I assume provides some insights into why MozDex decided to rank that web URL as whatever ... but the information as currently presented isn't understandable and/or explained.
For instance, I was interested where a Google Compute web page came up and was actually quite surprised that a MozDex Search shows it as #1. So I click on the explain button and I get a page with a buncha numbers ... but nowhere on this page (or anywhere on the MozDex site) can I find an explanation for what they heck they mean.
Since your claim-to-fame is open source/search, I think adding information on the internal algorithms would help you out. Keep up the good work - interesting stuff! ;-)
alek
P.S. Minor typo in the Corporate Info link from your FAQ

--
Hulk SMASH Celiac Disease
Rejection penalty by cgenman · 2004-05-12 17:09 · Score: 1

Well, it could easily assess a penalty to any website which was ranked above that which the user clicked upon in any given search, thus ensuring that "xxX Hot Teens Slashdot Xxx" is punted down to the bottom of the Slashdot searching pile. Of course, you would need an uncrackable way of summarizing the pages...

--
The ______ Agenda
what i look for in a search engine.. by pukerz · 2004-05-12 17:14 · Score: 1

- less importance given to commercial sites and blogs, more importance given to general information
- results not easily manipulated (eg. http://your-search-keywords.com/your-keywords/keyw ords.html and their ilk)
- fast discovery of new or updated sites
- features such as caching, view as X, spam reporting

--
the dead shall rise, from their graves, to destroy, geometry.
Mozdex using Nutch sponsored by Overture which is by christophe.vg · 2004-05-12 18:26 · Score: 5, Informative

While browsing the Mozdex site, I learned they are using Nutch, an open source search enigine. So I started browsing the Nutch site. On their site I found out that they are sponsored by Overture Research ... The name seemed familiar. Clicking on the link I arrived at http://labs.yahoo.com.

Apparantly Yahoo is rather interested in this project. Browsing the Yahoo Labs site I found this page(which is also the third hit when googling for nutch): "Welcome to the Yahoo! Research Labs implementation of the Nutch open source search engine (www.nutch.org). This search engine is intended as a demonstration platform for a number of search related technologies that we are working on and is specifically not intended to provide a full and comprehensive search experience for the average user. If you do a search here, please do not be surprised or offended if your favorite site is not in the result set for your query.
With this in mind, please feel free to test drive the technology. Happy Nutch-ing.

A very quick test shows that the 50 million pages counting index of mozdex is indeed still far to small to really find something. The ranking system will also need some tweaking, but this is also clearly stated on the nutch site: "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines.".

Although it is currently not possible to do any real comparison due to the big difference in the number of indexed pages, it sure is nice to see both the Nutch project and the Mozdex project. I hope that both of these project will receive enough funding (and hardware) to continue, and maybe we'll see another /. post when they hit the 5 billion page count and we will be able to do a massive comparison ... and all change from googling to nutching or mozdexing!

One to watch
Filter commercial sites by iangoldby · 2004-05-12 20:23 · Score: 3, Insightful

I'd love to be able to filter out all sites that are trying to sell something.

Searching on Google for things like reviews of mp3 players has become a nightmare these days. Any useful sites are drowned out in a noise of pricerunner/dealtime/kelkoo/shopping.yahoo/etc and other sites that are simply affiliate sites for Amazon etc.
httrack and grep... by Anonymous Coward · 2004-05-12 23:42 · Score: 1, Funny

...and a *HUGE* hard drive.

Download the internet with httrack and search it with grep.
In over your heads? by Alomex · 2004-05-13 00:11 · Score: 2, Insightful

An OSS search engine that actually indexes the entire web and is used by many people is at least a couple of orders of magnitude harder than the Mozilla project.

Writing the search code itself is not too hard (you still need a PhD in data structures and algorithms, but those can be found), the real hard part is the amount of bandwidth and CPU power that is required.
A different name by doc+modulo · 2004-05-13 00:13 · Score: 3, Interesting

You need a name that is as easy to pronounce as google. As friendly sounding would be good as well.

You're "competing" on a number of different areas with google, including the name ofcourse.

The first thing that came to my mind when I read the name was: "Typical for geeks who are good at the technical side of things, but are bad at marketing and the human interface/psychology side".

--
- -- Truth addict for life.
Results Baby by Derkec · 2004-05-13 00:16 · Score: 1

Right now, there might be 50 million pages indexed, but right now it looks like I've got to go through 1 million of those to get to what I searched for.

My two tests were 1: "Hattrick" which is an online soccer management game, and is great. Google it and up it comes with some handy links to some sites about it. Using this engine, I got a bunch of crap. It may have been pages that linked to hattrick, but I didn't check.

2: "Buyer Agent Boulder Colorado" - Exclusive buyer agents are the preferred way of buying real estate. I know some in Boulder and wanted to see how they'd do. While the site I was rooting for didn't show up on the front page of either engine, a major EBA office did pop up as the #1 result on Google, no such luck on this other one.

Hell, with their bias towards open source, I would have expected decent results for a search on Linux, but no luck. This is a nice idea and all, but the interesting part of the search engine is the ranking algorithms. That's what they pay the PHDs to come up with in the proprietary world. That money seems to be well spent. Looks like you need to reach out to academia and beg for some algorithms.
Re:Filter commercial sites -MOD UP IF YOU WILL by Anonymous Coward · 2004-05-13 10:05 · Score: 1, Insightful

Try this:

MP3 player review -buy -dealtime -pricegrabber -kelkoo -shopping.com -amazon.com -nextag.com -bizrate.com -moreover.com -celeb -porn -free -coupon -pimprig.com

I have it set to a hotkey for all the -'s and boom relatively valid search results.

A gift from a marketroid to you techies.
Easy! by JamesP · 2004-05-13 11:56 · Score: 2, Funny

cat database | grep query

Completely Open Source!

--
how long until /. fixes commenting on Chrome?
It's broken by DerekLyons · 2004-05-13 11:57 · Score: 1

What do you look for out of a search engine?
One whose algorithms work. (In this instance, your first result is the same for "henry l stimson" and for "uss henry l stimson".)

Back to the drawing board.
google by Anonymous Coward · 2004-05-13 14:32 · Score: 0

when you search for google, the first page does not even link to www.google.com. I cannot see how a search engine that does not even mention the obvious results would do well.
Re:Filter commercial sites -MOD UP IF YOU WILL by tdvaughan · 2004-05-14 01:45 · Score: 1

Pretty cool, but much of it is redundant since you lose all but the first ten search terms.