How Hard Is It To Write Your Own Search Engine?

← Back to Stories (view on slashdot.org)

How Hard Is It To Write Your Own Search Engine?

Posted by timothy on Wednesday May 12, 2004 @02:49AM from the can-barely-search-my-car dept.

kha0z writes "Anna Patterson, from Stanford University, overviews the difficulties that have to be overcome when attempting to develop and/or implement a search engine solution in this article in the ACM Queue Magazine. The article covers many issues dealing from data sources, to indexing, to ranking. How does Google make it look so easy?"

23 comments

Min score:

Reason:

Sort:

It Isn't Easy by Goo.cc · 2004-05-12 02:50 · Score: 0, Insightful

Google just makes it look that way.
Well by Gr8Apes · 2004-05-12 02:53 · Score: 0

You start with a simple model, which they did years ago when the internet was a much much smaller and simpler thing, then work years and years to perfect your model, and voila, simple, no?

--
The cesspool just got a check and balance.
Search engine != entire web by pauljlucas · 2004-05-12 03:28 · Score: 4, Insightful

Not all search engines are designed or intended for indexing and searching the entire web, and not everybody needs such a search engine. Often, people want to search their stuff: their documents on their local disks, their e-mail, etc.
While writing a local search engine isn't trivial, it's a lot easier than writing a web search engine since all the scaling issues disappear -- I know: I wrote one.

--
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
1. Re:Search engine != entire web by BigGerman · 2004-05-12 06:19 · Score: 1, Interesting
  
  exactly. A buddy of mine developed and now sells search appliance that can search network windows shares as well as web servers - EnterFind.
  it is not complicated at all.
How does Google make it look so easy? by stefanlasiewski · 2004-05-12 03:35 · Score: 3, Interesting

How does Google make it look so easy?

Google has hundreds of millions of dollars. Google treats their 2000 (2500+?) employees pretty well, so those employees work hard and smart and put in 40-60 hour work weeks. Google started earlier then the other modern engines, and had some very good ideas.

They focus on a small set of goals-- make it easy to search through a ton of information.

Compare this to Yahoo and MSN, where search is really just one part of their business model (there is no Google Singles! or Google Games).

--
"Can of worms? The can is open... the worms are everywhere."
1. Re:How does Google make it look so easy? by Gilk180 · 2004-05-12 04:11 · Score: 1
  
  They focus on a small set of goals-- make it easy to search through a ton of information.
  
  Originally this was the case, but recently even Google has started forays into other services. Just click the "more>>" link above the search field on their front page.
  
  Froogle(shopping), groups(news groups), blogger, and of course there is the even popular /. subject, gmail.
  
  Unfortunately it seems they may be losing focus.
2. Re:How does Google make it look so easy? by stefanlasiewski · 2004-05-12 05:09 · Score: 3, Insightful
  
  Well, most of the additional services still focus on searching and information. Information is also one of their core goals (I should have said this earlier). There still is nothing like Yahoo Travel or MSNBC.
  
  Froogle is still a search product, but with a focus on shopping.
  
  Groups is mostly still a search product (You can post also, so it's also about creating information). The service has been around for years (I think it's their second big project after web search). If I have a technical question, I often find the answers in Google groups. Blogger is new, but is similar to Groups in it's goals.
  
  Gmail is also largely about search. With search they can place ads in your email.
  
  Actually, I guess you can really say that Google is about using a good search technology to place highly targeted ads with the information.
  
  --
  "Can of worms? The can is open... the worms are everywhere."
3. Re:How does Google make it look so easy? by armando_wall · 2004-05-14 13:47 · Score: 1
  
  Groups is mostly still a search product (You can post also, so it's also about creating information). The service has been around for years (I think it's their second big project after web search).
  Actually, Google Groups used to be DejaNews, and they bought the technology.
4. Re:How does Google make it look so easy? by merlin_jim · 2004-05-17 09:52 · Score: 1
  
  those employees work hard and smart and put in 40-60 hour work weeks
  
  This is an excellent opportunity to point out that 8 hours of those every week, mandatory, must be spent on a personal project not related to google's line of business.
  
  That one perk is the absolute best.
  
  --
  I am disrespectful to dirt! Can you see that I am serious?!
Picture here. by mosel-saar-ruwer · 2004-05-12 03:41 · Score: 2, Offtopic

http://www-formal.stanford.edu/annap/www/annap-pic .gif
I mean, be for real - who gives a damn about the article?
1. Re:Picture here. by HyperCash · 2004-05-13 20:01 · Score: 1
  
  Anybody else notice the HARVARD binder in the picture?
  
  --HC
  
  --
  So I'm jump'n up and down screaming show me the money.
2. Re:Picture here. by Anonymous Coward · 2004-05-17 07:21 · Score: 0
  
  it's not like it's in big block letters very near the center of the photo where your eye is naturally drawn. thanks for pointing it out, I almost missed it. haha
Have you Read ast's Computer Networks Book? by Phoe6 · 2004-05-12 03:50 · Score: 2, Interesting

Andrew.S.Tanenbuam's Book "Computer Networks" deals this topic in a very elementary and a very good way. Any person,who is wondered by the Search Engine,should read that book. In the Chaper dealing with Application Layer,ast describes the basic data structures which constitute a web-search engine. You could better look at the presentation here.

--
Senthil
Ask Tim Bray by grungeKid · 2004-05-12 04:25 · Score: 3, Informative

For a overview of the field that is full-text search (of which web search engines is an important, but not the only, part), you should read Tim Bray's essays on search. He's been working on full-text search for a long time, knows his stuff and explains it in a very readable manner.
Thanks for the link by christopherfinke · 2004-05-12 05:25 · Score: 2, Funny

Thanks for the link to Google, submitter. I never would have found it otherwise!
1. Re:Thanks for the link by Anonymous Coward · 2004-05-12 11:35 · Score: 2, Funny
  
  Here's the FreeCache version in case of slashdotting.
not too hard by TyrelHaveman · 2004-05-12 12:08 · Score: 1

I wrote a search engine once which was designed to index the entire web and some other basic Google-like features. It only took me a couple days (probably 20 total hours) to write the indexer and search engine. It searched quick and used very little disk space, but it was really slow to index due to CPU usage problems. If I had millions of dollars like Google, I could have bought a large number of machines to do the indexing, but I only had 3 machines. So after 2 months or so of attempting to index the web, I gave up... having predicted it would take almost 10,000 years at the current rate for me to index the entire current web. Oh well, it was fun :)
The crawl is hard, too by blamanj · 2004-05-12 12:56 · Score: 4, Insightful

...harder than she implies.

You have to deal with 404s, robots.txt, politeness (don't bring down someone's site by crawling too fast), redirects, content you can't handle (Flash, Javascript).

The list goes on.
1. Re:The crawl is hard, too by merlin_jim · 2004-05-17 09:57 · Score: 1
  
  politeness (don't bring down someone's site by crawling too fast), redirects, content you can't handle (Flash, Javascript).
  
  I wrote my own crawler once. A just-for-fun-how-do-the-spammers-do-it kind of thing...
  
  I got around redirects and content by going straight via the network socket and looking at the response as pure text. Anything that fit the regex pattern for an email address got harvested.
  
  As far as politeness... I kept a circular growable queue (technically a linked list) of sites to visit. Each queue entry is current site - links deep - sites deep (I would ordinarily start with a google search and do 3-4 sites deep, 5-6 links deep)... if the crawl engine saw a link with the same server it had just visited, it would randomly change its link index. This kept it from hammering on a particular server too hard...
  
  --
  I am disrespectful to dirt! Can you see that I am serious?!
Nutch by idiotfromia · 2004-05-12 14:05 · Score: 1

There is one open source search engine that seems to be up-and-coming. Nutch is now powering Mozdex, and it looks fairly impressive so far.

Now, instead of the previous free-will donations, you can support the project through purchasing very cheap sponsered listings that appear to the right of the results (similar to Google)
Do really mean, a Finding engine? by Hocus+Locus · 2004-05-12 19:50 · Score: 0, Offtopic

Well hey... I used to write search engines all the time, usually perl one liners that spanned many lines, usually backgrounded, spewing an eclectic mix of status, progress result and debug to some poor inode in /tmp, 24-7. Every week or so, or when the swap or /tmp device fills, I 'killall perl' and see if I've accumulated any results...

Never found anything, just discovered where my mind had been wandering. Not too pretty.

Until one day a pair of algorithmic walker-seekers wandered into /tmp and started to parse each other's log files...! at first they just reacted to each other, like two dogs barking at each other while looking at a tree... I had spiced in a bit of adaptive restart queueing and pseudostack karmic bootstrap and some other odd bits. Megs of sequential ASCII pathology -- then suddenly they began to trade prime numbers. Weird. Then phonetic alphabet, clumps of strings ("language lessons?") and finally, a long correspondence of love letters between a "Romeo" and Juliet". And other oddities. They'd take turns: one would roll itself into a tight greedy loop, slowing the other down and suddenly break out of it, the other would babble about some sort of orgasmic experience. Lots of this. Then they became aware of me puttering around somehow and were moments away from issuing a 'write' to make contact, but they argued about what to say... then I hit the end of the logs...abruptly...

Yep, that 'killall' had really done the trick.

When I came out of less, cron had decided to clear /tmp and immediately started up some mindless 'awk' thing to recover that precious deleted space.

I printed out the last bit of buffer on my screen and tearfully hung it on the wall.

I only code in COBOL now so nothing like that can ever happen again. I guess it means I don't have much in the way of useful advice for this thread, don't even know why I decided to post this.

Sorry.
PERFORM LOBOTOMY 3 TIMES.

--
Everything in this book may be wrong
--Messiah's Handbook // Richard Bach
Author @ Google? by sgauss · 2004-05-19 05:01 · Score: 1

Anna Patterson was quoted in the Newsweek article on Google as a Google employee.
4 parts to a search engine by DukeyToo · 2004-05-24 06:32 · Score: 1

I had a go at writing one, and can summarize as follows. There are 4 essential parts...

1) The crawler - goes out and retrieves pages
2) The parser - parses the pages, finding links and text.
3) The indexer - indexes the pages.
4) The searcher - interprets the index in the context of a user's search request.

None of these are especially hard to do simple versions of, but all of them are hard to do well.

--
Most writers regard truth as their most valuable possession, and therefore are most economical in its use - Mark Twain