How Hard Is It To Write Your Own Search Engine?
kha0z writes "Anna Patterson, from Stanford University, overviews the difficulties that have to be overcome when attempting to develop and/or implement a search engine solution in this article in the ACM Queue Magazine. The article covers many issues dealing from data sources, to indexing, to ranking. How does Google make it look so easy?"
While writing a local search engine isn't trivial, it's a lot easier than writing a web search engine since all the scaling issues disappear -- I know: I wrote one.
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
How does Google make it look so easy?
Google has hundreds of millions of dollars. Google treats their 2000 (2500+?) employees pretty well, so those employees work hard and smart and put in 40-60 hour work weeks. Google started earlier then the other modern engines, and had some very good ideas.
They focus on a small set of goals-- make it easy to search through a ton of information.
Compare this to Yahoo and MSN, where search is really just one part of their business model (there is no Google Singles! or Google Games).
"Can of worms? The can is open... the worms are everywhere."
I mean, be for real - who gives a damn about the article?
Andrew.S.Tanenbuam's Book "Computer Networks" deals this topic in a very elementary and a very good way. Any person,who is wondered by the Search Engine,should read that book. In the Chaper dealing with Application Layer,ast describes the basic data structures which constitute a web-search engine. You could better look at the presentation here.
Senthil
For a overview of the field that is full-text search (of which web search engines is an important, but not the only, part), you should read Tim Bray's essays on search. He's been working on full-text search for a long time, knows his stuff and explains it in a very readable manner.
Thanks for the link to Google, submitter. I never would have found it otherwise!
...harder than she implies.
You have to deal with 404s, robots.txt, politeness (don't bring down someone's site by crawling too fast), redirects, content you can't handle (Flash, Javascript).
The list goes on.