Understanding Search Engines?
An anonymous reader asks: "I guess by now we can be fairly certain that search engines are here to stay, and hence I'm trying to understand how the technology works. I'm not so much looking for a particular 'best' technology or implementation, but rather an overview of the different approaches and their trade-offs. Something that would teach me: which approach works in a distributed vs a centralized infrastructure; how different algorithms will perform on complete search words vs arbitrary sub-strings; or how mass storage (hard disk vs. solid state) affects implementation choices. For most mature technologies there is a host of 'overview' books and papers for my questions -- but I couldn't find anything on search engines. Where should I look? Are there any good books or papers?"
Same basic concepts apply today ... although they probably didn't anticipate the rise of Black Hat SEO which attempts to "beat" the algorithms.
Hulk SMASH Celiac Disease
One of the founders of Google still has links to various publications (in PostScript format) about search engines, if that helps.
Managing Gigbytes author site Amazon
is a spectacular book on most of the underlying technologies. Although I've only read the first edition, I don't recall it talking about spidering/webcrawling. Instead it starts with building a simple index, and builds through all the refinements (ie stemming, etc) until you've built a serious workhorse for mining text documents. Its definitely at the core of what a search engine does,
So, aside from reading books on Information Retrieval and Data Mining, the other easily available reference are open source search engines. In particular, look at the Nutch project, which is actually a pretty high quality search engine implementation. Even better: start contributing to the project.
sigs are a waste of space