How to Build a Search Engine
CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast.
In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"
and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page
Words are only yours until someone else uses them...
I've never heard of gigablast either, but it seems to have some intresting features, it links to the wayback machine's page on the site so you can see past versions of the site. And it also says the most common phrases in which the search term was found. It also archives pages like google and goes as far as to link to OTHER search engines to help out your search
...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.
For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:
http://www10.org/cdrom/papers/577/
Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.
just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.
Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.
I don't believe it's actually being used in practice, but Nutch is developing rapidly. The largest test crawl they've completed has been about a hundred million pages. They're asking for donations to develop a larger demo system.