How to Build a Search Engine
CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast.
In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"
and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page
Words are only yours until someone else uses them...
I've never heard of gigablast either, but it seems to have some intresting features, it links to the wayback machine's page on the site so you can see past versions of the site. And it also says the most common phrases in which the search term was found. It also archives pages like google and goes as far as to link to OTHER search engines to help out your search
We use Gigablast as a back fill for one of our search engines. His stuff is very speedy and he's good guy to work with.
Thalasar
...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.
For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:
http://www10.org/cdrom/papers/577/
There was one a while back. Everybody installed a program kinda like glimpse on your server and indexed your own web site and a few others. IIRC it would automatically work out by IP address any sites that were nearby and not already over-indexed. They all then kinda pooled the results.
One benefit of it is that you can keep the index of your website up to the minute if you really want. I guess they just never got enough people running the indexing software.
Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.
Nah, they dropped google on Feb 17th. Get with the program :-D
I like AV because it's the only one (that I know of) that supports advanced embedded Boolean. Many a time Google fails to produce, and a well-built AV search will pop out what I'm looking for - albeit from a smaller selection.
If there is hope, it lies in the prowles.
Have you tried searching, though? Google pulls back more (quantity adn accuracy) than Gigablast for the same terms. For example, search for "larry wall interview" and get 77,300 vs 9,759 . I'm certainly not saying Google doesn't have its share of problems (seems to steadily be declining in quality). And I do like the categories/tags that Gigablast provides, but overall quality I'll give to Google.
Too big to fail? Does that make me to small to succeed?
just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.
Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.
"there doesn't seem to be a form you can fill in like Google's "advanced search" form."
except of course, for the advanced search form
I don't believe it's actually being used in practice, but Nutch is developing rapidly. The largest test crawl they've completed has been about a hundred million pages. They're asking for donations to develop a larger demo system.
"Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast."
You don't even have to RTFA. Read the summary.
And they're not search engines. They're just meta-search engines that compile the results of Google, Yahoo, etc.
LOAD "SIG",8,1
there has been work in this direction already from lehigh university check it out here http://wume.cse.lehigh.edu/
A9 serves Google results, so you can't quite call them "search company". But I'm shure there are at least a dozen as big and famous as "Gigablast"
"You get all the fun of sitting still, being quiet, writing down numbers, paying attention...science has it all."
"Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it."
You can get that service complete with a toolbar from http://www.vivisimo.com which is a great search engine.
Have a look at a9.com, which is Amazon's new search entry. Aside from a good web search engine, it provides a "history" of your previous searches and other innovative features.
If I click a link that is broken or vastly changed (e.g. the link to ancient Chinese pottery is now a porn site), that I could backup to the search results page and click a link to have them immediately re-crawl that page.
The index is usually updated only once every couple of weeks. Recomputing PageRank (or whatever everybody else uses) takes its time. That's why more or less immediate updates are reserved only to the best-known sites.
You can report 'false' results with the Dissatisfied? link at the bottom of the Google result page.