How to Build a Search Engine
CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast.
In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"
Am I the only one who's never heard of Gigablast... but then not too many years ago, I remember a time when I've never heard of Google. Kinda makes one wonder how secure a lead from its competition any search engine ever hope to obtain, and what kind of chances Microsoft stand in usurping the search engine market.
Gigablast: "273,384,720 pages indexed"
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com
ahh the dotcomfallout
at least www.cowboynealsproncollection.com is doing well
That way, I could share the load with people with similar interests as myself.
For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.
Lycos search no longer runs it own crawler. Matt's talking about people with their own crawler and algo.
Thalasar
By placing this on /. he got:
(("Slashdot serves 50 million pages per month"/(# users actually checking out this story))*number of searches tried) + a residual amount that might actually use this search engine more
And what they might be interested in.
What about hotbot? Lycos?
So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?
The most interesting assertion in the article was that Pagerank was useless. He says Google's real win is its ability to cache a copy of the page and show you a summary including your search terms. I do use that a lot to quickly exclude irrelevant pages.
He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.
He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.
Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.
If it can survive ./, thats a good sign.
Not really. I was impressed with the power of a good slashdotting until we made the slashdot frontpage a few weeks ago (we also made it to the frontpage a few years ago but at that time we were serving static htmls).
An article was pulled out of a mysql database, xsl transformed, sent to the webserver via SOAP and finally send about 150k of html and images to the user. Repeat 80,000 times over a 5 hour period.
This is hardly an impressive feat. I expected more, but it turns out that slashdot really only sends about 20-30k unique visitors to your site.
Yes, I used to be impressed with the power of a slashdotting, but now I realize that it's just the result of very crappy sites run on very crappy desktop machines pretending to be servers.
So, no, them withstanding a slashdot link isn't a good sign, it's the very least we can expect of a commercial entity.
Fave quote from that article..
However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.
Discuss.
Web Hosting Reviews
Here's an example of a search that turned up a PDF link. It is very clearly labled PDF on a red background:
& q= %22preston+alexander%22+-%22victoria+ashley%22
http://www.gigablast.com/search?k3v=898090&s=10
Pretty Nice if you ask me. I hate openning PDF links by accident. Sometimes in google I accidentally click them before I realize they are going to be opened by some stupid browser plugin or (more often than not) Adobe's bloated Reader.
The unofficial
I've often wondered why Google doesn't put up an "unsafe" image search option? (e.g. leave out all the images it deems "safe").
Then again, it hardly needs to most of the time...
I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.
I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.
http://www.ieaa.org/~adrian/
I have heard of Gigablast, but I've never been impressed by it. (I wrote a review back in 2002.) Most search engine optimizers love Gigablast, however, because it's such an easy engine to game.
It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.
Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.
Proud to be / Smiley-free / Since Nineteen / Ninety-Three
Wonderful as Google is, I'm finding more and more searches don't produce useful results.
I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.
Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.
What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.
Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.
A pizza of radius z and thickness a has a volume of pi z z a