Slashdot Mirror


How to Build a Search Engine

CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast. In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"

21 of 270 comments (clear)

  1. Gigablast... by vosbert · · Score: 4, Interesting

    Am I the only one who's never heard of Gigablast... but then not too many years ago, I remember a time when I've never heard of Google. Kinda makes one wonder how secure a lead from its competition any search engine ever hope to obtain, and what kind of chances Microsoft stand in usurping the search engine market.

  2. Hmmm.... by elid · · Score: 5, Interesting

    Gigablast: "273,384,720 pages indexed"
    Google: "Searching 4,285,199,774 web pages" That's quite a big difference.

  3. Whatever happened to by nevek · · Score: 5, Interesting

    Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com

    ahh the dotcomfallout

    at least www.cowboynealsproncollection.com is doing well

  4. P2P? by ron_ivi · · Score: 4, Interesting
    I always thought P2P would be a good infrastructure for a search engine.

    That way, I could share the load with people with similar interests as myself.

    For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.

    1. Re:P2P? by Anonymous Coward · · Score: 2, Interesting

      how about an "open" search engine? any takers? post below....

    2. Re:P2P? by cgenman · · Score: 4, Interesting

      The closest thing to what you're talking about is Grub, which is run by Looksmart as a dead-link checker and also feeds to WiseNut. While it doesn't allow you to crawl sites that you don't have control over, it does allow you to crawl your own site.

      Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it.

      Completely offtopic, but does anyone know of a screensaver on Windows that displays random (or spidered) web pages? I've been looking for an equivalent to the XWindows version for years.

    3. Re:P2P? by cgenman · · Score: 4, Interesting

      ...Just answered my own question. Combining A+ Web Screensaver (nonfree) with a random web page URL (www.uroulette.com/visit.php) gets a random web page display on idle. Yay! Now I'll never know if I'm going to a polynesian community church or a poorly written Raiders fansite.

      Now if there were only a way to open said site and continue reading in non-screensaver mode...

  5. Re:Lycos anyone by Thanatopsis · · Score: 4, Interesting

    Lycos search no longer runs it own crawler. Matt's talking about people with their own crawler and algo.

  6. I think the guy just expanded his database by Anonymous Coward · · Score: 2, Interesting

    By placing this on /. he got:

    (("Slashdot serves 50 million pages per month"/(# users actually checking out this story))*number of searches tried) + a residual amount that might actually use this search engine more

    And what they might be interested in.

  7. Lycos? HotBOT??? by Anonymous Coward · · Score: 1, Interesting

    What about hotbot? Lycos?

  8. What about patents? by enosys · · Score: 4, Interesting
    What about patents? A lot of the stuff that goes into a search engine must be patented by now. I'm sure that if you create a search engine you'll end up infringing a bunch of these patents. Yes, I'm sure that in many cases it's obvious, and there's probably prior art, but I expect that the patents are still there and it's like a minefield of patents.

    So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?

  9. The value of pagerank by jfengel · · Score: 4, Interesting

    The most interesting assertion in the article was that Pagerank was useless. He says Google's real win is its ability to cache a copy of the page and show you a summary including your search terms. I do use that a lot to quickly exclude irrelevant pages.

    He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.

    He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.

    Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.

  10. Re:Interesting by prockcore · · Score: 4, Interesting

    If it can survive ./, thats a good sign.

    Not really. I was impressed with the power of a good slashdotting until we made the slashdot frontpage a few weeks ago (we also made it to the frontpage a few years ago but at that time we were serving static htmls).

    An article was pulled out of a mysql database, xsl transformed, sent to the webserver via SOAP and finally send about 150k of html and images to the user. Repeat 80,000 times over a 5 hour period.

    This is hardly an impressive feat. I expected more, but it turns out that slashdot really only sends about 20-30k unique visitors to your site.

    Yes, I used to be impressed with the power of a slashdotting, but now I realize that it's just the result of very crappy sites run on very crappy desktop machines pretending to be servers.

    So, no, them withstanding a slashdot link isn't a good sign, it's the very least we can expect of a commercial entity.

  11. Search engines could replace a query language? by JusTyler · · Score: 4, Interesting

    Fave quote from that article..

    However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.

    Discuss.

  12. Re:Interesting by Gherald · · Score: 2, Interesting

    Here's an example of a search that turned up a PDF link. It is very clearly labled PDF on a red background:

    http://www.gigablast.com/search?k3v=898090&s=10& q= %22preston+alexander%22+-%22victoria+ashley%22

    Pretty Nice if you ask me. I hate openning PDF links by accident. Sometimes in google I accidentally click them before I realize they are going to be opened by some stupid browser plugin or (more often than not) Adobe's bloated Reader.

  13. Heh... by Xenographic · · Score: 4, Interesting

    I've often wondered why Google doesn't put up an "unsafe" image search option? (e.g. leave out all the images it deems "safe").

    Then again, it hardly needs to most of the time...

  14. less commercialism by dj245 · · Score: 4, Interesting
    I did a quick search on Gigablast for "Radio control speed controler". Now normally, on google, you would get a couple million pages of websites wanting to sell you a speed controller. On gigablast, however, The first 10 results were pretty much information about speed controllers, and/or battlebot sites that explained what you would need them for.

    I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    1. Re:less commercialism by a.ameri · · Score: 3, Interesting

      I actually was looking for some daily ISO snapshots of debian sid reopsitory. Nevr heard of Gigablast before, so give it a shot and search for 'daily sid snapshot iso'. Gigablast found no results, Google found 785, and looking at the first 10 results, I was easily able to find what I was looking for.

      C'mon, yes Google's interface is cool and stuff, Google's success isn't just it's interface. Their search algorithms are rock solid, their are continually improving them, and Google resturns the most relevant results, of any search engine.

      I still think that Google's biggest advantage over others, are their search algorithms, and their method of indexing webpages. Everyone can copy the interface. But not everyone can build what Google has built: rock solid searching algorithms, a clustered scalable filesystem GFS, their own webserver GWS (albeit a modified Apache) and they are reportedly making their own OS, in which anyone can have an account on! Add to these, numerus useful facilities like thier Linux/BSD/MS/Mac search, their newsgroup search, their news section, etc, and you see why Google is succesful, and well, others aren't that much.

      It's not 1996 anymore, when all you had to do was write a couple of perl scripts and install NCSA on a *Nix with a medicore DBMS, and viola, you had a search engine. These days, barriers to entry in the search engine field are very high. Google reportedly uses 100k servers. These days, it's a business that needs lots of capital, and knowledge, technical know-how, and labour, to start with.

      --
      -- /* Those who don't underestand Unix, are condemned to reinvent it poorly */
  15. Only five? by adriantam · · Score: 5, Interesting

    Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.

    I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.

    --
    http://www.ieaa.org/~adrian/
  16. Gigablast... 2 years old and nobody's heard of it! by mbauser2 · · Score: 4, Interesting

    I have heard of Gigablast, but I've never been impressed by it. (I wrote a review back in 2002.) Most search engine optimizers love Gigablast, however, because it's such an easy engine to game.

    It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.

    Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.

    --
    Proud to be / Smiley-free / Since Nineteen / Ninety-Three
  17. I'm ready to change by Andy_R · · Score: 4, Interesting

    Wonderful as Google is, I'm finding more and more searches don't produce useful results.

    I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.

    Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.

    What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.

    Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.

    --
    A pizza of radius z and thickness a has a volume of pi z z a