Slashdot Mirror


How to Build a Search Engine

CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast. In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"

16 of 270 comments (clear)

  1. Lol by SugoiMonkey · · Score: 5, Funny

    "even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast " Gigawho? You silly goose.

    1. Re:Lol by SphericalCrusher · · Score: 5, Insightful

      That sounds a lot like self-advertisement to me. And there are A LOT more than just five companies! Take MetaCrawler and DogPile for instance -- they aren't on his list.

      --
      "Instant gratification takes too long." - Carrie Fisher
  2. Not *that* complicated by Anonymous Coward · · Score: 5, Funny
    This will cover about 50% of your job:
    select * from internet where keywords like '%asian sex free pics%';
  3. Hmmm.... by elid · · Score: 5, Interesting

    Gigablast: "273,384,720 pages indexed"
    Google: "Searching 4,285,199,774 web pages" That's quite a big difference.

    1. Re:Hmmm.... by ixplodestuff8 · · Score: 5, Informative

      I've never heard of gigablast either, but it seems to have some intresting features, it links to the wayback machine's page on the site so you can see past versions of the site. And it also says the most common phrases in which the search term was found. It also archives pages like google and goes as far as to link to OTHER search engines to help out your search

  4. That list makes no sense by jonman_d · · Score: 5, Insightful

    I have to say, that list makes no sense. Maybe if you'd switch "Gigablast" with "MSN", you'd have a list of the some of the major search engines, but it sounds like this guy is just tooting his own horn (and without the proper credentials).

  5. Whatever happened to by nevek · · Score: 5, Interesting

    Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com

    ahh the dotcomfallout

    at least www.cowboynealsproncollection.com is doing well

  6. only 5? by micker · · Score: 5, Informative
    The poster left out vivisimo.... lately its been all I use...

    and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page

    --
    Words are only yours until someone else uses them...
  7. Voting methods and search engines.... by Anonymous Coward · · Score: 5, Informative

    ...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.

    For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:

    http://www10.org/cdrom/papers/577/

  8. Re:Isn't yahoo powered by google? by levram2 · · Score: 5, Informative

    Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.

  9. what timing for this /. article! by whowho · · Score: 5, Informative

    just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.

    Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.

  10. Searching from the server's perspective by no+longer+myself · · Score: 5, Insightful
    Having a webserver hobby, I see the search engines crawl through my site daily. Of course in the beginning they hungrily tripped through the pages, taking in as much as could be found. Of course as time went on it seemed like some of the search engines had a new method of just grabbing a page or two every hour or so. I imagine this was to prevent over-taxing my box, but it made the first glance at my logs look artificially inflated as if people were visiting the site instead of just a crawler working its way through... slowly and painfully.

    I'd just prefer it if search engines would have enhanced rules for the robot.txt file so a webmaster could tell them more specifically how they want to be searched.

    Yes, I know you can put in a delay between page searches, and you can deny access to parts or all of the site, and you can even tell some or all crawlers to take a flying leap, but I'd like to tell them at the front door, "Search on Wednesday, make it fast, do a thorough job, and don't come back for a week."

    Too much to ask, right?

  11. Uhm No by Tedium+Unleased · · Score: 5, Insightful

    Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.

    Oh yeah real nervous. They're getting on the bandwagon late; too late to monopolize this particular free (as in shut the fuck up) service. If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.

  12. Re:Open Source Search Engine? by idiotfromia · · Score: 5, Informative

    I don't believe it's actually being used in practice, but Nutch is developing rapidly. The largest test crawl they've completed has been about a hundred million pages. They're asking for donations to develop a larger demo system.

  13. What I'd Like To See In A Search Engine by Nom+du+Keyboard · · Score: 5, Insightful
    What I'd like to see in a search engine is a page kill or broken link feature to keep it current. If I click a link that is broken or vastly changed (e.g. the link to ancient Chinese pottery is now a porn site), that I could backup to the search results page and click a link to have them immediately re-crawl that page. I think it would make for better results, and am surprised that it's not already common.

    You my license my patent on this idea for reasonable terms in exchange for shares of your company's stock.

    --
    "It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
  14. Only five? by adriantam · · Score: 5, Interesting

    Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.

    I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.

    --
    http://www.ieaa.org/~adrian/