Slashdot Mirror


How to Build a Search Engine

CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast. In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"

37 of 270 comments (clear)

  1. Lol by SugoiMonkey · · Score: 5, Funny

    "even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast " Gigawho? You silly goose.

    1. Re:Lol by SphericalCrusher · · Score: 5, Insightful

      That sounds a lot like self-advertisement to me. And there are A LOT more than just five companies! Take MetaCrawler and DogPile for instance -- they aren't on his list.

      --
      "Instant gratification takes too long." - Carrie Fisher
  2. Gigablast... by vosbert · · Score: 4, Interesting

    Am I the only one who's never heard of Gigablast... but then not too many years ago, I remember a time when I've never heard of Google. Kinda makes one wonder how secure a lead from its competition any search engine ever hope to obtain, and what kind of chances Microsoft stand in usurping the search engine market.

  3. Not *that* complicated by Anonymous Coward · · Score: 5, Funny
    This will cover about 50% of your job:
    select * from internet where keywords like '%asian sex free pics%';
  4. Hmmm.... by elid · · Score: 5, Interesting

    Gigablast: "273,384,720 pages indexed"
    Google: "Searching 4,285,199,774 web pages" That's quite a big difference.

    1. Re:Hmmm.... by ixplodestuff8 · · Score: 5, Informative

      I've never heard of gigablast either, but it seems to have some intresting features, it links to the wayback machine's page on the site so you can see past versions of the site. And it also says the most common phrases in which the search term was found. It also archives pages like google and goes as far as to link to OTHER search engines to help out your search

    2. Re:Hmmm.... by Waffle+Iron · · Score: 4, Funny
      Gigablast: "273,384,720 pages indexed"
      Google: "Searching 4,285,199,774 web pages" That's quite a big difference.

      At least this Gigablast name is closer to the truth. They are only exaggerating their page count by a factor of 3.7 : 1.

      By my math, Google comes up short by 2.3x10^90 : 1.

  5. That list makes no sense by jonman_d · · Score: 5, Insightful

    I have to say, that list makes no sense. Maybe if you'd switch "Gigablast" with "MSN", you'd have a list of the some of the major search engines, but it sounds like this guy is just tooting his own horn (and without the proper credentials).

  6. Whatever happened to by nevek · · Score: 5, Interesting

    Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com

    ahh the dotcomfallout

    at least www.cowboynealsproncollection.com is doing well

  7. only 5? by micker · · Score: 5, Informative
    The poster left out vivisimo.... lately its been all I use...

    and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page

    --
    Words are only yours until someone else uses them...
  8. Humph by SlamMan · · Score: 4, Funny

    "and everyone's a little bit nervous to see what it's bringing.'"

    Money. Lots and lots of money.

    --
    Mod point free since 2001
  9. P2P? by ron_ivi · · Score: 4, Interesting
    I always thought P2P would be a good infrastructure for a search engine.

    That way, I could share the load with people with similar interests as myself.

    For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.

    1. Re:P2P? by cgenman · · Score: 4, Interesting

      The closest thing to what you're talking about is Grub, which is run by Looksmart as a dead-link checker and also feeds to WiseNut. While it doesn't allow you to crawl sites that you don't have control over, it does allow you to crawl your own site.

      Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it.

      Completely offtopic, but does anyone know of a screensaver on Windows that displays random (or spidered) web pages? I've been looking for an equivalent to the XWindows version for years.

    2. Re:P2P? by cgenman · · Score: 4, Interesting

      ...Just answered my own question. Combining A+ Web Screensaver (nonfree) with a random web page URL (www.uroulette.com/visit.php) gets a random web page display on idle. Yay! Now I'll never know if I'm going to a polynesian community church or a poorly written Raiders fansite.

      Now if there were only a way to open said site and continue reading in non-screensaver mode...

  10. Voting methods and search engines.... by Anonymous Coward · · Score: 5, Informative

    ...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.

    For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:

    http://www10.org/cdrom/papers/577/

  11. In my opinion..... by Kenja · · Score: 4, Funny
    In my opinion the best search engine is a Ford T-Block. Put that into a light weight steel frame and we can search them down and kill em in the street like wild animals.

    Whoa, hold on. Wrong site. Never mind.

    --

    "Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"
  12. BOOBLE! by the+MaD+HuNGaRIaN · · Score: 4, Funny

    What about BOOBLE.

  13. Re:Lycos anyone by Thanatopsis · · Score: 4, Interesting

    Lycos search no longer runs it own crawler. Matt's talking about people with their own crawler and algo.

  14. Re:Isn't yahoo powered by google? by levram2 · · Score: 5, Informative

    Yahoo used to use Google, but they bought Inktomi and have switched to their search engine. MSN also uses the Inktomi search engine, but tweak the results.

  15. Competition, in this case... a good thing by Jtoxification · · Score: 4, Insightful

    We all win. With the increasing # of sites, content, web services, spam, popup attacks, and "please allow us to rape your computer" certificates to download, (that's the main reason I use Firefox when on Windows now: because you can't tell I.E. to not accept those damned installation certificates, nor block requests to change the homepage.) it becomes equally more difficult to find what you're looking for, especially when it's not something that everyone else looks for, via Google's site ranking technology. Because they fight to be the best, we get cool things like ftp searches, grep and regexp searching of dmoz.org , video, image, and music searches, even linux and bsd search-specific pages. gMail, Microsoft's entry, and now Gigablast are all rewards we get to reap from each company attempting to set its roots deeper into the Internet like weeds vying for the same piece of dirt. We are extremely lucky, but then I doubt more than a handful search engines will ever hold top ranks at one time, due to the fact that they are so specialized in what they do. Just hope Gigablast and Google don't decide to create new IM service, too.

    --
    --I gots 99 problems but a new machine ain't one!
    AMD! Asus! Whoot! 6 years!
  16. Re:Matt's a good guy by cybermace5 · · Score: 4, Funny

    I'm glad you told everyone he's a good guy, for a minute there I just assumed he was an evil, scheming villain.

    --
    ...
  17. What about patents? by enosys · · Score: 4, Interesting
    What about patents? A lot of the stuff that goes into a search engine must be patented by now. I'm sure that if you create a search engine you'll end up infringing a bunch of these patents. Yes, I'm sure that in many cases it's obvious, and there's probably prior art, but I expect that the patents are still there and it's like a minefield of patents.

    So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?

  18. what timing for this /. article! by whowho · · Score: 5, Informative

    just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.

    Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.

  19. Searching from the server's perspective by no+longer+myself · · Score: 5, Insightful
    Having a webserver hobby, I see the search engines crawl through my site daily. Of course in the beginning they hungrily tripped through the pages, taking in as much as could be found. Of course as time went on it seemed like some of the search engines had a new method of just grabbing a page or two every hour or so. I imagine this was to prevent over-taxing my box, but it made the first glance at my logs look artificially inflated as if people were visiting the site instead of just a crawler working its way through... slowly and painfully.

    I'd just prefer it if search engines would have enhanced rules for the robot.txt file so a webmaster could tell them more specifically how they want to be searched.

    Yes, I know you can put in a delay between page searches, and you can deny access to parts or all of the site, and you can even tell some or all crawlers to take a flying leap, but I'd like to tell them at the front door, "Search on Wednesday, make it fast, do a thorough job, and don't come back for a week."

    Too much to ask, right?

  20. Uhm No by Tedium+Unleased · · Score: 5, Insightful

    Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.

    Oh yeah real nervous. They're getting on the bandwagon late; too late to monopolize this particular free (as in shut the fuck up) service. If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.

  21. The value of pagerank by jfengel · · Score: 4, Interesting

    The most interesting assertion in the article was that Pagerank was useless. He says Google's real win is its ability to cache a copy of the page and show you a summary including your search terms. I do use that a lot to quickly exclude irrelevant pages.

    He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.

    He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.

    Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.

  22. Re:Interesting by prockcore · · Score: 4, Interesting

    If it can survive ./, thats a good sign.

    Not really. I was impressed with the power of a good slashdotting until we made the slashdot frontpage a few weeks ago (we also made it to the frontpage a few years ago but at that time we were serving static htmls).

    An article was pulled out of a mysql database, xsl transformed, sent to the webserver via SOAP and finally send about 150k of html and images to the user. Repeat 80,000 times over a 5 hour period.

    This is hardly an impressive feat. I expected more, but it turns out that slashdot really only sends about 20-30k unique visitors to your site.

    Yes, I used to be impressed with the power of a slashdotting, but now I realize that it's just the result of very crappy sites run on very crappy desktop machines pretending to be servers.

    So, no, them withstanding a slashdot link isn't a good sign, it's the very least we can expect of a commercial entity.

  23. Search engines could replace a query language? by JusTyler · · Score: 4, Interesting

    Fave quote from that article..

    However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.

    Discuss.

  24. Heh... by Xenographic · · Score: 4, Interesting

    I've often wondered why Google doesn't put up an "unsafe" image search option? (e.g. leave out all the images it deems "safe").

    Then again, it hardly needs to most of the time...

  25. Re:Open Source Search Engine? by idiotfromia · · Score: 5, Informative

    I don't believe it's actually being used in practice, but Nutch is developing rapidly. The largest test crawl they've completed has been about a hundred million pages. They're asking for donations to develop a larger demo system.

  26. less commercialism by dj245 · · Score: 4, Interesting
    I did a quick search on Gigablast for "Radio control speed controler". Now normally, on google, you would get a couple million pages of websites wanting to sell you a speed controller. On gigablast, however, The first 10 results were pretty much information about speed controllers, and/or battlebot sites that explained what you would need them for.

    I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.

    --
    Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
  27. What I'd Like To See In A Search Engine by Nom+du+Keyboard · · Score: 5, Insightful
    What I'd like to see in a search engine is a page kill or broken link feature to keep it current. If I click a link that is broken or vastly changed (e.g. the link to ancient Chinese pottery is now a porn site), that I could backup to the search results page and click a link to have them immediately re-crawl that page. I think it would make for better results, and am surprised that it's not already common.

    You my license my patent on this idea for reasonable terms in exchange for shares of your company's stock.

    --
    "It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
  28. In other news... by sydbarrett74 · · Score: 4, Funny

    'In other news, Google announced the buy-out of Gigablast. The newly-formed company will be called Giggle.'

    --
    'He who has to break a thing to find out what it is, has left the path of wisdom.' -- Gandalf to Saruman
  29. Only five? by adriantam · · Score: 5, Interesting

    Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.

    I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.

    --
    http://www.ieaa.org/~adrian/
  30. Gigablast... 2 years old and nobody's heard of it! by mbauser2 · · Score: 4, Interesting

    I have heard of Gigablast, but I've never been impressed by it. (I wrote a review back in 2002.) Most search engine optimizers love Gigablast, however, because it's such an easy engine to game.

    It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.

    Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.

    --
    Proud to be / Smiley-free / Since Nineteen / Ninety-Three
  31. I'm ready to change by Andy_R · · Score: 4, Interesting

    Wonderful as Google is, I'm finding more and more searches don't produce useful results.

    I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.

    Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.

    What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.

    Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.

    --
    A pizza of radius z and thickness a has a volume of pi z z a
  32. Make it like a human brain... by Pedrito · · Score: 4, Funny

    I liked this quote: "Now that the Internet is very large, it makes for some well-developed memory. I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine."

    The protocol used in the brain? That can't be a good direction to go. I mean, if it's anything like my memory and honestly, the memory of most people I know, it's definitely going to be a step backwards. Human brains can hold a lot of information, but retreival is definitely not its specialty. I can see it now. Type in my search terms and the engine comes back with, "ummm, it's right on the tip of my tongue. Okay, I don't have a tongue, but I just about remember it. Give me just a minute to think about it. umm... umm... Nope, it's gone. Nevermind."