How to Build a Search Engine
CowboyRobot writes "Three years ago, former Infoseek developer Matt Wells decided to go solo and build his own search engine, Gigablast.
In this article, Infoseek founder Steve Kirsch interviews his former employee about the process and challenges of creating a modern, scalable search engine. From the article: 'Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast. It's a tight little community, and a lot of the people know and watch each other. Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.'"
"even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast " Gigawho? You silly goose.
Am I the only one who's never heard of Gigablast... but then not too many years ago, I remember a time when I've never heard of Google. Kinda makes one wonder how secure a lead from its competition any search engine ever hope to obtain, and what kind of chances Microsoft stand in usurping the search engine market.
Gigablast: "273,384,720 pages indexed"
Google: "Searching 4,285,199,774 web pages" That's quite a big difference.
I have to say, that list makes no sense. Maybe if you'd switch "Gigablast" with "MSN", you'd have a list of the some of the major search engines, but it sounds like this guy is just tooting his own horn (and without the proper credentials).
--
http://nemilar.net - Not your grandmother's soup kitchen
Hotbot, Lycos, Mamma.com, Iwon.com, wisenut.com, looksmart,com teoma.com, alltheweb.com, deja.com, direchit.com, excite.com, go.com, infoseek.com, invisibleweb.com, flipper.com, messageking.com, magellan.com, nbci.com, snap.com, northernlight.com, openfind.com, webcrawler.com
ahh the dotcomfallout
at least www.cowboynealsproncollection.com is doing well
First, corner the name "Google."
Second, work your fscking asses off for several years.
Third, become an over-night success!
and to the post above this.. what does 2 trillion hits matter against 2 million if they cant get what you really need up onto the first page
Words are only yours until someone else uses them...
"and everyone's a little bit nervous to see what it's bringing.'"
Money. Lots and lots of money.
Mod point free since 2001
That way, I could share the load with people with similar interests as myself.
For example, I would like a search engine that was more up-to-date crawling the PR of my competitors, but couldn't care less about most other companies. If I were running my own node of a P2P engine, I could set my node to focus on that, and anyone else who shared my interests could tap into it.
I wouldn't consider Gigablast a major contender, yet.... It looks nice, minimalist like google. The ads (Gigabits) are in a seperated section. Also, Gigablast appears to be handling the Slash blast just fine. If it can survive ./, thats a good sign. I haven't had much time to test out its search capabilities, and yes, it doesn't have that many pages indexed compared to google, but it has a chance. I could see it picking up.
Help Fight SPAM today!
Microsoft at the party would probably look something like this
"Pass the dip, guys!"
I mean, I know they're different sites and all, but isn't the yahoo site just the google search bar with all those category links added?
We use Gigablast as a back fill for one of our search engines. His stuff is very speedy and he's good guy to work with.
Thalasar
...have a lot in common. Different search engines allow sites to "vote" on which ones are the most authoritative, and the best methods in one field can give insight into the best method in the other.
For example, there is the Kemeny order (named after the same guy who came up with BASIC, John G. Kemeny). Using a version of ranked ballots and sorting websites by the mean Kemeny order gives you a method that is surprisingly good at putting authoritative sites at the top and spam sites at the bottom. For those of you who like in-depth analysis and don't mind math, the following is a good site:
http://www10.org/cdrom/papers/577/
I know that other people must use search engines other then google, but who? And why? I could see netscape, because it's the default homepage for many browsers, and maybe Ask Jeeve due to the easy syntax, but why would people go out of their way to Gigablast or Looksmart. Who's even heard of those two?
Apple has never claimed not to be evil, they're just very stylish about it.
Whoa, hold on. Wrong site. Never mind.
"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"
What about BOOBLE.
I found over one million hit for XXX and not even one hit as far as I could tell to do with the fucking vin desal pice of shit movie.
What about Lycos you insensitive clod? They're still around.
In the UK around the year 2000, they advertised Lycos on the TV. The advert featured a bagpiper who had a kilt and no underpants and asked Lycos to find some underpants. A dog then went off at great speed, and came back with underpants in his jaws, and then, the bagpiper could safely play the bagpipes when there were sudden gusts of wind. Anyway, just for fun, I typed in 'underpants', on Lycos and the first item it came up with was a pornographic website. However, this was lycos.com, and not lycos.co.uk which is what was advertised.
Dude, it's called a magnifying glass.
We all win. With the increasing # of sites, content, web services, spam, popup attacks, and "please allow us to rape your computer" certificates to download, (that's the main reason I use Firefox when on Windows now: because you can't tell I.E. to not accept those damned installation certificates, nor block requests to change the homepage.) it becomes equally more difficult to find what you're looking for, especially when it's not something that everyone else looks for, via Google's site ranking technology. Because they fight to be the best, we get cool things like ftp searches, grep and regexp searching of dmoz.org , video, image, and music searches, even linux and bsd search-specific pages. gMail, Microsoft's entry, and now Gigablast are all rewards we get to reap from each company attempting to set its roots deeper into the Internet like weeds vying for the same piece of dirt. We are extremely lucky, but then I doubt more than a handful search engines will ever hold top ranks at one time, due to the fact that they are so specialized in what they do. Just hope Gigablast and Google don't decide to create new IM service, too.
--I gots 99 problems but a new machine ain't one!
AMD! Asus! Whoot! 6 years!
By placing this on /. he got:
(("Slashdot serves 50 million pages per month"/(# users actually checking out this story))*number of searches tried) + a residual amount that might actually use this search engine more
And what they might be interested in.
I like AV because it's the only one (that I know of) that supports advanced embedded Boolean. Many a time Google fails to produce, and a well-built AV search will pop out what I'm looking for - albeit from a smaller selection.
If there is hope, it lies in the prowles.
What about hotbot? Lycos?
bastard made me shoot jello out my nose, fucking ow...
"Sic Semper Tyrannosaurus Rex."
What about Amazon/A9? We've seen enough hype about them this week to know they're in the search game too.
So how do you make a search engine and not get sued for infringement, or at least be able to win in the lawsuits?
here maybe?
"Sic Semper Tyrannosaurus Rex."
"Mall security???? You want HK's and Starlight scopes for mall security?????"
just as I'm pulling an all-nighter at this moment trying to embed a custom search engine into an app for use on an intranet.
Actually what is more interesting is Nutch and Mozdex, which seems to be based around Lucene (what I am using to build my own search engine embedded into a Horde framework app). Although probably a lot simpler than the industrial grade stuff, for someone who has been used to throwing a word at an input screen and magically getting back results, the insight into the inner workings of search engines is very interesting.
I'd just prefer it if search engines would have enhanced rules for the robot.txt file so a webmaster could tell them more specifically how they want to be searched.
Yes, I know you can put in a delay between page searches, and you can deny access to parts or all of the site, and you can even tell some or all crawlers to take a flying leap, but I'd like to tell them at the front door, "Search on Wednesday, make it fast, do a thorough job, and don't come back for a week."
Too much to ask, right?
"Read your history, you dolt..."
The UN says this, but you probably imagine the UN is some sort of conspiracy.
Idiot.
Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.
Oh yeah real nervous. They're getting on the bandwagon late; too late to monopolize this particular free (as in shut the fuck up) service. If by some miracle they produce something 'threatening', it will be because it's good or because the others have slacked off.
Gigablast sucks : Proof - I entered my name and Gigablast says "no results". Did u mean "something thats not my name". No thanx I did not
Google : My site is the first !!!
And of course I refuse to believr that anyone in the world would be interested in anything but my home page.
are there any open source search engines out there that have wide spread use?
and if not, why hasnt one been tried yet? (not to be cynical, but) i mean, theres open souce everything else, so why not a search engine?
actually that's standad SQL syntax.
it's the syntax you would use for PostgreSQL, MySQL and oracle.
surprise, surprise: seems like SQL server is the odd one out.
The Iraqi invasion has happenned, and there's not a lot anyone can do about it. A sudden withdrawl by Coalition troops would be a bad thing.
I don't know about your Clinton points - Democrats have tended to be pro-Jewish in the past, but the recent endorsement from Bush looks like a blatant attempt to win the less-intelligent Jewish vote to me.
Remember, the only way to resolve this sort of problem is through democracy. Look at Schleswig-Holstein for a situation where the victors (the Danes after WW2) took only the land that wanted to be theirs after a pleblisite. The Isareli problem can be resolved in a simialr civilized way.
From the article:"There I designed and implemented the Artists' Den for Dr. Jean-Louis Lassez, the department chairman of computer science. It was an open Internet database of art. Anyone could add pictures and descriptions of their artwork into the searchable database. It was a big success..."
That is the worst HTML I have ever seen, there's not even a starting or tag. I've looked through the source and I haven't got a clue how the browser is getting a title.
The idealistic part of me hopes that what this newfound competition will bring are more accurate searches. The cynical side of my being believes that it will be a no-holds barred advertising onslaught that will cause us to see a resurgence in the "Search Engine Optimization" business that has helped clogged search engines in the first place. At any rate, interesting times are ahead of us...and now that competition has heated up it will still be up to the user to seperate the wheat from the chaff. And so it goes.
Requiem
I've found CometWay to be quite useful. (Or does this fall under some sort of meta search?)
I really only use google, but I've been able to find things here that I haven't on google, because of their categories. (Like wacky shell invocations people use as their sigs and what the hell they do.)
You give us a bad name. Shame.
The most interesting assertion in the article was that Pagerank was useless. He says Google's real win is its ability to cache a copy of the page and show you a summary including your search terms. I do use that a lot to quickly exclude irrelevant pages.
He said that his internal tests at Infoseek showed that pagerank didn't substantially improve the value of searches over simpler link analysis algorithms. I find that interesting, because I've worked with that algorithm and I know it's a stone bitch to compute.
He might well be right. I like Google over the other search engines because the interface is simple and clean, and I find it pleasant to use. I'm reminded of Donald Norman's book on Emotional Design, about how we can get really attached to things that work for us.
Google sells itself on pagerank, but at the very least it's insufficient against "search engine spam". If pagerank is less important than speed and utility, maybe I'll have something else programmed in to my Firefox seach bar. But not today.
I am very impressed.
how about an open search engine? is it possible??? we could use p2p to index the web... would sure beat google...... it seems to be a good idea....
post below.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
I dunno. I better google it.
"and everyone's a little bit nervous to see what it's bringing."
Embrace, extend, ??, profit.
??=buy & close
why try and make the results pages look exactly like google's with the green URLs printed under site listing and general formatting of results?
they should make themselves stand out as something new and different, rather then try to imitate google.
otherwise it does look like a good search and the "gigabits" (the related searches that appear at top of results) are an interesting idea.
is posted on a weekday next week just before lunch hour eastern time.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
It scoured newsgroups for pr0n and presented it in an organized way. What's interesting is that Matt omitted "hornyporny.com" from his bio site. I wonder why?
Anyway, here's what Matt looked like circa 1998, when he used to be an infoseeker.
Fave quote from that article..
However, I think that search engines, if they index XML properly, will have a good shot at replacing SQL.
Discuss.
Web Hosting Reviews
I've often wondered why Google doesn't put up an "unsafe" image search option? (e.g. leave out all the images it deems "safe").
Then again, it hardly needs to most of the time...
Everybody knows what Microsoft is bringing. Well almost everybody. Okay, I'll spell it out:
1: Bring lots of money.
2: Buy out a competitor.
3: Rename it Microsoft Search.
4: Attempt to trademark the word "Search".
5: Bind it tightly into Windows as an essential service.
6: Don't get it right until version 3.0.
7: Profit!
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
I've noticed lately that Google seems to be filling up with websites wanting to sell you stuff (even if they don't use spamming techniques). Perhaps these little guys can put the pressure on Google to get some better algorithms. Or perhaps its time for Google to fade into the past like Altavista did a couple years ago and make way for the new.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
How long before other search engines start considering this stealing? I mean, I could have a search engine running tomorrow, if all it did was link to Google and return hits to my own bannered page.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Am I the only one who's never heard of Gigablast...
Nope, I'd never heard of them either. We BOTH have now.
but then not too many years ago, I remember a time when I've never heard of Google.
Yeah, until they submitted that "infomercial" story to Slashdot and snuck their name in their with the big names of the time, as though they were one of just a few search engines. Oh wait....they didn't do that. Google succeeded by being a good search engine.
I'm surprised the title of this story wasn't "make money fast."
- Plucene
Port funded by Kasei, released Feb 2004.You my license my patent on this idea for reasonable terms in exchange for shares of your company's stock.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
2) Okay, the jokes over. Bring back Clinton. He never would have [pulled UN support troops and watched while Hutu's hacked 800,000 Tutsi's to death with machetes, then dalied over the semantics of 'genocide'.] Whoops! I guess we should bring Clinton back after we arm the Sunni's and Shiite's with machetes? Or maybe if we need to carpet bomb Iraq to draw attention away from his latest blowjob?
All joking aside, your wish for the US to get the hell out of the Mid East is smart, but it's far too late. Talk to your terrorist pals and let them know that America will be invading their favorite host countries for generations to come because of one fucking idiot: Osama Bin Laden. You can also thank OBL for giving Sharon a blank check signed by the good ol' US of A. Go ahead and smear a wad of credit on Yasser and Hamas concurrently.
If you really want the US the hell out of that mess, just convince radical Islamic elements to stop blowing up innocent civilians, and to carry out their disagreements via discussion, argumentation and or treaty. Take some goddamned responsibility for stupid acts of terrorism, bring it to a halt, and and US public opinion will force an immediate withdrawl. Otherwise, it doesn't matter which party is in power, we're there to stay.
Given that, plus the fact that he's spidered my worthless blog, I'm pretty impressed. Definately something to watch.
Bleh!
you all think ur smarter than ur
you need an awesome combination of 2 things: ALGORITHM and STORAGE
Some ppl talk that a "killer" algo would get you there, but seriously, a search engine is supposed to give you relevant results in everything, and if you search for "Senior employees working at Sugarcane plant", you won't get that with just good scripting. Where the hell do u store it then
But algos are also imp as thats whats keeping us visitin Google from MSN
In all seriousness: Make the address and interface simple. For instance Google is very easy to type. It has become as easy to type as my name. You can keep your fingers on two keyboard buttons for 4/6th of the way through the name and move the two fingers to 'l' and 'e'. It doesn't take long to load up, either, which is the second best thing one can do. Vivisimo is a crazy address, and will never become popular. You first heard it from me.
-Xeon
Real programmers can write assembly code in any language. -- Larry Wall
IF MicroSoft comes to the table with their own search engine, you can be assured that there will be dozens of bugs that we can exploit to make our sites rank in the top 10, or that coveted "first page" of search.
My Doctor prescribed daily nasal saline irrigation, hehe
Most websites that review search engines have a high respect for Gigablast (ie. webmasterworld.com) For a small operation he does an excellent job and has very good product with potential...
Even SearchKing is better known than Gigablast... and SearchKing pretty much faded into obscurity after the Google/SearchKing problems a while back.
'In other news, Google announced the buy-out of Gigablast. The newly-formed company will be called Giggle.'
'He who has to break a thing to find out what it is, has left the path of wisdom.' -- Gandalf to Saruman
I could have a search engine running tomorrow, if all it did was link to Google and return hits to my own bannered page.
And if you did you would have a web site that didn't really provide its own content but rather generated it by retrieving data from elsewhere on the Internet. Is this not what google does? But they add value to it, just as you would have to do in order to see any significant traffic.
The Internet is us and we're a bunch that lives just about everywhere and does just about everything. If google, yahoo, Microsoft, SCO, IBM and a hundred like them all disappeared from the face of the planet tomorrow, the Internet would be just fine. What they bring to our table is of insignificant value compared to what we bring. And what value they add has value only because we've invited them to our table. These days too many companies benefiting from access to our Internet are like guests who, having brought passable wine, wish to claim credit for the success of the feast.
If you'd like to set up a site that pulls its content from google, go for it. Just be sure to allow them to take from you what you would take from them. That's how our feast works. We all bring a little something to it and share.
Making the world a better place, one psychotic episode at a time.
That's a poetic way to put it.
My way of looking at it is that for the first time, everyman has a microphone and a soapbox from which to speak to anyone in the entire world who wishes to listen. While I realize that absolutely has to upset many people entrenched in power, I feel it is the finest example of free, equal, and unfettered speech that has ever existed -- and is very much worth defending greatly.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Nope, SQL Server handles that syntax just fine. However, unlike C, the ; is unnecessary unless you're stringing multiple commands together on the same line. This is not SQL Server syntax, but ANSI SQL syntax. Most (all?) SQL developers don't bother with semicolons unless they're doing multiple commands on a single line. And since any good DB developer is not writing dynamic SQL (ie, "SELECT * from foo" from PHP, ASP, Perl, etc), but calling stored procedures through proper mechanisms (ie, not creating a dynamic query of "EXEC sp_foo param1, param2"), they typically don't bother.
You know what I want?
...
My own search engine, running on my POSIX-capable machine, indexing and organizing 'bookmarks', though I suppose at that point they won't be called bookmarks, and thank god for that.
RIP, bookmarks!
Anyway, my own search engine need not be for anyone but me, and it can search and index and process whatever web content I feed it, for later 'search' and organization.
That would be -far- useful to me than another 'gotta be on the net' web-service
; -- the corruption of government starts with its secrets. a truly free people keep no secrets. --
Search is a fiercely competitive arena, even though there are really only five Web search companies today: Google, Yahoo (Altavista/AlltheWeb/Inktomi), Looksmart (Wisenut), AskJeeves (Teoma), and Gigablast.
I am a Chinese speaker and the tradition of east asian writing is character-based and no alphabets. That means we don't separate words with blank spaces but rather dosomethinglikethis. The language we use is having this characteristic and caused many problem for search services because you never know you interpret that thing into dos ome thingli keth is is right or not. We have to introduce some dictionary into the search engine and it is different from many western languages. So I don't believe there is only five search engine providers in this world. At least I know a list of more search engines developed to support east asian languages.
http://www.ieaa.org/~adrian/
I have heard of Gigablast, but I've never been impressed by it. (I wrote a review back in 2002.) Most search engine optimizers love Gigablast, however, because it's such an easy engine to game.
It's a fairly old-school engine: indexes whatever it can and favors pages that are keyword-heavy. It's almost too easy to spam. I don't think there's anything PageRank-like in the algorithm, otherwise, it wouldn't be able to add pages to the index "instantly". (PageRank is too computationally intensive for that.) Gigablast still thinks meta-tags are a great idea! While the hardware setup might be innovative (I'll leave that to the hardware experts to decide), the engine software itself seems about ten years behind the times.
Like many posters here, I doubt a one-man outfit is going to take down Google (although many search engine optimizers would like it to). Gigablast has had two years to make an impression, and it hasn't. A company on an acquisistion binge might be crazy enough to buy it, but I wouldn't hold my breath.
Proud to be / Smiley-free / Since Nineteen / Ninety-Three
Wonderful as Google is, I'm finding more and more searches don't produce useful results.
I keep getting high rankings from sites like bizrate and kelkoo, which don't have any content whatsoever, but have convinced google to show pages that say "search for best prices on xxxx" where xxxx is my search term. Often the problem is so bad that I don't see any sites with content until page 2 of google.
Another issue is with searches for song lyrics. There are dozens of identikit advert sites which drown a tiny (and often inaccurate) text payload is a swarm of adverts. Finding a site written by someone who cares about accuracy is getting impossible.
What I want is sites ranked by volume of relvant content, with a negative ranking element for duplicate sites and a stronger negative ranking for multiple adverts.
Oh, and what I would also find useful is a 'go (after blocking adservers)' button instead of a 'go' button.
A pizza of radius z and thickness a has a volume of pi z z a
Did you mean: Radio control speed controller
I liked this quote: "Now that the Internet is very large, it makes for some well-developed memory. I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine."
The protocol used in the brain? That can't be a good direction to go. I mean, if it's anything like my memory and honestly, the memory of most people I know, it's definitely going to be a step backwards. Human brains can hold a lot of information, but retreival is definitely not its specialty. I can see it now. Type in my search terms and the engine comes back with, "ummm, it's right on the tip of my tongue. Okay, I don't have a tongue, but I just about remember it. Give me just a minute to think about it. umm... umm... Nope, it's gone. Nevermind."
Microsoft is also coming to the party, and everyone's a little bit nervous to see what it's bringing.
More bloody potato salad, I expect.
"You get all the fun of sitting still, being quiet, writing down numbers, paying attention...science has it all."
I'd fire you in an instant with sloppy code like that.
You forgot to add "Order by nipple_size desc".
1. Buy license for existing web search engine.
2. ???
3. Profit!
+1 Insightful, -1 Troll. What can I say, I'm an Insightful Troll.
of creating a modern, scalable search engine.
/. is because it's real and cool. I cannot stand it if it starts using pop-IT lingo such as every well-polished vendor spills out to me on a daily basis.... If I see "interoperability" down further on the news page today, then I'm gunno not read /. for a week!...
Scalable? Please. One main reason I read
"All great things are simple & expressed in a single word: freedom, justice, honor, duty, mercy, hope." --Churchill
Have a look at a9.com, which is Amazon's new search entry. Aside from a good web search engine, it provides a "history" of your previous searches and other innovative features.
Sorry - no stinkin' Java for massive internet web searching. I am Gary Google and I use C++.
funniest /. comment in months!
Yahoo (Altavista/AlltheWeb/Inktomi)
Altavista is DEC, innit?
well they should be! the titans will fall! muahahahahahahahahahahahahahahaha -cough-
dumbfind.com
Replying to myself to add: Matt Dwells just called me at home to complain about my review of Gigablast. He apparently thinks I wrote a negative review of his site because I was abused as a child. It's an interesting theory, to say the least.
I wonder if he's planning to call everybody who criticized Gigablast in this thread?
Proud to be / Smiley-free / Since Nineteen / Ninety-Three