Search Beyond Google
An anonymous reader writes: "'Search Beyond Google', the cover story of the March issue of Technology Review, is one of the few current Google stories that discusses whether their technology can stay ahead of the competition in the months to come."
I'm a heavy google user, but I still miss altavista's ability to search for stems. For example, an altavista search for "slid* rul*" will get 'slide rules,' 'sliding rulers,' and plenty of other variations. Google does support whole word wildcards (try "miserable * failure") but stems are even more useful.
--- Often in error; never in doubt!
People seem to think Google is simply a place to find HTML pages. You type in your words, and poof, you get some relavent sites. Could this be replaced in 3 months? Google has a huge index, a very good search algorithm, and works for most people, but (in theory) someone might come up with a working alternative in that period. However:
And more. Babelfish translation? Caching like a billion pages? Simple design, with text ads that are actually relavent? In 3 months.
Yeah, right.
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
http://search.yahoo.com/
I quite like Vivisimo (after I figured out how to make it include Google in it's query by adding 'google' to the 'sources=' part of the query URL).
dogpile is also quite good, when you've got it set to display results by relevance rather than by engine.
Remember, Amazon isn't the only online bookstore, ebay isn't the only online auction site and google isn't the only search engine...
Google works approximately by modding up the sites that get linked to the most. All the contributing links have an equal weightage it seems. This allows scamming by forming webrings and similar circular linking schemes
Another approach I heard being discussed is to give more popular sites a higher weightage. ie If a site has a lot of pages linking to it, the sites linked from this site must also be good. Apparently if done right, you can do a few iterations and get to a better algo.
Or probably assign a number to (karma if you will ) to each site. Then divide this karma by the number of sites it links to and add this to all the linked sites. Eliminate the cycles in the graphs and iterate.
There are three very distinct elements involved in creating a powerhouse search engine:
- A large crawl: A search engine with nothing in its database isn't going to work very well. A search engine needs as big of a crawl as possible in order to have any results at all. This takes huge resources in terms of bandwidth and computing power. Some of the early search engines met their demise when they couldn't afford to keep their crawlers growing as fast as new web content comes out.
- The Sorter: Once the long list of results that match the keywords are pulled out of the crawl, a sort needs to be applied in order to locate the best results and present them first. Google got vaulted to the top because PageRank was better than anybody else has ever put out. However, PageRank isn't perfect, so there is still room for somebody to make something better than PageRank.
-Promotion: A web site just sits there unused if it isn't promoted. Google never spent much on advertising and it just relied on word of mouth since it was so strong in the other two areas. And now that everyone turns to them first without even checking other engines, that has given them the strong advantage of a strong brand image. However, we've seen plenty of cases where inferior technology has been beaten out by better marketing. If somebody's tech passes Google, without marketing it nobody will know about it. Therefore, look for the challengers to be launching major ad campaigns inviting people to at least try them before they assume Google is better.
Can anybody put it all together? We're about to find out...
The problem with regexps is that they can be used to create very database-expensive queries. No search engine is ever going to allow a query that returns the entire database as the result set either.
but you wouldn't be able to run a regexp against the entire document base since Google does not store the entire document for the purposes of indexing (googlecache is for a different purpose), what kind of computing power would one need to search all documents in Googlecache with a regexp under one second? And for more than one user at a time?
You can't handle the truth.
I thought the whole concept of google was that it ranked pages higher if lots of other pages linked to it.
And this is exactly one of the problems that is now coming to light. Spammers set up hundreds of tiny sites that do nothing but point to each other, thus inflating their PageRanks. They've saturated Google to the point that searching for information about commercial products usually returns 2/10 legitimate pages.
At least, that's been my experience.
Google recently added stemming as a search of {quit smoke} will reveal. You can read about it in their help section. Stemming can be disabled on specific words. Otherwise the update came around November 15, 2003, but is probably still in flux, so there isn't too much good info about it yet.
This was posted on /. a while ago under a similar story, but in case you missed it, there is a place to report spam on Google:
http://www.google.com/contact/spamreport.html
I now have it as a bookmark so I can hit it quickly.
I discovered Discount Watcher via what seemed like a spam link on Google but it turns out to be a very cool service that finds the latest discounts on almost anything you want and turns it into an RSS feed. Now my aggregator is filled with spam. But it is spam I want.
try using this
something interesting -site:example.com
At this point there's no way to save it as a pref, but you could always drop it in a text file to keep a big list
that's because yahoo *offers* much more than google does.
if you want a simple search box, navigate to the yahoo! search page.
SIGUSR1
The google toolbar already has voting buttons. Not quite what you're talking about, but...
--
Why the hell not? Here's some SEO: Home Inspector
Actually, there essentially is a meta-moderate link tucked down at the bottom of the page:
It's not an automated system, but it does let you report "bad moderation".
Yes, Google has tweaked their algorithms and added filters to strip out some of the obvious abuses. But lately it seems like each time they remove a link, two more replace it.
Maybe they've got some super-sneaky solution they're working on right now to remedy this. It would certainly help prevent searches like:
+product +information -buy -deals -ReferralFarmName -otherRedirectTerms -...
Yeah, I'd read that, too. But Snopes claims it's not true.
You tell me how "whilst" differs from "while," and I'll stop calling you a pretentious jackass.
Some of these smaller natural language engines are beginning to look very promising, see: answerbus,brainboost,webqa
Interesting as to why the big boys are largely ignoring this domain. I suspect old man jeeves has turned people off to the possiblity of reliable QA.
Yes, this is kinda how miserable failure points to where it does. A bit on the technique behind this here.
Too big to fail? Does that make me to small to succeed?
A Googlewhack is a two-word Google query that returns exactly one result.
The term you're looking for is probably Googlebombing, which refers to deliberately placing keywords and links on multiple domains to boost a site's PageRank. Originally, Googlebombs were pranks or in good fun, like a search for weapons of mass destruction.
Now "Googlebombing" is being expanded by some to include manipulating PageRanks for commercial ends. I'll leave it to the armchair etymologists of Slashdot to decide if that is a correct use of the term.
~Idarubicin