The Future of Google Search and Natural Language Queries
eldavojohn writes "You might know the name Peter Norvig from the classic big green book, 'AI: A Modern Approach.' He's been working for Google since 2001 as Director of Search Quality. An interview with Norvig at MIT's Technology Review has a few interesting insights into the 'search mindset' at the company. It's kind of surprising that he claims they have no intent to allow natural questions. Instead he posits, 'We think what's important about natural language is the mapping of words onto the concepts that users are looking for. But we don't think it's a big advance to be able to type something as a question as opposed to keywords ... understanding how words go together is important ... That's a natural-language aspect that we're focusing on. Most of what we do is at the word and phrase level; we're not concentrating on the sentence.'"
"I'm sorry Dave, I'm afraid I can't search that."
I tend to agree with Norvig's focus on keywords and less emphasis on natural language. Trying to even define a natural language on top of a query engine introduces a layer of complexity probably unnecessary. Natural Language even introduces a level of noise to interfere with accurately (as possible) defining what the user is asking for.
Google has done a good job, and they get better each iteration figuring out what the user is looking for. I find their suggestion an effective way to not only constrain a query, it actually provides a way to spell check in a pre-emptive way. If you've not used this, install the Firefox Google toolbar, or use the experimental Google "Suggest". Often Google will provide suggestions in the drop down menu that refine your search in ways you hadn't considered that drive to a more direct and accurate representation of your intended query. Of course if their suggestions don't satisfy, you get to continue typing your keywords to your heart's desire.
(I have to offer an example of suggestion's effectiveness. I often Google to get to the Chicago Tribune (I don't visit there often enough to have created a bookmark, plus it's easy to do this in anyone's browser). Simply typing the first four letters, "chic", I see the first suggestion is "Chicago Tribune". A simple TAB and RETURN, I'm on the Google page with the first link or so my link to the Tribune (with the added bonus of Google's breakout of sublinks).) Your mileage may vary (Google's ranking system may vary the order and options that appear in the drop-down over time), but I find it an amazingly effective research tool (suggestion, not the Trib).
Natural language is mostly trying to guess intent with structure and key words (as opposed to keywords), but at the end of the day, if you filter out the natural language, and focus on keywords you're going to end up in close to the same place.
I wonder if any of these types of translation or recognition engines use Lojban as an intermediary. The unambiguous yet rich grammar of Lojban is ideal for representing different languages. Eventually, it will be used directly.
The problem with natural language searches is that natural language itself is a moving target. Sure, ten years ago "How do you change the air filter in a Toyota Camry?" would have been a legitimate question to ask a search engine online, but these days it would probably be asked like "lol how do u chng filtr in my pos car? kthxbye :)". I don't know how Google is supposed to keep up with that.
You can't take the sky from me...
What is 'natural' about the English language?
Isaac Asimov's fictional Multivac was a huge computer with some near-universal knowledge database that answered natural-language questions, giving Asimov all sorts of opportunities to present philosophical conundrums as entertaining short stories.
In the 1960s and thereabouts, when I used to hack around on minicomputers, but personal computers weren't well known to the general public, I always found it difficult to explain what computers did. One of their commonest questions was "Well, how does it work, do you type in questions and does it answer them?" Programming in assembly language didn't really fit that description.
Many technological fantasies seem to remain surprisingly distance. I tried ViaVoice and gave up: it's not a "voice typewriter." Roomba is not a general-purpose housekeeping humanoid-form robot, and neither are the machines that weld automobile chassis.
However, it seems to me that Google is within striking distance of Asimov's "Multivac" fantasy.
Incidentally, if you type in queries as complete sentences Google seems to do any worse than if you don't. Sort of the converse of adventure games, where one begins by typing "Walk over to the table on the left and pick up the silver key with your left hand" and quickly learns to use telegraphic style: "Go table. Take key."
"How to Do Nothing," kids activities, back in print!
It would actually be a great advance, but the resources required would not offset its advantages since 99% of the time you can find what you're looking for using keywords and phrases.
I tell new users that they should just ask Google a question in plain english. That gives the a more natural context in which to embed their keywords. I know Google is just picking up on the keywords and ignoring the filler words, but it usually gets the correct results and it's a lot easier for people who are just starting out on the Internet.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I meant that "if you type in queries as complete sentences, Google doesn't to do any worse than if you don't." That is, even though it's not an advertised feature, you can use natural language with Google if you like. It just doesn't help you; you might just as well use truncated phrases.
"How to Do Nothing," kids activities, back in print!
I suggest they focus their efforts on preventing websites from gaming the system.
How many time shave you entered a search term that is a company's name, expecting to see that company's link on the first page only to be shown a bunch of links to dumb ass search sites that have gamed the google search engine?
text-to-speech or speech-to-text is also useless (unless your blind/ deaf/ driving a car)
the idea of interacting with a computer like a human is an artificial hangover from being introduced to the computer the first time. after using it for awhile, you realize that ineracting with a computer, in small limited ways, like searching information, is easier NOT using natural language
for the very simple reason that it takes more thought, and more typing to interact naturally. it is easier to train a human to interact with a computer than it is to train a computer to interact with a human. and for the human, it is more rewarding, because the human realizes he doesn't need to exert so much effort
"what is the capital of france?"
versus
"france capital"
if you were to shout "france capital" at someone, it would be rude and confusing. but for a computer, it's actually superior
it is the conservation of communication effort at work here that wins out over natural language in computer interaction
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Typing "What is the capital of France?" won't get you better results than typing "capital of France." ... Most of what we do is at the word and phrase level; we're not concentrating on the sentence. We think it's important to get the right results rather than change the interface.
This misses situations like searching for "That sf-short-story were the crew of the visiting spaceship is given a dog as a present" in which googling failed, at least for me, or, more technically, when you have absolutely no idea about what the relevant terms within the outcome might be. In short, if you have a real question.
CC.
TaijiQuan (Huang, 5 loosenings)
These days, hardly any user enters queries in the form of natural language questions, judging from log files. That was different a couple of years ago.
Just like "Click here to do X" isn't used as much on Web pages anymore. People now tend to know that they can click on underlined text to find out more.
Natural Language from a linguistics perspective incorporate into a search engine will be truely innovative technology. After reading the article and his wording, it seems clear that it isn't so much that pursuing search via natural language is fruitless, but that it is borderline unattainable at the moment. Using keywords allows to the person performing the query to filter their own natural thought.
"Hm, I wonder how many moons Saturn has? I will Google 'Saturn+Moons.'"
This method is by far the most effective and least time consuming today, but the day we are able to think what we want and then search for what we want with no filtration necessary will coincide with the advent of true artificial intelligence. Linguistics (and thus, 'Natural Language') is one of the most complex studies in the world. The creation, evolution and implementation of different dialects within any given lexicon are very difficult to understand, let alone across different languages. 'Natural Language' search will be impossible to truly implement until we fully understand the way we communicate to one another. Simply extracting words or operators, clearly as we know, simply doesn't work. It is the complex relationship that matters. But once we figure that out- and we will- we will be at the next great step forward.
art is science made clear. -cocteau
Is that natural language stuff is hard. And even more so, AI, which was so promising to so many of us in the 80s turned out to be so hard that it is basically impossible. I think it caused a real shift in the natural language research, sending us to use statistics and probability, since basically AI never got going,
do people really type questions into search boxes? that always stumped me about the ask jeeves thing....who the crap really ASKED anything. I thought you just googled what you wanted to know about (or nowadays, hit the wikipedia page for it for starters).
Maybe I'm just not up on my search engine technology (or, rather, I don't know anything about it). I just don't know anybody who'd think to put a regular question into google.
"That's easy! The capital of France is 'F'."
----------
Any problem can be made unsolvable if there are enough meetings made to discuss it.
Esperanto, for one, makes a perfect study for researchers. Your brush is too broad. Your cynicism and jadedness are disappointing.
I wonder if MS or Yahoo are listening...
RS
Shoes for Industry. Shoes for the Dead.
If you have the opportunity to look at query logs, you see how dumb most search engine queries are.
First, a big fraction of queries are simply navigational. Many are just URLs. The major search providers recognize these in the front end machines and send back canned answers, without even passing them to the real search engine. If you type "myspace" into Google, very little work is expended returning the canned reply.
After that, most queries are one word. Phrase queries are less common.
Few people seem to have noticed, but Google started returning results based on synonyms and homonyms a few weeks ago. There have been some significant algorithm changes recently.
Less than 1% of queries use any operators, like '"" or '-'.
The real problem with natural language queries, though, is that "Ask Jeeves" was a flop. Remember Ask Jeeves? That was a system designed to process queries written as sentences. But it wasn't used that way, and didn't succeed commercially.
I think Norvig's lying. Google may not be pursuing linguistic structure above the phrase level in searches, but I'd bet a donut they're working their asses off trying to analyze crawled docs linguistically. To get relevance, they need to extract what a document is about. That implies sentence-level syntax analysis, which is input to sentence-level semantics, which is input to paragraph-level semantics, which is input to "pragmatic" analysis. I think what he's not saying is that the place the linguistic research dollars are going is elsewhere than parsing "Where is Paris?"
Answer: This. OK, programming joke aside, seriously...natural language should not be incorporated into search engines. What about generic questions, such as my subject line? What would Google return? What SHOULD Google return to that? Do a tracert on the user's IP, and answer with a map? Seriously, to implement natural language searching capability would be quite a feat. Especially in the age of, "ROFLMAO wtf iz 4 computa?!!1"
"Know but never fear the consequences of your actions."
For those who are speculating about where they are going, a possibility is in a recent (within 5 years) article by William A. Woods, one of the top natural language researchers. His work at Sun was about using noun phrases (turned into concepts) as search guides. No idea if this is relevant to Google, but the work seems very promising.
And sorry, I don't have the reference handy.
How much natural language do you really need for a search? Not much.
All you have to do is look at Yahoo answers' average question clarity to get a sense of why whole-sentence AI may not be the best strategy for a search engine.
stuff |
For Natural Language Processing and Question Answering research activities, search for "AQUAINT (DTO OR ARDA OR IARPA)" and also the NIST TREC (Text Retrieval Conference) workshops and research competitions.
There is a lot of interesting work out there and some answers as to why more precise information finding through natural language input is useful.
As a commenter indicated, it's easier for us to adapt to computers than to adapt them to us. Long term question: as we adapt to our computers, using handfuls of keywords instead of sentences, how will it affect the language itself? Change in language comes from technology now, c.f. "w00t" as word of the year or the most popular txtmsg acronyms.
Will we be reduced to the news people in that beer commercial who sum it all up in 10 seconds so they can go drink? It could have a positive effect in stripping language of fuzziness; if you were to Google 'initiating mobilizing synergistic dynamics to maximize total quality excellence,' you wouldn't get much, because it's b.s., whereas 'build better mousetrap' would give you hard data. Meetings would certainly get shorter if we were forced to communicate in searchable terms.
On the other hand, storytelling would suffer. "Boy girl meets gets loses" is ideal search terminology, but doesn't exactly pull the heartstrings.
I'm the queer the atheists sent here to take away your gun!
I agree with the other comments that it is much easier to get the user up to speed than to make search criteria easy for naive users. Remember Ask Jeeves? That implementation of natural language queries gave results that were not much better than random. Serious users quickly catch on to the tricks of word order, quotes, +/-, etc. Really, it's not much harder than typing a sentence and gives more predictable results.
then i won't be impressed until i can type "earl grey, hot" into google and find a nice cup of tea on my cd tray
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
What you are saying has nothing to do with the parents point. Do you even know the first thing about natural language research and/or AI research ? He's saying IT HASNT DONE SHIT since eliza was written. All of old fogey AI academics worshipped at the altar of Godel-Escher-Bach, only to find out that we were scratching granite wth our fingernails.
Natural language processing is useful when it is well-done. Getting it well-done is the tough part. Don't let Google reps trick you into thinking otherwise just because their R&D in the field isn't where they'd probably like it to be.
Here are some situations where it's useful:
1) interpreting a question rather than just treating it as a "bag of words." For instance, one can type "how tall is Mt. Everest" in the search bar and Google, rather than searching for documents that contain those 5 (or so) tokens will interpret that as a query asking for height and also search for documents that contain "Mt.", "Everest", and "height". Take that a step further and it might look for strings that represent height such as a number followed by "ft" or "meters" or "m".
2) Condensing query chains. Suppose you want to know what sport our 4th president enjoyed playing most. You can ask "what sport did the fourth president of the US like playing?" and the system will give you an answer by first interpreting "fourth president of the US" as Madison, and then searching for what sports Madison enjoyed playing. If not for such interpretation you would either have to run 2 queries (first to find out who the 4th president was, then what sports he liked), or hope that there is a document out there that Google's indexed that contains the words in that initial query.
3) Speech recognition! If you want to run a Q/A session with a computer system that has a speech recognition front end, it is more natural (easier and faster) to ask it "how tall is mt. everest?" than to say "mount everest height" or whatever you would end up typing into Google today. People like to speak using *natural language,* after all. They would gladly do it with computers if the SR systems in them were good enough (some are).
4) More precise query results. What's better, getting back a document that is likely to contain the answer to your query, or getting back the sentence that contains it? Or better yet, getting back the answer and nothing else? The more robust an NLP system the more complicated queries it can interpret and the more elegant its result can be.
On that note, Google actually *does perform* NLP on queries despite what from the summary (I didn't RTFA) looks like claims to the contrary. If you ask Google "how tall is Mt. Everest?" it actually DOES interpret that particular sentence and gives you the answer -- 29000ft or thereabouts. And you only get such an elegant result if you type "how tall is Mt. Everest" (without quotes) or "Mt. Everest how tall". Other queries of this nature will not give you quite as precise a response.
I like basketball!!1!
I phrase a majority of my searches as questions already and get back reasonable results. Like Norvig said, it's about the words in general and their meaning together in a phrase. In my experience I ask and I receive. What's the problem?
From my view, this is the classic debate in technology: emulating nature vs. reinventing nature.
When people first tried to fly, they copied birds but the better solution was to understand the principles of aerodynamics and
leverage the technology available.
The wheel was a better idea than trying to recreate feet.
In the key words vs natural language debate, Google has shown that key words is the better solution for now.
The real question is: how do you make searches more intuitive to the person making the search?
After all, usability is the only criteria that matters.
PowerSet.com claims to have a natural language search that's superior to the keywords searches. Let's see if PowerSet has the service to back up its boasts. PowerSet.com currently hides its service -- which is not a good sign.
> wii
Your query does not include a verb.
> find wii
Whose "wii" do you want me to find?
> find wii review
Unable to find any reviews authored by "wii".
> find review about wii
No reviews found concerning the common noun "wii".
> find review about Wii
Here is the most recent review about the proper noun "Wii": [url to a page full of keywords related to Wii]
> find review about Wii order by relevence
"relevence" is not an English word. Did you mean "relevance"?
> find review about Wii order by relevance
Here is the most relevant review about Wii: [url to a 2 year old pre-review of the Wii before it was launched]
> find review about Wii order by relevance then date
Here is the most recent and most relevant review about Wii: [url to a fanboy site]
> find all reviews about Wii order by relevance then date
Working...
> abort
Abort what?
> abort search
I am currently performing 1,231,415 searches. Which search do you want me to abort?
> abort last search
You do not have permission to abort others' searches.
> abort my last search
Last search aborted.
> find several reviews about Wii order by relevance then date
"Several" is not a quantifiable adjective. Do you mean "seven"?
> find seven reviews about Wii order by relevance then date
Here are your results. For better search results please capitalize the first word of sentences, and end sentences with proper punctuation.
Dan East
Better known as 318230.
Google should look to Karen, the computer wife of Plankton on SpongeBob SquarePants. Karen is so advanced her natural language responses even include sarcasm.
At least one startup is betting that natural language search will be the way to go. A number of ex-yahoo people there.
While natural language might seem like a good idea to people who are less technical, it's actually a really bad idea. It would slow a lot of things down in terms of search and would bring with it deep inefficiencies. Frankly, I think search engines would be improved if they offered advanced features with brief commands (kind of like how Unix abbreviates 'copy' as 'cp' or 'move' as 'mv'). For example, which do you think is better when you want to move quickly, a vehicle with wheels, or a bipedal vehicle with legs? The answer is obvious, wheels trump legs for speed. The same with language interfaces to computers. A middle language between machine and human language is the best approach. With a focus on efficiency and no ambiguity whatsoever. Loglan. There you go. move along...
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o
first step to building a NLP like search engine would be to map words to their respective subjects (or classification) - this has already been done with wordnet. then as you crawl the net you map the words found to your heirarchy, and you keep a running total of frequency of words on the document as well as the frequency on the net. Eventually, you can sift out the words that have little to no meaning (words that appear frequently typically have no meaning - the, a, and, but, etc...).
Now combine this with pagerank and social ranking and you can refine search results down pretty quickly. During my undergrad I was able to get really good results with this method but I needed more sites in my index to really see if it would work.
Essentially what happens is your queries start off broad and you refine the results down by providing more terms to search by that are associated with the line of queries. (This is how search engines like ask.com (teoma.com was the company that focused on this) work).
There are several problems with using natural language as a query language. For example, my northern neighbor, from Jamaica, is understandable and my Sothern neighbor, from Columbia, is intelligible but I tend to have to translate idioms between them. Illustrating that there are only about 6 billion natural languages to deal with. And you must use lots of short sentences with children, but longer more complex phraseology with adults.
The other problem is that because most words get repurposed over time and fields of study, a lot of natural language is used to set the context. The word "affluent" means quite different things when talking about watersheds and neighborhoods. And the rules of grammar pretty much guarantee that the words "watershed" and "affluent" would be in separate phrases with all the intervening words the phrases need. Hence the natural language query would be much more voluminous and need much more processing.
Still once computers can "read" a paper and "understand" what it says; a natural language query might be more efficient for constraining the search. (What rivers are affluent to the Blue Nile south of the 34th parallel?) But while the search engine scanners scan a document and create a key word distance measure on improbable words the improbable key word set will still be the most efficient query language. (Returning the containing documents instead of the answer.)
I guess they'll just let Powerset become the next Google. Face it, "keywordese" language is often not adequate. Questions constitute a significant fraction of search engine traffic, and all search engines fail miserably on anything but "how to" queries. Just yesterday I was looking for a comparison between two products on the web. I've found it, eventually, but there's no real reason why it shouldn't be the first hit after I enter "comparison between X and Y". It's not a question in itself, yet it's a distinctly natural phrase. I bet people would use things like this quite a bit if they actually worked well. Looking further into the future, quite often I'm looking for an answer, not for a set of hits I have to read and summarize myself. In 20-30 years from now I won't have to waste my time. I want the computer to become my "secretary". I give it a task (find relevant information about topics X and Y, summarize, present) and off it goes. In a minute or so I have a page of concentrated information to digest.
The reason why Google won't focus on NL queries is because there are a lot of unsolved problems and those may take decades to solve. Disambiguation/polysemy, summarization, knowledge representation, reasoning - you need all of them be anywhere close to a human in language understanding, and none of this is really "solved" yet. This even ignores purely technical issues (i.e. issues that can be solved today with a bit of elbow grease) such as extracting salient bits from the pages, storing linguistic data in index efficiently and retrieving it from there in a meaningful way in real time.
Is it hopeless, then? I don't think so, for two reasons. Reason one, it won't get done unless someone does it. Reason two, there are working implementations of language-aware search that for certain types of queries yield substantially better results. If Google doesn't do it, someone else will. And you can bet a billion bucks they'll patent the heck out of it.
That said, I don't see keyword search going away anytime soon. It works well for a lot of things and it'll live side by side with NL queries. But next time you click a link after link after link in Google's results page, think whether it'd be easier to just type a natural language phrase and have Google "understand what you mean".
Let's say I do a search for "java" and I get 501,000,000 hits I would like to narrow this down.
I'd like the search engine to give me a list of topics to refine my search.
Programming language
Coffee
Island
Companies
Other
And lets get rid of any links that are just lists of words. Read the web pages using natural language processing so that the computers understand what the page is about. Lists of words and random sentences should fail this natural language processing, and so not be in the context of anything.
Actually, one of the main challenges with natural language is that we humans perform so badly to begin with. Half the time we neither say what we mean, nor mean what we say. But it hardly matters: far more than half the time, the person (or people) listening hear either what they expected to hear, or what they wanted to hear, or they already knew they would disagree with whatever you were about to say before you even opened your mouth.
Sometimes it does matter. However, by the time you design a linguistic study to isolate the human gift for parsing grammar, the experimental task is about as "natural" as writing a law exam.
I think the contribution of grammar to early human language is way overstated. You don't need much grammar to handle everyday events, such as determining how to dress for dinner when the report from the field comes back "mammoth tusk hunter" or "hunter spear mammoth": in the former case (x3!) you'll be polishing your nose bone.
Where word order begins to matter is parsing the daily scuttlebutt. Did Adam tell Carol about Bob and Eve, or was it Eve telling Adam what she overheard between Bob and Carol? It's not easy keeping the cheaters distinct from the cheated upon. Plus Adam has to remember when to look surprised when Eve tells him something he learned from Carol just the other day. Not keeping your past/present/future and your cheatee/cheaters straight was a certain recipe for not sleeping on the warm side of the fire pit.
Later on, the grammar we acquired to parse who's zooming who became useful for digesting the BBQ assembly manual, but of course, that remains an evolutionary work in progress.
Maybe when children of the current MySpace generation reach the age to pop the big question ("What's an iPod?") and we've given up the fight to prevent our every indiscretion and peccadillo from being publicly archived for all posterity, we'll actually need a natural language interface to really drill down into the zettaflood of who said what to whom and who first posted it online and whether revenge was sweet.
Instead of trying to re-create or interpret the conventions of human speech, how about just a better way of representing the search results? I would like to see a visual representation of the search results so that I could spot the most promising semantic branches. There must be a way of grouping results that are closest in meaning, or refer to similar sources, or fall into broad categories of knowledge. Right now, Google just ranks them all in what it believes to be the order of significance, which is no help if the search results have gone in a direction not intended. Maybe the program should let humans resolve the ambiguity as much as possible; we're actually quite good at it. That's what makes Turing tests work.
or at least the option. That includes escaping "+" and "-". That would do sooooo much to improve searches.
I'm not repeating myself
I'm an X window user; I'm an ex-Windows user
How is the parent post insightful in any way? It's a fabricated example from nothing, a strawman post.
And even more so, AI, which was so promising to so many of us in the 80s turned out to be so hard that it is basically impossible.
You must have been asleep for the past 2 decades. AI is, to every generation, the stuff that we don't know how to do yet. In 1985, a chess computer being world champion seemed like AI. In 1995, a computer answering the telephone when you ask if your flight is on time seemed like AI. There are still things that seem like AI, but I doubt my children will believe it.
You might be thinking of Strong AI, but even that isn't completely lost yet.
basically AI never got going,
No, we just don't tend to call it "AI" much any more because it was hard to get funding for things labeled "AI" after the AI Winter. It's no coincidence that the guy who wrote the book on AI is now Director of Research at one of the top software companies.
Sometimes I like to idly type things into Google Suggest and see what comes up:
why is everything
can you eat
can you die from
where can I go to get
is it possible to
how would you
From playing with it for a few minutes, it seems that Google is mostly used by women in various stages of pregnancy, people worried that they might be arrested for using Limewire, and people looking for Wiis.
Unless a computer knows things the way a human does, it's not possible for natural language queries to ever work.
I think it is pretty clear that the manifestation google's ideology has secluded many who might have backed its progress originally. Is it too late for an open source, peer managed search network to form? Namely, in place of advertisements funding the service, how feasible would it be for the future's mainstream search to be managed by an academic network of global universities, catering to traffic via proximity, bolstering search features through open peer review and funded by mutually beneficial public sourcing?
Many services have proved they can be managed without a nanny looking after everything, is search the same?
Ask Jeeves was a flop because it started returning stupid results like, "Would you like to buy a Subatomic Physics?".
A house divided against itself cannot stand.