New Algorithm for Learning Languages

just thought.. by thegoogler · 2005-08-31 16:06 · Score: 3, Interesting

what if this could be integrated into a small plugin for your browser(or any program) of choice, that would then generate its own dictionary in your language.

would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.

Re:just thought.. by jaavaaguru · 2005-08-31 20:37 · Score: 4, Interesting

Perhaps it the algorithm could be used to identify spam more accurately. If it can understand the text, then it's got a reasonable chance of know if the text is junk.

--
Follow me

Didn't Google already do this? by powerline22 · 2005-08-31 16:08 · Score: 5, Interesting

Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.

Re:Didn't Google already do this? by spisska · 2005-08-31 17:19 · Score: 5, Interesting

IIRC, Google's translator works from a source of documents from the UN. By cross referencing the same set of documetents in all kinds of different languages, it is able to do a pretty solid translation built on the work of goodness knows how many professional translators.

What is a little more confusing to me is how machine translation can deal with finer points in language, like different words in a target language where the source language has only one. English for example has the word "to know" but many languages use different words depending on whether it is a thing or a person that is known. Or words that relate to the same physical object but carry very different cultural connotations -- the word for female dog is not derogatory in every language, for example, but some other animals can be extremely profane depending on who you talk to.

Or situations where two entirely different real-world concepts mean similar things in their respective language -- in English, for example, you're up shit creek, but in Slavic languages you're in the pussy.

I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate. English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

Consider this:

I was having breakfast when my sister, whom I hadn't seen in five years, called and asked if I was going to the county fair this weekend. I told her I wasn't because I'm having the painters come on Saturday. They'll have finished by 5:00, I told her, so we can get together afterwords.

These three sentences use six different tenses: past continuous, past perfect, past simple, present continuous, future perfect, and present simple, and are further complicated by the fact that you have past tenses refering to the future, present tenses refering to the future, and the wonderful future perfect tense that refers to something that will be in the past from an arbitrary future perspective, but which hasn't actually happened yet. Still following?

On the other hand, English is much less precise in things like prepositions and objects, and utterly inexplicable when it comes to things like articles, phrasal verbs, and required word order -- try explaining why:

I'll pick you up after work

I'll pick the kids up after work

I'll pick up the kids after work

are all OK, but

I'll pick up you after work

is not.
Machine translation will be a wonderful thing for a lot of reasons, but because of these kinds of differences in languages, it will be limited to certain types of writing. You may be able to get a computer to translate the words of Shakespeare, but a rose, by whatever name, is not equally sweet in every language.

SCIgen by OverlordQ · 2005-08-31 16:09 · Score: 5, Interesting

SCIgen anyone?

--
Your hair look like poop, Bob! - Wanker.

Speaking as someone working on NLP by OO7david · 2005-08-31 16:12 · Score: 4, Interesting

IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.

It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.

With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.

So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.

Re:Speaking as someone working on NLP by OO7david · 2005-08-31 16:29 · Score: 2, Interesting

Insofar as only utterance A is heard. A kid will always hear "Are you hungry" but never "Am you hungry" or "Are he hungry".

Native speakers by definition speak correctly, and that is all the child is hearing.
Re:Speaking as someone working on NLP by OO7david · 2005-08-31 16:55 · Score: 4, Interesting

It is in effect two parted:

Chomsky is to linguistics as Freud to psych. He had great ideas for the time (many still stand), and the science would be nowhere close to where it is without him. However, A) he's backed off alot of supporting his own theories and B) he's published papers contradicting his original ideas so that is some question there for their veracity. Since so many linguistics undergrads hold him as the pinnical of syntax none are really deviating drastically from him.

WRT the unscientificness, to make his view fit English, there has to be "do-support" which basically is that when forming an interrogative "do" just comes in to make things work without any explanation. In other words, it is in our grammar, but our view of syntax does not account for it.
Re:Speaking as someone working on NLP by PurpleBob · 2005-08-31 17:08 · Score: 4, Interesting

You're right about Chomsky holding back linguistics. (There are all kinds of counterarguments against his Universal Grammar, but people defend it because Chomsky Is Always Right, and Chomsky himself defends it with vitriolic, circular arguments that sound alarmingly like he believes in intelligent design.)

And I agree that this algorithm doesn't seem that it would be entirely successful in learning grammar. But this is not because it's statistical. I don't understand how you can look at something as complicated as the human brain and say "statistics does not come in at all".

If this algorithm worked, then it could be statistical, symbolic, Chomskyan, or magic voodoo and I wouldn't care. There's no reason that computers have to do things the same way the brain does, and I doubt they'll have enough computational power to do so for a long time anyway.

No, the flaws in this algorithm are that it is greedy (so a grammar rule it discovers can never be falsified by new evidence), and it seems not to discover recursive rules, which are a critical part of grammar. Perhaps it's learning a better approximation to a grammar than we've seen before, but it's not really doing the amazing, adaptive, recursive thing we call language.

--
Win dain a lotica, en vai tu ri silota
Re:Speaking as someone working on NLP by Anonymous Coward · 2005-09-01 04:12 · Score: 1, Interesting

I won't put a detailed explanation why he is very wrong either. However his methodology and presentation are clearly unscientific. He fudges his data, he asserts without proof, his proofs are always circular. He assumes things that need no assuming, easily established to be true or false by simple experiments. Nevertheless he just assumes and goes on. Some are contradicted by experimental evidence but he doesn't care. He sweeps stuff under the rug when it doesn't fit his model as "outside of domain of this theory" for an unspecified "domain." So whatever the truth value of his claims are, his ideas can not be scientific truth.
This much you can easily prove yourself. But being unscientific doesn't mean wrong; you could reason the Earth must be a sphere because the most beautiful shape is a sphere and you would be right wrt shape of the Earth, even though your reasoning is unscientific junk. If you want proof that Chomsy is wrong, rather than just using a useless methodology, I'm afraid you won't find it without spending a lot of time on it.
It takes quite a bit of time to give background information on developmental psychology, pyscholinguistics, neuroscience and biology in general to someone outside the field (not that I'm an expert, but I have a degree on Cogsci.) It takes at least as much time to establish which flavor of Chomskian linguistics is rubbish and why (Chomsy made so many contradicting models of language and mind that almost everything you say against him can be countered with a simple "ah, but you don't know his X theory") So you very probably won't get any such response unless you claim a specific chomskian theory is true and sound like you know what you are talking about.

Grammar depends on the input by Tsaac · 2005-08-31 16:14 · Score: 3, Interesting

If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps, it's quite hard to sort the good from the bad when it's learning. Take the megahal library http://megahal.alioth.debian.org/> for example. Although possibly not as complex, it does a decent job at learning, but when fed with rubbish it will output rubbish. I don't think it's the learning that will be that hard part, but rather the recognition of the good vs. the bad that will prove how good the system is.

--
eXemplary Abstract

Re:Grammar depends on the input by jim_v2000 · 2005-08-31 17:16 · Score: 2, Interesting

The problem with this program is that you could input the most gramatically correct sentences you can into it, and it'll still spew out senseless garbage. For this to be of any worth, the computer will need to understand the meaning each word, and how each meaning relates to what the other words in the sentence mean. And you can't program it into a computer what something is just by putting words into it. Like if I tell the machine that mice squeak, it has to know what a squeak sounds like and what a mouse is. How do you define a mouse to a computer? A small fuzzy rodent. Well, how do you define fuzzy? Or small? Or a rodent? You have to keep using more and more words...and still the computer will have no idea what you're talking about, other than just mroe word relationships.

I guess the missing thing is that a human can evision the meaning of the words as a concept or image, while the computer simply sees the words as, well, just words (or binary to specific).

--
Don't take life so seriously. No one makes it out alive.

Protein sequences? by Anonymous Coward · 2005-08-31 16:15 · Score: 1, Interesting

Let's see what human DNA really says and means!

Hieroglyphics? by Hamster+Of+Death · 2005-08-31 16:21 · Score: 2, Interesting

Can it decipher these things too?

Re:Isn't This the Universal Translator Idea by biryokumaru · 2005-08-31 16:25 · Score: 2, Interesting

In Star Trek 4, the universal translator was little help when the humpback whale armada arrived... No, seriously, that was one f**ked up movie.

But for this, I have one word: Dolphins.

--
When you're afraid to download music illegally in your own home, then the terrorists have won!

Incredible by Anonymous Coward · 2005-08-31 16:26 · Score: 1, Interesting

I hope the material is (or will be made) accessible to laypersons. I'd love to be able to use this algorithm for my own music experiments.

Programming Language by jmlsteele · 2005-08-31 16:37 · Score: 2, Interesting

How long until we see something like this applied to ?

No the didn't by Ogemaniac · 2005-08-31 16:53 · Score: 5, Interesting

I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.

I will believe this new program when I see it.

Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.

Re:No the didn't by lawpoop · 2005-08-31 17:12 · Score: 2, Interesting

The example you are suing is from conversation, which containts a lot of mutually shared assumptions and information. Take this example from Stephen Pinker:
"I'm leaving you."
"Who is she?"
However, in written text, where the author can assume that the reader brings no shared assumptions, nor can the author rely on any deefback, 'speakers' usually do a good job of including all necessary information in one way or another -- especially in texts meant to convince or promote a particular viewpoint. I'll bet these kinds of texts are more easily translatable than conversation.

--
Computers are useless. They can only give you answers.
-- Pablo Picasso
Re:No the didn't by a.different.perspect · 2005-08-31 17:33 · Score: 2, Interesting

Or was it "chinko wo nametakatta"? It's just as easy for me to believe, you hot Slashdot nerd, you.

Being more serious, how do you think humans learn the rudiments of language? It's pattern analysis, i.e. precisely the technique this algorithm tries to replicate. It is true that the algorithm won't then progress onto the next stage, which is using that rudimentary grasp of the language to be taught its finer points, but if you genuinely doubt the capacity of this method to produce an understanding of language you are contesting the experiences of every human on the planet.

Returning to your example, "I really wanted to see you tonight" is what you discerned that sentence meant from its context. You can hardly expect a machine translator to know that it was a woman you were out with at night who said it (which seems to be the basis for your insertion of "tonight", "really" and "you"); fortunately, this algorithm is intended to translate written, not spoken, language. Since writing would have to include that detail (in order to be independent of its context), the problem you identified is not even relevant.
Re:No the didn't by burns210 · 2005-08-31 17:49 · Score: 3, Interesting

There was a program that tried to use the language of Esperanto (a made-up language designed specifically to be very consistent and guessable with regards to how syntax and words are used, very easy to learn and understand quickly) to be a middleman for translation.

The idea being that you take any input language, Japanese for instance, and get a working Jap Esperanto translator. Being as Esperanto is so consistent and reliable in how it is designed, it should be easier to do than a straight Jap Eng translator.

To finish, you write a Esperanto English translator. By leveraging the consistent language of Esperanto, researchers thought they could write a true universal translator of sorts.

Don't know what ever came of it, but it was an interesting idea.
Re:No the didn't by Anonymous Coward · 2005-08-31 18:30 · Score: 1, Interesting

The word you're looking for to describe the intermediate is "interlingua", and it need not be real, just structure meaning somehow -- eg some wierd XML ;-).

My tutor got his doctorate in machine translation, and that was erm mid-early 80s? His "not for a long time" prediction (as seems to apply in general to AI) likely remains correct --- I'll believe the techniques (as AI in general) brings us more than extremely specialised uses when I see more than press releases and claims of software that isn't available for me to test.

In fact, fellow nerds, just give me a link to ONE impressive piece of AI software (that isn't a chess player) and I'll be bowled over. PS I'm posting this using Dragon NaturallySpeaking, which is one of the only examples of vaguely AI research reaching the home/office...
Re:No the didn't by krunk4ever · 2005-08-31 20:21 · Score: 2, Interesting

Being more serious, how do you think humans learn the rudiments of language? It's pattern analysis, i.e. precisely the technique this algorithm tries to replicate. It is true that the algorithm won't then progress onto the next stage, which is using that rudimentary grasp of the language to be taught its finer points, but if you genuinely doubt the capacity of this method to produce an understanding of language you are contesting the experiences of every human on the planet.

there's one flaw in your analysis is that humans learn language/grammar faster when their young and it becomes a lot harder when they get older. There's many different speculations on why that happens from children starting from a clean slate to children learn languages better as their brain develops. I mean pattern analysis would definitely be an advantage for grown ups, no? Why are children's pattern analysis better in this case if what you saying is true.

From what I've seen, to actually learn grammar and a foreign language, there's 2 requirements. One is you must have a passion for it. 2nd is that you must be constantly practicing. I've noticed if you attend classes but never use it in your real life, you'll never learn it. Find a group of people who are also learning and try communicating only with that language and you'll see how much faster you'll pick up. It also helps to have a friend who's fluent in the language to correct you (though it might not be that good for your pride). What I've noticed is that grammar nazis are the best for learning a new grammar. They pick on EVERY SINGLE MISTAKE YOU MAKE, so you'd think twice before making the same mistake again.

At college, I've actually seen flyers asking for help in english and in return they'll help you with the language they're fluent in, be in french, german, chinese, japanese, etc. So those people would meet maybe 3x a week and spend an hour in each language each time, which I thought was a really neat idea. Here you're helping a foreigner with english and there they are helping you with a foreign language you want to learn.

--
HD Trailers

Finaly by Trigulus · 2005-08-31 16:53 · Score: 2, Interesting

something that can make sense of the voynich manuscript http://www.voynich.nu/. They should have tested their system on it.

--
If something exists that does not need a creator (god) then why must the cosmos need one?

Universal Translator? by mwilli · 2005-08-31 16:58 · Score: 2, Interesting

Could this be integrated into a handheld device to be used as a universal translater much like a hearing aid?

Electronic babelfish anyone?

--
My sig beat up your sig.

Dolphins? by Stripsurge · 2005-08-31 17:29 · Score: 2, Interesting

Seems like that'd be a good place to test the system out. While talking with extraterestrials would be pretty awesome, having a chat with a dolphin would be pretty cool too. Remember: "The second most intelligent [species] were of course dolphins"

Give it a real challenge by pugugly · 2005-08-31 17:40 · Score: 3, Interesting

Feed it the entries in the "obfuscated C" competition - if it works for that, it oughta work for anything.

Pug

--
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media

Can it decipher ancient languages? by Anonymous Coward · 2005-08-31 17:57 · Score: 1, Interesting

For example the lost iberian language, spoken in Spain before latin. There are texts, but nobody understand them.

Spam filter? by goMac2500 · 2005-08-31 18:00 · Score: 2, Interesting

Could this be used to make a smarter spam filter?

O(n^n^n...)????? by mosel-saar-ruwer · 2005-08-31 18:34 · Score: 3, Interesting

From TFA: The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.

If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches, and then gosh only knows how many comparions thereafter to find all contiguous stretches of matches.

But if you were to take some set of embedded strings, and rotate them against a second set of global strings [where, in a worst case scenario, the set of embedded strings would consist of the set of all substrings of the set of global strings], then you would need to perform a staggeringly large [for all intents and purposes, infinite] number of byte comparisons.

What did they do to shorten the total number of comparisons? [I've got some ideas of my own in that regard, but I'm curious as to their approach.]

PS: Many languages are read backwards, and I assume they re-oriented those languages before feeding them to the algorithm [it would be damned impressive if the algorithm could learn the forwards grammar by reading backwards].

Re:O(n^n^n...)????? by volsung · 2005-09-01 03:22 · Score: 2, Interesting

Right-to-left languages (which I assume you mean as "backwards") are displayed that way to the user, but it does not affect their digital storage, which is still forwards (in the numerical offset sense).

Is that really a big problem? by Anonymous Coward · 2005-08-31 19:16 · Score: 1, Interesting

I am not sure if that is really that big a problem. With mobile text messaging, people have started changing their sentences into a form that can be understood by the phones dictionaries.

Say, if I normally would have typed "stroll" to say "walk" and I would notice that when I press 787655 on my phone's keyboard, the T9 dictionary misunderstands me, I would just start typing 9255 for "walk" instead. I think the same would happen here. If somehow the person typing the messages would get instantaneous feedback from the system about a "commonly misunderstood" structure, he would quickly learn to avoid these structures while typing.

On a related note, things like "fly like an arrow" are the most difficult thing to learn in my opinion in a language, and thus foreign speakers do not use or know them. And still, "badly spoken english" can be comprehensible among the people speaking it. One thing I have noticed myself is that it is the british who have most problems understanding a foreigner speaking english badly. Other foreigners would understand the same person just fine. Something to do with the way the brain is wired to wait for certain words after another I guess.

Of course, the problem is that we would get rid of all the things that make language "alive". But here I am typing a message on another language than my own and still many people can to some extent understand what I mean...

Re:Noam Chomsky by stephentyrone · 2005-08-31 19:20 · Score: 2, Interesting

Actually, this fits very tidily in a Chomskian context. The program has an internal, predetermined notion of "what a grammar looks like" (i.e. a class of allowable grammars sharing certain properties), and adapts that to the source text. The way all this is presented makes it seem like unsupervised learning that can find any pattern, but the best you can hope to do with a method like this is capture an arbitrary (possibly probabilistic) context free grammar (CFG).

Even then, Gold showed a long, long time ago (1967) that the task of inducing an arbitrary CFG using only generated strings from the language is basically hopeless [Gold, E. Mark. 1967. Language Identification in the Limit. Information and Control, 10:447-474].

That said, this doesn't even seem to be that novel (to me). Andreas Stolcke wrote a very nice PhD dissertation in 1994 on learning arbitrary PCFGs from langage strings [Stolcke, Andreas. 1994. Bayesian Learning of Probabilistic Language Models. PhD Dissertation. University of California at Berkeley.]

This is probably a better, more efficient method that Stolcke produced back in '94, but I would be *very* surprised if it revolutionized the way computers interact with language, or anything else of the sort. People working in computational linguistics have a nasty habit of making grand pronouncements, only to fall far short of what they claimed.

For the record: IANAL, but i play one on TV, by which i mean i'm an applied mathematician with a couple published papers in computational linguistics.

Re:Random test ... by Godwin+O'Hitler · 2005-08-31 23:11 · Score: 3, Interesting

I AM a professional human translator, and believe me, if a machine translation did even a half decent job of producing intelligible, natural text, I would use it to get a jump start and save a lot of time.

But as things stand, I'd spend more time knocking the bad translation into shape than if I translated the whole thing from scratch.

Translators are often asked to copy edit other translators' work (customers tend to call it this "proof reading", presumably to devalue it and get it done on the cheap, but it involves much more than hunting typos). That's fair enough if you want a quality check. But some smart-arse people try sending machine translations for copy editing. And you can bet they get sent straight back!

--
No, your children are not the special ones. Nor are your pets.

DNA Analysys by da5idnetlimit.com · 2005-08-31 23:28 · Score: 2, Interesting

Anyone else thinking about using the tech to learn something about "the grammar of DNA"?

If they can use it for analysing proteine sequences, maybe they can tackle "the grammar of Life" and kickstart the whole Bioengeenering sector into a new life...

OTOH, the integrist christians will probably denounce this as an evil thing...

--
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker

Re:grammar isn't enough by g2devi · 2005-09-01 00:05 · Score: 3, Interesting

Even better. The meaning of words can flip back and forth depending on the ever widening context.

* The clown threw a ball.

(Probably, a tennis or basket ball)

* The clown threw a ball,....for charity.

(Okay, sorry, a ball a party.)

* The clown threw a ball,....for charity...., and hit the target.

(Okay, sorry again, the tennis ball hit the dunking target and someone fell in the water. Got it. We're in a carnival.)

* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars.

(Scratch that. It really is a charity party and we've collected 1 million in donations. There's no way the meaning can change again.)

* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars....by striking out Babe Ruth.

(Oops again. The clown got 1 million dollars in pledges if he could strike out Babe Ruth, and he succeeded. We're talking about a base ball again. I give up.)

Re:Noam Chomsky by edibleplastic · 2005-09-01 01:29 · Score: 2, Interesting

This won't disprove Chomsky's theories, at most it will serve as evidence that language can be learned through statistical means. The reason it won't disprove anything is because we're ultimately interested in the way that *humans* learn language. Whether or not it's possible to learn a language solely through statistical means doesn't change the fact of the matter for humans, which may or may not have a genetic endowment for learning language. It's entirely possible that it's possible in principle to learn language this way, but we do it with some priors (the universal grammar).

There have been basically two prongs of arguments in favor of the existence of a Universal Grammar in the debate. The first is that the task of learning an infinite grammar from a finite subset of sentences (and then only from positive evidence) appears to be too difficult to accomplish solely through statistical means. The second is an effort to show that language learning is biologically- rather than experience-based. This is the effort to show that there is a critical period in language development, which would suggest that there is a strong biological (i.e., genetic) component to langauge learning.

In my opinion, the first prong isn't very strong, since it relies on assumptions about statistical learning to make its claims. Their claims to me seem to stem more from a lack of imagination than from anything we can pin down as logically necessary. Shimon Edelman's work would work against this prong, showing that yes, it is possible to learn a language via statistcal means. (It would still have to be shown that the knowledge the computer possesses is qualitatively similar to that learned by humans... it may learn languages in a completely different way).

His findings wouldn't affect the second prong at all, though, which to my mind is the stronger of the two approaches. There have been lots of studies which suggest that there is a biological timecourse for language acquisition, suggesting that we do have an innate capacity for it.

So to sum up, while I find it a very exciting and important finding, I don't believe it by itself will disprove the theory of Universal Grammar.

Slashdot Mirror

New Algorithm for Learning Languages

37 of 454 comments (clear)