New Algorithm for Learning Languages
An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.
Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.
SCIgen anyone?
Your hair look like poop, Bob! - Wanker.
IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.
It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.
With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.
So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.
If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps, it's quite hard to sort the good from the bad when it's learning. Take the megahal library http://megahal.alioth.debian.org/> for example. Although possibly not as complex, it does a decent job at learning, but when fed with rubbish it will output rubbish. I don't think it's the learning that will be that hard part, but rather the recognition of the good vs. the bad that will prove how good the system is.
eXemplary Abstract
Let's see what human DNA really says and means!
Can it decipher these things too?
But for this, I have one word: Dolphins.
When you're afraid to download music illegally in your own home, then the terrorists have won!
I hope the material is (or will be made) accessible to laypersons. I'd love to be able to use this algorithm for my own music experiments.
How long until we see something like this applied to ?
I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.
I will believe this new program when I see it.
Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.
something that can make sense of the voynich manuscript http://www.voynich.nu/. They should have tested their system on it.
If something exists that does not need a creator (god) then why must the cosmos need one?
Electronic babelfish anyone?
My sig beat up your sig.
Seems like that'd be a good place to test the system out. While talking with extraterestrials would be pretty awesome, having a chat with a dolphin would be pretty cool too. Remember: "The second most intelligent [species] were of course dolphins"
Feed it the entries in the "obfuscated C" competition - if it works for that, it oughta work for anything.
Pug
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
For example the lost iberian language, spoken in Spain before latin. There are texts, but nobody understand them.
Could this be used to make a smarter spam filter?
From TFA: The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.
If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches, and then gosh only knows how many comparions thereafter to find all contiguous stretches of matches.
But if you were to take some set of embedded strings, and rotate them against a second set of global strings [where, in a worst case scenario, the set of embedded strings would consist of the set of all substrings of the set of global strings], then you would need to perform a staggeringly large [for all intents and purposes, infinite] number of byte comparisons.
What did they do to shorten the total number of comparisons? [I've got some ideas of my own in that regard, but I'm curious as to their approach.]
PS: Many languages are read backwards, and I assume they re-oriented those languages before feeding them to the algorithm [it would be damned impressive if the algorithm could learn the forwards grammar by reading backwards].
I am not sure if that is really that big a problem. With mobile text messaging, people have started changing their sentences into a form that can be understood by the phones dictionaries.
Say, if I normally would have typed "stroll" to say "walk" and I would notice that when I press 787655 on my phone's keyboard, the T9 dictionary misunderstands me, I would just start typing 9255 for "walk" instead. I think the same would happen here. If somehow the person typing the messages would get instantaneous feedback from the system about a "commonly misunderstood" structure, he would quickly learn to avoid these structures while typing.
On a related note, things like "fly like an arrow" are the most difficult thing to learn in my opinion in a language, and thus foreign speakers do not use or know them. And still, "badly spoken english" can be comprehensible among the people speaking it. One thing I have noticed myself is that it is the british who have most problems understanding a foreigner speaking english badly. Other foreigners would understand the same person just fine. Something to do with the way the brain is wired to wait for certain words after another I guess.
Of course, the problem is that we would get rid of all the things that make language "alive". But here I am typing a message on another language than my own and still many people can to some extent understand what I mean...
Actually, this fits very tidily in a Chomskian context. The program has an internal, predetermined notion of "what a grammar looks like" (i.e. a class of allowable grammars sharing certain properties), and adapts that to the source text. The way all this is presented makes it seem like unsupervised learning that can find any pattern, but the best you can hope to do with a method like this is capture an arbitrary (possibly probabilistic) context free grammar (CFG).
Even then, Gold showed a long, long time ago (1967) that the task of inducing an arbitrary CFG using only generated strings from the language is basically hopeless [Gold, E. Mark. 1967. Language Identification in the Limit. Information and Control, 10:447-474].
That said, this doesn't even seem to be that novel (to me). Andreas Stolcke wrote a very nice PhD dissertation in 1994 on learning arbitrary PCFGs from langage strings [Stolcke, Andreas. 1994. Bayesian Learning of Probabilistic Language Models. PhD Dissertation. University of California at Berkeley.]
This is probably a better, more efficient method that Stolcke produced back in '94, but I would be *very* surprised if it revolutionized the way computers interact with language, or anything else of the sort. People working in computational linguistics have a nasty habit of making grand pronouncements, only to fall far short of what they claimed.
For the record: IANAL, but i play one on TV, by which i mean i'm an applied mathematician with a couple published papers in computational linguistics.
I AM a professional human translator, and believe me, if a machine translation did even a half decent job of producing intelligible, natural text, I would use it to get a jump start and save a lot of time.
But as things stand, I'd spend more time knocking the bad translation into shape than if I translated the whole thing from scratch.
Translators are often asked to copy edit other translators' work (customers tend to call it this "proof reading", presumably to devalue it and get it done on the cheap, but it involves much more than hunting typos). That's fair enough if you want a quality check. But some smart-arse people try sending machine translations for copy editing. And you can bet they get sent straight back!
No, your children are not the special ones. Nor are your pets.
Anyone else thinking about using the tech to learn something about "the grammar of DNA"?
If they can use it for analysing proteine sequences, maybe they can tackle "the grammar of Life" and kickstart the whole Bioengeenering sector into a new life...
OTOH, the integrist christians will probably denounce this as an evil thing...
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker
Even better. The meaning of words can flip back and forth depending on the ever widening context.
* The clown threw a ball.
(Probably, a tennis or basket ball)
* The clown threw a ball,....for charity.
(Okay, sorry, a ball a party.)
* The clown threw a ball,....for charity...., and hit the target.
(Okay, sorry again, the tennis ball hit the dunking target and someone fell in the water. Got it. We're in a carnival.)
* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars.
(Scratch that. It really is a charity party and we've collected 1 million in donations. There's no way the meaning can change again.)
* The clown threw a ball,....for charity...., and hit the target....of 1 million dollars....by striking out Babe Ruth.
(Oops again. The clown got 1 million dollars in pledges if he could strike out Babe Ruth, and he succeeded. We're talking about a base ball again. I give up.)
This won't disprove Chomsky's theories, at most it will serve as evidence that language can be learned through statistical means. The reason it won't disprove anything is because we're ultimately interested in the way that *humans* learn language. Whether or not it's possible to learn a language solely through statistical means doesn't change the fact of the matter for humans, which may or may not have a genetic endowment for learning language. It's entirely possible that it's possible in principle to learn language this way, but we do it with some priors (the universal grammar).
There have been basically two prongs of arguments in favor of the existence of a Universal Grammar in the debate. The first is that the task of learning an infinite grammar from a finite subset of sentences (and then only from positive evidence) appears to be too difficult to accomplish solely through statistical means. The second is an effort to show that language learning is biologically- rather than experience-based. This is the effort to show that there is a critical period in language development, which would suggest that there is a strong biological (i.e., genetic) component to langauge learning.
In my opinion, the first prong isn't very strong, since it relies on assumptions about statistical learning to make its claims. Their claims to me seem to stem more from a lack of imagination than from anything we can pin down as logically necessary. Shimon Edelman's work would work against this prong, showing that yes, it is possible to learn a language via statistcal means. (It would still have to be shown that the knowledge the computer possesses is qualitatively similar to that learned by humans... it may learn languages in a completely different way).
His findings wouldn't affect the second prong at all, though, which to my mind is the stronger of the two approaches. There have been lots of studies which suggest that there is a biological timecourse for language acquisition, suggesting that we do have an innate capacity for it.
So to sum up, while I find it a very exciting and important finding, I don't believe it by itself will disprove the theory of Universal Grammar.