New Algorithm for Learning Languages
An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
However, statistics does not come in at all with what they hear.
Utterance in pattern A is heard more often than utterance in pattern B; utterances in patterns C and D are not heard at all. How is that not statistics?
I'm starting to get the feeling that there nothing in sci fi that won't occur in reality. Except for the dorky guy getting to nail the hot busty alien babe that is. heh.
Unlike all the ridiculous patents being granted lately to IT companies, the one these guys are filing for, to me, seems legitimate. Its a nice change in my mind.
That's just a Markov Model that "learned" from what looks religious mumbo jumbo in the first place.
Markov models are perhaps the easiest language acquisition model to implement, but also one of the worst at coming up with valid speech or text.
Interestingly, they do much, much better as recommender systems.
This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).
Bogtha Bogtha Bogtha
And the "rules" of a language are NOT what children "learn". First of all, children acquire a language, they do not "learn" it. That is a large attribute to the child's ability to speak it--not whether or not they understand gerunds and the pluperfect.
Second, in a language such as English whose words for the most part lack any necessity to the order in which they're placed to understand they're meaning and, even worse, lack declension forms to distinguish subject from object of the preposition, with what success can a language recognition program have "learning" such a language when prepositions themselves mainly can be omitted? To teach a computer Latin is easy.
Third, what's the hope of the computer ever understanding something like Shakespeare, Joyce, or Dante, whose uses of language rely extensively on erudition for word placement as opposed to typical usage? While a computer might be able to learn Latin because of its rigourous rules, I doubt it could faithfully render a text from Ovid.
Perhaps a linguist could weigh in on this, but it seems to me that this kind of research is quite contrary to the Chomskian view of linguistics.
Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.
This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).
It's going to come from large bodies of text that exist in mmultiple langueages. Things like the Bible, the constitution, etcetera. The whole point of this technology is that by drawing conclusions from those texts, the program infers the underlying rules of the language and can therefore translate other things. Google was doing something similar. An online dictionary is completely different. First, it has to be compiled by someone. Second, it only helps for translating words verbatim. This technology would self-teach itself to translate languages, even if none of the researchers working on the project could even speak those languages themselves. That's the beauty of it.
If there were no rules, I could write a post using random letters for random sounds in a random order, or just using a bunch of non-letters. That wouldn't convey anything. Saying "I'm writing on slashdot" is more effective than writing "(*&$@(&^$)(#*$&"
English is easier said than done.
but
I'll pick up you after work
is not.
It can be, depending on context or emphasis. "I'll pick up the kids after lunch. I'll pick up you after work."
-- Alastair
Just because the program can extract grammar, it doesn't mean it can extract meaning. If I give you this sentence:
Ov brug termat akti mak lejna trovterna.
And tell you that "termat" and "lejna" are nouns, "akti mak" is a 'composite' verb, "brug" and "trovterna" are adjectives... it still doesn't say anything about the actual meaning.
The filesystem is the package manager
Sorry about the rant, but like I said, my prof did *not* like the Chomskyan view of linguistics.
Oh, and as far as the notion of the "language module" goes, it might be premature to call it a module, but there *is* neurophysiological evidence to suggest that humans are physically predisposed towards learning language from birth, so that much at the very least is tenable.
Yes, pattern recognition is a major part of the process. However, there are other fundamental parts that are also extremely important, and lacking them you get nonsense. In particular, context matters. "aitakatta" in the middle of a business letter probably does mean "wanted to meet". By itself, said by one member of a couple to the other over drinks at a bar, it does not.
In order for a program to translating to translate accurately, it needs to know who is speaking/writing, who is the audience, what their relationship is, and their location. Some of this may be given to the computer explicitly, or easily found in the text/speech (for a human at least) but some of it may not. This is not going to be an easy problem to solve.
Writing is never free from its context. I know before I even start whether I am reading a fiction novel, a satire, a scientific journal, an email from my boss, or a text message from my date this Saturday. The meaning of the words can change a lot in those cases.
Even Google translator, which was trained on multi-lingual UN reports, could not produce comprehensible English from simple Japanese business emails.
As for my chinko, that's a long story.
Called Pragmatics. It can be somewhat oversimplified as saying it's the study of how context affects meaning or as figuring out what we really mean, as opposed to what we say.
For example, a classical Pragmatics scenario:
John is interested in a co worker Anna, but is shy and doesn't want to ask her out if she's taken. He asks his friend Dave if he knows if Anna is available to which Dave replies "Anna has two kids."
Now, taken literally, Dave did not answer John's question. What he literally said is that Anna has at least two children, and presumably exactly two children. That says nothing of her avalibility for dating. However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.
So that's a major problem computers hit when trying to really understand natural language. You can write a set of rules that comletely describes all the syntax and grammar. However that doesn't do it, that doesn't get you to meaning, because meaning occurs at a higher level than that. Even when we are speaking literally and directly, there's still a whole lot of context that comes in to play. Since we are quite often at least speaking partially indirectly, it gets to be a real mess.
Your example is a great one of just how bad it gets between languages. The literal meaning in Japanese was not the same as the intended meaning. So first you need to decode that, however even if you know that, a literal translation of the intended meaning may not come out right in another language. To really translate well you need to be able to decode the intended meaning of a literal phrase, translate that into an approprate meaning in the other language, and then encode that in a phrase that conveys that intended meaning accurately, and in the appropriate way.
It's a bitch, and not something computers are even near capable of.
Yeah and this didn't learn the language in any meaningful sense. It just found a statistical pattern, and then generates possible sentences from that pattern. That's a whole lot different to you and I understanding the language and generating intentional, meaningful sentences.
http://www.perthonline.net
Klingon has simple grammar.
How about Dolphinese? Research shows that they seem to be able to scout and transfer information from one individual to his/her pod. If there's some grammar it would be pretty good nut to crack.
It is no longer uncommon to be uncommon.
If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches,...
No you don't :-)
If you want to find all singleton matches, it's enough to sort the string into ascending order (order n.log(n)), and then scan through for adjacent matches (order n). For example, sorting "the cat sat on the mat" gives "cat mat on sat the the"—where the two "the"s are now adjacent and so easily discovered.For finding longer matches the sorting method still works, except that you sort fragments of the sentence rather than individual words. Clearly there is more work involved, but (depending on exactly what you're counting) there are still order n.log(n) comparisons to be performed.
This means that searching for substring matches can be performed relatively efficiently. I don't know about how the language-learning algorithm works, but you may be interested to know that the compression algorithm used by "bzip2" works in exactly this way (google for "Burrows-Wheeler transform" for more details!)
Need to type accents and special characters in Windows? Use FrKeys
You over-estimate some speakers, me-thinks.
Tuus crepidae innexilis sunt.