Slashdot Mirror


New Algorithm for Learning Languages

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."

6 of 454 comments (clear)

  1. Re:just thought.. by Bogtha · · Score: 4, Insightful

    This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).

    --
    Bogtha Bogtha Bogtha
  2. Re:Noam Chomsky by venicebeach · · Score: 5, Insightful

    Perhaps a linguist could weigh in on this, but it seems to me that this kind of research is quite contrary to the Chomskian view of linguistics.

    Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.

  3. Re:Noam Chomsky by hunterx11 · · Score: 4, Insightful
    Linguistics has nothing to do with prescriptive grammar, except perhaps studying what influence it has on language. Something like "don't split infinitives" is not a rule in linguistics. Something like "size descriptors come before color descriptors in English" is a rule, because it's how people actually speak. Incidentally, most people are not even aware of these rules in their native language, despite obviously having mastery over them.

    If there were no rules, I could write a post using random letters for random sounds in a random order, or just using a bunch of non-letters. That wouldn't convey anything. Saying "I'm writing on slashdot" is more effective than writing "(*&$@(&^$)(#*$&"

    --
    English is easier said than done.
  4. Re:Noam Chomsky by SparksMcGee · · Score: 4, Insightful
    I took a linguistics class this previous year with a professor who absolutely disagreed with the Chomskyan view of linguistics (though she did acknowledge that he had contributed a great deal to the field). Some of the arguments against Chomsky include objections to the Chomskyan view of "universal grammar"--that essentially a series of nerual "switches" determine what language a person knows and that these in turn are purely grammatical in nature (the lexicon of different languages qualifying as "superficial"--in and of itself a somewhat tenable argument). While this holds reasonably well for English and closely related languages (English grammar in particular depends a tremendous amount upon word order and syntax, and thus lends itself well to this sort of computational model), in many languages the lines between nominally "superficial" categories--e.g. phonology, lexicon and syntax--become blurred, especially in, for instance, case languages. Whereas you can break down the grammatical elements of an English sentence fairly easily into "verb phrases" "noun phrases" and so on, this is largely because of English syntactical conventions. When a system of prefixes and suffixes can turn a base morpheme from a noun phrase to a verb phrase or any of various parts of speech, the kind of categories to which English morphemes and phrases lend themselves become much harder to apply. Add to this the fact that there exist languages (e.g. Chinese) in which grammatically superficial categories (in English) like phonology become syntactically and grammatically significant, and the sheer variety of lingiustic grammars either seriously undermines the theory in general or forces upon one the Socratic assumption that everyone knows every language and every possible grammar from birth and simply need to be exposed to the rules of whatever their native language is and to pickup superficialities like lexicon to become a fluent speaker. It's not all complete nonsense, but if it were truly correct then presumably computerized translation software (with the aid of large dictionary files for lexicons) would have been perfected some time ago).


    Sorry about the rant, but like I said, my prof did *not* like the Chomskyan view of linguistics.

    Oh, and as far as the notion of the "language module" goes, it might be premature to call it a module, but there *is* neurophysiological evidence to suggest that humans are physically predisposed towards learning language from birth, so that much at the very least is tenable.

  5. It's actually a new language study by Sycraft-fu · · Score: 3, Insightful

    Called Pragmatics. It can be somewhat oversimplified as saying it's the study of how context affects meaning or as figuring out what we really mean, as opposed to what we say.

    For example, a classical Pragmatics scenario:

    John is interested in a co worker Anna, but is shy and doesn't want to ask her out if she's taken. He asks his friend Dave if he knows if Anna is available to which Dave replies "Anna has two kids."

    Now, taken literally, Dave did not answer John's question. What he literally said is that Anna has at least two children, and presumably exactly two children. That says nothing of her avalibility for dating. However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.

    So that's a major problem computers hit when trying to really understand natural language. You can write a set of rules that comletely describes all the syntax and grammar. However that doesn't do it, that doesn't get you to meaning, because meaning occurs at a higher level than that. Even when we are speaking literally and directly, there's still a whole lot of context that comes in to play. Since we are quite often at least speaking partially indirectly, it gets to be a real mess.

    Your example is a great one of just how bad it gets between languages. The literal meaning in Japanese was not the same as the intended meaning. So first you need to decode that, however even if you know that, a literal translation of the intended meaning may not come out right in another language. To really translate well you need to be able to decode the intended meaning of a literal phrase, translate that into an approprate meaning in the other language, and then encode that in a phrase that conveys that intended meaning accurately, and in the appropriate way.

    It's a bitch, and not something computers are even near capable of.

  6. Re:O(n^n^n...)????? by psmears · · Score: 3, Insightful

    If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches,...

    No you don't :-)

    If you want to find all singleton matches, it's enough to sort the string into ascending order (order n.log(n)), and then scan through for adjacent matches (order n). For example, sorting "the cat sat on the mat" gives "cat mat on sat the the"—where the two "the"s are now adjacent and so easily discovered.

    For finding longer matches the sorting method still works, except that you sort fragments of the sentence rather than individual words. Clearly there is more work involved, but (depending on exactly what you're counting) there are still order n.log(n) comparisons to be performed.

    This means that searching for substring matches can be performed relatively efficiently. I don't know about how the language-learning algorithm works, but you may be interested to know that the compression algorithm used by "bzip2" works in exactly this way (google for "Burrows-Wheeler transform" for more details!)