Slashdot Mirror


New Algorithm for Learning Languages

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."

24 of 454 comments (clear)

  1. Sucks to be a support tech in India by HeLLFiRe1151 · · Score: 5, Funny

    Their jobs be outsourced to computers.

    --
    I've got 101 mod points and you can't have them!
  2. Didn't Google already do this? by powerline22 · · Score: 5, Interesting

    Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.

    1. Re:Didn't Google already do this? by spisska · · Score: 5, Interesting

      IIRC, Google's translator works from a source of documents from the UN. By cross referencing the same set of documetents in all kinds of different languages, it is able to do a pretty solid translation built on the work of goodness knows how many professional translators.

      What is a little more confusing to me is how machine translation can deal with finer points in language, like different words in a target language where the source language has only one. English for example has the word "to know" but many languages use different words depending on whether it is a thing or a person that is known. Or words that relate to the same physical object but carry very different cultural connotations -- the word for female dog is not derogatory in every language, for example, but some other animals can be extremely profane depending on who you talk to.

      Or situations where two entirely different real-world concepts mean similar things in their respective language -- in English, for example, you're up shit creek, but in Slavic languages you're in the pussy.

      I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate. English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

      Consider this:

      I was having breakfast when my sister, whom I hadn't seen in five years, called and asked if I was going to the county fair this weekend. I told her I wasn't because I'm having the painters come on Saturday. They'll have finished by 5:00, I told her, so we can get together afterwords.

      These three sentences use six different tenses: past continuous, past perfect, past simple, present continuous, future perfect, and present simple, and are further complicated by the fact that you have past tenses refering to the future, present tenses refering to the future, and the wonderful future perfect tense that refers to something that will be in the past from an arbitrary future perspective, but which hasn't actually happened yet. Still following?

      On the other hand, English is much less precise in things like prepositions and objects, and utterly inexplicable when it comes to things like articles, phrasal verbs, and required word order -- try explaining why:

      I'll pick you up after work

      I'll pick the kids up after work

      I'll pick up the kids after work

      are all OK, but

      I'll pick up you after work

      is not.

      Machine translation will be a wonderful thing for a lot of reasons, but because of these kinds of differences in languages, it will be limited to certain types of writing. You may be able to get a computer to translate the words of Shakespeare, but a rose, by whatever name, is not equally sweet in every language.
  3. SCIgen by OverlordQ · · Score: 5, Interesting

    SCIgen anyone?

    --
    Your hair look like poop, Bob! - Wanker.
  4. PDF of paper by mattjb0010 · · Score: 5, Informative

    Paper here for those who have PNAS access.

    1. Re:PDF of paper by ksw2 · · Score: 5, Funny
      Paper here for those who have PNAS access.

      HEH! funniest meant-to-be-serious acronym ever.

  5. Woah by SpartanVII · · Score: 4, Funny

    Imagine if the editors started using this, what would everyone have to bitch about on Slashdot?

  6. Speaking as someone working on NLP by OO7david · · Score: 4, Interesting

    IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.

    It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.

    With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.

    So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.

    1. Re:Speaking as someone working on NLP by OO7david · · Score: 4, Interesting

      It is in effect two parted:

      Chomsky is to linguistics as Freud to psych. He had great ideas for the time (many still stand), and the science would be nowhere close to where it is without him. However, A) he's backed off alot of supporting his own theories and B) he's published papers contradicting his original ideas so that is some question there for their veracity. Since so many linguistics undergrads hold him as the pinnical of syntax none are really deviating drastically from him.

      WRT the unscientificness, to make his view fit English, there has to be "do-support" which basically is that when forming an interrogative "do" just comes in to make things work without any explanation. In other words, it is in our grammar, but our view of syntax does not account for it.

    2. Re:Speaking as someone working on NLP by PurpleBob · · Score: 4, Interesting

      You're right about Chomsky holding back linguistics. (There are all kinds of counterarguments against his Universal Grammar, but people defend it because Chomsky Is Always Right, and Chomsky himself defends it with vitriolic, circular arguments that sound alarmingly like he believes in intelligent design.)

      And I agree that this algorithm doesn't seem that it would be entirely successful in learning grammar. But this is not because it's statistical. I don't understand how you can look at something as complicated as the human brain and say "statistics does not come in at all".

      If this algorithm worked, then it could be statistical, symbolic, Chomskyan, or magic voodoo and I wouldn't care. There's no reason that computers have to do things the same way the brain does, and I doubt they'll have enough computational power to do so for a long time anyway.

      No, the flaws in this algorithm are that it is greedy (so a grammar rule it discovers can never be falsified by new evidence), and it seems not to discover recursive rules, which are a critical part of grammar. Perhaps it's learning a better approximation to a grammar than we've seen before, but it's not really doing the amazing, adaptive, recursive thing we call language.

      --
      Win dain a lotica, en vai tu ri silota
  7. Wow! by the_skywise · · Score: 4, Funny

    They've rediscovered the Eliza program!

    Input: "For example, the sentences I would like to book a first-class flight to Chicago, I want to book a first-class flight to Boston and Book a first-class flight for me, please may give rise to the pattern book a first-class flight -- if this candidate pattern passes the novel statistical significance test that is the core of the algorithm."

    How does it feel to "book a first-class flight"?

  8. Markov Chains anyone? by ImaLamer · · Score: 5, Informative

    http://en.wikipedia.org/wiki/Markov_chain

    Used this (easy to compile) C program:

    http://www.eblong.com/zarf/markov/

    to create these:

    http://www.mintruth.com/mirror/texts/

    Mod points to whomever can tell us what texts they use. (No mod points can actually be given)

  9. Full article for non-PNAS subscribers by dmaduram · · Score: 4, Informative

    Unsupervised learning of natural languages

    Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman
    School of Physics and Astronomy and School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; and Department of Psychology, Cornell University, Ithaca, NY 14853

    We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

    Many types of sequential symbolic data possess structure that is (i) hierarchical and (ii) context-sensitive. Natural-language text and transcribed speech are prime examples of such data: a corpus of language consists of sentences defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages; recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce an unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificial-grammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins.

    The ADIOS Algorithm for Grammar-Like Rule Induction

    In a machine learning paradigm for grammar induction, a teacher produces a sequence of strings generated by a grammar G0, and a learner uses the resulting corpus to construct a grammar G, aiming to approximate G0 in some sense (6). Recent evidence suggests that natural language acquisition involves both statistical computation (e.g., in speech segmentation) and rule-like algebraic processes (e.g., in structured generalization) (7-11). Modern computational approaches to grammar induction integrate statistical and rule-based methods (12, 13). Statistical information that can be learned along with the rules may be Markov (14) or variable-order Markov (15) structure for finite state (16) grammars, in which case the EM algorithm can be used to maximize the likelihood of the observed data. Likewise, stochastic annotation for context-free grammars (CFGs) can be learned by using methods such as the Inside-Outside algorithm (14, 17).

    We have developed a method that, like some of those just mentioned, combines statistics and rules: our algorithm, ADIOS (for automatic distillation of structure) uses statistical information present in raw sequential data to identify significant segments and to distill rule-like regularities that support structured generalization. Unlike

  10. Re:just thought.. by Bogtha · · Score: 4, Insightful

    This algorithm works with sample data. Where is the sample data going to come from? If you have to download it, then that negates the whole point of using it. If you use what you see online, well that's just rediculous, for obvious reasons :).

    --
    Bogtha Bogtha Bogtha
  11. No the didn't by Ogemaniac · · Score: 5, Interesting

    I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.

    I will believe this new program when I see it.

    Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.

    1. Re:No the didn't by superpulpsicle · · Score: 4, Informative

      Try this free website out. http://www.freetranslation.com/

      I know it is fairly accurate because I have fooled my spanish speaking friends once in an IM conversation. I told them I learned spanish via hypnosis and basically just copy/pasted everything spanish into IM. The conversation went on for like 15 minutes full spanish before I told them I was using the website. They were pissing their pants.

  12. Re:Noam Chomsky by venicebeach · · Score: 5, Insightful

    Perhaps a linguist could weigh in on this, but it seems to me that this kind of research is quite contrary to the Chomskian view of linguistics.

    Instead of a language module with specialized abilities tuned to learn rule-based grammar, we have an an unsupervised learning system has surmised the grammar of the language merely from the patterns inherent in the data it is given. That a system can do this is evidence against the notion that an innate grammar module in the brain is necessary for language.

  13. Re:Noam Chomsky by hunterx11 · · Score: 4, Insightful
    Linguistics has nothing to do with prescriptive grammar, except perhaps studying what influence it has on language. Something like "don't split infinitives" is not a rule in linguistics. Something like "size descriptors come before color descriptors in English" is a rule, because it's how people actually speak. Incidentally, most people are not even aware of these rules in their native language, despite obviously having mastery over them.

    If there were no rules, I could write a post using random letters for random sounds in a random order, or just using a bunch of non-letters. That wouldn't convey anything. Saying "I'm writing on slashdot" is more effective than writing "(*&$@(&^$)(#*$&"

    --
    English is easier said than done.
  14. Re:just thought.. by Mac+Degger · · Score: 5, Informative

    What they've develloped is something which interprets grammar; the ruleset behind the organisation of buildingblocks, apparently buildingblock agnostic.

    A dictionary is just words. This algorythm cant assign meaning to the buildingblocks, it can only dicide how and in what order the buildingblocks go together.

    --
    -- Waht? Tehr's a preveiw buottn?
  15. Re:Noam Chomsky by SparksMcGee · · Score: 4, Insightful
    I took a linguistics class this previous year with a professor who absolutely disagreed with the Chomskyan view of linguistics (though she did acknowledge that he had contributed a great deal to the field). Some of the arguments against Chomsky include objections to the Chomskyan view of "universal grammar"--that essentially a series of nerual "switches" determine what language a person knows and that these in turn are purely grammatical in nature (the lexicon of different languages qualifying as "superficial"--in and of itself a somewhat tenable argument). While this holds reasonably well for English and closely related languages (English grammar in particular depends a tremendous amount upon word order and syntax, and thus lends itself well to this sort of computational model), in many languages the lines between nominally "superficial" categories--e.g. phonology, lexicon and syntax--become blurred, especially in, for instance, case languages. Whereas you can break down the grammatical elements of an English sentence fairly easily into "verb phrases" "noun phrases" and so on, this is largely because of English syntactical conventions. When a system of prefixes and suffixes can turn a base morpheme from a noun phrase to a verb phrase or any of various parts of speech, the kind of categories to which English morphemes and phrases lend themselves become much harder to apply. Add to this the fact that there exist languages (e.g. Chinese) in which grammatically superficial categories (in English) like phonology become syntactically and grammatically significant, and the sheer variety of lingiustic grammars either seriously undermines the theory in general or forces upon one the Socratic assumption that everyone knows every language and every possible grammar from birth and simply need to be exposed to the rules of whatever their native language is and to pickup superficialities like lexicon to become a fluent speaker. It's not all complete nonsense, but if it were truly correct then presumably computerized translation software (with the aid of large dictionary files for lexicons) would have been perfected some time ago).


    Sorry about the rant, but like I said, my prof did *not* like the Chomskyan view of linguistics.

    Oh, and as far as the notion of the "language module" goes, it might be premature to call it a module, but there *is* neurophysiological evidence to suggest that humans are physically predisposed towards learning language from birth, so that much at the very least is tenable.

  16. grammar isn't enough by JoeBuck · · Score: 4, Informative
    The classic problem example is:
    • Time flies like an arrow.
    • Fruit flies like a banana.
    There are other, similar examples. Computer systems tend to deduce either that there's a type of insect called "time flies", or that the latter sentence refers to the aerodynamic properties of fruit.
  17. English only has two tenses. by ericbg05 · · Score: 5, Informative
    I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate.

    Yes! I'd have thrown a mod point at you just for this paragraph if I could.

    English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

    Not really. Firstly, English only has two or three tenses. (Depending upon which linguist you ask, English either has a past/non-past distinction or past/present/future distinctions. See [1], [2]. The general consensus seems to be in favor of the former, although I humbly disagree with the general consensus.) It maintains a variety of aspect distinctions (perfective vs imperfective, habitual vs continuous, nonprogressive vs progressive). See [3]. Its verbs also interact with modality, albeit slightly less strongly.

    It's a very common mistake to count the combinations of tense, aspect, and modality in a language and arrive at some astronomical number of "tenses". It's an even more common mistake (for native English speakers, anyway) to think that English is special or different or strange compared to other languages. In most cases, it's not -- especially when compared with other Indo-European languages.

    Secondly, and more interestingly IMHO, most languages do not have three distinct tenses. The most common cases are either to have a future/non-future distinction or a past/non-past distinction. In any case, the future tense, if it exists, is normally derived from modal or aspectual markers and is diachronically weak (which is linguist-babble meaning "future tenses forms don't stick around for very long"). See [3].

    English is a perfect example: will, of course, used to refer to the agent's desire (his or her will) to do something. Only recently has it shifted to have a more temporal sense, and it still maintains some of its modal flavor. In fact, the least marked way of making the future (in the US, at least) is to use either gonna or a present progressive form: I'm having dinner with my boss tonight. I'm gonna ask him for a raise. See Comrie [1] again.

    So as not to be anglo-centric, I'll give another example. Spanish has three widespread means of forming the future tense. Two of these are periphrastic and are exemplified by he de cantar 'I've gotta sing' and voy a cantar 'I'm gonna sing'. The last is the synthetic form, cantaré 'I'll sing'.

    Most high school or college Spanish teachers would tell you that the "pure" future is cantaré. Actually, it's historically derived from the phrase cantar he 'I have to sing' (from Latin cantáre habeo), and is being displaced by the other two forms all across the Spanish-speaking world. I'm told, for example, that cantaré has been largely lost in in Argentina and southern Chile (see [4]).

    In any case, the parent's main point still holds. It's a b?tch to deal with cross-linguistic differences in major semantic systems computationally. But good lord, it's fun to try. :)

    References:

    1. Comrie, Bernard. Tense. Cambridge, UK: Cambridge University Press, 1985.
    2. Davidsen-Nielsen, Niels. "Has English a Future?" Acta Linguistica Hafniensia 21 (1987): 5-20.
    3. Frawley, William.
  18. Random test ... by Mostly+a+lurker · · Score: 5, Funny
    I know it is fairly accurate because I have fooled my spanish speaking friends once in an IM conversation. I told them I learned spanish via hypnosis and basically just copy/pasted everything spanish into IM. The conversation went on for like 15 minutes full spanish before I told them I was using the website. They were pissing their pants.
    English to German produces:
    Ich weiß, dass es ziemlich genau ist, weil ich mein Spanisch getäuscht habe, Freunde einmal in einer IM Konversation zu sprechen. Ich habe sie erzählt, dass ich Spanisch über Hypnose und im Grunde nur Kopie gelernt habe/hat eingefügt alles Spanisch in IM. Die Konversation ist weitergegangen für wie 15 Minuten volles Spanisch, bevor ich sie erzählt habe, dass ich die Website benutzte. Sie pissten ihre Hose
    Then, German to English:
    I know that it rather exactly is, because I deceived my Spanish to speak friends once in one IN THE conversation. I told it, learned would have inserted that I Spanish over hypnosis and in the reason only copy all Spanish in IN THAT. The conversation is gone on for Spanish full like 15 minutes before I told it, that I the websites used. You pissten its pair of pants
    My conclusion is that there is still a place for human translators.
  19. Re:just thought.. by jaavaaguru · · Score: 4, Interesting

    Perhaps it the algorithm could be used to identify spam more accurately. If it can understand the text, then it's got a reasonable chance of know if the text is junk.