Slashdot Mirror


New Algorithm for Learning Languages

An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."

15 of 454 comments (clear)

  1. just thought.. by thegoogler · · Score: 3, Interesting
    what if this could be integrated into a small plugin for your browser(or any program) of choice, that would then generate its own dictionary in your language.

    would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.

    1. Re:just thought.. by jaavaaguru · · Score: 4, Interesting

      Perhaps it the algorithm could be used to identify spam more accurately. If it can understand the text, then it's got a reasonable chance of know if the text is junk.

  2. Didn't Google already do this? by powerline22 · · Score: 5, Interesting

    Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.

    1. Re:Didn't Google already do this? by spisska · · Score: 5, Interesting

      IIRC, Google's translator works from a source of documents from the UN. By cross referencing the same set of documetents in all kinds of different languages, it is able to do a pretty solid translation built on the work of goodness knows how many professional translators.

      What is a little more confusing to me is how machine translation can deal with finer points in language, like different words in a target language where the source language has only one. English for example has the word "to know" but many languages use different words depending on whether it is a thing or a person that is known. Or words that relate to the same physical object but carry very different cultural connotations -- the word for female dog is not derogatory in every language, for example, but some other animals can be extremely profane depending on who you talk to.

      Or situations where two entirely different real-world concepts mean similar things in their respective language -- in English, for example, you're up shit creek, but in Slavic languages you're in the pussy.

      I've done translation work before (Slovak -> English), and there's much more going on than differences in words and grammar. There are whole conceptual frameworks in languages that just don't translate, and this is frustrating for anyone learning a language, let alone trying to translate. English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.

      Consider this:

      I was having breakfast when my sister, whom I hadn't seen in five years, called and asked if I was going to the county fair this weekend. I told her I wasn't because I'm having the painters come on Saturday. They'll have finished by 5:00, I told her, so we can get together afterwords.

      These three sentences use six different tenses: past continuous, past perfect, past simple, present continuous, future perfect, and present simple, and are further complicated by the fact that you have past tenses refering to the future, present tenses refering to the future, and the wonderful future perfect tense that refers to something that will be in the past from an arbitrary future perspective, but which hasn't actually happened yet. Still following?

      On the other hand, English is much less precise in things like prepositions and objects, and utterly inexplicable when it comes to things like articles, phrasal verbs, and required word order -- try explaining why:

      I'll pick you up after work

      I'll pick the kids up after work

      I'll pick up the kids after work

      are all OK, but

      I'll pick up you after work

      is not.

      Machine translation will be a wonderful thing for a lot of reasons, but because of these kinds of differences in languages, it will be limited to certain types of writing. You may be able to get a computer to translate the words of Shakespeare, but a rose, by whatever name, is not equally sweet in every language.
  3. SCIgen by OverlordQ · · Score: 5, Interesting

    SCIgen anyone?

    --
    Your hair look like poop, Bob! - Wanker.
  4. Speaking as someone working on NLP by OO7david · · Score: 4, Interesting

    IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.

    It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.

    With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.

    So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.

    1. Re:Speaking as someone working on NLP by OO7david · · Score: 4, Interesting

      It is in effect two parted:

      Chomsky is to linguistics as Freud to psych. He had great ideas for the time (many still stand), and the science would be nowhere close to where it is without him. However, A) he's backed off alot of supporting his own theories and B) he's published papers contradicting his original ideas so that is some question there for their veracity. Since so many linguistics undergrads hold him as the pinnical of syntax none are really deviating drastically from him.

      WRT the unscientificness, to make his view fit English, there has to be "do-support" which basically is that when forming an interrogative "do" just comes in to make things work without any explanation. In other words, it is in our grammar, but our view of syntax does not account for it.

    2. Re:Speaking as someone working on NLP by PurpleBob · · Score: 4, Interesting

      You're right about Chomsky holding back linguistics. (There are all kinds of counterarguments against his Universal Grammar, but people defend it because Chomsky Is Always Right, and Chomsky himself defends it with vitriolic, circular arguments that sound alarmingly like he believes in intelligent design.)

      And I agree that this algorithm doesn't seem that it would be entirely successful in learning grammar. But this is not because it's statistical. I don't understand how you can look at something as complicated as the human brain and say "statistics does not come in at all".

      If this algorithm worked, then it could be statistical, symbolic, Chomskyan, or magic voodoo and I wouldn't care. There's no reason that computers have to do things the same way the brain does, and I doubt they'll have enough computational power to do so for a long time anyway.

      No, the flaws in this algorithm are that it is greedy (so a grammar rule it discovers can never be falsified by new evidence), and it seems not to discover recursive rules, which are a critical part of grammar. Perhaps it's learning a better approximation to a grammar than we've seen before, but it's not really doing the amazing, adaptive, recursive thing we call language.

      --
      Win dain a lotica, en vai tu ri silota
  5. Grammar depends on the input by Tsaac · · Score: 3, Interesting

    If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps, it's quite hard to sort the good from the bad when it's learning. Take the megahal library http://megahal.alioth.debian.org/> for example. Although possibly not as complex, it does a decent job at learning, but when fed with rubbish it will output rubbish. I don't think it's the learning that will be that hard part, but rather the recognition of the good vs. the bad that will prove how good the system is.

    --
    eXemplary Abstract
  6. No the didn't by Ogemaniac · · Score: 5, Interesting

    I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.

    I will believe this new program when I see it.

    Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.

    1. Re:No the didn't by burns210 · · Score: 3, Interesting

      There was a program that tried to use the language of Esperanto (a made-up language designed specifically to be very consistent and guessable with regards to how syntax and words are used, very easy to learn and understand quickly) to be a middleman for translation.

      The idea being that you take any input language, Japanese for instance, and get a working Jap Esperanto translator. Being as Esperanto is so consistent and reliable in how it is designed, it should be easier to do than a straight Jap Eng translator.

      To finish, you write a Esperanto English translator. By leveraging the consistent language of Esperanto, researchers thought they could write a true universal translator of sorts.

      Don't know what ever came of it, but it was an interesting idea.

  7. Give it a real challenge by pugugly · · Score: 3, Interesting

    Feed it the entries in the "obfuscated C" competition - if it works for that, it oughta work for anything.

    Pug

    --
    An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
  8. O(n^n^n...)????? by mosel-saar-ruwer · · Score: 3, Interesting

    From TFA: The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.

    If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches, and then gosh only knows how many comparions thereafter to find all contiguous stretches of matches.

    But if you were to take some set of embedded strings, and rotate them against a second set of global strings [where, in a worst case scenario, the set of embedded strings would consist of the set of all substrings of the set of global strings], then you would need to perform a staggeringly large [for all intents and purposes, infinite] number of byte comparisons.

    What did they do to shorten the total number of comparisons? [I've got some ideas of my own in that regard, but I'm curious as to their approach.]

    PS: Many languages are read backwards, and I assume they re-oriented those languages before feeding them to the algorithm [it would be damned impressive if the algorithm could learn the forwards grammar by reading backwards].

  9. Re:Random test ... by Godwin+O'Hitler · · Score: 3, Interesting

    I AM a professional human translator, and believe me, if a machine translation did even a half decent job of producing intelligible, natural text, I would use it to get a jump start and save a lot of time.

    But as things stand, I'd spend more time knocking the bad translation into shape than if I translated the whole thing from scratch.

    Translators are often asked to copy edit other translators' work (customers tend to call it this "proof reading", presumably to devalue it and get it done on the cheap, but it involves much more than hunting typos). That's fair enough if you want a quality check. But some smart-arse people try sending machine translations for copy editing. And you can bet they get sent straight back!

    --
    No, your children are not the special ones. Nor are your pets.
  10. Re:grammar isn't enough by g2devi · · Score: 3, Interesting

    Even better. The meaning of words can flip back and forth depending on the ever widening context.

    * The clown threw a ball.

    (Probably, a tennis or basket ball)

    * The clown threw a ball,....for charity.

    (Okay, sorry, a ball a party.)

    * The clown threw a ball,....for charity...., and hit the target.

    (Okay, sorry again, the tennis ball hit the dunking target and someone fell in the water. Got it. We're in a carnival.)

    * The clown threw a ball,....for charity...., and hit the target....of 1 million dollars.

    (Scratch that. It really is a charity party and we've collected 1 million in donations. There's no way the meaning can change again.)

    * The clown threw a ball,....for charity...., and hit the target....of 1 million dollars....by striking out Babe Ruth.

    (Oops again. The clown got 1 million dollars in pledges if he could strike out Babe Ruth, and he succeeded. We're talking about a base ball again. I give up.)