New Algorithm for Learning Languages
An anonymous reader writes "U.S. and Israeli researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar. The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences."
would probably help with the problem of either downloading a small, incomplete dictionary, a dictionary with errors, or a massive dictionary file.
Their jobs be outsourced to computers.
I've got 101 mod points and you can't have them!
Google apparently has a system like this in their labs, and entered it into some national competetion, where it pwned everyone else. Apparently, the system learned how to translate to/from chinese extremely well, without any of the people working on the project knowing the language.
SCIgen anyone?
Your hair look like poop, Bob! - Wanker.
Paper here for those who have PNAS access.
Imagine if the editors started using this, what would everyone have to bitch about on Slashdot?
This is a perfect apportunity to remind that its Chomsky's contribution to Linuguistics which enabled this amazing (if true) achievement. For those of you don't know Chomsky, he is the father of modern linguistics. Many would also know him as a political activist. Very amazing character. http://www.sk.com.br/sk-chom.html
"There is no flag large enough to cover the shame of killing innocent people."--Howard Zinn
IAALinguist doing computational things and my BA focused mainly on syntax and language acquisition, so here're my thoughts on the matter.
It's not going to be right. The algorithm is stated as being statistically based which while is similar to the way children learn languages is not exactly it. Children learn by hearing correct native languages from their parents, teachers, friends, etc. The statistics come in when children produce utterances that either do not conform to speech they hear or when people correct them. However, statistics does not come in at all with what they hear.
With respect to the learning of the algorithm the underlying grammar of a language, I am dubious enough to call it a grand, untrue claim. Basically all modern views of syntax are unscientific and we're not going to get anywhere until Chompsky dies. Think about the word "do" in english. No view of syntax describes from where that comes. Rather languages are shoehorned into our constructs.
So, either they're using a flawed view of syntax or they have a new view of syntax and for some reason aren't releasing it in any linguistics journal as far as I know.
They've rediscovered the Eliza program!
Input: "For example, the sentences I would like to book a first-class flight to Chicago, I want to book a first-class flight to Boston and Book a first-class flight for me, please may give rise to the pattern book a first-class flight -- if this candidate pattern passes the novel statistical significance test that is the core of the algorithm."
How does it feel to "book a first-class flight"?
If fed with a heap of decent grammar, what happens when it's fed with bad grammar and spelling? Will it learn, and incorporate, the tripe or reject it? That's the sort of problem with natural language apps, it's quite hard to sort the good from the bad when it's learning. Take the megahal library http://megahal.alioth.debian.org/> for example. Although possibly not as complex, it does a decent job at learning, but when fed with rubbish it will output rubbish. I don't think it's the learning that will be that hard part, but rather the recognition of the good vs. the bad that will prove how good the system is.
eXemplary Abstract
I know we all feel like we've been screwed by the conspicuous lack of flying cars around these days, but at least some progress is being made on the Universal Translator front...
If you're not part of the solution, you are part of the precipitate
Using this software, I can finally win the 'Summarize Proust Competition'!
I'm not a Troll, it's reverse psychology.
Can it decipher these things too?
But for this, I have one word: Dolphins.
When you're afraid to download music illegally in your own home, then the terrorists have won!
We just had an article on this. There was a shootout by NIST. At least I think, /. search engine blows, hard. Either way, here a link to the tests.
This is one that wasn't covered by the tests, so I guess its front page news.
Is there anything better than clicking through Microsoft ads on Slashdot?
That's just a Markov Model that "learned" from what looks religious mumbo jumbo in the first place.
Markov models are perhaps the easiest language acquisition model to implement, but also one of the worst at coming up with valid speech or text.
Interestingly, they do much, much better as recommender systems.
http://en.wikipedia.org/wiki/Markov_chain
Used this (easy to compile) C program:
http://www.eblong.com/zarf/markov/
to create these:
http://www.mintruth.com/mirror/texts/
Mod points to whomever can tell us what texts they use. (No mod points can actually be given)
Get your Unix fortune now!
Unsupervised learning of natural languages
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman
School of Physics and Astronomy and School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel; and Department of Psychology, Cornell University, Ithaca, NY 14853
We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
Many types of sequential symbolic data possess structure that is (i) hierarchical and (ii) context-sensitive. Natural-language text and transcribed speech are prime examples of such data: a corpus of language consists of sentences defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages; recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce an unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificial-grammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins.
The ADIOS Algorithm for Grammar-Like Rule Induction
In a machine learning paradigm for grammar induction, a teacher produces a sequence of strings generated by a grammar G0, and a learner uses the resulting corpus to construct a grammar G, aiming to approximate G0 in some sense (6). Recent evidence suggests that natural language acquisition involves both statistical computation (e.g., in speech segmentation) and rule-like algebraic processes (e.g., in structured generalization) (7-11). Modern computational approaches to grammar induction integrate statistical and rule-based methods (12, 13). Statistical information that can be learned along with the rules may be Markov (14) or variable-order Markov (15) structure for finite state (16) grammars, in which case the EM algorithm can be used to maximize the likelihood of the observed data. Likewise, stochastic annotation for context-free grammars (CFGs) can be learned by using methods such as the Inside-Outside algorithm (14, 17).
We have developed a method that, like some of those just mentioned, combines statistics and rules: our algorithm, ADIOS (for automatic distillation of structure) uses statistical information present in raw sequential data to identify significant segments and to distill rule-like regularities that support structured generalization. Unlike
How long until we see something like this applied to ?
I played around with the Google translator for a while. I work in Japan and am half-way fluent. Google couldn't even turn my most basic Japanese emails into comprehensible English. Same is true for the other translation programs I have seen.
I will believe this new program when I see it.
Translation, especially from extremely different languages, is absurdly difficult. For example, I was out with a Japanese woman the other night, and she said "aitakatta". Literally translated, this means "wanted to meet". Translated into native English, it means "I really wanted to see you tonight". It is going to take one hell of a computer program to figure that out from statistical BS. I barely could with my enormous meat-computer and a whole lot of knowledge of the language.
something that can make sense of the voynich manuscript http://www.voynich.nu/. They should have tested their system on it.
If something exists that does not need a creator (god) then why must the cosmos need one?
Electronic babelfish anyone?
My sig beat up your sig.
God loves you. God will burn you in hell for all eternity. God wants more foreskins.
If something exists that does not need a creator (god) then why must the cosmos need one?
And the "rules" of a language are NOT what children "learn". First of all, children acquire a language, they do not "learn" it. That is a large attribute to the child's ability to speak it--not whether or not they understand gerunds and the pluperfect.
Second, in a language such as English whose words for the most part lack any necessity to the order in which they're placed to understand they're meaning and, even worse, lack declension forms to distinguish subject from object of the preposition, with what success can a language recognition program have "learning" such a language when prepositions themselves mainly can be omitted? To teach a computer Latin is easy.
Third, what's the hope of the computer ever understanding something like Shakespeare, Joyce, or Dante, whose uses of language rely extensively on erudition for word placement as opposed to typical usage? While a computer might be able to learn Latin because of its rigourous rules, I doubt it could faithfully render a text from Ovid.
PNAS wants you to subscribe to download the PDF.
Or you could just go to the authors' page and download it for free: http://www.cs.tau.ac.il/~ruppin/pnas_adios.pdf
In analyzing proteins, for example, the algorithm was able to extract from amino acid sequences patterns that were highly correlated with the functional properties of the proteins.
NCBI BlastP already does this for proteins. Similarities and rules for things can be found but if the meaning of the sequence is not known then what good is it? In the end you need to do experiments involving biology/biochemistry/structural biology to determine the function of a protein or nucleotide sequence. Furthermore in language as well as in biology/chemistry things which have similar vocabulary (chemical formula) may in the end be structurally very different (enantiomers), which leads to vastly different functionality.
Seems like that'd be a good place to test the system out. While talking with extraterestrials would be pretty awesome, having a chat with a dolphin would be pretty cool too. Remember: "The second most intelligent [species] were of course dolphins"
- translate some posts on /. into comprehensible contents
- figure out it is a dupe and kill it before it even appears
- RTFA for me and just give me a good summary (by the rate of articles posted here, there's probably not much to summarize either)
- translate "IANAL" into something else that does not make me think of ANAL thing
- figure that articles on Google and Apple are just speculations by some dude living in his (can't be her, for sure) parent's basement, and not really news worth posting
- translate my suggestions into something acceptable to the (kernel) hackers that good hygiene is a good thing
- understand that I'm just ranting, and it should not take it personal.
Feed it the entries in the "obfuscated C" competition - if it works for that, it oughta work for anything.
Pug
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
Finally! Engrish for the masses!
~ slashdot.org - Where some of the world's greatest minds come together to scrutinize grammar.
Yes, pattern recognition is a major part of the process. However, there are other fundamental parts that are also extremely important, and lacking them you get nonsense. In particular, context matters. "aitakatta" in the middle of a business letter probably does mean "wanted to meet". By itself, said by one member of a couple to the other over drinks at a bar, it does not.
In order for a program to translating to translate accurately, it needs to know who is speaking/writing, who is the audience, what their relationship is, and their location. Some of this may be given to the computer explicitly, or easily found in the text/speech (for a human at least) but some of it may not. This is not going to be an easy problem to solve.
Writing is never free from its context. I know before I even start whether I am reading a fiction novel, a satire, a scientific journal, an email from my boss, or a text message from my date this Saturday. The meaning of the words can change a lot in those cases.
Even Google translator, which was trained on multi-lingual UN reports, could not produce comprehensible English from simple Japanese business emails.
As for my chinko, that's a long story.
Could this be used to make a smarter spam filter?
Called Pragmatics. It can be somewhat oversimplified as saying it's the study of how context affects meaning or as figuring out what we really mean, as opposed to what we say.
For example, a classical Pragmatics scenario:
John is interested in a co worker Anna, but is shy and doesn't want to ask her out if she's taken. He asks his friend Dave if he knows if Anna is available to which Dave replies "Anna has two kids."
Now, taken literally, Dave did not answer John's question. What he literally said is that Anna has at least two children, and presumably exactly two children. That says nothing of her avalibility for dating. However, there's nobody who reads that scenario who doesn't get what Dave actually meant to communicate: That Anna is married, with children.
So that's a major problem computers hit when trying to really understand natural language. You can write a set of rules that comletely describes all the syntax and grammar. However that doesn't do it, that doesn't get you to meaning, because meaning occurs at a higher level than that. Even when we are speaking literally and directly, there's still a whole lot of context that comes in to play. Since we are quite often at least speaking partially indirectly, it gets to be a real mess.
Your example is a great one of just how bad it gets between languages. The literal meaning in Japanese was not the same as the intended meaning. So first you need to decode that, however even if you know that, a literal translation of the intended meaning may not come out right in another language. To really translate well you need to be able to decode the intended meaning of a literal phrase, translate that into an approprate meaning in the other language, and then encode that in a phrase that conveys that intended meaning accurately, and in the appropriate way.
It's a bitch, and not something computers are even near capable of.
-
Time flies like an arrow.
-
Fruit flies like a banana.
There are other, similar examples. Computer systems tend to deduce either that there's a type of insect called "time flies", or that the latter sentence refers to the aerodynamic properties of fruit.From TFA: The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.
If you take just a single string [of length n] and rotate it against itself in a search for matches, then you've got to do n^2 byte comparisons just to find all singleton matches, and then gosh only knows how many comparions thereafter to find all contiguous stretches of matches.
But if you were to take some set of embedded strings, and rotate them against a second set of global strings [where, in a worst case scenario, the set of embedded strings would consist of the set of all substrings of the set of global strings], then you would need to perform a staggeringly large [for all intents and purposes, infinite] number of byte comparisons.
What did they do to shorten the total number of comparisons? [I've got some ideas of my own in that regard, but I'm curious as to their approach.]
PS: Many languages are read backwards, and I assume they re-oriented those languages before feeding them to the algorithm [it would be damned impressive if the algorithm could learn the forwards grammar by reading backwards].
Yeah and this didn't learn the language in any meaningful sense. It just found a statistical pattern, and then generates possible sentences from that pattern. That's a whole lot different to you and I understanding the language and generating intentional, meaningful sentences.
http://www.perthonline.net
Yes! I'd have thrown a mod point at you just for this paragraph if I could.
English is very precise (when used as directed) in matters of time and sequence -- we have more than 20 verb tenses where most languages get away with three.
Not really. Firstly, English only has two or three tenses. (Depending upon which linguist you ask, English either has a past/non-past distinction or past/present/future distinctions. See [1], [2]. The general consensus seems to be in favor of the former, although I humbly disagree with the general consensus.) It maintains a variety of aspect distinctions (perfective vs imperfective, habitual vs continuous, nonprogressive vs progressive). See [3]. Its verbs also interact with modality, albeit slightly less strongly.
It's a very common mistake to count the combinations of tense, aspect, and modality in a language and arrive at some astronomical number of "tenses". It's an even more common mistake (for native English speakers, anyway) to think that English is special or different or strange compared to other languages. In most cases, it's not -- especially when compared with other Indo-European languages.
Secondly, and more interestingly IMHO, most languages do not have three distinct tenses. The most common cases are either to have a future/non-future distinction or a past/non-past distinction. In any case, the future tense, if it exists, is normally derived from modal or aspectual markers and is diachronically weak (which is linguist-babble meaning "future tenses forms don't stick around for very long"). See [3].
English is a perfect example: will, of course, used to refer to the agent's desire (his or her will) to do something. Only recently has it shifted to have a more temporal sense, and it still maintains some of its modal flavor. In fact, the least marked way of making the future (in the US, at least) is to use either gonna or a present progressive form: I'm having dinner with my boss tonight. I'm gonna ask him for a raise. See Comrie [1] again.
So as not to be anglo-centric, I'll give another example. Spanish has three widespread means of forming the future tense. Two of these are periphrastic and are exemplified by he de cantar 'I've gotta sing' and voy a cantar 'I'm gonna sing'. The last is the synthetic form, cantaré 'I'll sing'.
Most high school or college Spanish teachers would tell you that the "pure" future is cantaré. Actually, it's historically derived from the phrase cantar he 'I have to sing' (from Latin cantáre habeo), and is being displaced by the other two forms all across the Spanish-speaking world. I'm told, for example, that cantaré has been largely lost in in Argentina and southern Chile (see [4]).
In any case, the parent's main point still holds. It's a b?tch to deal with cross-linguistic differences in major semantic systems computationally. But good lord, it's fun to try. :)
References:
> We can say, Earlier you educated me. but not Earlier you teached me. Why?
We say 'earlier you taught me' instead. What is your point?
In terms of language evolution, the word 'taught' has the same relationship to 'teach' as 'wrought' has to 'wreak', and similar relationships to 'thought'-'think', 'brought'-'bring' and (less so) 'bought'-'buy'. The pretirite form of each of these verbs is actually formed by a very similar linguistic rule to the one that forms 'educated' from 'educate' - the basic rule in germanic languages being that you stick a dental plosive 't' or 'd' sound on the end of the verb (ignore how the words are spelled, as that's really an irrelevance to the evolution of the words in the first place - we're talking about sounds here). Once this form has been created, however, it can create an awkward sound at the end of the word - 'ct', 'ngd', 'nct', etc. Language users don't like awkward sounds, they change them, preserving the distinctiveness, but losing some of the closeness to the original word. Also bear in mind that 'ch' was not always the sound at the end of the word 'teach' - it was once a much harder sound.
Add to this general rule the tendency in germanic languages for certain verbs ('strong verbs') to change their vowel sound in the past tunse (cf: 'run'-'ran', 'sing'-'sang', etc.), and you can see roughly where 'taught' came from. It's not really an 'exception', just a very old word that's had time to be moulded into a more comfortable shape through usage.
When trying to reduce a living language to a syntax, you miss out on the richness imparted to languages by the conventions that they gather through continual usage. English has simple syntax rules - I can coin a new verb and use it in grammatical sentences without anybody having any doubt about what syntactic role it is playing - look at the rise of 'google' as a verb - nobody had to teach you the words 'googles', 'googled' and 'googling', but you would happily use them. But once words are accepted into the language and used, they move over time, sometimes not in the same direction as their near relatives (as 'teach' and 'taught'). To explain where these words come from you need to look at the syntax rules prevailing at the time the derivative word was coined, and the pressures and modifications the words have been subjected to since. This is exactly what we mean by a 'living language'.
Klingon has simple grammar.
How about Dolphinese? Research shows that they seem to be able to scout and transfer information from one individual to his/her pod. If there's some grammar it would be pretty good nut to crack.
It is no longer uncommon to be uncommon.
Anyone else thinking about using the tech to learn something about "the grammar of DNA"?
If they can use it for analysing proteine sequences, maybe they can tackle "the grammar of Life" and kickstart the whole Bioengeenering sector into a new life...
OTOH, the integrist christians will probably denounce this as an evil thing...
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker
It works the other way too:
"I'm leaving you."
What?
"I'm leaving you, Alice."
I don't understand what you're trying to do.
"I've met someone."
What do you mean 'met'?
"Look...just read the pamphlet."
I don't have the pamphlet.
"I have to go."
Which way do you want to go?
"Uh...west."
You would need a machete to head further west.
I can't tell you how many of my break-ups have ended with needing a machete.
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club
In the natural language processing business they call this "the same level of understanding as a two-year-old child".
Can you teach it to take a breath?
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.