Computer Program Learns Baby Talk in Any Language
athloi writes "Researchers have made a computer program that learns to decode sounds from different languages in the same way that a baby does. The program will help to shed new light on how people learn to talk. It has already raised questions as to how much specific information about language is hard-wired into the brain."
they have only tested with japanese and english. (see ars technica's coverage here). while they do present some intriguing results, the authors themselves admit that their methodology is flawed. btw, when did slashdot become ars redux?
IAAL (I am a linguist), and I believe you are correct. Language is a colligation of sound and meaning, but this technology merely distinguishes sounds: it is a vastly simplified model, not of how children acquire language, but of how children pick up phones. The phone is the most basic unit of the physical (sound) aspect of language, so if this technology is to have any use at all, it has a very long way to go.
From TFA:
Expanding on some existing ideas, he and a team of international researchers developed a computer model that resembles the brain processes a baby uses when learning about speech.
This sentence means nothing. How do they know their computer model resembles the brain processes? Because they got the same outcome? Is that enough to verify what goes on in the mind of a child?
How about this: as soon as their program can distinguish allophones, I will be impressed. Allophones are different sounds in a language that native speakers do not distinguish, but which nevertheless occur in certain environments. For instance, in English we do not distinguish the voiced th sound and the voiceless th sound, but we do distinguish f and v, even though the only difference in both pairs is voicing. The difference is that exchanging f and v can change the meaning of a word, but changing voiced th and voiceless th only makes the word sound funny.
Esoteric reference.
Actually, in English, we do distinguish voiced and unvoiced /th/. They aren't allophones at all - unless you think "thigh" and "thy" are the same word, of course.
While "thy" is somewhat archaic it's still part of the language. Voiced and unvoiced is an area where English distinguishes heavily; we're very light on aspiration, mind you.
Here's an audio clip of its learning progression.
And I recall seeing a TV broadcast showing an experiment where infants were incapable of even hearing certain sounds from one language (e.g. an inuit language with subtle throat-clicking sounds) if they were primarily exposed to another language (say French or English). A baby had to be repeatedly exposed to certain sounds before they could perceive them.
I'm not at a place where I can access the research article, so let me comment about what I know about McLelland's previous work with neural networks.
Rumelhart and McLelland worked on the groundbreaking "can a neural network learn how to pronounce words based on their spelling?" paper, which used back-propagation to train a neural net to do just that. That was in the 1980s. (Sejnowski at the Salk Institute followed up with a lot of neural net training studies too.)
Their little cheat was that there was no temporal component to the data. Words were represented as sets of triplet-letters: catalog is represented as "-ca", "cat", "ata", "tal", "alo", and "log". (Actually, I don't remember if they used special sequences to represent start and stop, so --c -ca og- and g-- may not have been part of the sets.}
And of course the neural net didn't really have audio output, though of course the rejoinder is that this would be trivial.
My key question is how they deal with the issue of time in this study, and if there is any actual audio output which would act as feed-back for the training system or whether the output is representational only, as an output set of phonemes.
Having real audio output and real audio input would let it correlate its output with real language examples. Having representational blobs would only mean that: given inputs of the hash that represents "hard TH" vs the hash that represents "soft TH" the system could yield a result of different outputs.
And you're saying that the key result would be if the system learned to conflate or ignore the two sounds of "TH", hard or soft, in trying to interpret words. Remember that the initial Rumelhart-McLelland model was "content/meaning free", and I suspect that this one is too. Learning to conflate "x" and "y" in a neural net would be trivially implemented and trainable: the links for "x" and "y" into the model would have similar weights in the right contexts (the context being the set of predecessor and successor phonemes).
It sounds like an agglomerator: given a large dataset of valid words in a given language, this system learns the rule for "predecessor" and "successor" probabilities of a particular phoneme vs another phoneme and then produces random output with the same Bayesian probability, producing gibberish nonsensical sounds which follow the probability distribution of the input training language.
or that's my guess at least trying to be the typical slashdotter commenting without reading the article.
I'll try to get at the article from the Uni with journal access tomorrow.
Kris