Automated Language Deciphering By Computer AI

← Back to Stories (view on slashdot.org)

Automated Language Deciphering By Computer AI

Posted by samzenpus on Wednesday June 30, 2010 @03:42PM from the what-about-dwarvish? dept.

eldavojohn writes "Ugaritic has been deciphered by an unaided computer program that relied only on four basic assumptions present in many languages. The paper (PDF) may aid researchers in deciphering eight undecipherable languages (Ugaritic has already been deciphered and proved their system worked) as well as increase the number of languages automated translation sites offer. The researchers claim 'orders of magnitude' speedups in deciphering languages with their new system."

20 of 109 comments (clear)

Min score:

Reason:

Sort:

Re:Sweet by Fluffeh · 2010-06-30 15:51 · Score: 3, Funny

But will it go into your ear, or will it be injected via a syringe and live in your gut is the question?

--
Moved to http://soylentnews.org/. You are invited to join us too!
Answers to all TFA questions by cappp · 2010-06-30 15:53 · Score: 5, Informative

Just so we can keep the “didn’t read TFA” comments to a minimum: The four assumptions as laid out in the article are:

- The language being deciphered is closely related to some other language: In the case of Ugaritic, the researchers chose Hebrew.

- There’s a systematic way to map the alphabet of one language on to the alphabet of the other, and that correlated symbols will occur with similar frequencies in the two languages. The system makes a similar assumption at the level of the word: The languages should have at least some cognates, or words with shared roots, like main and mano in French and Spanish, or homme and hombre.

- The system assumes a similar mapping for parts of words. A word like “overloading,” for instance, has both a prefix — “over” — and a suffix — “ing.” The system would anticipate that other words in the language will feature the prefix “over” or the suffix “ing” or both, and that a cognate of “overloading” in another language — say, “surchargeant” in French — would have a similar three-part structure.

. The article also notes the success rates where it states that

Ugaritic has already been deciphered: Otherwise, the researchers would have had no way to gauge their system’s performance. The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent. “Of those that are incorrect, often they’re incorrect only by a single letter, so they’re often very good guesses,” Snyder says.
Critics noted that

The researchers’ approach, he says, presupposes that the language to be deciphered has an alphabet that can be mapped onto the alphabet of a known language — “which is almost certainly not the case with any of the important remaining undeciphered scripts.” It also assumes, he argues, that it’s clear where one character or word ends and another begins, which is not the case with many deciphered and undeciphered scripts. The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
1. Re:Answers to all TFA questions by MichaelSmith · 2010-06-30 15:59 · Score: 4, Insightful
  
  The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
  Maybe I should go around and write "computer" in English on all my computers, as a service to future language researchers.
  
  --
  http://michaelsmith.id.au
2. Re:Answers to all TFA questions by DurendalMac · 2010-06-30 16:01 · Score: 2, Interesting
  
  Darn. So the Voynich Manuscript is probably not a prime candidate.
3. Re:Answers to all TFA questions by jd · 2010-06-30 17:43 · Score: 2, Insightful
  
  Neither is my great great grandmother's cookbook. Which really is a shame, as I strongly suspect the recipes make something more edible than what's served at the local coffee shop.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
4. Re:Answers to all TFA questions by vlueboy · 2010-06-30 18:29 · Score: 3, Interesting
  
  The decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic.
  Maybe I should go around and write "computer" in English on all my computers, as a service to future language researchers.
  Extinct language researchers examining english would fail at this same task 3000 years from now. English has no nouns --it has brand names: today's "computers" have big "Dell" logos but not "Computer."
  Also, how would researchers realize that [Apple Mac Glyph] isn't an integral part of our "ancient moon runes" if seen from their era? :)
5. Re:Answers to all TFA questions by mrsurb · 2010-06-30 19:46 · Score: 4, Funny
  
  Also, how would researchers realize that [Apple Mac Glyph] isn't an integral part of our "ancient moon runes" if seen from their era? :)
  They'd probably see it as having some sort of religious significance. And they'd be correct.
6. Re:Answers to all TFA questions by L4t3r4lu5 · 2010-06-30 21:14 · Score: 2, Funny
  
  So you're telling me he was at Woodstock '69?
  
  For those who don't know what it was like, clicky
  
  --
  Finally had enough. Come see us over at https://soylentnews.org/
Re:Sweet by doishmere · 2010-06-30 15:53 · Score: 3, Informative

Their method relies heavily on the unknown language being related to a known language by some degree. At their heart of their technique is Bayesian statistics applied to lexical and frequency analysis; for this approach to work, there must be some basis for comparison.
Re:Sweet by Anonymous Coward · 2010-06-30 15:58 · Score: 5, Funny

Good news, it's a suppository.
Pfft, why? by mdenham · 2010-06-30 16:01 · Score: 5, Funny

Label at least one computer "ham sandwich" to confuse future language researchers.
Alternatively, label each computer with a character's name from (insert show of your choice here).
1. Re:Pfft, why? by L4t3r4lu5 · 2010-06-30 21:11 · Score: 2, Insightful
  
  How idiotic. Name servers that way if you must, but workstations should be named by geographic location, building, room, station number. Nicknames don't count, but for sanity's sake name your equipment logically.
  
  --
  Finally had enough. Come see us over at https://soylentnews.org/
Linear A Implications by DowdyGoat · 2010-06-30 16:03 · Score: 5, Interesting

This is very cool for us undeciphered language fans.
In the article, the language author Andrew Robinson correctly points out that this computer program won't work for languages that don't have a known language that is close to them, say like for Linear A found on Crete, which is definitely not Greek like Linear B turned out to be. There is a lot of speculation that Linear A is a native Minoan (Cretan) script, largely unrelated to any other known script.
However, parallel with Linear A on Crete was a Cretan pictographic script, which may, or may not be related to Egyptian hieroglyphics. The Minoans had known trading ties to Egypt, which had written language long before them. If a relationship could be found (via this computer program) between the Minoan pictographic script and Egyptian hieroglyphs, then that might give insights into how the Linear A script was set up (which is a syllabary script).
The only difficulty is that there may not be enough of the pictographic script to work--I'd imagine you'd need a fair number of examples to really allow the computer to compare and contrast.
1. Re:Linear A Implications by KritonK · 2010-06-30 17:59 · Score: 3, Informative
  
  Actually, the program might be able to help: From what I understand, the Linear A alphabet is related to the linear B alphabet, which has been deciphered, even though the languages may be different. We know a bit about context (what we have are mostly inventories), and we even know the meaning of one word: the one next to the total of the amounts in the inventory probably means "total". Furthermore, that word, ku-ro, is similar to a form of a Greek word for "total" ("houlon"), so it is very likely that the language is at least indoeuropean in origin. One could try using various indoeuropean languages as candidates for the related language, until the program comes up with something meanngful.
  Now, if only we had a larger sample of the language of the disk of Phaestos...
Next step: by BoppreH · 2010-06-30 16:08 · Score: 2, Insightful

Voynich manuscript!

If only we could find a language that is similar enough...
Re:Sweet by grcumb · 2010-06-30 16:18 · Score: 4, Funny

Universal translator, here we come!
Cool! Can I bring it into my next marketing meeting?

--
Crumb's Corollary: Never bring a knife to a bun fight.
Re:Sweet by Walt+Dismal · 2010-06-30 16:32 · Score: 4, Funny

Only if the gross gains in closing juncture exceed the long-term sustainability goals of the viability imperative for all mass interoperability. We at Mega Industries believe this will move us forward to our cloud-based monetization of the human-media dynamic which is strategically important in an ever-evolving mobile continuum. We have directed our customer experience champions to ensure consumers realize this when they call in with emphatic expressions of dissatisfaction.
Screw the article.... by djupedal · 2010-06-30 18:02 · Score: 2, Informative

IBM, as one example, has been on this hard since 2002 ( http://news.cnet.com/2100-1008-998264.html ) when the prize was first announced....stop going all lady gaga over stuf that is so old it can't even be recycled properly.
Re:Sweet by jd · 2010-06-30 18:25 · Score: 2, Interesting

Well, Old Norse is technically based on Old Germanic rather than the other way round, and Old English not only had Old Germanic input but Old Norse input as well. Along with an uncertain amount of Anglic (amazingly little is known about the Angles), possibly some Jute. English uses Norman French, plus modern French (which itself is derived from Norman French). Norman French survives in the modern world in Guernsey, Jersey and maybe some other Channel Islands but became extinct on Alderney.
To bring this Back On Topic, if English were lost, it would be almost impossible to use this program to recover it. English has input from too many sources, resulting in way too many loan-words of incompatible structure and too much incompatible grammar. However, one very interesting test of the program would be to map each of the derived phonemes in Pre-Indo-European to a character, then compare this derived PIE script with each Indo-European language in turn. If the derivation is correct, the number of correct guesses for translations of PIE words into each known IE language aught to be above what would be expected by chance alone AND the translations should remain compatible with the derivations the PIE engineers used in the first place. By comparing across the translations for all languages, the program may discover other word-parts that had not been noticed before.
It may be possible to determine if a language is truly isolate or not, by analyzing against a language multiple times using slightly different data sets and seeing if the results remain about the same. If this test works, then languages of uncertain/unknown ancestry (such as Basque and Etruscan*) can be tested against all 7,200 known languages to see if any of them produce a moderately stable match. No match means no connection with any other existent linguistic family tree.
*Etruscan is a bugbear. There is one book that is completely intact and undamaged. It's made of gold leaf. The academic who currently owns it has not published so much as a single line of the text, merely two of the illustrations. All other Etruscan texts are fragmentary (so you've very little context to work with and not many words that are definitely complete) or too short to be useful. We don't know what Etruscan is related to, but if the above hypothesis is correct, we could find out and then translate the book. But the damaged texts, such as a linen book used to wrap a mummy, are way too fragmentary. You'd never be sure if such a translation was correct. A complete book, on the other hand, would offer no possibility for mistake. It would work or it wouldn't.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
You want to impress me... by ngc5194 · 2010-06-30 18:44 · Score: 3, Funny

... see if it can decipher some of the perl code I've had to take over.