Amazon AI Researchers Release a Dataset of 400,000 Transliterated Names To Aid the Development of Natural-Language-Understanding Systems (amazon.com)

← Back to Stories (view on slashdot.org)

Amazon AI Researchers Release a Dataset of 400,000 Transliterated Names To Aid the Development of Natural-Language-Understanding Systems (amazon.com)

Posted by msmash on Thursday August 9, 2018 @03:21AM from the marching-forward,-together dept.

New submitter georgecarlyle76 writes: Amazon AI researchers have publicly released a dataset of almost 400,000 transliterated names, to aid the development of natural-language-understanding systems that can search across databases that use different scripts. They describe the dataset's creation in a paper [PDF] they're presenting at COLING, together with experiments using the dataset to train different types of machine learning models.

12 comments

Min score:

Reason:

Sort:

Pretty amazing by 110010001000 · 2018-08-09 03:46 · Score: 2

It is really amazing all the research they are able to do there. I would have thought the humidity and rain would wreak havoc with computers. Maybe it helps there is no Internet access as well, so they aren't distracted by social media and can focus on AI research.
1. Re:Pretty amazing by Anonymous Coward · 2018-08-09 04:54 · Score: 0
  
  All of those things have gotten markedly better since they moved their headquarters from Seattle to the Amazon rainforest in Brazil.
2. Re:Pretty amazing by Anonymous Coward · 2018-08-09 05:20 · Score: 0
  
  It's very likely that tribesmen in the deepest reaches of the Amazon have better internet access than most of Seattle.
3. Re:Pretty amazing by Tablizer · 2018-08-09 08:16 · Score: 1
  
  All because they run Linux instead of Windows.
  
  --
  Table-ized A.I.
Not the usual NN/ML hype paper by isj · 2018-08-09 05:48 · Score: 2

The paper is informative. They point out the obvious problems (translation from scripts/orthography missing vowels, but also that many names are actually quite rare. In their dataset 73% of the names only occur once.
They also compare the results with traditional hardcoded rules, and find that neural networks may not be better.So kudos for including non-positive results in the paper.
1. Re: Not the usual NN/ML hype paper by Anonymous Coward · 2018-08-09 06:28 · Score: 0
  
  Agreed. All of tech would be better if the people involved were willing to admit that it might not be an infalliable panacea.
What do these sentences mean? by dhaen · 2018-08-09 07:32 · Score: 1

"In most names, the pronunciation of the last name is independent of the rst or middle names"
"So it makes sense to train a transliteration system on independent pairs of first names, last names, and so on."
I'm confused about the meaning of the sentences above. There seems to be an emphasis on last names. Now as an English speaker that sounds ok, but since this about multiple languages where often it's family name first, it doesn't seem to compute.
1. Re:What do these sentences mean? by isj · 2018-08-09 08:03 · Score: 1
  
  What they mean is that there is no or nearly no correlation between first name and last name.
  So John, Bob, Rob, Randy, Elizabeth, Maggie are all equally likely for surname X.
  Of course there will be a weak correlation if the surname is Fleischer then the first name has a slightly higher probability of being Jens, Uwe or Reichard.
2. Re:What do these sentences mean? by Anonymous Coward · 2018-08-09 21:57 · Score: 0
  
  I am much more concerned about how it will fare in tolerating superfluous apostrophe's...
Are they confusing transliteration &transcript by Anonymous Coward · 2018-08-10 00:08 · Score: 0

The example given (cannot quote here because Slashdot Unicode yada yada) is clearly not transliteration. Transliteration isn't based on pronunciation. An NLP project like this should have a linguist at least as an advisor so they can avoid using words wrong.