IBM Strives For 'Superhuman' Speech Tech

← Back to Stories (view on slashdot.org)

IBM Strives For 'Superhuman' Speech Tech

Posted by ScuttleMonkey on Tuesday January 24, 2006 @09:34PM from the fansubbing-in-jeopardy dept.

robyn217 writes "IBM unveiled new speech recognition technology today that can comprehend the nuances of spoken English, translate it on the fly, and even create on-the-fly subtitles for foreign-language television programs. One of the projects perpetually monitors Arabic television stations, dynamically transcribing and translating any words spoken into English subtitles. Videos can then be viewed via a web browser, with all transcriptions indexed and searchable."

19 of 289 comments (clear)

Min score:

Reason:

Sort:

Which ... by spiny · 2006-01-24 21:36 · Score: 3, Interesting

Which witch blew the blue candle out ?

--

Fry: heh, Yakov Smirnoff said it
Leela: No he didn't.
1. Re:Which ... by jcupitt65 · 2006-01-24 22:41 · Score: 5, Interesting
  
  Or I can wreck a nice beach versus I can recognise speech.
  Sometimes you need rather a large context to disambiguate: is this sentence part of a discussion on shore-front management, or spoken language understanding?
2. Re:Which ... by Squalish · 2006-01-25 04:18 · Score: 2, Interesting
  
  The computer is being programmed with the goal of understanding the user, not some arbitrarily defined 'perfect speech' dialect/accent.
  
  --
  People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation
3. Re:Which ... by Anonymous Coward · 2006-01-25 06:37 · Score: 1, Interesting
  
  Which witch blew the blue candle out
  
  Computer: In my probability based language model, "*empty* Which witch" occurs more than "*empty* witch witch", "*empty* witch which", or "*empty* which which". Therefore I will assume "which witch".
  
  My point is that computers, like human beings (shockingly enough), use contextual information. Assuming they don't is assuming dumb programmers, low computer resources, or not much interest in the problem. All 3 assumptions are wrong (given a relative definition of 'low')
Opensource? by Anonymous Coward · 2006-01-24 21:45 · Score: 1, Interesting

Will IBM make this technology public or will it be proprietary?
Re:Coherency? by Yahweh+Doesn't+Exist · 2006-01-24 21:48 · Score: 3, Interesting

yes, there will always be delay for the reason you state. but that's true even with human translators, yet no-one claims real-time meetings between people via translators is a waste of time.

since even "live" boradcasts are usually delayed several minutes for technical and legal reasons anyway, if this technology can get to the state where you're just one or two sentences behind real-life it will be effectively real-time anyway for almost all practical purposes.
IBM and Google cooperation to come? by Mostly+a+lurker · 2006-01-24 22:13 · Score: 2, Interesting

IBM has been one of the pioneers in speech recognition for a long time. However, indications are that Google (in the lab) has been making tremendous progress in translation. While the two companies are bound to be fierce competitors, it would seem they would both have much to gain from cooperation in the area of language recognition and translation.
Re:Foreign languages are complex... by Mushdot · 2006-01-24 22:16 · Score: 3, Interesting

I have a friend works in Japan and he tells me the same. He often goes to watch English films that are subtitled in Japanese and tells me that they completely miss-translate most of the jokes and miss subtle nuances of speech. One example he gave was a scene from 'The Full Monty' (im doing this from distant memory so it might not be quite right - in fact, a bad translation :-)

One of the characters is shouting up to someone in their bedroom window. They don't respond to the shouting and the character says "He obviously can't hear me because of his triple glazing".

This is a sarcastic comment relating to the house owners supposed wealth but in Japanese it was translated as:

"He has thick windows"

Perhaps in this case there was no easy way to translate - but I suspect films are probably translated in one pass and there is no time to understand the context of each sentence spoken so it's left to literal translatation only.
This won't make speech recognition mainstream by thbb · 2006-01-24 22:16 · Score: 4, Interesting

As it has been the case for the past thirty years, the description of the prowesses of the system are still written in the conditional form: "...IBM technology can be used to control computers and devices..." rather than the active form: "is being used"...

Ben Shneiderman is the person who, in my opinion, articulates the best the limits of speech recognition.

One of my favorite phrases to explain this issue is: "You don't want to speak to a computer, because you can't speak and think at the same time". More precisely, speech utterance makes use of some modules in our brain which are required for planification too. Hence, you can't plan as well what to do next when you speak, which is a big hurdle in the type of intellectual activities one carries with a computer.
American or English? by squoozer · 2006-01-24 22:30 · Score: 2, Interesting

I realize that Anericans and British (English at least ;o)) speak essentially the same language but I have yet to find any speech recognition software that can get more than roughly 85% of what I say correct. I have a fairly soft neutral english accent with pretty good enunciation so I would have expectd to be getting a recognition rate in the high 90%s. I'm wondering if, as most of this software is developed in the US, it is tuned specifically to pick up on english with a US accent? I realize that you train the software for your voice but AIUI all you are doing is tuning a basic speech model. Has anyone else had this problem or is it just me?

--
I used to have a better sig but it broke.
Re:Coherency? by dancallaghan · 2006-01-24 22:40 · Score: 3, Interesting

but I personally don' think there could ever be real time translation for the following reason. [German]

You are going to have that problem whether it's a machine doing the translating or a human. As I understand it, interpreters of German get around this by some quick-thinking restructuring of the translated sentence, or they simply lag a half-sentence or so behind.

The real problem for machine translation is, and always has been, determining the sense of a word from context (indeed I recall a recent Slashdot article about some guy who suggests this is the separating factor between computers and animal intelligence). Most languages have a great many homonyms whose meaning a listener can determine only from the surrounding contenxt and, often, general background knowledge of the language or topic at hand.
Not _that_ amazing by johndoe42 · 2006-01-24 22:42 · Score: 2, Interesting

It's been well-known among language researchers that both speech recognition and parsing/comprehension are much easier when applied to a small problem domain. SRI in Palo Alto and CSLI at Stanford, for example, have a number of very impressive speech recognition packages that understand, for example, medicine-related sentences. The dashboard controls just sound like a logical progression of this to faster computers and an even smaller problem domain. They're cool nonetheless.

The translation, on the other hand, sounds damned impressive. For unrestricted content, especially with an untrained voice (I imagine that IBM isn't individually training to each Al Jazeera talking head), 70% recognition sounds quite good. 70% accuracy post-translation ought to be quite a bit better than what's currently out there. The description of MASTOR, however, is useless -- it could easily describe anything that isn't word-for-word translation.
Re:Foreign languages are complex... by anum · 2006-01-24 23:33 · Score: 2, Interesting

Ya, I got ya'.

I almost added "I just hope GWB doesn't decide to fire all his intell linguists based on this post" but it seemed kind of like bashing the Prez and i would never do that...

Cheers

--
I don't think, Therefore I'm not.
funny this subject should come up... by dafragsta · 2006-01-25 00:04 · Score: 2, Interesting

I've actually never used any speech recognition software before today. That said, today just happens to be the day. That said, I tried out Dragon NaturallySpeaking for the first time, and it is a complete coincidence that this topic should come up. I'm actually dictating this post with Dragon, as we speak. ha ha

the training process definitely has its ups and downs. The more you work with it however, the more it becomes attenuated to your own speech patterns and moreover, the quirky words we use every day. If you can get past the first two or three hours, you'll see that it is totally worth the effort, especially if this IBM tech isn't available to end-users for some time. There is also an aspect of the software training you, while you train the software. At the present time, I can dictate to slightly slower than I can probably type.

In the end, I can see where this would make a writing e-mails and other such time-consuming tasks, which involve spellchecking, grammar, and other proof reading significantly quicker. When you really hit your stride, it's easy to write at the speed of thought, which is really appealing. There are caveats, however. it's very easy to dictate several sentences worth of tax and taken for granted that it to everything down the way you attendedselect tax select select tax undo
Real-time eavesdropping by 0xC2 · 2006-01-25 00:30 · Score: 2, Interesting

Although most of the discussion so far has focused on foreign language translation, this technology is about *real-time-audio-to-text* conversion. The feds will be able to monitor, analyze, and record our conversations in real time:

Monitor all conversation.
Apply real-time text filters.
Assign live agents to priority eavesdropping.
Profit!

If you could apply a filter to listen in to any call what would it be?

--
Be heard || Be herd
Let's see it translate poems by roman_mir · 2006-01-25 01:57 · Score: 2, Interesting

When and if it can translate poems from language to language, while keeping the style, the nuances, the rythm, the cultural references, the general idea and the details, then we will know - it is done. Until then, don't hold your breath.

--
You can't handle the truth.
1. Re:Let's see it translate poems by hunterx11 · 2006-01-25 02:23 · Score: 3, Interesting
  
  I'd be happy enough if humans could do this.
  
  --
  English is easier said than done.
S-to-T in hospitals by stardancer · 2006-01-25 02:20 · Score: 2, Interesting

I know that one hospital in Norway has been experimenting with/testing speech-to-text software for a while, and reports say it's been very successful! (this supports what was said about speech recognition within a tight context in an earlier comment). I believe the plan is to, at some point, eliminate the need of secretaries transcribing what the doctors dictate, so that ideally the doctors can just speak into a mic and the text automagically appears in the patient's (electronic/digital) journal!
this of course worries secretaries, since they might eventually lose their job/"career". on the other hand it would improve effeciency *a lot*.

--
There's nothing too profound behind this sig.
breakdown of the article by Anonymous Coward · 2006-01-25 04:32 · Score: 1, Interesting

The article is really saying two things:

1. IBM has updated their ViaVoice large vocabulary continuous speech recognition (LVCSR) engine.

2. IBM has paired ViaVoice with some clever apps to use the ViaVoice output in interesting ways (e.g. "on the fly" recognition, translation).

Things that are not obvious from the article:

1. ViaVoice has been around for ages and has always been pretty darn good at LVCSR. Without seeing numbers and knowing exactly how they were measured, it's impossible to know how much of an improvement 4.4 is over previous versions.

2. Speaker-dependent speech recognition can always achieve much higher accuracy rates than speaker-independent systems like ViaVoice. Dragon NaturallySpeaking is an example of speaker-dependent speech recognition.

3. Limited grammatical contexts (i.e. language models with low perplexity) always give better recognition than when you don't know what to expect next. For example, when your phone only has to tell "home" and "wife" apart, it's a lot less likely to make a mistake than if it has to figure out which word out of a list of 50,000 you just said. The more context, the better. The most interesting tech in the article seems to be the algorithms "that can determine this context on the fly."

4. No improvements in translation technology were noted in the article; it sounds like they might as well have fed ViaVoice through BabelFish, made it happen in real time, and slapped a UI on it. The app might be new, but the tech is not.