The conversational telephone speech (CTS) results I quoted above were achieved using a state-of-the-art research system running under 10 times real time (10xRT); i.e., using less than 10 hours to transcribe an hour of speech. The winning system in 2004 DARPA EARS evaluation achieved 15.2% WER. For system description, see this paper (requires subscription to ieeexplore). In 2004, many EARS teams achieved the same level of performance in real time as their 10xRT system in 2003. Since EARS program was killed after 2004 evaluation and DARPA's focus has shifted to foreign languages (GALE), it is hard to predict the current state-of-the-art in English CTS transcription and when that level of performance will be available in commercial products.
Just to correct my earlier post, Arabic broadcast news (BN) transcription error rates are still around 20%. Mandarin Chinese BN character error rate is close to 10%.
I beg to disagree with you on the relative difficulty between speech-to-text (STT) and machine translation (MT). The state-of-the-art in broadcast news transcription is currently over 90% accurate - using 100 minus word error rate (WER) - in English and close to 90% in both Arabic and Chinese. Also, English conversational telephone speech transcription reached over 85% accuracy during the DARPA EARS program. However, translation accuracy - using 100 minus human-mediated translation error rate (HTER) which is the official metric in DARPA GALE - is only around 80% on both Arabic-to-English and Chinese-to-English.
To counter your last statement, experiments carried out before the GALE 2006 evaluation showed that the translation accuracy of STT output is pretty much the same as the translation accuracy of the STT reference transcripts. This is clearly due to the poor performance of the current state-of-the-art MT. Most of the research in GALE is currently tackling MT and only when the MT is good enough, the STT errors will begin to make a difference.
The conversational telephone speech (CTS) results I quoted above were achieved using a state-of-the-art research system running under 10 times real time (10xRT); i.e., using less than 10 hours to transcribe an hour of speech. The winning system in 2004 DARPA EARS evaluation achieved 15.2% WER. For system description, see this paper (requires subscription to ieeexplore). In 2004, many EARS teams achieved the same level of performance in real time as their 10xRT system in 2003. Since EARS program was killed after 2004 evaluation and DARPA's focus has shifted to foreign languages (GALE), it is hard to predict the current state-of-the-art in English CTS transcription and when that level of performance will be available in commercial products.
Just to correct my earlier post, Arabic broadcast news (BN) transcription error rates are still around 20%. Mandarin Chinese BN character error rate is close to 10%.
I beg to disagree with you on the relative difficulty between speech-to-text (STT) and machine translation (MT). The state-of-the-art in broadcast news transcription is currently over 90% accurate - using 100 minus word error rate (WER) - in English and close to 90% in both Arabic and Chinese. Also, English conversational telephone speech transcription reached over 85% accuracy during the DARPA EARS program. However, translation accuracy - using 100 minus human-mediated translation error rate (HTER) which is the official metric in DARPA GALE - is only around 80% on both Arabic-to-English and Chinese-to-English.
To counter your last statement, experiments carried out before the GALE 2006 evaluation showed that the translation accuracy of STT output is pretty much the same as the translation accuracy of the STT reference transcripts. This is clearly due to the poor performance of the current state-of-the-art MT. Most of the research in GALE is currently tackling MT and only when the MT is good enough, the STT errors will begin to make a difference.