Recorded Speech to Text Software?

← Back to Stories (view on slashdot.org)

Recorded Speech to Text Software?

Posted by Cliff on Friday January 23, 2004 @01:23PM from the seeking-lil-help-from-the-processor dept.

shfted! asks: "Recently, I've been given the task of transcribing several dozen audio tapes of interviews to typed word, that is, listening for 10 seconds, write what was said, repeat. At around 4 hours per hour long tape, I would like to automate the process somehow. Recording the tape into the computer is no problem, but I need some software that will do the speech recognition accurately more than quickly -- several hours per tape is not an issue (I have access to several machines running 24/7). I will still have to go over the computer's work to correct any mistakes. A free solution for Linux would be best, non-free and Windows solutions are okay, but a working solution is highest priority. Can anyone point me in the right direction(s)?"

18 of 66 comments (clear)

Min score:

Reason:

Sort:

Lo tek is the way to go in this instance by Txiasaeia · 2004-01-23 13:27 · Score: 4, Interesting

Several hours per tape is acceptable? Well, if you can do one tape in four hours, then two people can do one tape in two hours. In other words, hire a college student at minimum wage for a contract position (I.e. until the tapes are transcribed) and go to it.
It's cost effective, as fast as you need it to be and best of all more accurate than any software solution to date. Most software packages are still at only about 90% accuracy, so that's still 24 minutes per four hour tape that you'll need to correct, and you'll still probably have to listen to the whole thing over again in order to verify the accuracy of any software program.

--
Condemnant quod non intellegunt.
1. Re:Lo tek is the way to go in this instance by bluGill · 2004-01-23 13:38 · Score: 2, Insightful
  
  Tapes can be copied on off time. If they are standard audio cassette tapes, then they are not more than 45 minutes per side anyway so you are looking several tapes anyway.
  Even assuming the worst case, 1 tape that is 4 hours long, you can feed the output of the player into the input of a computer, do a ogg (mp3) rip on the stream, and then fast forward to different places. There will be issues merging the copies, but still much less time per person than one person doing the entire thing. (but more work overall if that matters)
2. Re:Lo tek is the way to go in this instance by shfted! · 2004-01-23 16:14 · Score: 3, Interesting
  
  Actually, I am a college student hired to transcode these tapes at $40 CAN a tape, which at 4 hours a tape is just a little above minimum wage where I live. I want to make more than minimum wage, thus my desire to automate things somewhat :) Again, my intent was to have the machine do the first pass, then I could listen and correct errors as I went. Why? I can type continuously at about 70 wpm, but people speak around 150 to 200. However, if I have a 90% accurate copy, that means I only need to type 15 to 20 wpm to keep up, correcting on a single pass, thus reducing my time per tape to the duration of the tape.
  
  --
  He who laughs last is stuck in a time dilation bubble.
3. Re:Lo tek is the way to go in this instance by Directrix1 · 2004-01-24 07:44 · Score: 2, Insightful
  
  The tapes are 1 hour long. You didn't even need to read the article to see that. Four hours is how long it takes to use the start/stop method of transcription. Slowing the tape down to 70% speed and never start / stopping would take 1 hour and 26 minutes.
  
  --
  Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
Simple Suggestion by Anonymous Coward · 2004-01-23 13:28 · Score: 2, Insightful

Given that half-decent speech recognition is still struggling, might I suggest:

1) Give your neighbour's kid $10 to transcribe the tape one afternoon
2) ...
3) Text!
Not really, the technology isn't there by bluGill · 2004-01-23 13:31 · Score: 3, Insightful

The technology to do this isn't really there. If the machine can learn how you speak, it can do it. If you limit yourself to just a few words (1000 perhaps?) it is easier. To do it in general for random speakers though?
The problem is people are too varied. I have trouble understanding people from the "deep south". The accent is too think for my ears. I'm sure they have the same problem with my accent.
That isn't to say don't try it, but don't get your hopes up. Vocie recignition is hard, and isn't done well. Just be glad you only have a few to do, my sister's full time job is typing things like that. (most of less interest as she describes it)
1. Re:Not really, the technology isn't there by cognibrain · 2004-01-24 03:01 · Score: 2, Interesting
  
  To do it in general for random speakers though?
  
  The state of the art for arbitrary news broadcasts is about a 20% word error rate. While this isn't good enough for the poster's needs, it turns out to be almost good enough for indexing.
  
  Wonder when we'll start seing Google return audio and video along with text documents? There's a research project demo of this happening here.
Existing software by skinfitz · 2004-01-23 13:37 · Score: 2, Interesting

What about simply plugging the tape into a system running Dragon Naturally Speaking or IBM ViaVoice?

From the Dragon page:
True Continuous Speech - Speak to your computer naturally and at a normal pace--without pausing between words. Your spoken words swiftly appear on your computer screen.
1. Re:Existing software by lambent · 2004-01-23 13:43 · Score: 2, Interesting
  
  The problem is that you have to train the software for your voice before you can obtain any high degree of accuracy. Not possible with pre-recorded speech.
  
  Hell, the new voice-mail voice activated menus that have been popping up when i dial customer service sometimes force me to say out my phone number. And even to do that accurately, I have to speak very slowly and quite loud. More ofen, I just press random buttons until I get dumped to a live operator. (Try it, it works!)
You must be joking by Radical+Rad · 2004-01-23 13:38 · Score: 3, Insightful

Just do the tapes. It will take longer to screw with software setup and cleanup than to just do it. But if you either buy or rig up a foot switch to play/rewind the tape I think it would help. Also I am assuming you are a touch typist. If not then get someone who is to do this job for you.
1. Re:You must be joking by splattertrousers · 2004-01-23 13:46 · Score: 4, Insightful
  
  If not then get someone who is to do this job for you.
  Court reporters do this kind of thing for a living and some (all?) are contract workers. They can do it in real time and would probably be quite happy to be able to do it all at home rather than in a deposition room or court room. Oh, and their accuracy would be a lot higher than if you did it yourself without checking or if you hired a student to do it.
  Though a tech solution would be cool...
2. Re:You must be joking by T-Ranger · 2004-01-23 16:04 · Score: 2, Informative
  
  Court reporters are not typing on a QWERTY keyboard, its something Wikipedia calls a "syllabic chord keyboard".
  Basicly, rather then typing in characters to form words, they are typing in syllables to form words. Sometime later they transcribe the shorthand into full text. So while recording speech in real time, they are not transcribing it into full text.
  And somewhere back in my brain ISTR that prety much all US court procedings have been recorded on audio tape for decades. I know for a fact that the local court houses (Halifax, Nova Scotia, Canada) have over the last decade or so invested huge amounts in real time, computer based, audio recording gear. So, in addition to having the shorthand version, when transcribing into full text, the reporter would have the ability to listen to it again.
Slow the playback down by billh · 2004-01-23 13:48 · Score: 4, Informative

Slow the playback down and type them as you listen. If you can't do this, hire someone who can. I know many people that can keep up with spoken conversations in real-time.
Years ago, I improved my own typing speed and accuracy by transcribing phone conversations with friends. It just takes some practice.
Of course, if you are listening to this guy, you can disregard my advice.
Sphinx by jcausey · 2004-01-23 14:03 · Score: 4, Informative

Give Sphinx a try. It's pretty accurate; especially Sphinx-3. I've used v2 before for a live test, and it works great -- even with different voices.
Hire a professional by rueger · 2004-01-23 14:57 · Score: 4, Insightful

If your hours of tape are something that has to be transcribed accurately, don't waste your time trying to do it with a computer.

A person who does transcription for a living will do it faster, probably cheaper, and will be able to handle all of the quirks of human speech that will gum up the works of a voice to text program.

There are still places where a machine cannot match the quality of a real live person.

--
Three Squirrels
SuSE 7.3 by Anthony+Boyd · 2004-01-23 20:03 · Score: 2, Informative

If you can get a copy of SuSE 7.3 Professional, it comes with IBM's ViaVoice for Linux. It can take audio and turn it into text. The trick is that 7.3 came out about 2 years ago, I think. Most stores would have the newer 9.0 version, which doesn't have ViaVoice.

I guess it is possible that IBM still sells ViaVoice for newer distros. I've never looked.

--
My Greasemonkey scripts for Digg &
Re:automatic transcription by lukew · 2004-01-23 23:39 · Score: 2, Funny

Ahh crap.

Way to make a dick of yourself on Slashdot #445:
NOT USE THE BLOODY PREVIEW BUTTON.
Some (slightly) OT Advice by travail_jgd · 2004-01-24 03:21 · Score: 2, Informative

A friend was in a similar situation -- she had recorded a phone interview [1], and needed to transcribe it. To make certain there were no technical glitches, the interview was recorded to cassette and as a WAV file on her PC.

When the time came to transcribe the interview, she found the version on her PC more helpful -- her hands never had to leave the keyboard in order to pause or "rewind" the audio.

If you go this route, remember that you'll need about 600 MB per hour of uncompressed audio. If space is an issue and you need to compress, don't max out the compression; saving a few megabytes here and there could result in hours of extra work due to artifacts.

[1] With explicit permission given.