Open Source Transcription Software?
sshirley writes "I am beginning to do some interviews with family members and will do some audio journals for genealogy purposes. I would really love to be able to run the resulting MP3 or WAV files through some software a get a text file out. I know that software like this exists commercially. But does this exist in the open source world?"
Looks active.
let's set so double the killer delete select all.
Seriously, transcribe it manually... automatic speech recognition just doesn't work. And can never work, because much of the time the only reason humans can understand each other is by making informed guesses based on context, which a computer program cannot do.
Carnegie Mellon has an open source speech recognition project you might want to look into. Sphinx
Fight or flight its all the same
Live to die another day
--Ryan
I spent several month searching for something like this. Open-source voice recognition is in really infant stages, and there does not seem to be much interested in improving the few things we have.
Warning, knife is sharp. Please keep out of children.
It seems like there should be some way to "hack" the audio transcription that google offers through google voice or youtube. Unfortunately I haven't found a way to upload a file. With youtube, if you make a fake movie, it gives an error that it can't be transcribed. Getting google voice to work would require some sort of phone interface I suppose...
just upload it to youtube, its genius google transcription technology will make everything sense out of it.
I'm interested in open version of a transcription app (I run a lab with a lot of this software/equipment) but this is a very vertical market - up until recently there wasn't any standard interface for the foot pedal (newer ones are hid usb devices now).
I had to throw away a bunch of sony serial devices because they only worked with one app I can't make work on newer versions of Windows.
upload to youtube and let it create closed captions. the results won't be perfect, but it will be better than most software.
I've tried simon and julius, but couldn't get past the learning curve to do actual transcription. I will say that it looks like both could be better for recognizing "just your own voice" once you get past the learning curve enough to train. The commercial software is good at recognizing everybody's voice, which isn't that helpful for transcription.
Why don't you give XTrans a shot: XTrans
Have you looked into the Speech API's baked into Vista and Windows 7? If you're familiar with .NET coding, version 4 of the framework provides easy to use hooks into the speech api. The only problem is it is designed to be used with fairly specific grammars/lexicons (programmer supplied) however it does come with a general speech recognizer - but you'll get some interesting results without training it first.
http://msdn.microsoft.com/en-us/magazine/cc163663.aspx
Downsides also include it only natively supports WAV files but that can be addressed with some rolling-your-own goodness.
A black hole is where God divided by 0
You should look into using Amazon Mechanical Turk.
See this: Cheap, Easy Audio Transcription with Mechanical Turk
I've put a bunch of time into this for a project of my own. The short answer is, no, I have found no such program. I've experimented with a few older programs, but they're useless. Sorry.
You could record it, then call yourself and play it through the phone.
Most Windows Vista or Win7 machines come with a built in transcribing feature, that you can enable in the control panel (Win7, under ease of access, Speech recognition).
However - the only way it works properly is if you train it to understand you personally. You load your profile, and it'll run you through a whole bunch of test sentences. The FULL test takes you about 20 minutes I think (It's been a while since I've used it) - and actually works quite well. There is a cut off point at about 2 and a half minutes if you want to stop and try it out. It actually makes it keyboard and mouseless if you want. When you open a browser it highlights everything on the web page thats clickable and assigns it a number, and you simply say "Click 7" and it hits the reply button for you. Then you talk when the textbox has focus and it'll transcribe every word you say.
I did this for my girlfriend's paper once, I read it aloud (you have to mention things like comma, end paragraph, etc) and put it into a Word document. Out of a 15 page single spaced Essay - it got 3 sentences wrong - and that's only because I was mentioning some of the more Obscure greek names (she's a history major). It managed to get full sentences regarding Octavia and her fondness of libraries without error, which I thought was odd since thats not a name you hear every day.
Anyways - if he wants to do this, he should record the test phrases (there will be a lot though) and have each of his interviewees read the test sentences so he can then relay those through the computer and train the computer for each person.
All in all - he may still run across a few errors, but its not nearly as bad as say Google Voice Mail, which tries to figure out what you're saying without having any previous knowledge on how that person speaks. Windows Speech Recognition is something that will handle what he's after though.
Buy a USB foot control (check out infinity or fortherecord), and download the free player from fortherecord.com. You can stop, start, rewind and fast-forward without having to take your eyes off the screen or leave your word processing app.
Here's a list. In my experience, only Dragon is worth trying, with the following caveats:
On the plus side, correction is easy -- read the document, and select words that look wrong to hear what they sounded like.
Most of the other programs are aimed at very small vocabularies (i.e. 100 words) for accessibility applications (controlling a computer).
I've worked on loooads of transcripts. I did most of these:
* http://wiki.fsfe.org/Transcripts
The best technique I've found is to have mplayer play the audio at 60% normal speed and have a text editor (emacs is my preference) in another window, flick between them with alt-TAB and hit Space to start and pause mplayer.
Expert in software patents or patent law? Contribute to the ESP wiki!
"I am beginning to do some interviews with family members and will do some audio journals for genealogy purposes. I would really love to be able to run the resulting MP3 or WAV files through some software a get a text file out. I know that software like this exists commercially. But does this exist in the open source world?"
I think you already have the software, and are testing it on this ask Slashdot question. Well played.
...you could always use RentACoder (er, Vworker.com now) and hire someone for pennies to do it.
Advice: on VPS providers
What are you a fucking hobo?
You can have a look here:
http://htk.eng.cam.ac.uk/
I've used it in the past. It's a bit hard to use, but the results are decent.
What you have to realize is that you will need to have _very_ clean recordings,
or else the recognition rate will suffer greatly.
To play an audio file at 60% normal speed:
mplayer -af scaletempo=scale=0.6 the_file.ogg
And then to check the transcript, change the 0.6 to 1.5 (or 2.0 for someone like Richard Stallman who speaks slowly and clearly).
Expert in software patents or patent law? Contribute to the ESP wiki!
I wish you luck in your quest as I'm also working on genealogy and would like to be able to do this as well. I'd be interested in hearing if you find something that works acceptably well for this purpose. In my experience (IBM Via Voice from OS/2 v.3 days to Dragon Naturally Speaking 10) the state of the art just isn't ready for general use. Even after training, I always got enough errors to discourage use. And I type relatively quickly, so it was just more effective for me to do it manually.
Little girls, like butterflies, need no excuse. -- L. Long
I would really love to be able to run the resulting MP3 or WAV files through some software a get a text file out. I know that software like this exists commercially.
No. Automated arbitrary speech recognition is an unsolved problem -- all voice recognition systems require speaker to make an effort to pronounce words clearly, or make the number of mistakes that take more effort to fix than to write manually.
It will make more sense to write a transcription assistance software -- an equivalent to the tape player with a foot pedal commonly used for this purpose, except with capability to play and repeat short sequences of words or phrases, speed adjustment, etc.
Contrary to the popular belief, there indeed is no God.
Try XTrans from the Linguistic Data Consortium. It's GPL and specifically designed for doing speech transcription. Ask nicely for support, please; the main developer is quite busy.
Pay them a buck per page and they learn some family history along the way. Problem solved.
Google has been working on speech to text for years, and they've got Google Voice to where it transcribes your messages to text. Works great with the Android client, and they have a web page. But even with google's experience and money, its not very accurate. It might be better than most of what you'll find though, and its free.
You could probably rig up Google Voice to where each thing you want to transcribe gets recorded as a "message" to you.
That said, here's a voicemail I got recently:
"Hey Jeff, Nate what you can still haven't been able talk to you in. X-rite is and see if you've been found. If off seems like just. I don't know if the E Z the phone software. This is not available 4. Slash number. I wanna malfunction or give us a call back to you now."
So its not perfect... One funny thing is that my name isn't Jeff or Nate, and neither was the caller.
-Taylor
Worldwide Military budgets: $2100 billion. Worldwide Space Exploration budgets: $38 billion. Really, world? Really?
I think the most important thing to keep in mind for a project like this is that you should do everything you can to ensure a high quality recording. Don't worry about transcription at this point - just focus on getting content. When algorithms (and computers) have improved in 5-10 years time you can do the transcription. It might even be useful to record the sessions with a video camera. Maybe speech recognition tech of the future will use lipreading in addition to the approaches that are used now.
If you are transcribing manually, you really want to consider using something with foot pedals, so you can control the playback with that instead of switching between typing and playback software all the time.
http://www.nch.com.au/scribe/pedals.html
Just make a call to your favorite terrorist-harboring nation, add in some carefully chosen phrases, and them do an FIA request for them.
I've been a transcriptionist for over 5 years, and unless you want to have to retype most of it yourself anyway, don't offer pennies on a site like guru/vworker/elance. A decent transcriptionist is going to charge at least $45-50 per AUDIO hour (not hours it takes) if it's a good, clear recording & a single speaker. If there was a really great product out there, I'd be out of a job. If you want to do it on the cheap, get an inexpensive USB Infinity foot pedal (on ebay) as mentioned before & Express Scribe is a free download to playback & rewind the audio. Both are what I use. Good luck!
I just slice everything up into segments of 60 seconds and let Google Voice transcribe it for me. Sure, some nay-sayers might point out that it's slower that transcribing it all manually, but they don't get that I'm getting Google to do the work for me!
Nothing for 6-digit uids?
Were I work we have tons of recordings from engineering committees and we tried lots of free and commercial programs but at the end of the day due to the vagueness of the English language the best solution was to "hire a human". So thats what we did, we have found a few people in India who were happy to transcribe our recordings for a fraction of the cost of hiring someone to fix the stuff ups from the speech-text software (also good speech-text software costs a fortune and takes ages to train especially when most of the engineers sentences are full of acronyms). So save time and help those with less money and hire someone, not like we have a global shortage of people.
Though there are interesting speech recognition products for other applications ; for this task Dragon and IBM ViaVoice, both sold by ScanSoft, are pretty much the only software choices until someone qualified gets an NSF grant to beef up Sphinx.
I can second the recommendation of the LDC's XTrans if you're going to do this yourself.
If you want someone else to do it, here are a lot of podcasters who want transcripts, and a bunch of transcription services have sprung up to address the market. They've already implemented a lot of the quality-control mechanisms you'd have to address in order to get good results from something like the Mechnical Turk.
The Wall Street Journal ran a side-by-side comparison back in 2008 and recommended castingwords.com, but another provider may very well be better by now. Shop around.
Get it back in 3 hours
Transcriber is the tool that you are looking for. It plays the file and you type and annotate. It's in the Ubuntu repositories so I assume it's in Debian's as well.
Cory Doctorow talking about cloud computing makes as much sense as George W Bush talking about electrical engineering.
I hear Google has a great tool for this that they use for Google Voice...
Or... transcribed...
I'm here googoo, hi a grape too fur this that day fuse far google boys...
StarTrekPhase2 - The Five Year Mission Continues!
Coding Horror recently posted an article about the current voice recognition technology.
http://www.codinghorror.com/blog/2010/06/whatever-happened-to-voice-recognition.html
There is a poem which got transcribed, and the title became like this:
"a poem by Mike Bliss --> a poem by like myth"
The rest of the poem is equally funny. So basically you better transcribe it manually.
Your question was phrased wrong.
Just ask for what you mean, you want free software not so much OSS. Its not like you're going to go editing and fixing bugs in the speech algorithm so the openness here really is just a guise to get something for free.
You'll find plenty of no-cost ways to transcribe, but OSS options fall short.
Reality of it is, you'll save yourself a lot of effort if you just type it yourself. It'll be faster and far more accurate.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Why don't you just use google voice to transcribe? Google Voice has a feature to transcribe your voice mails, not sure how long each message can be but maybe you can automate it somehow?
>>Sig under construction
I looked into automatic transcription software too. I think the consensus is that none of it works well unless it is trained, and trying to "train" software with regular recordings of conversations is not likely to work.
I wrote my own little application so that I could type the text in myself. It works with WinAmp so its tied to windows (Sorry! Time constraints...) From my web page:
http://csclub.uwaterloo.ca/~jg3macka/GabbleFarb/index.html
What it is:
GabbleFarb is basically a glorified notepad application that works with WinAmp (a free audio and video player). A number of hotkey combinations exist to control WinAmp from inside GabbleFarb. As a transcriber, this allows you to easily pause, rewind, fast-forward and control volume levels without leaving the editor. Additionally, as a video or audio file is playing in WinAmp whenever the ENTER key is pressed GabbleFarb will begin the next line with a timestamp of the current playing time. Within the editor, you can then double-click on a line of text in your transcript and GabbleFarb will automatically tell WinAmp to start playback at that point in the file.
The decision of automatic vs manual depends on whats the accuracy you want. Automatic will can go upto 75% to 80%.The best way to use automatic transcription would be to train your PC's speech recognition, play the file with headphones, speak it out loud yourself. Again, there are a lot of contextual information which cannot be transcribed accurately by a computer. So you'll have to manually edit these files if you want to take it to 100%.
You can also manually transcribe it yourself. If you have typing speed around 80wpm then an hour of audio will take around 4 hours to do. Have a look at NCH ExpressScribe. Its a free play/stop software which is almost de-facto standard in the transcription industry.
You can also use various transcription services which are out there. A professional transcription service will charge you around $1 to $2 per audio hour. Freelancers will charge around half of that. But then with freelancers you cannot guarantee the quality.
Shameless Plug: We provide a transcription service for $0.75 per minute of audio. http://callgraph.biz
Artificial Artificial Intelligence. IOW, Farm it out to piece workers on the net for pennies on the Amazon Turk project.
Informative?
Attention slashdotters, There is at least one retard on the loose. He may be calling himself and playing tapes into the phone. If you encounter him do not engage him as he is armed with modpoints and may use them erratically.
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
It's been my job to work with speech recognition technology for the last 10 years. I've worked with speaker-independent grammar-based recognizers like Nuance Recognizer. I've worked with speaker-dependent training-based recognizers like Dragon Naturally Speaking. I've used open source recognizers like Sphinx. I've even dabbled with writing my own basic recognition engine. I can tell you with confidence: with the current state of commercial/open-source technology, you will not be able to get satisfactory results transcribing two speakers in the same recording. Accurate machine transcription requires training and single-speaker. I have heard people claim that speech recognition is a dead technology because it has stopped improving at appreciable speeds. While improvements have slowed down drastically, I do not believe speech recognition is dead by any means. We've really been making the same steady progress since the inception of speech recognition -- but previously we were riding the wave of geometric (sometimes exponential) growth in CPU clock rate. Now that the free lunch is gone, recognition algorithms need to be parallelized to once again ride improvements in CPU design.
record what they are saying into a voice mail on google voice...
King of kings and Lord of lords
The only good transcription software still runs on wetware.
Luckily, humans are cheap and easily available.
Casting words is one of the cheapest ways get humans to transcribe your content.
http://castingwords.com/
If you'd like to save a few bucks by cutting out the middleman, see an even cheaper way here:
http://waxy.org/2008/09/audio_transcription_with_mechanical_turk/
Blessed are the pessimists, for they have made backups.
post an ad on craigslist that you are paying $20/hour of recording to have it typed out. Pizza provided as well. ByoB. bet some college kid takes you up on it.
All of the above was encrypted with a Quad ROT-13 method. Unauthorized decryption is in violation of the DMCA.
Get the lazy bastards to type it out in the first place.
All that screwing around when you could just be typing it, I see doctors do it too...... blah..blah...blah... shut your trap.... here's a keyboard imbecile.
After reading the replies it appears that open source doesn't have anything worthwhile in this area too.
What the fuck does that bunch of 2nd rate fanbois have to offer?
When I did some medical transcription a couple of years ago it was up to me to do it myself, and I didn't find anything open source at the time.
So I loaded up Amarok, configured global hotkeys to pause and jump forward and backward in the audio file in five second gap, and then loaded up a word processor.
Sure, it's not automatic, but it helped me get the job done.
It took me 3 to 4 hours to transcribe each spoken hour of a group of strangers. When the subjects have familiar speech patterns or it's an individual I found progress was much faster.
Considering that Slashdot is now running ads for the BSA, I figure we need to go all buy overpriced crap.
Man, the other day the BSA -- they tried to kiss me, man. Then they turned around and tried to fuck me up the ass.
Why does everyone expect there to be a free software alternative for everything imaginable? Usually when you want something (other than digital goods) you have to actually pay money for it.
Here's someone who has already done it..
http://waxy.org/2008/09/audio_transcription_with_mechanical_turk/
Split up the audio into 5 min pieces.
Set up a template on Amazon Turk for'workers' to grab the 5 min mp3 files, and pay them $2 for each file translated.
More info in the comments. http://www.audiobookcutter.com/ is capable of chopping up the file at the silences for you.
I'm sure you roar get ding fan plastic results from goo gull boys, two eye find it variably hell full.
End of lesson. You may press the button.
As others have said, you can not get accurate speech recognition for multiple speakers. Even for the best of breed closed source software (Dragon) you also need to have good control over microphone quality and placement, and the technique in this instance is to shadow the speakers (put them on headphones and speak into the microphone). transcript.el will remove some of the pain points for transcribing for you if you're happy using emacs. It works out as cost/time effective - I reckon it takes transcription time from 5-8x the length of the recording to something like 2-3x the length, but at this point in time you're not going to find a satisfactory open source solution to machine transcription, either shadowed, or from live tapes.
Several years ago, I wrote a general-purpose media player in VB6. Would you like the source for that?
http://www.vsubhash.com/article.asp?id=15&info=Subhash_VCDPlayer#open_source_development
It has a special transcription mode.
http://www.vsubhash.com/article.asp?id=15&info=Subhash_VCDPlayer#transcription
Transcribe manually, using a transcription program like PRAAT.
Some years ago, I wrote a general media player software in VB6. It is a shell around Windows Media Player. http://www.vsubhash.com/article.asp?id=15&info=Subhash_VCDPlayer#transcription As I was working as a transcriptionist at that time, I added a special transcription mode feature to the player. Thanks to hot keys, you will not need need a footpedal to control the playback. The player window also stays above all windows. On the site, I had offered to give the source code for OSS development but did not give it away as a download. I will go home today and upload the VB6 source code as a free download.
Suggest you look at pocketsphinx, it is a front end on Sphinx, and includes Sphinx. http://cmusphinx.sourceforge.net/2010/03/pocketsphinx-0-6-release/
It's not what you are asking for, but it sure will help you: Transana
open up vi, press i, (or a), and press play on the audio device.
Type out whatever you hear.
Problem solved. :wq
Seriously, transcribe it manually... automatic speech recognition just doesn't work.
That view is only ever held by people who have not used speech to text software in a very long time. Speech recognition software today works INCREDIBLY well. I would say that it was over 95% accurate (and I have a very cheap microphone). Here's a video of it in action http://www.youtube.com/watch?v=bsohqUgjqK0&feature=related It frustrates me that people think that speech recognition hasn't progressed since they last used it and therefore their opinion of it from when they used it years ago is still valid. Speech recognition works INCREDIBLY well, for me it's more accurate than typing.
You might want to read up on Jeff Atwood's post: http://www.codinghorror.com/blog/2010/06/whatever-happened-to-voice-recognition.html
I've Googled around a bit, but I only seem to find shallow babble. Could you point me in the right direction about the good stuff on NLP?
If there's anyone who should be concerned with creating an accurate text record of spoken words, it would be a court reporter. The ones I know tend to use a belt & suspenders approach; they keep a recorder running to capture the audio while they stenographically record the words as they hear them. The written transcript starts from the steno and gets proofread while listening to the tape. You would be surprised the number of places where what the reporter heard doesn't match what the proofreader hears on the tape. That being the case, if you were to run the audio through something like Dragon Naturally Speaking, you would still need to verify what is in the text.
A while back I was looking to get some dictation notes on a novel transcribed, and the best I found (after playing with Dragon Naturally Speaking and a few others) was to simply pay some broke college student a nominal fee per audio hour to transcribe them. Its not a professional level Im sure, and I had to go back and do some paragraph formatting.
But the bottom line is that for getting the words reliably on the screen, the human brain is still the best solution out there.
I don't know of any good trained open-source speech recognisers. There are open-source back-ends like Sphinx or HTK (which I sort of work on) but you need massive transcribed training corpora to train a speech recogniser. This is expensive which I guess is why open-source speech recognition hasn't taken off. In the speech recognition group at my university, most people use Linux, and I don't think anyone actually uses a speech recogniser in their daily work.
eye find it variably hell full.
Indeed.
Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect.--Linus Torvalds
He means Dragon NaturallySpeaking. It is claimed that it is "Up to 99% Accurate". "Up to" means "0% to".
Even if Dragon NaturallySpeaking is 99% accurate, that last 1% is a problem to correct. The software will never make a mistake in spelling. However, it will sometimes substitute similar words that change the meaning of what you intended to say, sometimes in subtle ways.
Dragon NaturallySpeaking has improved a lot since version 7. I don't know whether there were improvements in the recognition engine since version 8.
Sometimes Version 10 Standard is sold at Fry's with rebates that make the total cost $25. However, only the Preferred and more expensive versions allow you to dictate into a handheld recorder for later transcription.
Check out this application I wrote 10 years ago to transcribe some family history audio. It saved me many hours of time.
http://www.wiedenhof.nl/ul/dictplay.htm
Dragon is not open source. It is not even multi-platform.
What? Their technology is on multiple platforms and trivially confirmed with google in seconds with queries like: dragon speech mac
WINDOWS: http://www.nuance.com/naturallyspeaking/products/editions/default.asp
MAC: http://www.nuance.com/naturallyspeaking/products/macintosh/for-the-mac.asp
iPhone/iPad: time-limited note recording, but impressive accuracy : http://www.dragonmobileapps.com/
Phone via calling like, as a regular phone: http://jott.com/
1) Make a call to a random barber shop in Iran or Afghanistan.
2) Say "Al Qaeda", "terrorist", and "spy" very clearly.
3) Play the tapes of your family members' stories.
4) Get a copy of the transcripts from your lawyer at your espionage trial.
5) profit?
Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
This might be cheap for actual human transcription: https://www.mturk.com/mturk/welcome
"I say we take off, nuke the site from orbit. It's the only way to be sure."
I believe it's an exact literal translation from the Chinese for the phrase, "If at first you don't succeed, try try again." frickin' Google Translator...
NCH Express Scribe is freely available to the public for commercial and residential use. I know many transcription companies and education systems use it. It works with most foot pedals as well. A good cheap foot pedal is the vPedal found at several online locations. Transcription software is pretty easy to write. I have written several transcription programs in the past with ease. For those of you who will ask, I do not work for NCH or any associated companies.
My Android phone transcribes search terms from voice input. I wonder if there's a way to use Google's voice transcription servers. No idea how, but it seems like Google has already done the hard work.
Dragon Naturally speaking! Windows 7 has dictation built in, and MAC does too I believe. All transcription / dictation software will require training for anything close to 98% output.
I'm not sure about the OSS options, but Dragon is now pretty cheap, at just $80. You can't beat that with a stick for the quality you get. Win7 has it built in, try it out if you have a box w/ it. I was actually pleasantly suprised.
The fastest cheapest option however, would be to pay a college kid to transcribe it. You can do that for $8/ per hr, and if you edit the waves, chopping silence, etc, you can get someone to pretty much break down an hour of speach per man hour. Especially if you can rig up a way to speed up the waves a bit (people can talk fast, but usually don't talk as fast as a good transcriptionist can type. To pick up the word of someone mumbling, the transcriptionist will have to back up, listen, and try and work out the word. This will slow things down. If you can clarify those words, let them know (at 3:15 uncle jimmy says CHICKENFLUFFER not CHICKENpuffer)
The thing is, by the time you get done with this article, reading everyones solutions, etc... you will have been able to transcribe much of that data yourself. (depending on family size I guess...)
How much is your data worth? Back it up now.
so good I like http://www.coachoutletfactory.com/