Vincent's Blog: Speech to Text using Java

Is there out any utility, preferable command line that allows you to input an audio file and output a text file? Moreover, is there such a thing packaged as a library that developers can use? And yet again another technical requirement, is there something like it written on Java or another high level language? And as if it was not enough I want it for free. More than three years back I was looking for such a solution and I could not find something that works out of the box. If you try searching for something like this you will probably find the sphinx-4 project, a speech recognizer entirely written in Java. I spent a lot of time trying to understand the underlying basis of speech recognition and how it works, what are the roles of Language Models, Vocabularies, Acoustic Models, etc and not very few hours trying to make the whole thing work. I was successful in making it recognize isolated digits but what I wanted was general speech recognition that could deal with continuous speech, large vocabulary and complex grammar models. So I tried making it work for this scenario but I was unsuccessful, and for some of those models with larger vocabularies the system was very slow and I couldn't get much help with the documentation and available examples on the web. Without coming up with more excuses for my failed attempt I decided to take a different approach.

Surfing for other possible solutions I found some documentation for speech recognition engines from private companies that usually ship their products with a developer's API, big companies like IBM or Nuance. They would usually implement an interface known as SAPI (Speech Application Programming Interface) developed by Microsoft to provide speech recognition capabilities to Windows applications. In the same lines there is a JSAPI specification for the Java programming language. Microsoft not only developed the SAPI specification but also includes his own speech recognizer with some versions of the Windows operating system. So I downloaded the SAPI Software Development Kit and wrote a simple command line utility that reads a raw audio wav file and outputs a text file with the transcription of what was said in the audio file. Results were not great, especially because the recognition engine is not very speaker independent and some audio files that I tried had noise/music in the background.

The results of this hacking activity went beyond modifying one of their code examples to write this command line utility but I also wrote a JNI (Java Native Interface) interface to use the recognition engine from Java, although I have to stress I did it more as a practicing exercise because I'm still limited to read only from a wav file. Of course this will only work on Windows but portability is something you will have to give up for this time if you want all those things that I mentioned at the beginning of this post. I'm including a link here with the command line utility with source code and the Java interface and one example of usage of the interface using Java.

Download: wav2text.zip

One limitation of those tools will also be that the wav file must be a PCM raw audio file with 22KHz frequency, 16 bit per sample and stereo sound. For this purpose I recommend using some nice command line utility to do the job: You can use SoX, an open source library and command line tool to change any of those parameters from the wav audio file. Also it would be great if you could input mp3 files or even video files. The problem with mp3 is that there are not so many solutions for conversion out of the box due to patents. You will have to download and compile LAME and integrate this encoder with SoX in order to get SoX to convert raw PCM wav files to mp3 encoded files. For video files the best solution is to use mplayer from the command line using mplayer -vo null -ao pcm:file=%FILE_PATH%. This will extract the audio from the video file.

Getting to hack with audio and video formats and speech recognition development was an interesting experience, this also exposed me to other higher level technologies like VoiceXML and later I had the chance to meet one of the developers of Firevox, a Firefox extension that focus on accessibility. Coincidentially a research project here at Stony Brook also does accesibility using voice technologies: HearSay.

This book about the history of vocoders
is titled: "How to wreck a nice beach"
a phrase commonly used as example of
the difficulty of speech recognition,
because it sounds like: "How to recognize speech"

6 comments:

AnonymousJune 20, 2011 at 11:35 AM
Did you also extract timestamp along with Text from wave file?
Let me know if you could. When I tried to get timestamp from SAPI last year(Dec 2010), I found critical bug on SAPI that the timestamp from SAPI was gibberish and totally useless. Microsoft engineer admitted there is a bug at that time.
Vicente OrdonezJune 20, 2011 at 10:01 PM
Hi Thomas,

I indeed extracted timestamps and used them in an audio retrieval demo application, when the user searched for a word I included the portion of the text where the word showed up with the initial and ending timestamps. I wrote a JNI interface so that when some audio is transcribed a callback function returns also the timestamp in Java. It used to work although I didn't verify the confidence of the timestamp. But I tried this in 2008 in Windows Vista Home Edition. I think I also tried on XP but I don't remember if I needed to download something extra.
t_roherJuly 7, 2011 at 12:13 PM
If you know the bitrate of the wav files (you could probably find a way to force a specific bit rate), you can extract timestamps by getting the byte position of each recognition and dividing it by bit rate (with some unit conversions for kbps and bytes of course). I did this in C++, so you might not have access to the same methods in Java?
Vicente OrdonezJuly 7, 2011 at 3:47 PM
Hi.

That sounds reasonable too. What I did was writing a JNI interface so that I can access the functionality of SAPI from Java, I wrote C++ code to extract the timestamps directly from whatever the API was giving me. You could also invoke a listener from the C++ code with timestamps calculated in the way you describe and report them back to your Java program.
RaveeshAugust 23, 2011 at 6:23 AM
Hi vincent, Primarily thanks for the zip. could you share ur jni header file as well.. i cant find it in the zip.
RaveeshAugust 23, 2011 at 6:52 AM
sorry my bad.. got the file thanks

Sunday, February 20, 2011

Speech to Text using Java

6 comments: