Sunday, February 20, 2011

Speech to Text using Java

Is there out any utility, preferable command line that allows you to input an audio file and output a text file? Moreover, is there such a thing packaged as a library that developers can use? And yet again another technical requirement, is there something like it written on Java or another high level language? And as if it was not enough I want it for free. More than three years back I was looking for such a solution and I could not find something that works out of the box. If you try searching for something like this you will probably find the sphinx-4 project, a speech recognizer entirely written in Java. I spent a lot of time trying to understand the underlying basis of speech recognition and how it works, what are the roles of Language Models, Vocabularies, Acoustic Models, etc and not very few hours trying to make the whole thing work. I was successful in making it recognize isolated digits but what I wanted was general speech recognition that could deal with continuous speech, large vocabulary and complex grammar models. So I tried making it work for this scenario but I was unsuccessful, and for some of those models with larger vocabularies the system was very slow and I couldn't get much help with the documentation and available examples on the web. Without coming up with more excuses for my failed attempt I decided to take a different approach.

Surfing for other possible solutions I found some documentation for speech recognition engines from private companies that usually ship their products with a developer's API, big companies like IBM or Nuance. They would usually implement an interface known as SAPI (Speech Application Programming Interface) developed by Microsoft to provide speech recognition capabilities to Windows applications. In the same lines there is a JSAPI specification for the Java programming language. Microsoft not only developed the SAPI specification but also includes his own speech recognizer with some versions of the Windows operating system. So I downloaded the SAPI Software Development Kit and wrote a simple command line utility that reads a raw audio wav file and outputs a text file with the transcription of what was said in the audio file. Results were not great, especially because the recognition engine is not very speaker independent and some audio files that I tried had noise/music in the background.

The results of this hacking activity went beyond modifying one of their code examples to write this command line utility but I also wrote a JNI  (Java Native Interface) interface to use the recognition engine from Java, although I have to stress I did it more as a practicing exercise because I'm still limited to read only from a wav file. Of course this will only work on Windows but portability is something you will have to give up for this time if you want all those things that I mentioned at the beginning of this post. I'm including a link here with the command line utility with source code and the Java interface and one example of usage of the interface using Java.


One limitation of those tools will also be that the wav file must be a PCM raw audio file with 22KHz frequency, 16 bit per sample and stereo sound. For this purpose I recommend using some nice command line utility to do the job: You can use SoX, an open source library and command line tool to change any of those parameters from the wav audio file. Also it would be great if you could input mp3 files or even video files. The problem with mp3 is that there are not so many solutions for conversion out of the box due to patents. You will have to download and compile LAME and integrate this encoder with SoX in order to get SoX to convert raw PCM wav files to mp3 encoded files. For video files the best solution is to use mplayer from the command line using mplayer -vo null -ao pcm:file=%FILE_PATH%. This will extract the audio from the video file.

Getting to hack with audio and video formats and speech recognition development was an interesting experience, this also exposed me to other higher level technologies like VoiceXML and later I had the chance to meet one of the developers of Firevox, a Firefox extension that focus on accessibility. Coincidentially a research project here at Stony Brook also does accesibility using voice technologies: HearSay.
This book  about the history of vocoders
is titled: "How to wreck a nice beach"
a phrase commonly used as example of
the difficulty of speech recognition,
because it sounds like: "How to recognize speech"