Saturday, February 26, 2011

Podcast search - Large scale media retrieval

This is a follow-up post from my last tech post on Speech to Text using Java. I'm exploring here one idea I envisioned some time back (almost 4 years ago now) when I was a college student. This is the reason that pushed me to look for mature speech recognition technologies that I could use. By the way, this reminds me of some online lectures that I was listening the other day of Prof. Stephen Boyd (Stanford) on Convex Optimization where he was commenting that saying something is a "technology" might feel like derogatory to some people because this translates to using something as a black box without really understanding what is inside but he said he usually replies back by saying that most of us use TCP/IP without a deep understanding of what's going on inside to get a secure channel. In my case I was decided to use speech recognition roughly as a black box but ended up learning a bit of the underlying grounds anyways (language models, acoustic models, HMM's, etc).

My goal with my experiments with speech recognition technologies was to find a useful way to do video content retrieval using noisy automatically extracted transcriptions. Using this at the time I might have tried to beat Youtube on search (insert smiley face here), I even sketched the web search interface included in this post. But I was aware that making this at web-scale would require more computing power than whatever I could have available, especially at the time when I was working on this back in 2007 when Amazon EC2 was in its early stages, the Windows Azure Platform was non-existent and even Google App Engine was yet not released. I had to seriously narrow down my objective and make something that people who want media attention with a partially working prototype justify usually as "proof of concept" and in my case I would use this proof-of-concept as my way out of college.

At the time there was still a lot of buzz about podcasts on the web, which are the audio versions of blogs. So some companies were developing search solutions that could potentially look (or in this case listen) inside the contents of a large collection of audio documents. I'm including here a timeline of events related to podcast search and the use of speech recognition for large media search:

December 2004: Blinkx launches as the first audio search engine powered by speech-to-text technologies http://www2.prnewswire.com/cgi-bin/stories.pl?ACCT=LRTVN.story&STORY=/www/story/12-16-2004/0002636303&EDATE=THU+Dec+16+2004,+08:02+AM

April 2005: Podscope launches as the first audio search engine powered by speech-to-text technologies. (Hey! wasn't it Blinkx the first one?) http://en.wikipedia.org/wiki/Podscope

October 2005: Yahoo! Podcasts gets created although I haven't found if they used speech-to-text

January 2006: Podzinger launches featuring a US government funded speech recognizer

July 2006: AOL Launches Podcast Search (powered by Podscope), this was the major internet company launching a service like this! I haven't found any information about when they broke the deal but AOL Podcast Search doesn't seem to be available anymore.

October 2006: Microsoft uses some help from Blinkx technologies for their video search. http://en.wikipedia.org/wiki/Blinkx

June 2007: Plugdd starts as a podcast search solution using speech-to-text technologies. http://mashable.com/2007/06/29/pluggd-launches-audio-search-player-on-cnet/

October 2007: Yahoo! closes Yahoo! Podcasts

July 2008: Google Launches a test version of their Speech Recognition based Video Retrieval system http://googleblog.blogspot.com/2008/07/in-their-own-words-political-videos.html

February 2009: TED adds captions and translated captions to its videos by using the power of the crowd http://blog.ted.com/2009/02/09/unveiling_teds/

November 2009: Google Launches automatic captions from youtube videos

This list is not exhaustive but I think it's enough to draw some conclusions. I included the fact that TED added captions generated by users in 2009 because I used some TED videos in my project to generate automatic transcripts so that video content from TED could be searched. I thought that since TED is all about interesting ideas it was a pity that you could not search based on contents. If I had put my idea/prototype online in 2007 it would have been useless by 2009 when it was no longer necessary to do automatic speech recognition when you already have perfect transcriptions in several languages for most TED videos. Another thing that I can see from 2007 up until now is that two big companies decided to shutdown their podcast search service (Yahoo and AOL). I was actually surprised that Google was not going into using speech recognition for audio search back then, they took their time and they have incorporated this into their already existing products (Youtube, Google Voice).  Podscope hasn't changed much since 2007 and Podzinger was rebranded twice (Everyzing, RAMP). Things move relatively slow in this area mostly because building automatic speech recognition software that is speaker independent and handles a large vocabulary is very expensive so only big companies or companies already owning rights over speech recognition software can compete. I think we yet have to see what's the final take on search based on audio contents, I think the best example today is the automatic caption generation on Youtube videos.

Wednesday, February 23, 2011

Washington DC, Philadelphia and Baltimore

At the beginning of this year I was part of a two-day trip crossing through five states in the East Coast of the United States: New York, New Jersey, Pennsylvania, Delaware, Maryland. Following the same procedures as in my Niagara trip we embarked on a shuttle at Chinatown. Here are some pictures we took along the way:

US Capitol in Washington D.C.

Independence Hall - Philadelphia, Pennsylvania


Baltimore Inner Harbor, Maryland

Potomac River - Washington D.C.

Sunday, February 20, 2011

Speech to Text using Java

Is there out any utility, preferable command line that allows you to input an audio file and output a text file? Moreover, is there such a thing packaged as a library that developers can use? And yet again another technical requirement, is there something like it written on Java or another high level language? And as if it was not enough I want it for free. More than three years back I was looking for such a solution and I could not find something that works out of the box. If you try searching for something like this you will probably find the sphinx-4 project, a speech recognizer entirely written in Java. I spent a lot of time trying to understand the underlying basis of speech recognition and how it works, what are the roles of Language Models, Vocabularies, Acoustic Models, etc and not very few hours trying to make the whole thing work. I was successful in making it recognize isolated digits but what I wanted was general speech recognition that could deal with continuous speech, large vocabulary and complex grammar models. So I tried making it work for this scenario but I was unsuccessful, and for some of those models with larger vocabularies the system was very slow and I couldn't get much help with the documentation and available examples on the web. Without coming up with more excuses for my failed attempt I decided to take a different approach.

Surfing for other possible solutions I found some documentation for speech recognition engines from private companies that usually ship their products with a developer's API, big companies like IBM or Nuance. They would usually implement an interface known as SAPI (Speech Application Programming Interface) developed by Microsoft to provide speech recognition capabilities to Windows applications. In the same lines there is a JSAPI specification for the Java programming language. Microsoft not only developed the SAPI specification but also includes his own speech recognizer with some versions of the Windows operating system. So I downloaded the SAPI Software Development Kit and wrote a simple command line utility that reads a raw audio wav file and outputs a text file with the transcription of what was said in the audio file. Results were not great, especially because the recognition engine is not very speaker independent and some audio files that I tried had noise/music in the background.

The results of this hacking activity went beyond modifying one of their code examples to write this command line utility but I also wrote a JNI  (Java Native Interface) interface to use the recognition engine from Java, although I have to stress I did it more as a practicing exercise because I'm still limited to read only from a wav file. Of course this will only work on Windows but portability is something you will have to give up for this time if you want all those things that I mentioned at the beginning of this post. I'm including a link here with the command line utility with source code and the Java interface and one example of usage of the interface using Java.

Download: wav2text.zip

One limitation of those tools will also be that the wav file must be a PCM raw audio file with 22KHz frequency, 16 bit per sample and stereo sound. For this purpose I recommend using some nice command line utility to do the job: You can use SoX, an open source library and command line tool to change any of those parameters from the wav audio file. Also it would be great if you could input mp3 files or even video files. The problem with mp3 is that there are not so many solutions for conversion out of the box due to patents. You will have to download and compile LAME and integrate this encoder with SoX in order to get SoX to convert raw PCM wav files to mp3 encoded files. For video files the best solution is to use mplayer from the command line using mplayer -vo null -ao pcm:file=%FILE_PATH%. This will extract the audio from the video file.

Getting to hack with audio and video formats and speech recognition development was an interesting experience, this also exposed me to other higher level technologies like VoiceXML and later I had the chance to meet one of the developers of Firevox, a Firefox extension that focus on accessibility. Coincidentially a research project here at Stony Brook also does accesibility using voice technologies: HearSay.
This book  about the history of vocoders
is titled: "How to wreck a nice beach"
a phrase commonly used as example of
the difficulty of speech recognition,
because it sounds like: "How to recognize speech"