This is a follow-up post from my last tech post on Speech to Text using Java. I'm exploring here one idea I envisioned some time back (almost 4 years ago now) when I was a college student. This is the reason that pushed me to look for mature speech recognition technologies that I could use. By the way, this reminds me of some online lectures that I was listening the other day of Prof. Stephen Boyd (Stanford) on Convex Optimization where he was commenting that saying something is a "technology" might feel like derogatory to some people because this translates to using something as a black box without really understanding what is inside but he said he usually replies back by saying that most of us use TCP/IP without a deep understanding of what's going on inside to get a secure channel. In my case I was decided to use speech recognition roughly as a black box but ended up learning a bit of the underlying grounds anyways (language models, acoustic models, HMM's, etc).
My goal with my experiments with speech recognition technologies was to find a useful way to do video content retrieval using noisy automatically extracted transcriptions. Using this at the time I might have tried to beat Youtube on search (insert smiley face here), I even sketched the web search interface included in this post. But I was aware that making this at web-scale would require more computing power than whatever I could have available, especially at the time when I was working on this back in 2007 when Amazon EC2 was in its early stages, the Windows Azure Platform was non-existent and even Google App Engine was yet not released. I had to seriously narrow down my objective and make something that people who want media attention with a partially working prototype justify usually as "proof of concept" and in my case I would use this proof-of-concept as my way out of college.
At the time there was still a lot of buzz about podcasts on the web, which are the audio versions of blogs. So some companies were developing search solutions that could potentially look (or in this case listen) inside the contents of a large collection of audio documents. I'm including here a timeline of events related to podcast search and the use of speech recognition for large media search:
December 2004: Blinkx launches as the first audio search engine powered by speech-to-text technologies http://www2.prnewswire.com/cgi-bin/stories.pl?ACCT=LRTVN.story&STORY=/www/story/12-16-2004/0002636303&EDATE=THU+Dec+16+2004,+08:02+AM
April 2005: Podscope launches as the first audio search engine powered by speech-to-text technologies. (Hey! wasn't it Blinkx the first one?) http://en.wikipedia.org/wiki/Podscope
October 2005: Yahoo! Podcasts gets created although I haven't found if they used speech-to-text
January 2006: Podzinger launches featuring a US government funded speech recognizer
July 2006: AOL Launches Podcast Search (powered by Podscope), this was the major internet company launching a service like this! I haven't found any information about when they broke the deal but AOL Podcast Search doesn't seem to be available anymore.
October 2006: Microsoft uses some help from Blinkx technologies for their video search. http://en.wikipedia.org/wiki/Blinkx
June 2007: Plugdd starts as a podcast search solution using speech-to-text technologies. http://mashable.com/2007/06/29/pluggd-launches-audio-search-player-on-cnet/
October 2007: Yahoo! closes Yahoo! Podcasts
July 2008: Google Launches a test version of their Speech Recognition based Video Retrieval system http://googleblog.blogspot.com/2008/07/in-their-own-words-political-videos.html
February 2009: TED adds captions and translated captions to its videos by using the power of the crowd http://blog.ted.com/2009/02/09/unveiling_teds/
November 2009: Google Launches automatic captions from youtube videos
This list is not exhaustive but I think it's enough to draw some conclusions. I included the fact that TED added captions generated by users in 2009 because I used some TED videos in my project to generate automatic transcripts so that video content from TED could be searched. I thought that since TED is all about interesting ideas it was a pity that you could not search based on contents. If I had put my idea/prototype online in 2007 it would have been useless by 2009 when it was no longer necessary to do automatic speech recognition when you already have perfect transcriptions in several languages for most TED videos. Another thing that I can see from 2007 up until now is that two big companies decided to shutdown their podcast search service (Yahoo and AOL). I was actually surprised that Google was not going into using speech recognition for audio search back then, they took their time and they have incorporated this into their already existing products (Youtube, Google Voice). Podscope hasn't changed much since 2007 and Podzinger was rebranded twice (Everyzing, RAMP). Things move relatively slow in this area mostly because building automatic speech recognition software that is speaker independent and handles a large vocabulary is very expensive so only big companies or companies already owning rights over speech recognition software can compete. I think we yet have to see what's the final take on search based on audio contents, I think the best example today is the automatic caption generation on Youtube videos.