Vincent's Blog: 2011

Saturday, November 5, 2011

Colorado Springs and Kinect

I have again neglected my blog for sometime due to deadlines and other aside work. I just realized I never got myself some time to write about my visit to Colorado Springs for CVPR 2011 during the summer. Actually there is not much for me to say about Colorado Springs since I hardly visited any places beyond the hotel so I will just mainly refer about one particular paper presented in this conference.

CVPR 2011 might perhaps be remembered as the conference where the research paper Real-time Human Pose Recognition in Parts from Single Depth Images (The Kinect paper!) by Microsoft Research was presented. (Ok, this is obviously an overstatement, quality of research at this conference is really high) Microsoft Kinect is a product that has had a big impact that goes beyond gaming. This is a very iconic example of Computer Vision that works and is readily available to the world.

Wearing a cap and being Kinect captured

Kinect has inspired hackers (http://kinecthacks.net/), artists (http://artandcode.com/3d/) and general technology enthusiasts since its introduction some time ago. It has also inspired researchers to create new algorithms that can clean the data captured by the Kinect sensors and make the most out of it or just play with it (here http://acberg.com/kinect/ some kinect hacking in Matlab by Alex Berg, one of my vision professors in Stony Brook). The picture I included in this blog post was ironically captured in our fancy Motion Capture Lab using the inexpensive capturing device from Microsoft. People here have also been working on Kinect with applications to Music performances and Motion identification. (More links to be added later...). Update: Interactive Music using Kinect: http://tamaraberg.com/papers/kinect_music.pdf

Although I didn't stay for the whole week of the conference I also presented a paper in CVPR 2011 about automatically estimating photo-quality and user engagement for photographs titled: High Level Describable Attributes for Predicting Aesthetics and Interestingness. Our goal was to use Computer Vision to recognize what are the kind of photographs that users think are cool without explicitly having to ask them what is cool? (Note: Users might not even realize what are the individual things that make them judge something as cool or interesting).

Saturday, September 10, 2011

Time for Mobile!

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Update 10/05/2011: Steve Jobs just passed away, my deep admiration for his work and legacy will always be alive. People like him have made working in Silicon Valley a dream for people like me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Some years back there was a time when people thought it was time for mobile! but it wasn't. It was around 2002 when developers were being lured to creating wap applications for the new generation of more capable phones. Java had already gained popularity and it was finally serving its purpose of being a language designed to run on top of anything. J2ME applications were a hot topic, at least among developers. But I was always more attracted to web application development and so selectively chose not to go into the mobile arena. I never coded any J2ME or wap application or anything that runs on a cellphone even though most of my peers would "wow!" such developments. And of course this is my personal opinion and you should understand I'm biased on this topic but I know some people will agree and I have some good reasons product of my own experience to think that time for mobile is now and it wasn't quite ready before.

When I say time for mobile I mean time for developers to create applications and really start making profit and embracing powers that go beyond the typical desktop application. Technology has made a lot of progress of course. Now a smartphone can run stuff on the GPU (Graphics Processing Unit) that go beyond drawing things on the screen. You can take not only full advantage of the web but also other multimodal sources of information like GPS and information coming from various sensors: proximity, orientation, etc. But the more important changes are the not very technical ones, now you have a more clear model for distributing applications massively (Apple Appstore, Android market), and devices are more third party application centric than before. More importantly you have a more clear model for monetizing applications!, you can create the new big hit $1 app or you can go big with advertisements. You have already been listening for some time on the news about kids making $1 applications and going big and you might think I'm writing this way too late. I think this is partially true but while one-big-hit-wonders happened in the previous years, the success of those were in a lot of cases hard to predict. Today you can start a more principled entrepreneurial project and gain some reasonable success even if you don't happen to become a big hit.

The establishment of Social has powered the establishment of Mobile and also the other way around. Social networking applications bring people even closer when they incorporate mobile, for instance Foursquare lets you see whom of your friends is around you at a given time by using your mobile device location and the location of your friends (see my previous post mentioning Foursquare). You can engage users in a more personal level connection using mobile.

Finally, there are a lot of other things I would like to write about mobile, for instance Computer Vision for mobile since I work in Computer Vision and I spent the last summer working with images at the Multimedia Content Analysis team at Android. I gladly noticed how much people from everywhere inside Google especially from research teams were willing to contribute to Android and what cool applications are out there from third party developers taking advantage of Computer Vision both for Android and iPhone. One thing I will not write about is the controversy on which one is better or their legal problems on patents and the like. I really admire Steve Jobs and I think without his vision and the iPhone coming into the mobile scene things wouldn't have moved so fast, taking the idea of multi-touch to mainstream usage in a product for the first time was a big hit but the technological advances have been the sum of the knowledge of so many people and I think the world is a more colorful place with competition that fits the needs of several groups of users.

Wednesday, July 27, 2011

Learning Karate by Waxing Cars

I will introduce here an article by Peter Norvig that I recently read and involves a discussion about Artificial Intelligence, Statistical Models and Machine Learning, but before posting the link I will introduce the discussion with an example:

When I was a child I learned how to write proper Spanish by following what I will refer in this article as the karate-kid or Mr.-Miyagi approach. I present here two approaches to learn and improve your basic writing skills.
1. Take a class on grammatical and orthographic rules.
2. Read lots of grammatically and orthographically correct text, not bothering about rules.

By using the first approach you can get a sense of how to construct correct sentences early on, but the effects depend a lot on having rules hard-coded in your brain very heavily, practicing with those rules with lots of examples can certainly reinforce the rules to be learned.

The second approach does not involve learning rules at all but just reading lots of text data, let's say books. After reading lots of correctly structured sentences and words you can develop a sense of how a correctly structured sentence or word feels like without being conscious about rules. In other words for some cases you will be using rules almost without consciously thinking about them due to the amazing ability of our brains to find patterns. This second approach is the data-driven approach or as I prefer to call it, the karate-kid approach because I suspect this is the path Mr. Miyagi would have chosen if he had to mentor a pupil about how to write properly.

Mr. Miyagi asking his pupil to wax cars over and over again.

The field of Artificial Intelligence used to follow the first approach. If you want a smart computer, then hard code rules on it so that it can behave as desired. Hard-coding rules doesn't scale very well so you might want to learn the rules from data or adapt the rules over time but ultimately people realized that you might not really need to care about rules at all, as long as you just care about the system behaving as desired. This is the topic of discussion in the article by Peter Norvig in response to Noam Chomsky's remarks where Chomsky apparently derided machine learning researchers. You can read it in the following link, I highly recommend it: http://norvig.com/chomsky.html

Wednesday, July 6, 2011

San Francisco Bay Area - Summer 2011

Golden Gate Bridge insider's view

I'm writing this after a long gap since my last post due to the burden of moving all the way to the west coast and signing contracts for every other thing imaginable that comes with relocation. I'm working in multimedia content analysis for Android during the summer. Probably the couple of friends who read my blog already know what is Android but I will anyways write here the explanation I gave to my parents: Android is an operating system that can be installed in a variety of smart phones manufactured by different companies. It's pretty much what Windows used to be for PC's. Note: Maybe you already noticed, but obviously this doesn't represent the views from my employer in any way! The thing is I didn't feel like explaining Android is open source which means you can download, read and analyze its source code and it has a very open model for writing and publishing new applications, including games!

Yoda fountain outside of Lucasfilms

Regarding the San Francisco Bay area, there are a couple of new places I visited this time, from the Santa Cruz Beach Boardwalk to the south up to Sausalito to the north. Sausalito is on the other side of the Golden Gate bridge and I include a picture here while crossing it. Another iconic place I wanted to visit for so long was the Yoda fountain outside of Lucasfilms. I felt like the security guard outside the building gave me the "oh here it comes another freak"-kind of look :) and then gently pointed that we can park anywhere for 20 minutes, so I guess lots of Star Wars fans go there just to take a picture of the fountain. Next to the place is the Palace of Fine Arts Theater and the Exploratorium. So if you visit any of those places then you might want to take a walk and take a picture of the iconic Yoda fountain.

Tuesday, May 17, 2011

Mapping the world - from housingmaps to foursquare

This is the map generated using part of the controversial consolidated.db from my iPhone.

Yes, the above picture was generated using the pretty much talked consolidated.db file from my iPhone. I'm not planning to write a post about such controversy, a lot has already been said. I want to rather use this map to start talking about web mapping applications and applications that leverage the use of geo-data and two entrepreneur stories that have amazed me.

When I found about this iPhone-tracking-your-every-move controversy, as a good geek I tried to get the map of my recent movements and downloaded an application that some good developers had already put online. From the map it seems that I have yet left to explore Nassau County and the boroughs of New York, especially Brooklyn and Queens which are closer. I have mainly been visiting places in Manhattan and around Stony Brook in Suffolk County like: East Setauket, Smithtown, Selden, Centereach, Lake Grove and Port Jefferson.The consolidated.db file also successfully recorded my visit to Fire Island in the south shore of Long Island and another recent visit to Coney Island in the south of Brooklyn. Looking at your data on maps is a nice experience because even though you might have been to a lot of places, it is hard to have a picture of how much you have explored until you see your data in an actual picture!

An interesting related note I found is that there's a group of researchers in the New York Times Company Research and Development Lab asking for people to donate their iPhone consolidated.db data for the benefit of all [see openpaths.cc]. More interesting to me than the applications on transportation, epidemiology or land use that they suggest is the fact that The New York Times Company has a full research lab. I like the idea that research is so important these days even for a media company best known for distributing one of the most popular newspapers.

Now I will talk about the first entrepreneurship story. Undoubtedly two mapping applications that changed things on the web were Google Maps and Google Earth, both cited as milestones in the history of web mapping compiled on the Wikipedia. They both started from the minds of very keen engineers and entrepreneurs but there's another story on top of that. They allowed people to start mapping anything without the effort of installing your own geographical information system. But most people might not know that this was not the case in the beginning of Google Maps where you didn't use to have a nice API. One research engineer working in the field of Computer Graphics and trying to make computer generated images more realistic spent some of his research time on an aside project later known as housingmaps.com, where he merged the information from craigslist (housing advertisements) and Google Maps. This application didn't go unnoticed, it became so popular that people often believe this as one of the reasons for Google to release a full developer's API. The next year he was named one of the TR35 (Top innovators under 35) [link here] for creating this web application often regarded as the first maps mash-up. What I like about this story is the fact that his big idea came from an aside project and the fact that coding this application probably didn't take longer than a month.

Foursquare Badge rewarding
people for visiting three times
any place above 59th street in
New York City.

The second story concerns another application that uses geo-data although it doesn't include maps itself: foursquare. A mobile application where you voluntarily reveal your location and moves through checking-in into different venues. It pulls your latitude-longitude coordinates from your mobile device so that you're one click away from shouting your location to the world or your friends. More recently you have the option to upload a picture about the places you visit. Chances to collect some fine-grained image dataset about places some day? [like the im2gps project]. Well, the story about foursquare is that it was developed by engineers who had previously worked at Google. But there are quite a few startups founded by former Google employees you might say. The interesting part is that these guys had already developed before joining Google another location based application called dodgeball, which was acquired by Google. Some time later they left Google and started again but this time they came up with this thing called foursquare.

My friends often ask me how to do stuff with the Google Maps API because I did an internship at Google in one of the Google Earth/Maps teams. Although I did indeed worked with one such team, my work was mostly concerned with server-side programming of image processing routines for aerial images. I in fact used the Google Maps API a bit for displaying results and also on my own time just for fun but I don't have vast experience with it and I know they have kept adding lots of features to it. I will end this by just wondering what else is left to do with maps and how many other useful linear information can be nicely mapped.

Thursday, April 28, 2011

High School Algebra in Latin America

I'm grossly generalizing here in the title of this post, I will mostly discuss about my own experience as a high school student in Ecuador. But as I have read from several sources, some of these things indeed apply to lots of places in Latin America. That being said, I will keep the title.

If you come from one of the spanish-speaking countries in Latin America, you might with high probability recognize the book cover in this blog post. This is what used to be and is the synonym for algebra for most high school students in those regions. If you don't recognize this image, be it because you're not from Latin America or you really didn't know about it, then I would really like to know what's the standard book used in your country for learning algebra. This book is so pervasive in lots of places in Latin America that the word Algebra and Baldor and the picture of the guy with the turbant all come together to people's mind when the word 'algebra' is uttered.

What most people don't know is that Baldor was a Cuban mathematician who later emigrated to the United States and not the guy with the turbant depicted on the book cover. I couldn't find much information about his education in mathematics or other published material from him but he held a teaching position in the United States later in life, although the book was already being distributed from Mexico at the time. I believe one of the reasons for the adoption of his book was the lack of mathematics textbooks in Spanish at the time. But mainly because this is also a good book and if it has any flaws I believe those are the same flaws that books in other languages might have and I will refer to this in the following paragraphs.

One thing that I particularly like about this book is that it's well organized and easy to follow, I would say even easy to follow on your own. One thing I don't like is that most of the exercises are repetitive and easy, even sometimes boring. One thing that I like is that every few pages it has short biographies for every major mathematician in history. This is probably the first book where students get to know who are Laplace, Euler, Descartes, Newton, Fermat, etc. Guys who you will keep on hearing from, especially if you go into the hard sciences later in college.

Finally I will quote something said by the great physicist Richard Feynman in his interview with the BBC and which I think illustrates a criticism that applies to this book as well as to books in other languages regarding teaching science:

"I learnt algebra fortunately by not going to school and knowing the whole idea was to find out what x was and it didn't make any difference how you did it, there's no such thing as, you know, you do it by arithmetic, you do it by algebra, that was a false thing that they had invented in schools so that the children who have to study algebra can all pass it. They had invented a set of rules which if you followed them without thinking could produce the answer: subtract 7 from both sides, if you have a multiplier divide both sides by the multiplier and so on, and a series of steps by which you could get the answer if you didn't understand what you were trying to do."

I think this applies to this textbook and I highly doubt Prof. Feynman was ever in touch with our legendary Baldor's Algebra. Sadly, this is just the way Algebra is being taught in schools in general: learn the rules and get me the results.

Related Links, Sources:
Richard Feynman's interview
Aurelio Baldor's wikipedia entry

Update (April 30, 2011):
I just want to clarify that even though I mention this Baldor Algebra book was adopted maybe because it was a good book giving the standards of the time and available books in Spanish, I also state that it's too easy and boring and even if it was more difficult then it would still fall in the mistakes other books fall, just teaching calculations and rules and not the beauty of math. The idea is better pictured in the following video, this is a TED Talk by Conrad Wolfram on his view of teaching math.
http://www.ted.com/talks/conrad_wolfram_teaching_kids_real_math_with_computers.html
(Thanks @sergioroa from DFKI - Germany for sending the link)

Sunday, April 10, 2011

Minesweeper as an Introduction to Computer Science

The very first computer game that I remember playing is Minesweeper, in spanish Buscaminas (literally translates as Minefinder). It was probably 1997 during one day I was visiting the workplace of my father, I started exploring a computer with Windows 95. I think my father had loosely explained to me some of the basics about computers but I think it was mainly my experience with video games that made the transition smooth. I quickly navigated through the task bar looking for games and ended up clicking on the salient smiley face icon. I randomly clicked through the cells and ended up losing the game quickly and with disappointment. I had to wait some 3 years later until my family could finally acquire a personal computer for our house.

I liked minesweeper very much and I want to explain here how I embraced it in my education and what are the things I believe we can learn from such game that according to Wikipedia has been there since the early mainframes in the 60's. The first thing I liked about the game is that it is very self contained, it's more exciting if you figure out the game rules by yourself while trying it. You will quickly realize what are the meanings of the numbers when you start uncovering cells: The number of bombs around that cell. Then you will start realizing how to use this information and start finding patterns: The 1's in the corners, the 2's in the corners, several combinations of 2's and 3's and so on, those patterns that allow you to become faster and really master the game. Shortly after I started playing, my father and sister also liked the game and started playing it, often challenging ourselves in our computer.

After some time and with lots of spare time and a computer at home, I started exploring some basic programming. What I was actually most interested was in learning how to create webpages but at some point I found myself with programming some Javascript, mostly to open the infamous popup windows and creating some more stylish navigation menus. But I also started to wonder how to encode the rules of a game like minesweeper and thought maybe it would be a good exercise to try. I couldn't do it at the time and in my naive attempts without any guidance other than Yahoo Search and Altavista and the less known newcomer Google, I tried to learn the Pascal programming language to do the job but honestly I couldn't get very far on my own. I had to wait another 2 or 3 years until I was in my second year of college.

So I'm talking now about 2004, after an introduction to programming class and a data structures class under my belt. I was enrolled in the 'Object Oriented' programming class, a still hot topic at the time, at least in the local tech community back there. So I set my mind that I would use this class as an opportunity to implement minesweeper and play around with the rules of the game to create some variants of the basic game. The basic game requires an understanding of several things, there are two obvious things that you will gain from the experience:

1. Be good at manipulating arrays/matrices: Obvious! The game even looks like a matrix so this is the data structure you will need. You will have to traverse the matrix up and down, forward and backwards in every way possible. I implemented this in Java so I didn't need to think about dynamic allocation of arrays explicitly but if you want your game dimensions to be variable (beginner, medium, expert), then in a language like C you will want to go dynamic.

2. Be good at using recursion: This might not be too obvious but uncovering a cell with no number and no mine requires propagating a recursion call in several directions until you find a numbered cell. There's always a way to do it without recursion but recursion just comes naturally.

Beyond these two things you can learn about the power of random number generation when you're writing the routine to place the mines and also a basic convolution-like operation when you're assigning the numbers to the cells after the placement of the mines. Also if you're really into it you will notice things from the Windows Minesweeper like the fact that you never hit a bomb in your first move.

In the realm of Object Oriented programming itself which was the excuse for getting into this project, you can also learn to encapsulate your objects so well as to have the ability to create a new game by instantiating a Minesweeper class. Things like: new Minesweeper(), new Minesweeper('expert'), new Minesweeper(width, height, mineCount). Or even more, generalize your game to add the extra features in this way: SuperMinesweeper extends Minesweeper. Which effectively addresses the whole purpose of programming with objects in mind.

Minesweeper is not the only game that I get to program when I was on my first steps in the world of programming but now that I'm usually writing code for image processing and computer vision I get to remember the first times I was traversing matrices and performing convolutions and recursion calls over rows and columns. I'm including in this post a caption of the game that looks as close as it can get to the one distributed in older Windows versions. The code I wrote back then is still fully working but nowadays you can find lots of minesweeper implementations out there, even in Javascript which needs you to install nothing, you can try this one. Indeed you can find so many variants of it online and from so many places around the world. I, for one, thank all the programmers who wrote the early versions of minesweeper and also Microsoft's decision to include this nice game in his most popular software. This is a game that I believe has truly inspired many people even so far as to somebody proving that Minesweeper is NP-complete! How cool is that?

Saturday, February 26, 2011

Podcast search - Large scale media retrieval

This is a follow-up post from my last tech post on Speech to Text using Java. I'm exploring here one idea I envisioned some time back (almost 4 years ago now) when I was a college student. This is the reason that pushed me to look for mature speech recognition technologies that I could use. By the way, this reminds me of some online lectures that I was listening the other day of Prof. Stephen Boyd (Stanford) on Convex Optimization where he was commenting that saying something is a "technology" might feel like derogatory to some people because this translates to using something as a black box without really understanding what is inside but he said he usually replies back by saying that most of us use TCP/IP without a deep understanding of what's going on inside to get a secure channel. In my case I was decided to use speech recognition roughly as a black box but ended up learning a bit of the underlying grounds anyways (language models, acoustic models, HMM's, etc).

My goal with my experiments with speech recognition technologies was to find a useful way to do video content retrieval using noisy automatically extracted transcriptions. Using this at the time I might have tried to beat Youtube on search (insert smiley face here), I even sketched the web search interface included in this post. But I was aware that making this at web-scale would require more computing power than whatever I could have available, especially at the time when I was working on this back in 2007 when Amazon EC2 was in its early stages, the Windows Azure Platform was non-existent and even Google App Engine was yet not released. I had to seriously narrow down my objective and make something that people who want media attention with a partially working prototype justify usually as "proof of concept" and in my case I would use this proof-of-concept as my way out of college.

At the time there was still a lot of buzz about podcasts on the web, which are the audio versions of blogs. So some companies were developing search solutions that could potentially look (or in this case listen) inside the contents of a large collection of audio documents. I'm including here a timeline of events related to podcast search and the use of speech recognition for large media search:

December 2004: Blinkx launches as the first audio search engine powered by speech-to-text technologies http://www2.prnewswire.com/cgi-bin/stories.pl?ACCT=LRTVN.story&STORY=/www/story/12-16-2004/0002636303&EDATE=THU+Dec+16+2004,+08:02+AM

April 2005: Podscope launches as the first audio search engine powered by speech-to-text technologies. (Hey! wasn't it Blinkx the first one?) http://en.wikipedia.org/wiki/Podscope

October 2005: Yahoo! Podcasts gets created although I haven't found if they used speech-to-text

http://docs.yahoo.com/docs/pr/release1266.html

January 2006: Podzinger launches featuring a US government funded speech recognizer

http://www.internetnews.com/infra/article.php/3576941/PodZinger-Launches-Audio-Search-Service.htm

July 2006: AOL Launches Podcast Search (powered by Podscope), this was the major internet company launching a service like this! I haven't found any information about when they broke the deal but AOL Podcast Search doesn't seem to be available anymore.

http://news.ebrandz.com/ask-a-aol/2005/336-aol-launches-podcast-search.html

http://web.resourceshelf.com/go/resourceblog/42962

October 2006: Microsoft uses some help from Blinkx technologies for their video search. http://en.wikipedia.org/wiki/Blinkx

June 2007: Plugdd starts as a podcast search solution using speech-to-text technologies. http://mashable.com/2007/06/29/pluggd-launches-audio-search-player-on-cnet/

October 2007: Yahoo! closes Yahoo! Podcasts

https://searchengineland.com/yahoo-podcasts-to-close-the-sorry-state-of-podcast-search-12288

July 2008: Google Launches a test version of their Speech Recognition based Video Retrieval system http://googleblog.blogspot.com/2008/07/in-their-own-words-political-videos.html

February 2009: TED adds captions and translated captions to its videos by using the power of the crowd http://blog.ted.com/2009/02/09/unveiling_teds/

November 2009: Google Launches automatic captions from youtube videos

http://googleblog.blogspot.com/2009/11/automatic-captions-in-youtube.html

This list is not exhaustive but I think it's enough to draw some conclusions. I included the fact that TED added captions generated by users in 2009 because I used some TED videos in my project to generate automatic transcripts so that video content from TED could be searched. I thought that since TED is all about interesting ideas it was a pity that you could not search based on contents. If I had put my idea/prototype online in 2007 it would have been useless by 2009 when it was no longer necessary to do automatic speech recognition when you already have perfect transcriptions in several languages for most TED videos. Another thing that I can see from 2007 up until now is that two big companies decided to shutdown their podcast search service (Yahoo and AOL). I was actually surprised that Google was not going into using speech recognition for audio search back then, they took their time and they have incorporated this into their already existing products (Youtube, Google Voice). Podscope hasn't changed much since 2007 and Podzinger was rebranded twice (Everyzing, RAMP). Things move relatively slow in this area mostly because building automatic speech recognition software that is speaker independent and handles a large vocabulary is very expensive so only big companies or companies already owning rights over speech recognition software can compete. I think we yet have to see what's the final take on search based on audio contents, I think the best example today is the automatic caption generation on Youtube videos.

Wednesday, February 23, 2011

Washington DC, Philadelphia and Baltimore

At the beginning of this year I was part of a two-day trip crossing through five states in the East Coast of the United States: New York, New Jersey, Pennsylvania, Delaware, Maryland. Following the same procedures as in my Niagara trip we embarked on a shuttle at Chinatown. Here are some pictures we took along the way:

US Capitol in Washington D.C.

Independence Hall - Philadelphia, Pennsylvania

Baltimore Inner Harbor, Maryland

Potomac River - Washington D.C.

Sunday, February 20, 2011

Speech to Text using Java

Is there out any utility, preferable command line that allows you to input an audio file and output a text file? Moreover, is there such a thing packaged as a library that developers can use? And yet again another technical requirement, is there something like it written on Java or another high level language? And as if it was not enough I want it for free. More than three years back I was looking for such a solution and I could not find something that works out of the box. If you try searching for something like this you will probably find the sphinx-4 project, a speech recognizer entirely written in Java. I spent a lot of time trying to understand the underlying basis of speech recognition and how it works, what are the roles of Language Models, Vocabularies, Acoustic Models, etc and not very few hours trying to make the whole thing work. I was successful in making it recognize isolated digits but what I wanted was general speech recognition that could deal with continuous speech, large vocabulary and complex grammar models. So I tried making it work for this scenario but I was unsuccessful, and for some of those models with larger vocabularies the system was very slow and I couldn't get much help with the documentation and available examples on the web. Without coming up with more excuses for my failed attempt I decided to take a different approach.

Surfing for other possible solutions I found some documentation for speech recognition engines from private companies that usually ship their products with a developer's API, big companies like IBM or Nuance. They would usually implement an interface known as SAPI (Speech Application Programming Interface) developed by Microsoft to provide speech recognition capabilities to Windows applications. In the same lines there is a JSAPI specification for the Java programming language. Microsoft not only developed the SAPI specification but also includes his own speech recognizer with some versions of the Windows operating system. So I downloaded the SAPI Software Development Kit and wrote a simple command line utility that reads a raw audio wav file and outputs a text file with the transcription of what was said in the audio file. Results were not great, especially because the recognition engine is not very speaker independent and some audio files that I tried had noise/music in the background.

The results of this hacking activity went beyond modifying one of their code examples to write this command line utility but I also wrote a JNI (Java Native Interface) interface to use the recognition engine from Java, although I have to stress I did it more as a practicing exercise because I'm still limited to read only from a wav file. Of course this will only work on Windows but portability is something you will have to give up for this time if you want all those things that I mentioned at the beginning of this post. I'm including a link here with the command line utility with source code and the Java interface and one example of usage of the interface using Java.

Download: wav2text.zip

One limitation of those tools will also be that the wav file must be a PCM raw audio file with 22KHz frequency, 16 bit per sample and stereo sound. For this purpose I recommend using some nice command line utility to do the job: You can use SoX, an open source library and command line tool to change any of those parameters from the wav audio file. Also it would be great if you could input mp3 files or even video files. The problem with mp3 is that there are not so many solutions for conversion out of the box due to patents. You will have to download and compile LAME and integrate this encoder with SoX in order to get SoX to convert raw PCM wav files to mp3 encoded files. For video files the best solution is to use mplayer from the command line using mplayer -vo null -ao pcm:file=%FILE_PATH%. This will extract the audio from the video file.

Getting to hack with audio and video formats and speech recognition development was an interesting experience, this also exposed me to other higher level technologies like VoiceXML and later I had the chance to meet one of the developers of Firevox, a Firefox extension that focus on accessibility. Coincidentially a research project here at Stony Brook also does accesibility using voice technologies: HearSay.

This book about the history of vocoders
is titled: "How to wreck a nice beach"
a phrase commonly used as example of
the difficulty of speech recognition,
because it sounds like: "How to recognize speech"

Wednesday, January 12, 2011

Google New York

Google invited the Computer Science graduate student community at Stony Brook to visit their offices in New York last December. Most of the people were busy giving finals and wrapping up their Fall semester, still we managed to go there with a group of fellow graduate students. Weather was very cold but no snow yet at the time.

The tour was guided by both a Google software engineer and a recruiter. The company is moving from their Times Square location to two new buildings and although they only occupy a couple of floors on those buildings, the place still looks huge. From the outside the building we visited looks like any other building in Manhattan. This came in contrast to the Google-land looks of their main offices in California but once you're inside those buildings the feeling is the same, an open place full of fun and geek culture in every bit.

We attended a Tech Talk given by a Software Engineer and Ph.D. alumni of Stony Brook working on Google Local Search. He explained a lot of the challenges he has to deal with in his daily work and how they collect, interpret, present and more importantly search through geo data. Before the talk we played a trivia game where we could win precious Google merchandise. I managed to get away with the two items shown in the picture of this post: A Google mug and an Android plush doll. I got the green Android by completing the phrase of Edsger Dijkstra: "Computer Science is no more about computers than astronomy is about telescopes" and I won the mug by remembering the original name given to the Google search engine when it was first put online: Backrub.

After the talk and with some Google goodies in hand we had lunch with the rest of employees in the nice restaurants that have made this company famous for the free and good food. Beyond Google we didn't do much this day. We went to Washington Square Park with a group of friends and did some shopping nearby. Google New York seems like an awesome place to work although for me I came to realize that most of the work in my field of interest is happening in the west coast.

Tuesday, January 11, 2011

The Museum of Modern Art - MoMA in New York City

Me and a group of friends visited MoMA in New York last November. Being part of the State University of New York system grants you some benefits on some of these venues and visiting this important museum in New York is not one we wanted not to take advantage of. The streets of the city were particularly crowded since it was Thanksgiving season and people were buying and selling stuff more than ever.

I had already visited more than two years ago MoMA San Francisco and admired some of the most relevant pieces in contemporary art. San Francisco and New York are very different in several respects, but they are definitely the two most important cities in the US when it comes to art. I used to do painting and charcoal drawing myself before going to college. I attended a couple of academies when I was a kid but when deciding my career I didn't pursue this further. Nevertheless, this helped me gaining some knowledge and appreciation for art.

I noticed that one of the pieces that most attracted the attention from the public was Starry Night by Vincent Van Gogh. But I had the chance to see other works that I had only seen in the books from my art classes from when I was a kid. Les Demoiselles d'Avignon or the Three Musicians from Pablo Picasso are a good example. In the picture in this entry I am with a Jackson Pollock painting, although his art of composition of simple forms in complicated ways has received mixed critics, one of his paintings is considered the world's most expensive painting, sold in 2006 in 140 million dollars. The art of Monet, Mondrian, Dalí, Miró, Warhol also made the trip to the city more than worth it. Next time I will try to visit the Metropolitan Museum, another big one for art in New York City.