Tuesday, January 21, 2014

Australia and Computer Vision

Note: I have more than ever been busy with work and the loss of reach my blog suffered when Google Buzz ended has reduced a bit of the motivation I had earlier when updating it. I won't say I'm making a resolution for a come back this year but I will try to write earlier and write often.
The Three Sisters, a rock formation in the Blue Mountains in New South Wales, Australia.
I was recently lucky enough to attended a big conference in computer vision, the ICCV - International Conference in Computer Vision. It took place in Australia last December and it was refreshing. Australia is a great place to visit but more on that later. One thing that took me by surprise was that I and my collaborators were awarded the prestigious Marr Prize in computer vision. I'm very grateful about that. I won't go into details about our own work that led to this prize in this blog post but I will borrow an excerpt from Tomasz's vision blog to describe it:

"[...]. It is all about entry-level categories - the labels people will use to name an object - which were originally defined and studied by psychologists in the 1980s. In the ICCV paper, the authors study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. The authors learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. NOTE: If you haven't read Eleanor Rosch's seminal 1978 paper, The Principles of Categorization, do yourself a favor: grab a tall coffee, read it and prepare to be rocked. [...]"

I not only recommend following his blog but also to check out and fund his new venture on Kickstarter.

The conference was a big success and we saw some of the new trends in the field that are getting more established. I will briefly discuss two of those trends: 1) Using geometry for high level computer vision tasks like detection or total scene understanding. 2) Using neural networks and particularly deep architectures. These two separate trends share some history. Geometry was initially taken as the main tool for solving vision problems but time seemed to prove that, at least for object recognition, representing images with 2D templates and using clever machine learning was sufficient to recognize objects. This approach has practically solved the problem of detecting faces. Geometry was to an extent confined to a separate track of classic tasks like reconstruction from multiple images. But in this new trend there is a clear revival for geometry in recognition where people have realized that in order to move forward to make recognition more general we need to incorporate geometry back into the mix.

In the same light neural networks were at some point more or less thought as a general black-box for learning. You show the computer some images and some labels and the neural network learns how to label new images by adjusting weights and filters on the input representations. This weight and filter system has some analogy to how networks of neurons work together, hence the name. They were also used together with 2D templates to learn to detect things like digits or faces and they were successful. Unfortunately there was the problem of scaling to more general neural network architectures with more layers. This required adjusting an enormous amount of parameters that computers were just not able to handle at the time. Machine learning theorists also came up with very solid theory for other alternatives to neural networks like max-margin classifiers and kernels that had better theoretical grounds and comparable performance than the neural networks of the time. And soon people largely stopped using neural networks to attempt computer vision problems and even problems in other fields. This recent revival and interest in neural networks in computer vision is due to recent key successes in training deep neural network architectures that have shown superior performance on several important vision tasks. Today's computers combined with large amounts of data and clever new techniques have all made this possible.

Our paper on entry-level categories doesn't fall in any of the above two categories but instead insists on yet a new focus on vision which is that of finding clever representations of what we should be learning about our visual world. Solving computer vision is not just about translating pixel content into labels but rather about building a higher representation model of the visual world that we can eventually apply to specific instances of this visual world. We have advocated for some time that we can augment this visual world model using knowledge collected from text and studying how people describe images using natural language.

I will take the almost impossible task of using my last paragraph to highlight Australia as a destination. It is an awesome place! Being a huge island, the fauna of Australia is just so different than everywhere else that it feels the closest to a totally different planet. I can only imagine how was the first encounter of non-aborigins of the island with those animals. The kangaroos are like a portrayal of deers taken from a very creative sci-fi movie except they are real. The koalas also have no parallels. The wombats and tasmanian devils are also something to be set apart from the rest of the fauna. I also had a chance to visit Cairns and the The Great Reef Barrier and while this is an awesome view of the underwater fauna I have to say that I enjoyed the above ground fauna the most.