Sunday, May 20, 2012

Visual Attention and Visual Saliency

"Everybody knows what attention is ..."
-William James 1890

This quote is referenced in a research paper from the Visual Computing group at Microsoft Research Asia (MSRA), titled "Learning to Detect a Salient Object". I don't know exactly the context for that quote but it is interesting that somebody says this in 1890 when yet today we don't know many things about visual attention. Visual attention is particularly interesting in Computer Vision because in this field we want to teach computers how to recognize things in the visual world and it seems humans might be taking advantage of things like visual attention in ways computers still aren't.

To actually give some definition of visual attention I would say that it is the condition by which our vision focus in more or less degree on some things within the total amount of information that is perceived. Particularly in computer vision there are some research lines that are closely related to ideas in visual attention, one of them is visual saliency, which could cover among other things a) class-independent object detection or proto-object detection (although proto-objects as defined in the visual perception literature might not be directly usable in a practical application), b) detecting salient objects on an image (under the assumption that we humans do not consider all objects are equally visually important) or c) Detecting saliency maps that define regions that are important on an image without explicitly associating them with an object.

The paper from MSRA ("Learning to Detect a Salient Object" CVPR'07) detects a salient object under the assumption that we know a priori that there exists a salient object in the image. I believe this assumption holds for a large number of images on the web because that's just the way we think when we capture pictures, we usually focus on something. It is easy to imagine that not only Microsoft but also Google are already using some form of visual saliency to autocrop images from the web for display on search results or generating thumbnails. But beyond this obvious application there is room for using these kind of techniques to improve object detection itself or at least to avoid trying to detect objects on every possible location within an image.

Sample saliency maps for the top left image used as features in  the MSRA paper.
Those maps were generated using my own implementation of their method.
The MSRA paper also introduces the MSRA Salient Object Database, a large collection of images with manually annotated bounding boxes enclosing the salient object on each image. The only thing not included is source code, that's why in 2009 while I was starting graduate school I decided to implement their method on Matlab [link to source code]. And although the CRF formulation is not exactly the same, I get similar performance to the one reported in the original paper (See slides included at the end of this post). The paper has got some considerable attention since it was first published and so although I don't keep track of how many people are downloading my code, I see a lot of traffic coming from Google search. Also I didn't run many experiments beyond what is explained in the original paper but I found somebody using my code who did a more thorough evaluation. This was done by a student at the Computer Vision class at the University of Texas Austin http://vision.cs.utexas.edu/cv-fall2011/slides/larry-expt.pdf. As I had expected this method does better than Itti & Koch (previous much simpler approach) but only when it actually detects something, which is most likely to happen in the kind of images where we have a clear single salient object, the kind of images the method was trained on.



Links in this post:
MSRA: Learning to detect a salient object source code:
http://www.cs.stonybrook.edu/~vordonezroma/code.html
Saliency Experiments Slides from the University of Texas Austin:
http://vision.cs.utexas.edu/cv-fall2011/slides/larry-expt.pdf