Research
- Recognition and categorization of objects, scenes, and actions
- Statistical visual learning
- Biologically motivated vision
- Robust and adaptive methods for computer / cognitive vision
- Mobile robot localization using panoramic images
- Robust recovery of parametric models in range and intensity images (Recover-and-Select Paradigm)
- Color vision
Some recent work
Learning a Hierarchical Compositional Shape Vocabulary
for Multi-class Object Representation
We propose a framework for learning a hierarchical shape vocabulary for multi-class object representation. The vocabulary is compositional, where each shape feature in the hierarchy is composed out of simpler ones by means of spatial relations. Learning is statistical and is performed bottom-up. The approach takes simple oriented contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class specific shape compositions, each exerting a high degree of shape variability. In the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. We assume supervision in terms of a positive and a validation set of class images — however, the hierarchical structure of each class is learned in a completely unsupervised way (no labels on object parts and smaller constituents are assumed).

Learning results
To train the representation for an individual class it takes on average 20 - 25 minutes. When learning multiple classes incrementally, training time for each additional class reduces.
Examples of learned vocabulary shapes (with the exception of a fixed Layer 1) learned on 1500 natural images. Only the mean of the shape models are depicted:

Examples of shape models at layers 4, 5, and the final object layer learned for 15 classes for object detection:

Examples of the complete learned whole-object shape models (with also the learned spatial relations shown):

Object class detection
Matching a vocabulary learned for each individual class takes from 2 - 4 seconds per image, depending on the size of the image (in our experiments the average size is roughly 700×500) and the amount of texture it contains. For the joint vocabulary of 15 object classes it takes only from 16 - 20 second per image.
Examples of detections of several object classes:

