Personal tools
You are here: Home Events iWOC@ICCV07
 


2004 Workshop on Generic Object Recognition and Categorization
 

International Workshop on Object Categorization (iWOC) @ ICCV07


October 15, 2007
Rio de Janiero, Brazil


Home
Topic and Motivation
Format
Speakers
Workshop Participation
Email: Bernt Schiele

 

Part of ICCV07

 

Organizers

Sven Dickinson
University of Toronto
Toronto, Canada

Ales Leonardis
University of Ljubljana
Slovenia

Bernt Schiele
TU Darmstadt
Darmstadt, Germany

Michael Tarr
Brown University
Providence, RI, USA


Support

toyota.jpg

euCognitionLogo.png


 


Note: This workshop will finish before the "Visual Recognition Challenge" Workshop starts -- they will not clash.

The workshop will begin promptly at 8:00am, and finish promptly at 4:15pm. The PASCAL Visual Recognition Challenge workshop will begin promptly at 4:30pm and run until 8:00pm.  Both workshops address the topic of object categorization from different perspectives, and can therefore be seen as complementary.  To maximize joint attendance at the two workshops, we have scheduled them back-to-back in the same room.


Topic and Motivation

The recognition of object categories has a rich history in both human and computer vision.  Since the 1970's, generic object recognition systems sought to model and recognize objects based on their coarse, prototypical shape. These early systems employed complex 3-D models, which offered invariance to viewpoint (including image translation, rotation, and scale), articulation, occlusion, and minor within-class shape deformation.  Similarly, theories of human object recognition sought to exploit the power of 3D models as a means for explaining how observers routinely achieve invariant recognition. However, despite their powerful modeling paradigms, these early systems/theories lacked the low- and intermediate-level segmentation, grouping, and abstraction machinery to recover such prototypical shape from real images of real objects.  Over the next 20 years, the recognition community began to back away from this ``holy grail'' of recognition, bringing the models closer to the image in an effort to reduce the representational gap between extractable image features and model features.  During this time, the community passed through the CAD-based vision era, where exact 3-D geometry was specified, to the appearance-based vision era, where exact 2-D photometry was specified (either globally, or locally at interest points).


Over this period, the recognition problem was reformulated from generic object recognition to exemplar recognition.  For the first time, real object exemplars, with full texture and complex shape, could be recognized. However, it became apparent that these techniques for exemplar recognition did not scale up to generic objects (alternatively called classes or categories).  Over the last 10 years, the mainstream object recognition pendulum has started to swing back toward object categorization.  Armed with new features, new segmentation techniques, new optimization and matching techniques, and new machine learning methods, the community is far better prepared this time around to tackle this important problem.  Unfortunately, since categorization was absent from the mainstream for so long, there is a tendency in the computer vision community not to look back at earlier problem formulations, challenges, and solutions. This historical disconnect has not served the community well, and we're now navigating today's challenges, including the quests for more categorical features (shape vs. appearance), viewpoint invariance, articulation invariance, occlusion invariance, and invariance to within-class structural change, without the clear hindsight of the community's earlier experience.


In an effort to foster greater communication between researchers from disparate camps, and to help bridge this historical disconnect, we held international workshops on generic object recognition at CVPR 97, ICCV 99, and CVPR 04.  The workshops all had identical format: bring together 10-12 of the community's most prominent researchers, from both the human and computer vision recognition communities whose research spans the evolution of the field, to share their perspectives on the problem.  To stimulate discussion, we purposely chose researchers with differing viewpoints, in an effort to represent all perspectives on the problem, rather than favor a particular methodology.  Since many of the workshop attendees were graduate students, we felt that both a broad treatment of the problem, with broad historical context, was critical.  Speakers were encouraged not to simply present their latest work, but rather provide a perspective on their experience working on the problem, and talk about the challenges, successes, and failures.  The workshops were a great success, with the most recent incarnation (at CVPR 04) being the most attended workshop of the conference.


On this, the tenth anniversary of the first workshop we held on the topic, there are many reasons why we believe the timing is right for a fourth such workshop.  As mentioned above, the historical disconnect continues to grow at the same time as more researchers enter the recognition community; an institutional memory refresh is especially important for today's researchers and students, if we are to maximally benefit from the community's prior work.  Perhaps a more compelling reason for holding such a workshop now is an increased interest in the fields of psychology and neurophysiology to study mechanisms of human visual object categorization.  We have recently seen several successful algorithms that are biologically inspired/motivated.  By bringing together researchers from different vision subcommunities, we hope to increase the interdisciplinary awareness and collaboration, both of which will ultimately shed light on this important problem.


Organization and Workshop Format

Like our previous three workshops, the format of the one-day workshop will be 12 invited speakers (six human vision, six computer vision) plus a panel discussion.  We have a book contract with Cambridge University Press to assemble an edited volume whose contributions not only include those of our 12 invited speakers, but include speakers from the three previous workshops as well as others we would like to include but cannot invite due to various constraints.

Location of the Workshop

The workshop is part of ICCV07 and is held prior to the main conference.
For updated information about the location of the workshop please refer to the webpage of the main conference: webpage of ICCV 2007 


List of Invited Speakers



Human Vision


Computer Vision

Tai Sing Lee (abstract) (slides)
Carnegie Mellon UniversityCordelia Schmid (abstract) (slides)
INRIA
Moshe Bar (abstract) (slides)
Harvard UniversitySong Chun Zhu (abstract) (slides)
University of California
Kalanit Grill-Spector (abstract) (slides)
Stanford UniversityPietro Perona (abstract) (slides)
Caltech
James J. DiCarlo (abstract) (slides)
Massachusetts Institute of TechnologyShimon Ullman (abstract) (slides)
Weizmann Institute
Heinrich B. Bülthoff (abstract)  (slides)
MPI for Biological CyberneticsStuart Geman (abstract) (slides)
Brown University
Irving Biederman (abstract) (slides)
University of Southern California
David W. Jacobs (abstract) (slides)
University of Maryland


Workshop Participation

The workshop is open to all ICCV-participants.


Program of the Workshop 


8:00  -  8:15:     Welcome and Opening Remarks
8:15  -  8:45:     James J. DiCarlo
8:45  -  9:15:     Pietro Perona
9:15  -  9:45:     Stuart Geman

9:45  - 10:00:    Coffee

10:00 - 10:30:   Irving Biederman
10:30 - 11:00:   David W. Jacobs
11:00 - 11:30:   Kalanit Grill-Spector

11:30 - 12:30:   Lunch

12:30 -  1:00:    Tai Sing Lee
1:00  -  1:30:     Shimon Ullman
1:30  -  2:00:     Song Chun Zhu

2:00  -  2:15:     Coffee

2:15  -  2:45:     Moshe Bar
2:45  -  3:15:     Cordelia Schmid
3:15  -  3:45:     Heinrich B. Bülthoff
3:45  -  4:15:     Panel Discussion



David W. Jacobs: Comparing Images of 3D Objects

We can approach categorization as a process in which a new 2D image is assigned to a category by comparing it to previously seen 2D images of objects in that category.  At the same time, one of our main sources of knowledge when making such comparisons is that these are images of 3D objects.  This raises a key question: how do we compare a new image to one or more previously seen images, to determine whether these all come from the same 3D object?

I will describe a number of approaches to image comparison that rest on attempts to model the ways in which changes in viewpoint, lighting, or shape can alter the 2D appearance of a 3D object.  I will also describe a number of open problems raised by this research perspective. 


Song-Chun Zhu: Learning  Object Category Modeling, Learning, and Recognition by Stochastic Grammar

In this talk, I will review and discuss some recent progress in my group on object category recognition and parsing. I will introduce a stochastic context sensitive grammar as a unified framework for object modeling, learning and recognition. This grammar is embodied in an And-Or graph representation integrating (i) a hierarchic decomposition to account for structural variations, and (ii) a set of horizontal relations for spatial and functional contexts. Then each object category is defined as the set of all valid configurations produced by its grammar. Then a probabilistic model is defined on the And-Or graph to account for the natural frequency of object instances. This model can be learned from a small training set and simulated through MCMC sampling to generalize to a large number of novel configurations so that they cover unforeseen instances in testing images. To make this grammar model scalable, we have constructed a large manually annotated image database at Lotus Hill Institute to support the learning and evaluation. We also developed a recursive algorithm for bottom-up / top-down inference. I will show some case studies on modeling clothes, man-made objects, and object categories.

This talk is prepared based on a long review paper by Zhu and Mumford, 2007 "A Stochastic Grammar of Images", downloadable from http://www.stat.ucla.edu/~sczhu/papers/Grammar_quest.pdf.


Shimon Ullman: Recognizing visual classes and individual objects by semantic hierarchies

We describe a general scheme for recognizing object classes at different levels of specificity, as well as individual objects across changes in viewing conditions.

In this approach, objects are represented by a hierarchy of semantic features that are extracted during learning from observed examples. Each semantic feature contains a set of fragments that depict different appearances of the same object part. The equivalent part appearances are identified by two distinct mechanisms, one is based on the tracking of object features over time, and the second on the use of common context. The semantic features are used for categorization, as well as for recognizing individual objects under large changes in the viewing conditions. Each object part is represented by a set of fragments depicting this part under different viewing conditions, and together they form a view-invariant representation of objects within a general class.

The hierarchical structure of the representation is used for recognizing not only complete objects, but also their parts and sub-parts at multiple levels. The parts at all levels are identified efficiently and unambiguously using a two-pass algorithm, composed of a bottom-up followed by a top-down sweep.

Finally, the hierarchical representation is used to combine figure-ground segmentation with the recognition process. The figure–ground labeling of the stored fragments is first learned from a collection of non-segmented image examples during a learning stage. Given a novel object from the learned class, the detected fragments produce a cover of the object in terms of stored fragments, which is used to guide the segmentation of the entire object.


Tai Sing Lee: Correlated structures in natural scenes and their implications on neural learning of prior models for objects and surfaces

Features associated with an object tend to vary coherently in our visual experience, and are highly correlated in their occurrence in space and time. These  correlations in turn are manifested in correlated neural activities and effective connectivities among neurons,  leading to the formation of hierarchical and compositional representations of shapes and objects  in the ventral  stream of the visual cortex. We will examine these correlations in natural scenes, and present neurophysiological evidence on how they manifest in the activities of neuronal ensembles. We will finally discuss a class of computational models for statistical visual inference and unsupervised object learning motivated by these principles.


Kalanit Grill-Spector: Functional Organization and Development of  the Human Ventral Stream

Research in the past decade has shown that the human ventral stream consists of a constellation of regions involved in object, face,  and place perception.  However, there is much debate about the degree of segregation of ventral stream regions and about computations performed in distinct subregions. I will describe results of my research contributing to the understanding of functional organization and computations in the human ventral stream. I will discuss results of experiments that show a hierarchy of computations along ventral stream regions in which the representation of objects becomes more abstract and high level, with posterior regions sensitive to objects’ viewpoint, size and position and more anterior regions along this processing stream show a higher degree of invariance. Nevertheless, even in anterior regions there is sensitivity to object view and position, suggesting a view-based representation of objects in the human ventral stream. Then I will describe results that examine the development of the ventral stream, showing evidence for differential maturation across the ventral stream, in that regions that prefer faces and places mature later than object-selective regions. Further, this cortical maturation is correlated with maturation of recognition memory for faces and place. Importantly, these data show that prolonged development (>11years) is necessary for the maturation of these cortices and speaks to the role of experience in shaping the ventral stream.  Finally, I will  describe results of recent experiments using high resolution fMRI showing that there is a fine scale structure in the human ventral stream which raises new questions about the degree of modularity and segregation of face and object computations.


James J. DiCarlo: Can we understand how the brain solves object recognition?

Many of us are united by a common goal of understanding how the primate brain solves the remarkably challenging problem of visual object recognition.  This problem will not be solved by any single approach, but by a tight coupling of experimental and computational approaches.  However, over the last ten to twenty years, both types of approaches have drifted away from the computational crux of the problem, each in their own way and for their own reasons.   For example, visual neurophysiologists have implicitly and explicitly taken their goal to be that of predicting the responses of single neurons to visual stimuli or the effects of behavioral state (task, attention, etc.), with much less effort devoted to understanding how populations of such neurons can, together, form good object representations.  Computational approaches have tended to sidestep the larger crux problems of object recognition, and to instead focus on limited problems because they are approachable and can show results.  In this talk, I will give my thoughts on the kinds of combined experimental and computational approaches that are likely to prove most fruitful, by outlining examples of recent work from our laboratory on the representation of objects in clutter and the role of natural visual experience in constructing powerful forms of object representation.


Stuart Geman Stuart Geman: Hierarchy and Reusability in Image Analysis

Image analysis is the problem of parsing an image into an abstract, relational, description of its components.  Applications abound in industrial automation, target recognition, security, data mining, and content-based search.  Yet forty years of well-funded efforts by  engineers, cognitive scientists, computer scientists, and mathematicians have done little to close the “ROC gap.”  Biological vision is superior, especially when operating at high detection rates, where artificial systems suffer extremely low specificities.

In the cognitive sciences, compositionality refers to the apparent ability of humans to represent perceptions and thoughts in terms of a structured hierarchy of “parts,” meaning sub-representations that are themselves hierarchical and re-usable as components of other perceptions and thoughts.  Language representation is the prototypical example, but the dual principles of hierarchy and reusability may prove to be central to all of cognition.  I will discuss possible roles of hierarchy and reusability in computer vision, and I will
develop a prototype compositional vision system.

In way of background and foundational material, I will (i) attempt to identify the main sources of poor performance of computer vision as compared to biological vision; (ii) discuss the probabilistic generative approach to complex inference problems, and emphasize the importance of proper normalization; (iii) review the basic relationships connecting probability distributions, graphical structures, and computational complexity, emphasizing tree-structured graphs and their relationship to grammars and parsing.

Probabilistic generative models are Bayesian: a probability (the “prior”) is placed on a suitably defined space of interpretations; given an interpretation, an observation model defines a conditional data likelihood, preferably at the level of pixel intensities.  In principle, and to a degree in practice, images sampled from the model validate assumptions or suggest improvements.

I will develop a demonstration system, probabilistic and generative, built upon a “Bayesian net” (Markov random field) backbone.  I will argue, and attempt to demonstrate, that the Markov property, although computationally convenient, is untenable for image and language parsing.  Markov models are convenient computationally but too weak (too broad) in their coverage. I will introduce a non- Markovian perturbation on the Markov backbone that dramatically improves the fit, while preserving some of the computational advantages of Bayesian nets.  Experiments in license-plate reading and face detection illustrate the importance of hierarchy, reusability, and non-Markovian dependencies. 


Heinrich Bülthoff: Multimodal categorization

The question of how humans learn to categorize objects and events has been at the heart of cognitive and neuroscience research for the last decades.
In this talk, we are going to review some of the key challenges in categorization from a combined cognitive and computational perspective. We will focus on issues that so far have only started to be addressed but that we think are crucial for a deeper understanding of categorization in biological ant artificial systems. Recent advances in computer graphics and computer vision have made it possible to both produce highly realistic stimulus material for controlled experiments in life-like environments as well as to enable highly detailed analyses of the physical properties of real-world stimuli. Based on evidence from several such experiments, we will discuss issues such as transformation-costs of categorization processes, the need for spatio-temporal representations, perception of material properties, multi-modal/multi-sensory aspects of object processing, as well as the coupling of perception and action, that is, recognition and categorization by an active agent. For each of those issues, we will mention possible implications and applications for computational/cognitive modeling. The talk will conclude with several "challenges" with which we try to provide discussion material for perceptually and cognitively fundamental, but computationally difficult categorization tasks.



Moshe Bar: The Proactive Brain:  Using Memory to Anticipate What's Next

Rather than passively “waiting” to be activated by sensations, I propose that the human brain is constantly busy generating predictions that approximate the directly relevant future.  Building on previous work, this proposal posits that rudimentary information is first extracted rapidly from the input, and then used to derive analogies linking that input with the most similar representations in memory. The linked stored representations in turn selectively activate the associations that are relevant in the specific context, which provides focused, testable predictions. These predictions facilitate perception and cognition by pre-sensitizing relevant representations. Predictions that involve more complex information, such as those required in social interactions, are a product of additional simulations that integrate across smaller analogies. Such an associations-based predictive framework can explain a variety of phenomena, ranging from recognition to first impressions, and from the brain's “default mode” to a host of mental disorders.


Cordelia Schmid: Designing image & object representations for recognition

Recent progress in object recognition has mainly been due to novel methods for image representation as well as to the use of recent machine learning techniques. In this talk, I will review the progress in image representation, starting with a presentation of invariant local features based on either appearance or shape. Given a set of these local features, a simple, yet efficient, image representation is a bag of features which achieves surprisingly good results in retrieving images of the same object as well as image based category classification. The addition of spatial relations has shown to further improve the results. Nevertheless, the design of object models must go a step further to support more complex tasks such as localizing objects, and determining their outline, pose or state. Existing methods range from simple multi-view classifiers over shape models to full 3D approaches. We will discuss examples of these techniques as well as potential directions for improvement.



Pietro Perona: Visual Recognition: Where do we come from? What are we doing? Where should we go?

This will be a broad perspective on visual recognition. I will start by proposing a small set of 10-year goals. I will then survey and organize the  main ideas and techniques that have emerged in the past 10 years and perhaps mention a couple of red herrings that have delayed us somewhat. I will propose measures of the remarkable progress we have made, and discuss challenges that lie ahead of us.



Irving Biederman: Neural Evidence for Structural Descriptions

Despite widespread incorporation in theoretical accounts of visual cognition and the apparent ease of humans to employ prepositions (e.g., “above”) or to reason about spatial relations, no neural evidence has ever been reported for structural descriptions (SDs), which make explicit a distinction between the shape of the entities in a scene and the relations between those entities. A number of current theoretical accounts dispense with the distinction all together, assuming that when object A is above object B different features will be defined than when object B is above object A and that explicit relations are only defined at a later stage. In a series of fMRI-adaptation experiments, subjects viewed a sequence of two frames, each composed of a pair of spatially separated objects and had to judge whether one of the objects had changed.  (A diminished BOLD response is expected to the extent that the same neurons are activated by the two frames.)  Switching the relative positions of the objects resulted in a greater release from fMRI-adaptation in the posterior fusiform, an area critical for object recognition, than when the objects were translated to the same extent, indicating explicit representation of relations.  Control conditions ruled out explanations involving global and between-object features, attentional shifts, and eye movements.

by Mario Fritz last modified 2007-12-01 23:12