Vision-based multimodal analysis of affective face and upper-body behaviour
- Publication Type:
- Thesis
- Issue Date:
- 2007
Closed Access
| Filename | Description | Size | |||
|---|---|---|---|---|---|
| 01Front.pdf | contents and abstract | 1.77 MB | |||
| 02Whole.pdf | thesis | 56.44 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- This thesis introduces the design and creation of the FABO system, a novel approach
to automatic human affect recognition using the combination of two visual modalities,
namely, facial expressions and upper-body gestures. The design and creation of the
FABO system has followed the steps typical of a pattern recognition investigation:
definition of the patterns, collection and annotation of a significant dataset, choice of
the feature set, and selection of the most appropriate pattern recognition techniques.
The classes for the patterns were chosen from the literature on nonverbal communication
of the emotions. To be able to develop and test a robust affective bimodal system,
there was a need to access databases containing representative samples of human multimodal
expressive behaviour. Thus, the second step for the FABO project was to
create a bimodal face and body gesture database for automatic analysis of human nonverbal
affective behaviour: the FABO database. The database consisted of recordings
of the subjects that performed face and upper-body gestures simultaneously by looking
at the facial camera. The recordings were obtained by using a scenario approach
where subjects are provided with situation vignettes or short scenarios describing
an emotion eliciting situation. Once the FABO data were acquired, they needed to
be annotated and analysed to form the ground truth for machine understanding of
the human affective multimodal behaviour. Annotation of the affective data in the
FABO database consisted of labelling them in terms emotions/affective states (e.g.,
happiness, anger etc.), rating the expressiveness of the display and the speed, space,
amount of movement (only for the body videos) and temporal dynamics. The emotion
labelling was achieved with annotation by independent human observers, and the
temporal segment annotation was done by one human coder (i.e., the author of this
thesis) by repeatedly viewing each face and body video in slowed and stopped motion
to determine when (in which frame) the neutral-onset-apex-offset-neutral phases start
and end. After this step, the feature set and feature extraction techniques were chosen
for the input videos. The feature extraction techniques employed enabled analysis of
the videos containing challenging visual displays, occlusion (e.g., occluded facial features
or hands) and substantial degree of head motion. The system employed various
techniques to recover the head/face region by detecting the face in every frame and
registering the relative location; detecting the eye region in every frame and using
this region for stabilising the rest of the facial regions and modelling the face and
upper-body movement to compensate for region merging and occlusion handling. Using
these techniques, the numeric/symbolic information from the observations were
computed and the pattern representations (i.e., feature vectors) were obtained. In
this way, the input data were transformed into feature vectors containing a number
of features that are descriptive of the affective states. Finally, extensive experiments
(i.e., 12 affective states from 10 subjects and more than 500 videos) were carried
out in order to obtain the affective state recognition machine of the FABO system.
The first stage of this task was the recognition of the temporal segments (i.e., neutral-onset-
apex-offset-neutral) of face and body display. Experiments were conducted with
the FABO data in order to find out whether the true temporal dynamics of the face
and body gestures could be represented using features extracted with statistical and
image processing techniques. Detection of the temporal segments was achieved in two
ways: static (frame-based) and dynamic (sequence-based). Validation with IO-fold
cross-validation and holdout methods were explored. Affective state recognition was
obtained in two stages: monomodal and bimodal. The monomodal recognition stage
was based on training a separate classifier for face and body feature vectors separately.
Details of the training and testing procedures of various classifiers were provided and
compared. Feature selection was also applied in order to reduce the dimensionality
of the problem domain. For monomodal affective state recognition a new two-stage
approach was proposed and tested. Similarly, for bimodal affective state recognition
a three-stage approach with synchronisation scheme was proposed and experimented.
Finally, details of the feature level and decision level fusion approaches were provided
and results compared. Results showed that emotion/affect recognition using the two
modalities achieved better recognition accuracy in general, outperforming the classification
using the face or body modality alone. Early fusion seemed to achieve a better
recognition accuracy compared to late fusion and the weighted sum rule proved the
best way to fuse the two modalities at the decision level.
Please use this identifier to cite or link to this item:
