Vision-based multimodal analysis of affective face and upper-body behaviour
- Publication Type:
- Issue Date:
NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- This thesis introduces the design and creation of the FABO system, a novel approach to automatic human affect recognition using the combination of two visual modalities, namely, facial expressions and upper-body gestures. The design and creation of the FABO system has followed the steps typical of a pattern recognition investigation: definition of the patterns, collection and annotation of a significant dataset, choice of the feature set, and selection of the most appropriate pattern recognition techniques. The classes for the patterns were chosen from the literature on nonverbal communication of the emotions. To be able to develop and test a robust affective bimodal system, there was a need to access databases containing representative samples of human multimodal expressive behaviour. Thus, the second step for the FABO project was to create a bimodal face and body gesture database for automatic analysis of human nonverbal affective behaviour: the FABO database. The database consisted of recordings of the subjects that performed face and upper-body gestures simultaneously by looking at the facial camera. The recordings were obtained by using a scenario approach where subjects are provided with situation vignettes or short scenarios describing an emotion eliciting situation. Once the FABO data were acquired, they needed to be annotated and analysed to form the ground truth for machine understanding of the human affective multimodal behaviour. Annotation of the affective data in the FABO database consisted of labelling them in terms emotions/affective states (e.g., happiness, anger etc.), rating the expressiveness of the display and the speed, space, amount of movement (only for the body videos) and temporal dynamics. The emotion labelling was achieved with annotation by independent human observers, and the temporal segment annotation was done by one human coder (i.e., the author of this thesis) by repeatedly viewing each face and body video in slowed and stopped motion to determine when (in which frame) the neutral-onset-apex-offset-neutral phases start and end. After this step, the feature set and feature extraction techniques were chosen for the input videos. The feature extraction techniques employed enabled analysis of the videos containing challenging visual displays, occlusion (e.g., occluded facial features or hands) and substantial degree of head motion. The system employed various techniques to recover the head/face region by detecting the face in every frame and registering the relative location; detecting the eye region in every frame and using this region for stabilising the rest of the facial regions and modelling the face and upper-body movement to compensate for region merging and occlusion handling. Using these techniques, the numeric/symbolic information from the observations were computed and the pattern representations (i.e., feature vectors) were obtained. In this way, the input data were transformed into feature vectors containing a number of features that are descriptive of the affective states. Finally, extensive experiments (i.e., 12 affective states from 10 subjects and more than 500 videos) were carried out in order to obtain the affective state recognition machine of the FABO system. The first stage of this task was the recognition of the temporal segments (i.e., neutral-onset- apex-offset-neutral) of face and body display. Experiments were conducted with the FABO data in order to find out whether the true temporal dynamics of the face and body gestures could be represented using features extracted with statistical and image processing techniques. Detection of the temporal segments was achieved in two ways: static (frame-based) and dynamic (sequence-based). Validation with IO-fold cross-validation and holdout methods were explored. Affective state recognition was obtained in two stages: monomodal and bimodal. The monomodal recognition stage was based on training a separate classifier for face and body feature vectors separately. Details of the training and testing procedures of various classifiers were provided and compared. Feature selection was also applied in order to reduce the dimensionality of the problem domain. For monomodal affective state recognition a new two-stage approach was proposed and tested. Similarly, for bimodal affective state recognition a three-stage approach with synchronisation scheme was proposed and experimented. Finally, details of the feature level and decision level fusion approaches were provided and results compared. Results showed that emotion/affect recognition using the two modalities achieved better recognition accuracy in general, outperforming the classification using the face or body modality alone. Early fusion seemed to achieve a better recognition accuracy compared to late fusion and the weighted sum rule proved the best way to fuse the two modalities at the decision level.
Please use this identifier to cite or link to this item: