Automatic temporal segment detection and affect recognition from face and body display

Publication Type:
Journal Article
Citation:
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39 (1), pp. 64 - 84
Issue Date:
2009-01-01
Filename Description Size
Thumbnail2008001853OK.pdf1.52 MB
Adobe PDF
Full metadata record
Psychologists have long explored mechanisms with which humans recognize other humans' affective states from modalities, such as voice and face display. This exploration has led to the identification of the main mechanisms, including the important role played in the recognition process by the modalities' dynamics. Constrained by the human physiology, the temporal evolution of a modality appears to be well approximated by a sequence of temporal segments called onset, apex, and offset. Stemming from these findings, computer scientists, over the past 15 years, have proposed various methodologies to automate the recognition process. We note, however, two main limitations to date. The first is that much of the past research has focused on affect recognition from single modalities. The second is that even the few multimodal systems have not paid sufficient attention to the modalities' dynamics: The automatic determination of their temporal segments, their synchronization to the purpose of modality fusion, and their role in affect recognition are yet to be adequately explored. To address this issue, this paper focuses on affective face and body display, proposes a method to automatically detect their temporal segments or phases, explores whether the detection of the temporal phases can effectively support recognition of affective states, and recognizes affective states based on phase synchronization/alignment. The experimental results obtained show the following: 1) affective face and body displays are simultaneous but not strictly synchronous; 2) explicit detection of the temporal phases can improve the accuracy of affect recognition; 3) recognition from fused face and body modalities performs better than that from the face or the body modality alone; and 4) synchronized feature-level fusion achieves better performance than decision-level fusion. © 2008 IEEE.
Please use this identifier to cite or link to this item: