Recognition of Emotions in User-generated Videos with Transferred Emotion Intensity Learning

Zhang, H; Xu, M

Recognition of Emotions in User-generated Videos with Transferred Emotion Intensity Learning

Zhang, H

Xu, M

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2021, PP, (99), pp. 1-1
Issue Date:: 2021-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 31 Dec 2023

Adobe PDF

Download Accepted versionAdobe PDF (3.78 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, H https://orcid.org/0000-0002-0021-3634
dc.contributor.author	Xu, M https://orcid.org/0000-0001-9581-8849
dc.date.accessioned	2022-01-26T23:59:02Z
dc.date.available	2022-01-26T23:59:02Z
dc.date.issued	2021-01-01
dc.identifier.citation	IEEE Transactions on Multimedia, 2021, PP, (99), pp. 1-1
dc.identifier.issn	1520-9210
dc.identifier.issn	1941-0077
dc.identifier.uri	http://hdl.handle.net/10453/153592
dc.description.abstract	Recognition of emotions in user-generated videos has attracted considerable research attention. Most existing approaches focus on learning frame-level features and fail to consider frame-level emotion intensities which are critical for video representation. In this research, we aim to extract frame-level features and emotion intensities through transferring emotional information from an image emotion dataset. To achieve this goal, we propose an end-to-end network for joint emotion recognition and intensity learning with unsupervised adversarial adaptation. The proposed network consists of a classification stream, an intensity learning stream and an adversarial adaptation module. The classification stream is used to generate pseudo intensity maps with the class activation mapping method to train the intensity learning subnetwork. The intensity learning stream is built upon an improved feature pyramid network in which features from different scales are cross-connected. The adversarial adaptation module is employed to reduce the domain difference between the source dataset and target video frames. By aligning cross domain features, we enable our network to learn on the source data while generalizing to video frames. Finally, we apply a weighted sum pooling method to frame-level features and emotion intensities to generate video-level features. We evaluate the proposed method on two benchmark datasets, i.e., VideoEmotion-8 and Ekman-6. The experimental results show that the proposed method achieves improved performance compared to previous state-of-the-art methods.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Transactions on Multimedia
dc.relation.isbasedon	10.1109/TMM.2021.3134167
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Recognition of Emotions in User-generated Videos with Transferred Emotion Intensity Learning
dc.type	Journal Article
utslib.citation.volume	PP
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2023-12-31T00:00:00+1000Z
dc.date.updated	2022-01-26T23:58:59Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Recognition of emotions in user-generated videos has attracted considerable research attention. Most existing approaches focus on learning frame-level features and fail to consider frame-level emotion intensities which are critical for video representation. In this research, we aim to extract frame-level features and emotion intensities through transferring emotional information from an image emotion dataset. To achieve this goal, we propose an end-to-end network for joint emotion recognition and intensity learning with unsupervised adversarial adaptation. The proposed network consists of a classification stream, an intensity learning stream and an adversarial adaptation module. The classification stream is used to generate pseudo intensity maps with the class activation mapping method to train the intensity learning subnetwork. The intensity learning stream is built upon an improved feature pyramid network in which features from different scales are cross-connected. The adversarial adaptation module is employed to reduce the domain difference between the source dataset and target video frames. By aligning cross domain features, we enable our network to learn on the source data while generalizing to video frames. Finally, we apply a weighted sum pooling method to frame-level features and emotion intensities to generate video-level features. We evaluate the proposed method on two benchmark datasets, i.e., VideoEmotion-8 and Ekman-6. The experimental results show that the proposed method achieves improved performance compared to previous state-of-the-art methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/153592