Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Li, J; Xie, J; Zhu, L; Qian, L; Tang, S; Zhang, W; Shi, H; Zhang, S; Wei, L; Tian, Q; Zhuang, Y

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

Li, J Xie, J Zhu, L

Qian, L Tang, S Zhang, W Shi, H Zhang, S Wei, L Tian, Q Zhuang, Y

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5083-5092
Issue Date:: 2022-10-10

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Download Published versionAdobe PDF (9.78 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Li, J
dc.contributor.author	Xie, J
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Qian, L
dc.contributor.author	Tang, S
dc.contributor.author	Zhang, W
dc.contributor.author	Shi, H
dc.contributor.author	Zhang, S
dc.contributor.author	Wei, L
dc.contributor.author	Tian, Q
dc.contributor.author	Zhuang, Y
dc.date.accessioned	2023-05-19T05:44:19Z
dc.date.available	2023-05-19T05:44:19Z
dc.date.issued	2022-10-10
dc.identifier.citation	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5083-5092
dc.identifier.isbn	9781450392037
dc.identifier.uri	http://hdl.handle.net/10453/170376
dc.description.abstract	Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos (TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/TemporalEmotion-Localization-in-Videos.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
dc.relation.ispartof	Proceedings of the 30th ACM International Conference on Multimedia
dc.relation.isbasedon	10.1145/3503161.3547886
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access	*
dc.date.updated	2023-05-19T05:44:17Z
pubs.publication-status	Published

Abstract:

Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos (TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/TemporalEmotion-Localization-in-Videos.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170376