Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

Wu, Y; Yang, Y

Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

Wu, Y

Yang, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 1326-1335
Issue Date:: 2021-11-13

Closed Access

	Filename	Description	Size
	Exploring_Heterogeneous_Clues_for_Weakly-Supervised_Audio-Visual_Video_Parsing.pdf	Published version	2.81 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2021-06-20
dc.date.accessioned	2022-05-29T06:22:46Z
dc.date.available	2022-05-29T06:22:46Z
dc.date.issued	2021-11-13
dc.identifier.citation	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 1326-1335
dc.identifier.isbn	9781665445092
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/157799
dc.description.abstract	We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video into temporal event segments and predict the audible or visible event categories. The task is challenging since there only exist video-level event labels for training, without indicating the temporal boundaries and modalities. Previous works take the overall event labels to supervise both audio and visual model predictions. However, we argue that such overall labels harm the model training due to the audio-visual asynchrony. For example, commentators speak in a basketball video, but we cannot visually find the speakers. In this paper, we tackle this issue by leveraging the cross-modal correspondence of audio and visual signals. We generate reliable event labels individually for each modality by swapping audio and visual tracks with other unrelated videos. If the original visual/audio data contain event clues, the event prediction from the newly assembled data would still be highly confident. In this way, we could protect our models from being misled by ambiguous event labels. In addition, we propose the cross-modal audio-visual contrastive learning to induce temporal difference on attention models within videos, i.e., urging the model to pick the current temporal segment from all context candidates. Experiments show we outperform state-of-the-art methods by a large margin.
dc.language	en
dc.publisher	IEEE
dc.relation	http://purl.org/au-research/grants/arc/DP200100938
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/cvpr46437.2021.00138
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Nashville, TN, USA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2022-05-29T06:22:44Z
pubs.finish-date	2021-06-25
pubs.publication-status	Published
pubs.start-date	2021-06-20
pubs.volume	00

Abstract:

We investigate the weakly-supervised audio-visual video parsing task, which aims to parse a video into temporal event segments and predict the audible or visible event categories. The task is challenging since there only exist video-level event labels for training, without indicating the temporal boundaries and modalities. Previous works take the overall event labels to supervise both audio and visual model predictions. However, we argue that such overall labels harm the model training due to the audio-visual asynchrony. For example, commentators speak in a basketball video, but we cannot visually find the speakers. In this paper, we tackle this issue by leveraging the cross-modal correspondence of audio and visual signals. We generate reliable event labels individually for each modality by swapping audio and visual tracks with other unrelated videos. If the original visual/audio data contain event clues, the event prediction from the newly assembled data would still be highly confident. In this way, we could protect our models from being misled by ambiguous event labels. In addition, we propose the cross-modal audio-visual contrastive learning to induce temporal difference on attention models within videos, i.e., urging the model to pick the current temporal segment from all context candidates. Experiments show we outperform state-of-the-art methods by a large margin.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157799