Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Liu, C; Li, PP; Qi, X; Zhang, H; Li, L; Wang, D; Yu, X

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Liu, C Li, PP Qi, X Zhang, H Li, L Wang, D Yu, X

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7590-7598
Issue Date:: 2023-10-26

Closed Access

	Filename	Description	Size
	3581783.3612373.pdf	Published version	2.62 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, C
dc.contributor.author	Li, PP
dc.contributor.author	Qi, X
dc.contributor.author	Zhang, H
dc.contributor.author	Li, L
dc.contributor.author	Wang, D
dc.contributor.author	Yu, X
dc.date.accessioned	2024-06-08T21:44:02Z
dc.date.available	2024-06-08T21:44:02Z
dc.date.issued	2023-10-26
dc.identifier.citation	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7590-7598
dc.identifier.uri	http://hdl.handle.net/10453/179460
dc.description.abstract	The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects and also achieves state-of-the-art performance in both the single-source and multi-source scenarios.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation	http://purl.org/au-research/grants/arc/DP220100800
dc.relation	http://purl.org/au-research/grants/arc/DE230100477
dc.relation.ispartof	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
dc.relation.ispartof	Proceedings of the 31st ACM International Conference on Multimedia
dc.relation.isbasedon	10.1145/3581783.3612373
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
dc.type	Conference Proceeding
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2024-06-08T21:44:00Z
pubs.publication-status	Published

Abstract:

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects and also achieves state-of-the-art performance in both the single-source and multi-source scenarios.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179460