CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Li, K; Yang, Z; Chen, L; Yang, Y; Xiao, J

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Li, K Yang, Z Chen, L Yang, Y

Xiao, J

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1485-1494
Issue Date:: 2023-10-26

Closed Access

	Filename	Description	Size
	3581783.3611724.pdf	Published version	1.9 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, K
dc.contributor.author	Yang, Z
dc.contributor.author	Chen, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Xiao, J
dc.date	2023-10-29
dc.date.accessioned	2024-06-09T08:10:25Z
dc.date.available	2024-06-09T08:10:25Z
dc.date.issued	2023-10-26
dc.identifier.citation	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1485-1494
dc.identifier.isbn	979-8-4007-0108-5
dc.identifier.uri	http://hdl.handle.net/10453/179461
dc.description.abstract	Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adheres to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at https://github.com/aspirinone/CATR.github.io.
dc.language	en
dc.publisher	ACM
dc.relation.ispartof	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
dc.relation.ispartof	ACM International Conference on Multimedia
dc.relation.isbasedon	10.1145/3581783.3611724
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
dc.type	Conference Proceeding
utslib.location.activity	Ottawa, Canada
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2024-06-09T08:10:23Z
pubs.finish-date	2023-11-03
pubs.place-of-publication	USA
pubs.publication-status	Published
pubs.start-date	2023-10-29
dc.location	USA

Abstract:

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adheres to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at https://github.com/aspirinone/CATR.github.io.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179461