CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
- Publisher:
- ACM
- Publication Type:
- Conference Proceeding
- Citation:
- MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1485-1494
- Issue Date:
- 2023-10-26
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
3581783.3611724.pdf | Published version | 1.9 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adheres to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at https://github.com/aspirinone/CATR.github.io.
Please use this identifier to cite or link to this item: