Toward a perceptive pretraining framework for Audio-Visual Video Parsing

Publisher:
Elsevier BV
Publication Type:
Journal Article
Citation:
Information Sciences, 2022, 609, pp. 897-912
Issue Date:
2022-09-01
Filename Description Size
1-s2.0-S0020025522008404-main.pdfPublished version1.58 MB
Adobe PDF
Full metadata record
Audio-Visual Video Parsing (AVVP) is a new multi-modal weakly supervised task which aims to detect and localize events leveraging the partial alignment of audio and visual streams and weak labels. We identified two significant challenges in the AVVP: Cross-mode semantic misalignment and Contextual audio-visual dataset bias. For challenge 1, the existing methods tend to leverage the temporal similarity of the features. However, it is inappropriate for our AVVP task because multi-modal features with the same label do not always have the same semantics. Thus, we propose an instance-adaptive multi-modal time series max-margin loss (MTSM) which uses the temporal information to align features adaptively. Furthermore, to restrict the inescapable noise introduced during the feature fusion, we reuse the expression of MTSM in the single-mode. For the second challenge, we argue that bias mitigation should seek help from model generalization. Thus, we propose collocating pre-trained models: either” traverse” or based on domain-adaptation. First, we prove a hypothesis and then propose a method based on the Alternating Direction Method of Multipliers(ADMM) to decouple the optimal pre-trained model collocation solution, which reduces the time consumption. Experiments show that our method outperforms the contrastive methods.
Please use this identifier to cite or link to this item: