Toward a perceptive pretraining framework for Audio-Visual Video Parsing

Wu, J; Jiang, Z; Chen, Q; Wen, S; Men, A; Wang, H

Toward a perceptive pretraining framework for Audio-Visual Video Parsing

Wu, J Jiang, Z Chen, Q Wen, S

Men, A Wang, H

Permalink

Publisher:: Elsevier BV
Publication Type:: Journal Article
Citation:: Information Sciences, 2022, 609, pp. 897-912
Issue Date:: 2022-09-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0020025522008404-main.pdf	Published version	1.58 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, J
dc.contributor.author	Jiang, Z
dc.contributor.author	Chen, Q
dc.contributor.author	Wen, S https://orcid.org/0000-0001-8077-7001
dc.contributor.author	Men, A
dc.contributor.author	Wang, H
dc.date.accessioned	2023-02-27T03:06:00Z
dc.date.available	2023-02-27T03:06:00Z
dc.date.issued	2022-09-01
dc.identifier.citation	Information Sciences, 2022, 609, pp. 897-912
dc.identifier.issn	0020-0255
dc.identifier.uri	http://hdl.handle.net/10453/166441
dc.description.abstract	Audio-Visual Video Parsing (AVVP) is a new multi-modal weakly supervised task which aims to detect and localize events leveraging the partial alignment of audio and visual streams and weak labels. We identified two significant challenges in the AVVP: Cross-mode semantic misalignment and Contextual audio-visual dataset bias. For challenge 1, the existing methods tend to leverage the temporal similarity of the features. However, it is inappropriate for our AVVP task because multi-modal features with the same label do not always have the same semantics. Thus, we propose an instance-adaptive multi-modal time series max-margin loss (MTSM) which uses the temporal information to align features adaptively. Furthermore, to restrict the inescapable noise introduced during the feature fusion, we reuse the expression of MTSM in the single-mode. For the second challenge, we argue that bias mitigation should seek help from model generalization. Thus, we propose collocating pre-trained models: either” traverse” or based on domain-adaptation. First, we prove a hypothesis and then propose a method based on the Alternating Direction Method of Multipliers(ADMM) to decouple the optimal pre-trained model collocation solution, which reduces the time consumption. Experiments show that our method outperforms the contrastive methods.
dc.language	en
dc.publisher	Elsevier BV
dc.relation.ispartof	Information Sciences
dc.relation.isbasedon	10.1016/j.ins.2022.07.144
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	01 Mathematical Sciences, 08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Toward a perceptive pretraining framework for Audio-Visual Video Parsing
dc.type	Journal Article
utslib.citation.volume	609
utslib.for	01 Mathematical Sciences
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-02-27T03:05:44Z
pubs.publication-status	Published
pubs.volume	609

Abstract:

Audio-Visual Video Parsing (AVVP) is a new multi-modal weakly supervised task which aims to detect and localize events leveraging the partial alignment of audio and visual streams and weak labels. We identified two significant challenges in the AVVP: Cross-mode semantic misalignment and Contextual audio-visual dataset bias. For challenge 1, the existing methods tend to leverage the temporal similarity of the features. However, it is inappropriate for our AVVP task because multi-modal features with the same label do not always have the same semantics. Thus, we propose an instance-adaptive multi-modal time series max-margin loss (MTSM) which uses the temporal information to align features adaptively. Furthermore, to restrict the inescapable noise introduced during the feature fusion, we reuse the expression of MTSM in the single-mode. For the second challenge, we argue that bias mitigation should seek help from model generalization. Thus, we propose collocating pre-trained models: either” traverse” or based on domain-adaptation. First, we prove a hypothesis and then propose a method based on the Alternating Direction Method of Multipliers(ADMM) to decouple the optimal pre-trained model collocation solution, which reduces the time consumption. Experiments show that our method outperforms the contrastive methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/166441