Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

Ma, F; Zhu, L; Yang, Y

Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

Ma, F Zhu, L Yang, Y

Permalink

Publisher:: SPRINGER
Publication Type:: Journal Article
Citation:: International Journal of Computer Vision, 2022, 130, (5), pp. 1244-1258
Issue Date:: 2022-05-01

Closed Access

	Filename	Description	Size
	Weakly Supervised Moment Localization.pdf	Published version	3.32 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Ma, F
dc.contributor.author	Zhu, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2023-03-27T02:26:35Z
dc.date.available	2023-03-27T02:26:35Z
dc.date.issued	2022-05-01
dc.identifier.citation	International Journal of Computer Vision, 2022, 130, (5), pp. 1244-1258
dc.identifier.issn	0920-5691
dc.identifier.issn	1573-1405
dc.identifier.uri	http://hdl.handle.net/10453/168538
dc.description.abstract	Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.
dc.language	English
dc.publisher	SPRINGER
dc.relation	http://purl.org/au-research/grants/arc/DP200100938
dc.relation.ispartof	International Journal of Computer Vision
dc.relation.isbasedon	10.1007/s11263-022-01600-0
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction
dc.type	Journal Article
utslib.citation.volume	130
utslib.for	0801 Artificial Intelligence and Image Processing
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-27T02:26:34Z
pubs.issue	5
pubs.publication-status	Published
pubs.volume	130
utslib.citation.issue	5

Abstract:

Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168538