Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction

Publisher:
SPRINGER
Publication Type:
Journal Article
Citation:
International Journal of Computer Vision, 2022, 130, (5), pp. 1244-1258
Issue Date:
2022-05-01
Filename Description Size
Weakly Supervised Moment Localization.pdfPublished version3.32 MB
Adobe PDF
Full metadata record
Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.
Please use this identifier to cite or link to this item: