Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction
- Publisher:
- SPRINGER
- Publication Type:
- Journal Article
- Citation:
- International Journal of Computer Vision, 2022, 130, (5), pp. 1244-1258
- Issue Date:
- 2022-05-01
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
Weakly Supervised Moment Localization.pdf | Published version | 3.32 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint annotations. In this paper, we focus on a weakly supervised setting, where the temporal endpoints of moments are not available during training. We develop a decoupled consistent concept prediction (DCCP) framework to learn the relations between videos and query texts. Specifically, the atomic objects and actions are decoupled from the query text to facilitate the recognition of these concepts in videos. We introduce a concept pairing module to temporally localize the objects and actions in the video. The classification loss and the concept consistency loss are proposed to leverage the mutual benefits of object and action cues for building relations between languages and videos. Extensive experiments on DiDeMo, Charades-STA, and ActivityNet Captions demonstrate the effectiveness of our model.
Please use this identifier to cite or link to this item: