Zero-Shot Video Grounding With Pseudo Query Lookup and Verification.
- Publisher:
- IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- Publication Type:
- Journal Article
- Citation:
- IEEE Trans Image Process, 2024, 33, pp. 1643-1654
- Issue Date:
- 2024
Closed Access
| Filename | Description | Size | |||
|---|---|---|---|---|---|
| 1717015.pdf | Published version | 2.14 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Video grounding, the process of identifying a specific moment in an untrimmed video based on a natural language query, has become a popular topic in video understanding. However, fully supervised learning approaches for video grounding that require large amounts of annotated data can be expensive and time-consuming. Recently, zero-shot video grounding (ZS-VG) methods that leverage pre-trained object detectors and language models to generate pseudo-supervision for training video grounding models have been developed. However, these approaches have limitations in recognizing diverse categories and capturing specific dynamics and interactions in the video context. To tackle these challenges, we introduce a novel two-stage ZS-VG framework called Lookup-and-Verification (LoVe), which treats the pseudo-query generation procedure as a video-to-concept retrieval problem. Our approach allows for the extraction of diverse concepts from an open-concept pool and employs a verification process to ensure the relevance of the retrieved concepts to the objects or events of interest in the video proposals. Comprehensive experimental results on the Charades-STA, ActivityNet-Captions, and DiDeMo datasets demonstrate the effectiveness of the LoVe framework.
Please use this identifier to cite or link to this item:
