Zero-shot natural language-driven video analysis and synthesis
- Publication Type:
- Thesis
- Issue Date:
- 2024
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
The rapid advancement of communication and multimedia technologies have established video as a key medium for information sharing, necessitating the development of sophisticated tools for video understanding and production. This thesis delves into the interplay between human interaction and video content, focusing on natural language-driven video analysis and synthesis within the context of zero-shot learning. This approach is pivotal as it addresses the challenges of limited video-text datasets and substantial computational requirements by utilizing unlabeled videos or image models, thereby reducing the need for manual annotations.
In the realm of video analysis, the thesis pioneers novel pseudo-supervised learning techniques that are inherently aligned with the zero-shot paradigm. The proposed Lookup-and-Verification (LoVe) framework dynamically generates pseudo queries for training the video grounding models, significantly enhancing model performance. This innovation not only bolsters model efficacy but also adheres to the zero-shot learning ethos by minimizing reliance on labeled data.
For the task of video-text retrieval, the thesis presents the Pseudo-Supervised Selective Contrastive Learning (PS-SCL) framework. This framework excels in enhancing the congruence between video content and text by generating pseudo-textual labels from unlabeled videos. The selective contrastive learning strategy employed further reinforces the zero-shot approach by leveraging the available unlabeled resources to improve alignment.
In the domain of video synthesis, the thesis differentiates between short and long video synthesis. For short videos, the proposed FlowZero framework leverages large language models to plan spatiotemporal dynamics, producing coherent videos and achieving vivid object motion without additional video training.
For long video synthesis, a large-scale video-story dataset enables the illustration of multi-sentence stories through sequential video clip retrieval. This approach exemplifies zero-shot learning by effectively illustrating complex stories with minimal supervision.
This thesis significantly advances AI systems for video analysis and synthesis, offering novel methodologies that enhance video engagement and pave the way for innovative video content creation.
Please use this identifier to cite or link to this item:
