Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition
- Publisher:
- Association for Computing Machinery (ACM)
- Publication Type:
- Conference Proceeding
- Citation:
- MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4640-4649
- Issue Date:
- 2024-10-28
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition.pdf | Accepted version | 6.38 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
Please use this identifier to cite or link to this item: