Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Jia, C; Luo, M; Chang, X; Dang, Z; Han, M; Wang, M; Dai, G; Dang, S; Wang, J

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Jia, C Luo, M Chang, X

Dang, Z Han, M Wang, M Dai, G Dang, S Wang, J

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4640-4649
Issue Date:: 2024-10-28

Closed Access

	Filename	Description	Size
	Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition.pdf	Accepted version	6.38 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Jia, C
dc.contributor.author	Luo, M
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Dang, Z
dc.contributor.author	Han, M
dc.contributor.author	Wang, M
dc.contributor.author	Dai, G
dc.contributor.author	Dang, S
dc.contributor.author	Wang, J
dc.date.accessioned	2025-01-24T05:30:42Z
dc.date.available	2025-01-24T05:30:42Z
dc.date.issued	2024-10-28
dc.identifier.citation	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4640-4649
dc.identifier.uri	http://hdl.handle.net/10453/184198
dc.description.abstract	Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
dc.relation.ispartof	Proceedings of the 32nd ACM International Conference on Multimedia
dc.relation.isbasedon	10.1145/3664647.3680690
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition
dc.type	Conference Proceeding
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/UTS Groups
pubs.organisational-group	University of Technology Sydney/UTS Groups/Australian Artificial Intelligence Institute (AAII)
utslib.copyright.status	closed_access	*
dc.date.updated	2025-01-24T05:30:39Z
pubs.publication-status	Published

Abstract:

Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/184198