Few-shot activity recognition with cross-modal memory network

Zhang, L; Chang, X; Liu, J; Luo, M; Prakash, M; Hauptmann, AG

Few-shot activity recognition with cross-modal memory network

Zhang, L Chang, X

Liu, J Luo, M Prakash, M Hauptmann, AG

Permalink

Publisher:: ELSEVIER SCI LTD
Publication Type:: Journal Article
Citation:: Pattern Recognition, 2020, 108
Issue Date:: 2020-12-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0031320320301515-main.pdf	Published version	1.71 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, L
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Liu, J
dc.contributor.author	Luo, M
dc.contributor.author	Prakash, M
dc.contributor.author	Hauptmann, AG
dc.date.accessioned	2023-03-31T10:14:17Z
dc.date.available	2023-03-31T10:14:17Z
dc.date.issued	2020-12-01
dc.identifier.citation	Pattern Recognition, 2020, 108
dc.identifier.issn	0031-3203
dc.identifier.issn	1873-5142
dc.identifier.uri	http://hdl.handle.net/10453/168976
dc.description.abstract	Deep learning based action recognition methods require large amount of labelled training data. However, labelling large-scale video data is time consuming and tedious. In this paper, we consider a more challenging few-shot action recognition problem where the training samples are few and rare. To solve this problem, memory network has been designed to use an external memory to remember the experience learned in training and then apply it to few-shot prediction during testing. However, existing memory-based methods just update the visual information with fixed label embeddings in the memory, which cannot adapt well to novel activities during testing. To alleviate the issue, we propose a novel end-to-end cross-modal memory network for few-shot activity recognition. Specifically, the proposed memory architecture stores the dynamic visual and textual semantics for some high-level attributes related to human activities. And the learned memory can provide effective multi-modal information for new activity recognition in the testing stage. Extensive experimental results on two video datasets, including HMDB51 and UCF101, indicate that our method could achieve significant improvements over other previous methods.
dc.language	English
dc.publisher	ELSEVIER SCI LTD
dc.relation	http://purl.org/au-research/grants/arc/DE190100626
dc.relation.ispartof	Pattern Recognition
dc.relation.isbasedon	10.1016/j.patcog.2020.107348
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Few-shot activity recognition with cross-modal memory network
dc.type	Journal Article
utslib.citation.volume	108
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:14:15Z
pubs.publication-status	Published
pubs.volume	108

Abstract:

Deep learning based action recognition methods require large amount of labelled training data. However, labelling large-scale video data is time consuming and tedious. In this paper, we consider a more challenging few-shot action recognition problem where the training samples are few and rare. To solve this problem, memory network has been designed to use an external memory to remember the experience learned in training and then apply it to few-shot prediction during testing. However, existing memory-based methods just update the visual information with fixed label embeddings in the memory, which cannot adapt well to novel activities during testing. To alleviate the issue, we propose a novel end-to-end cross-modal memory network for few-shot activity recognition. Specifically, the proposed memory architecture stores the dynamic visual and textual semantics for some high-level attributes related to human activities. And the learned memory can provide effective multi-modal information for new activity recognition in the testing stage. Extensive experimental results on two video datasets, including HMDB51 and UCF101, indicate that our method could achieve significant improvements over other previous methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168976