Learning to Anticipate Egocentric Actions by Imagination.

Wu, Y; Zhu, L; Wang, X; Yang, Y; Wu, F

Learning to Anticipate Egocentric Actions by Imagination.

Wu, Y

Zhu, L

Wang, X Yang, Y

Wu, F

Permalink

Publisher:: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:: Journal Article
Citation:: IEEE Trans Image Process, 2021, 30, pp. 1143-1152
Issue Date:: 2021

Closed Access

There are no files associated with this item.

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Wang, X
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Wu, F
dc.date.accessioned	2022-04-12T01:48:37Z
dc.date.available	2022-04-12T01:48:37Z
dc.date.issued	2021
dc.identifier.citation	IEEE Trans Image Process, 2021, 30, pp. 1143-1152
dc.identifier.issn	1057-7149
dc.identifier.issn	1941-0042
dc.identifier.uri	http://hdl.handle.net/10453/156110
dc.description.abstract	Anticipating actions before they are executed is crucial for a wide range of practical applications, including autonomous driving and robotics. In this paper, we study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos. Previous approaches focus on summarizing the observed content and directly predicting future action based on past observations. We believe it would benefit the action anticipation if we could mine some cues to compensate for the missing information of the unobserved frames. We then propose to decompose the action anticipation into a series of future feature predictions. We imagine how the visual feature changes in the near future and then predicts future action labels based on these imagined representations. Differently, our ImagineRNN is optimized in a contrastive learning way instead of feature regression. We utilize a proxy task to train the ImagineRNN, i.e., selecting the correct future states from distractors. We further improve ImagineRNN by residual anticipation, i.e., changing its target to predicting the feature difference of adjacent frames instead of the frame content. This promotes the network to focus on our target, i.e., the future action, as the difference between adjacent frame features is more important for forecasting the future. Extensive experiments on two large-scale egocentric action datasets validate the effectiveness of our method. Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.relation	http://purl.org/au-research/grants/arc/DP200100938
dc.relation.ispartof	IEEE Trans Image Process
dc.relation.isbasedon	10.1109/TIP.2020.3040521
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Learning to Anticipate Egocentric Actions by Imagination.
dc.type	Journal Article
utslib.citation.volume	30
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2022-04-12T01:48:33Z
pubs.publication-status	Published
pubs.volume	30

Abstract:

Anticipating actions before they are executed is crucial for a wide range of practical applications, including autonomous driving and robotics. In this paper, we study the egocentric action anticipation task, which predicts future action seconds before it is performed for egocentric videos. Previous approaches focus on summarizing the observed content and directly predicting future action based on past observations. We believe it would benefit the action anticipation if we could mine some cues to compensate for the missing information of the unobserved frames. We then propose to decompose the action anticipation into a series of future feature predictions. We imagine how the visual feature changes in the near future and then predicts future action labels based on these imagined representations. Differently, our ImagineRNN is optimized in a contrastive learning way instead of feature regression. We utilize a proxy task to train the ImagineRNN, i.e., selecting the correct future states from distractors. We further improve ImagineRNN by residual anticipation, i.e., changing its target to predicting the feature difference of adjacent frames instead of the frame content. This promotes the network to focus on our target, i.e., the future action, as the difference between adjacent frame features is more important for forecasting the future. Extensive experiments on two large-scale egocentric action datasets validate the effectiveness of our method. Our method significantly outperforms previous methods on both the seen test set and the unseen test set of the EPIC Kitchens Action Anticipation Challenge.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/156110