Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Wang, X; Wu, Y; Zhu, L; Yang, Y

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Wang, X Wu, Y Zhu, L

Yang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, 2020
Issue Date:: 2020

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Submitted versionAdobe PDF (3.39 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X
dc.contributor.author	Wu, Y
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2021-04-16T19:08:40Z
dc.date.available	2021-04-16T19:08:40Z
dc.date.issued	2020
dc.identifier.citation	The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, 2020
dc.identifier.uri	http://hdl.handle.net/10453/148178
dc.description.abstract	Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification. However, correlation studies between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of a position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.
dc.language	en
dc.relation.ispartof	The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Symbiotic Attention with Privileged Information for Egocentric Action Recognition
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.date.updated	2021-04-16T19:08:27Z

Abstract:

Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification. However, correlation studies between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of a position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148178