Symbiotic Attention with Privileged Information for Egocentric Action Recognition
- Publication Type:
- Conference Proceeding
- Citation:
- The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, 2020
- Issue Date:
- 2020
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
Egocentric video recognition is a natural testbed for diverse interaction
reasoning. Due to the large action vocabulary in egocentric video datasets,
recent studies usually utilize a two-branch structure for action recognition,
ie, one branch for verb classification and the other branch for noun
classification. However, correlation studies between the verb and the noun
branches have been largely ignored. Besides, the two branches fail to exploit
local features due to the absence of a position-aware attention mechanism. In
this paper, we propose a novel Symbiotic Attention framework leveraging
Privileged information (SAP) for egocentric video recognition. Finer
position-aware object detection features can facilitate the understanding of
actor's interaction with the object. We introduce these features in action
recognition and regard them as privileged information. Our framework enables
mutual communication among the verb branch, the noun branch, and the privileged
information. This communication process not only injects local details into
global features but also exploits implicit guidance about the spatio-temporal
position of an on-going action. We introduce novel symbiotic attention (SA) to
enable effective communication. It first normalizes the detection guided
features on one branch to underline the action-relevant information from the
other branch. SA adaptively enhances the interactions among the three sources.
To further catalyze this communication, spatial relations are uncovered for the
selection of most action-relevant information. It identifies the most valuable
and discriminative feature for classification. We validate the effectiveness of
our SAP quantitatively and qualitatively. Notably, it achieves the
state-of-the-art on two large-scale egocentric video datasets.
Please use this identifier to cite or link to this item: