Interactive Prototype Learning for Egocentric Action Recognition

Wang, X; Zhu, L; Wang, H; Yang, Y

Interactive Prototype Learning for Egocentric Action Recognition

Wang, X Zhu, L Wang, H Yang, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2022, 00, pp. 8148-8157
Issue Date:: 2022-02-28

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 28 Feb 2024

Adobe PDF

Download Accepted versionAdobe PDF (1.27 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X
dc.contributor.author	Zhu, L
dc.contributor.author	Wang, H
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2021-10-10
dc.date.accessioned	2023-03-06T06:02:35Z
dc.date.available	2023-03-06T06:02:35Z
dc.date.issued	2022-02-28
dc.identifier.citation	2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2022, 00, pp. 8148-8157
dc.identifier.isbn	9781665428125
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/166686
dc.description.abstract	Egocentric video recognition is a challenging task that requires to identify both the actor’s motion and the active object that the actor interacts with. Recognizing the active object is particularly hard due to the cluttered background with distracting objects, the frequent field of view changes, severe occlusion, etc. To improve the active object classification, most existing methods use object detectors or human gaze information, which are computationally expensive or require labor-intensive annotations. To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor. First, we introduce a set of verb prototypes to disentangle active object features from distracting object features. Each prototype corresponds to a primary motion pattern of an egocentric action, offering a distinctive supervision signal for active object feature learning. Second, we design two interactive operations to enable the extraction of active object features, i.e., noun-to-verb assignment and verb-to-noun selection. These operations are parameter-efficient and can learn judicious location-aware features on top of 3D CNN backbones. We demonstrate that the IPL framework can generalize to different backbones and outperform the state-of-the-art on three large-scale egocentric video datasets, i.e., EPIC-KITCHENS-55, EPIC-KITCHENS-100 and EGTEA.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2021 IEEE/CVF International Conference on Computer Vision (ICCV)
dc.relation.ispartof	2021 IEEE/CVF International Conference on Computer Vision
dc.relation.isbasedon	10.1109/iccv48922.2021.00806
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Interactive Prototype Learning for Egocentric Action Recognition
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Montreal, QC, Canada
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2024-02-28T00:00:00+1000Z
dc.date.updated	2023-03-06T06:02:33Z
pubs.finish-date	2021-10-17
pubs.publication-status	Published
pubs.start-date	2021-10-10
pubs.volume	00

Abstract:

Egocentric video recognition is a challenging task that requires to identify both the actor’s motion and the active object that the actor interacts with. Recognizing the active object is particularly hard due to the cluttered background with distracting objects, the frequent field of view changes, severe occlusion, etc. To improve the active object classification, most existing methods use object detectors or human gaze information, which are computationally expensive or require labor-intensive annotations. To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor. First, we introduce a set of verb prototypes to disentangle active object features from distracting object features. Each prototype corresponds to a primary motion pattern of an egocentric action, offering a distinctive supervision signal for active object feature learning. Second, we design two interactive operations to enable the extraction of active object features, i.e., noun-to-verb assignment and verb-to-noun selection. These operations are parameter-efficient and can learn judicious location-aware features on top of 3D CNN backbones. We demonstrate that the IPL framework can generalize to different backbones and outperform the state-of-the-art on three large-scale egocentric video datasets, i.e., EPIC-KITCHENS-55, EPIC-KITCHENS-100 and EGTEA.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/166686