Marginalized average attentional network for weakly-supervised learning

Yuan, Y; Lyu, Y; Shen, X; Tsang, IW; Yeung, DY

Marginalized average attentional network for weakly-supervised learning

Yuan, Y Lyu, Y Shen, X Tsang, IW Yeung, DY

Permalink

Publication Type:: Conference Proceeding
Citation:: 7th International Conference on Learning Representations, ICLR 2019, 2019
Issue Date:: 2019-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (5.72 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Yuan, Y
dc.contributor.author	Lyu, Y
dc.contributor.author	Shen, X
dc.contributor.author	Tsang, IW
dc.contributor.author	Yeung, DY
dc.date.accessioned	2020-06-06T01:58:12Z
dc.date.available	2020-06-06T01:58:12Z
dc.date.issued	2019-01-01
dc.identifier.citation	7th International Conference on Learning Representations, ICLR 2019, 2019
dc.identifier.uri	http://hdl.handle.net/10453/141172
dc.description.abstract	© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from O(2T) to O(T2). Extensive experiments on two large-scale video datasets show that our MAAN achieves a superior performance on weakly-supervised temporal action localization.
dc.language	en
dc.relation	http://purl.org/au-research/grants/arc/DP180100106
dc.relation	http://purl.org/au-research/grants/arc/DP200101328
dc.relation.ispartof	7th International Conference on Learning Representations, ICLR 2019
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Marginalized average attentional network for weakly-supervised learning
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access	*
dc.date.updated	2020-06-06T01:58:05Z
pubs.publication-status	Published

Abstract:

© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved. In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from O(2T) to O(T2). Extensive experiments on two large-scale video datasets show that our MAAN achieves a superior performance on weakly-supervised temporal action localization.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/141172