Memory transformation networks for weakly supervised visual classification

Liu, H; Zheng, Q; Luo, M; Chang, X; Yan, C; Yao, L

Memory transformation networks for weakly supervised visual classification

Liu, H Zheng, Q Luo, M Chang, X

Yan, C Yao, L

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Knowledge-Based Systems, 2020, 210, pp. 1-13
Issue Date:: 2020-12-27

Closed Access

	Filename	Description	Size
	1-s2.0-S095070512030561X-main.pdf		1.27 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, H
dc.contributor.author	Zheng, Q
dc.contributor.author	Luo, M
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Yan, C
dc.contributor.author	Yao, L
dc.date.accessioned	2022-10-31T02:30:19Z
dc.date.available	2022-10-31T02:30:19Z
dc.date.issued	2020-12-27
dc.identifier.citation	Knowledge-Based Systems, 2020, 210, pp. 1-13
dc.identifier.issn	0950-7051
dc.identifier.issn	1872-7409
dc.identifier.uri	http://hdl.handle.net/10453/163028
dc.description.abstract	The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.
dc.language	English
dc.publisher	Elsevier
dc.relation.ispartof	Knowledge-Based Systems
dc.relation.isbasedon	10.1016/j.knosys.2020.106432
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 15 Commerce, Management, Tourism and Services, 17 Psychology and Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Memory transformation networks for weakly supervised visual classification
dc.type	Journal Article
utslib.citation.volume	210
utslib.for	08 Information and Computing Sciences
utslib.for	15 Commerce, Management, Tourism and Services
utslib.for	17 Psychology and Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2022-10-31T02:30:18Z
pubs.publication-status	Published
pubs.volume	210

Abstract:

The lack of labeled exemplars makes video classification based on supervised neural networks difficult and challenging. Utilizing external memory that contains task-related knowledge is a beneficial way to learn a category from a handful of samples; however, most existing memory-augmented neural networks still struggle to provide a satisfactory solution for multi-modal external data due to the high dimensionality and massive volume. In light of this, we propose a Memory Transformation Network (MTN) to convert external knowledge, by involving embedded and concentrated memories, so as to leverage it feasibly for video classification with weak supervision. Specifically, we employ a multi-modal deep autoencoder to project external visual and textual information onto a shared space to produce joint embedded memory, which can capture the correlation amongst different modalities to enhance the expressive ability. The curse of dimensionality issue can also be alleviated owing to the inherent dimension reduction ability of the autoencoder. Besides, an attention-based compression mechanism is employed to generate concentrated memory, which records useful information related to a specific task. In this way, the obtained concentrated memory is relatively lightweight to mitigate the time-consuming content-based addressing on large-volume memory. Our model outperforms the state-of-the-arts by 5.44% and 1.81% on average in two metrics over three real-world video datasets, demonstrating its effectiveness and superiority on visual classification with limited labeled exemplars.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/163028