Vision-dialog navigation by exploring cross-modal memory

Zhu, Y; Zhu, F; Zhan, Z; Lin, B; Jiao, J; Chang, X; Liang, X

Vision-dialog navigation by exploring cross-modal memory

Zhu, Y Zhu, F Zhan, Z Lin, B Jiao, J Chang, X

Liang, X

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 10727-10736
Issue Date:: 2020-01-01

Closed Access

	Filename	Description	Size
	Zhu_Vision-Dialog_Navigation_by_Exploring_Cross-Modal_Memory_CVPR_2020_paper.pdf	Published version	807.69 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, Y
dc.contributor.author	Zhu, F
dc.contributor.author	Zhan, Z
dc.contributor.author	Lin, B
dc.contributor.author	Jiao, J
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Liang, X
dc.date	2020-06-13
dc.date.accessioned	2023-03-31T10:27:31Z
dc.date.available	2023-03-31T10:27:31Z
dc.date.issued	2020-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 10727-10736
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/168987
dc.description.abstract	Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments. 1
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.isbasedon	10.1109/CVPR42600.2020.01074
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Vision-dialog navigation by exploring cross-modal memory
dc.type	Conference Proceeding
utslib.citation.volume	00
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:27:29Z
pubs.finish-date	2020-06-19
pubs.publication-status	Published
pubs.start-date	2020-06-13
pubs.volume	00

Abstract:

Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments. 1

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168987