Vision-language navigation with self-supervised auxiliary reasoning tasks

Zhu, F; Zhu, Y; Chang, X; Liang, X

Vision-language navigation with self-supervised auxiliary reasoning tasks

Zhu, F Zhu, Y Chang, X

Liang, X

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 10009-10019
Issue Date:: 2020-01-01

Closed Access

	Filename	Description	Size
	Zhu_Vision-Language_Navigation_With_Self-Supervised_Auxiliary_Reasoning_Tasks_CVPR_2020_paper.pdf	Published version	1.15 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, F
dc.contributor.author	Zhu, Y
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Liang, X
dc.date	2020-06-13
dc.date.accessioned	2023-03-31T10:26:28Z
dc.date.available	2023-03-31T10:26:28Z
dc.date.issued	2020-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 10009-10019
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/168986
dc.description.abstract	Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation	http://purl.org/au-research/grants/arc/DE190100626
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.isbasedon	10.1109/CVPR42600.2020.01003
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Vision-language navigation with self-supervised auxiliary reasoning tasks
dc.type	Conference Proceeding
utslib.citation.volume	00
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:26:27Z
pubs.finish-date	2020-06-19
pubs.publication-status	Published
pubs.start-date	2020-06-13
pubs.volume	00

Abstract:

Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168986