Learning to Follow and Generate Instructions for Language-Capable Navigation

Wang, X; Wang, W; Shao, J; Yang, Y

Learning to Follow and Generate Instructions for Language-Capable Navigation

Wang, X Wang, W Shao, J Yang, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, PP, (99), pp. 1-17
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	Learning_to_Follow_and_Generate_Instructions_for_Language-Capable_Navigation.pdf	Published version	7.5 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X
dc.contributor.author	Wang, W
dc.contributor.author	Shao, J
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2024-03-18T01:05:47Z
dc.date.available	2024-03-18T01:05:47Z
dc.date.issued	2023-01-01
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, PP, (99), pp. 1-17
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/176833
dc.description.abstract	Visual-language navigation (VLN) is a challenging task that requires embodied agents to follow natural language instructions to navigate in previously unseen environments. However, existing literature put most emphasis on interpreting instructions into actions, only delivering “dumb” wayfinding agents which cannot actively use natural language to communicate with humans. In this article, we devise <sc>Lana</sc>, a <bold><underline>la</underline>nguage-capable</bold> <underline>n</underline>avigation <underline>a</underline>gent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We further extend <sc>Lana</sc> by exploiting object semantics during route encoding. This leads to <sc>Lana+</sc>, a more powerful framework that simulates the way humans refer to landmarks for instructions composition and wayfinding. We empirically verify that, compared with recent advanced task-specific solutions, <sc>Lana</sc> attains better performances on both instruction following and generation, with nearly half complexity. In addition, endowed with language generation capability, <sc>Lana</sc> can explain to humans its behaviors and assist human's wayfinding. Benefiting from landmark information, <sc>Lana+</sc> exhibits even more impressive performance. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence
dc.relation.isbasedon	10.1109/TPAMI.2023.3341828
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4611 Machine learning
dc.title	Learning to Follow and Generate Instructions for Language-Capable Navigation
dc.type	Journal Article
utslib.citation.volume	PP
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-18T01:05:42Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Visual-language navigation (VLN) is a challenging task that requires embodied agents to follow natural language instructions to navigate in previously unseen environments. However, existing literature put most emphasis on interpreting instructions into actions, only delivering “dumb” wayfinding agents which cannot actively use natural language to communicate with humans. In this article, we devise Lana, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We further extend Lana by exploiting object semantics during route encoding. This leads to Lana+, a more powerful framework that simulates the way humans refer to landmarks for instructions composition and wayfinding. We empirically verify that, compared with recent advanced task-specific solutions, Lana attains better performances on both instruction following and generation, with nearly half complexity. In addition, endowed with language generation capability, Lana can explain to humans its behaviors and assist human's wayfinding. Benefiting from landmark information, Lana+ exhibits even more impressive performance. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/176833