Monocular Camera-Based Point-Goal Navigation by Learning Depth Channel and Cross-Modality Pyramid Fusion

Publication Type:
Conference Proceeding
Citation:
Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, 2022, 36, pp. 5422-5430
Issue Date:
2022-06-30
Filename Description Size
Monocular Camera-Based Point-Goal Navigation by Learning Depth Channel and.pdfPublished version5.63 MB
Adobe PDF
Full metadata record
For a monocular camera-based navigation system, if we could effectively explore scene geometric cues from RGB images, the geometry information will significantly facilitate the efficiency of the navigation system. Motivated by this, we propose a highly efficient point-goal navigation framework, dubbed Geo-Nav. In a nutshell, Geo-Nav consists of two parts: a visual perception part and a navigation part. In the visual perception part, we firstly propose a Self-supervised Depth Estimation network (SDE) specially tailored for the monocular camera-based navigation agent. SDE learns a mapping from an RGB input image to its corresponding depth image by exploring scene geometric constraints in a self-consistency manner. Then, in order to achieve a representative visual representation from the RGB inputs and learned depth images, we propose a Cross-modality Pyramid Fusion module (CPF). Concretely, CPF computes a patch-wise cross-modality correlation between different modal features and exploits the correlation to fuse and enhance features at each scale. Thanks to the patch-wise nature of CPF, we can fuse feature maps at high resolution, allowing the visual network to perceive more image details. In the navigation part, the extracted visual representations are fed to a navigation policy network to learn how to map the visual representations to agent actions effectively. Extensive experiments on the Gibson benchmark demonstrate that Geo-Nav outperforms the state-of-the-art in terms of efficiency and effectiveness.
Please use this identifier to cite or link to this item: