MEE-Project

Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation

Zongtao He, Liuyi Wang, Lu Chen, Shu Li, Qingqing Yan, Chengju Liu, Qijun Chen

Department of Control Science and Engineering, Tongji University

Paper (IROS 2024) Code

Abstract

Can multimodal encoder evolve when facing increasingly tough circumstances? Our work investigates this possibility in the context of continuous vision-language navigation (continuous VLN), which aims to navigate robots under linguistic supervision and visual feedback. We propose a multimodal evolutionary encoder (MEE) comprising a unified multimodal encoder architecture and an evolutionary pre-training strategy. The unified multimodal encoder unifies rich modalities, including depth and sub-instruction, to enhance the solid understanding of environments and tasks. It also effectively utilizes monocular observation, reducing the reliance on panoramic vision. The evolutionary pre-training strategy exposes the encoder to increasingly unfamiliar data domains and difficult objectives. The multi-stage adaption helps the encoder establish robust inner- and inter-modality connections and improve its generalization to unfamiliar environments. To achieve such evolution, we collect a large-scale multi-stage dataset with specialized objectives, addressing the absence of suitable continuous VLN pre-training. Evaluation on VLN-CE demonstrates the superiority of MEE over other direct action-predicting methods. Furthermore, we deploy MEE in real scenes using self-developed service robots, showcasing its effectiveness and potential for real-world applications.

Method

Overall procedure.

The evolutionary pre-training procedure and dataset.

The model architecture.

Main Contributions

The unified multimodal encoder enhances comprehensive perception with unified features while reduces the reliance on panoramas.
The evolutionary pre-training enables the encoder to evolve better feature representations and generalization ability across multiple stages.
Both simulated and real scene experiments validate the effectiveness of MEE. Code and datasets have been publicly released.

BibTeX

@INPROCEEDINGS{10802484,
  author={He, Zongtao and Wang, Liuyi and Chen, Lu and Li, Shu and Yan, Qingqing and Liu, Chengju and Chen, Qijun},
  booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation}, 
  year={2024},
  volume={},
  number={},
  pages={1443-1450},
  keywords={Visualization;Costs;Codes;Navigation;Service robots;Linguistics;Feature extraction;Solids;Decoding;Intelligent robots},
  doi={10.1109/IROS58592.2024.10802484}
}

More Works from Our Lab

Learning Depth Representation From RGB-D Videos by Time-Aware Contrastive Pre-Training

NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization

Vision-and-Language Navigation via Causal Learning