AwareVLN: Reasoning with Self-awareness for
Vision-Language Navigation

Wenxuan Guo   Xiuwei Xu*   Yichen Liu   Xiangyu Li   Hang Yin   Huangxing Chen
Wenzhao Zheng   Jianjiang Feng   Jie Zhou   Jiwen Lu
*Project leader  Corresponding author
Tsinghua University
Paper Video Code

CVPR 2026


Highlights

Demo

We visualize representative navigation episodes in simulation and in the real world: the model performs structured reasoning during navigation—for example, detecting a misinterpreted turn and issuing a corrective plan, or recognizing a completed subtask and planning the next phase aligned with the instruction.

Abstract


Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in the Habitat simulator show that our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods.

AwareVLN overview: self-aware reasoning at key navigation points

AwareVLN triggers structured, self-aware reasoning at key nodes (e.g., subtask boundaries) rather than relying solely on end-to-end action prediction.

Approach


Unified reason–act framework. A single VLM jointly predicts a mode token ([REASON] vs [ACT]) and text: sparse reasoning summarizes scene context, progress, and next-step plans; action mode parses movement commands into low-level primitives. Past reasoning and relative step distance from the last reasoning step are fed back for temporal grounding.

framework

Automatic data engine. Key reasoning nodes (subtask completion, path deviation, stopping error) are detected automatically using simulator semantics and ground-truth waypoints; a general VLM generates structured supervision at scale without manual annotation.

data engine

Experiments


Simulation. We evaluate on R2R-CE and RxR-CE (Val-Unseen) in Habitat with monocular RGB. AwareVLN achieves strong results without depth, panoramas, or odometry, compared with methods that may use richer sensing or simulator pre-trained waypoint predictors.

R2R-CE and RxR-CE results

Real-world evaluation. We report navigation error and success rate across corridor, home, and office settings with simple vs. complex instructions.

real-world results

Qualitative rollouts. Examples in Habitat and on a real quadruped show self-correction after misinterpreting an instruction and progress-aware re-planning at subtask boundaries.

simulator rollout

Habitat: deviation recovery and subtask-aware planning.


real-world rollout

Real world: long-horizon task with reasoning at subtask boundaries (sim-trained only).

BibTeX


@inproceedings{guo2026awarevln, title = {AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation}, author = {Wenxuan Guo and Xiuwei Xu and Yichen Liu and Xiangyu Li and Hang Yin and Huangxing Chen and Wenzhao Zheng and Jianjiang Feng and Jie Zhou and Jiwen Lu}, booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition}, year = {2026} }


© Wenxuan Guo | Last update: May 12, 2026