June 2026

Towards Generalizable Embodied Navigation with Vision-Language Models

Authors:

Master's Thesis

Abstract:

Embodied navigation asks an autonomous agent to move through unknown environments and accomplish tasks such as finding objects or following instructions. Reliable performance in real-world settings, from household assistance to warehouse logistics, requires the agent to tightly integrate perception, semantic reasoning, and long-horizon planning under cluttered layouts, ambiguous appearances, and robot-specific constraints. Vision-language models (VLMs) offer rich semantic priors for this task, but directly inserting them into the navigation loop often leads to inefficient exploration, unstable behavior, and limited transfer across platforms. This thesis argues that these failures stem from a multi-level misalignment between how VLMs reason and what navigation demands, and presents four complementary contributions to address it. STRIVE shows that object navigation improves substantially when the environment is summarized as a structured graph of objects, viewpoints, and rooms, letting the VLM reason at a semantic level while classical algorithms handle local exploration. SysNav extends this into a deployable system by decoupling semantic reasoning, room-level planning, and embodiment-specific control for robust cross-platform deployment. IntentNav shifts from prompting to learning, showing that navigation decisions become more stable when trained with intent-aligned supervision from human demonstrations. Recognizing that object-goal search underutilizes VLM reasoning, Goal2Pixel moves to instruction-guided navigation where longer, compositional instructions demand richer language grounding, and reformulates the task as pixel grounding so the model directly connects instruction understanding to executable motion. Together, these works trace a progression from structured representation through system integration and learned decision making to instruction-guided navigation, arguing that effective embodied navigation with VLMs requires aligning reasoning with the right representations, architectures, learning objectives, and task formulations.

Notes:

@mastersthesis{Li-2026-88312,
author = {Zongtai Li},
title = {Towards Generalizable Embodied Navigation with Vision-Language Models},
year = {2026},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-62},
}