Towards Practical Vision-and-Language Navigation Systems Through 3D Referential Grounding

July 2025

Towards Practical Vision-and-Language Navigation Systems Through 3D Referential Grounding

Authors:

Nader Zantout

Abstract:

As robots transition toward practical deployment as collaborative agents in human environments, it becomes essential to improve language-conditioned environmental understanding. A vision-and-language navigation (VLN) system must adapt to both the types of language used and the actions expected by a human collaborator. Often, a single sentence containing spatial relations and semantic attributes---e.g., “fetch the yellow bottle on the table”---is all that is provided to specify a target object in a complex scene. The task of identifying the correct object from such a statement is known as 3D referential grounding.

This thesis develops and deploys a practical VLN system through the lens of 3D referential grounding, a particularly challenging task due to the large number of objects in typical scenes and the relative scarcity of 3D data compared to 2D. We pursue two complementary approaches: (1) scaling up the training of an end-to-end 3D referential grounding model, and (2) decomposing the task into a modular pipeline.

First, we introduce IRef-VLA, a large-scale benchmark for Interactive Referential Vision-and-Language-guided Action. IRef-VLA aims to improve generalization in 3D referential grounding using synthetic utterances generated from scene graphs with view-independent spatial relations. Baseline models trained on IRef-VLA show strong zero-shot transfer performance, and an LLM-based graph search baseline achieves high grounding accuracy, motivating a modular alternative to end-to-end approaches.

We then explore this modular approach in SORT3D, a Spatial Object-centric Reasoning Toolbox for 3D grounding with foundation models. SORT3D combines real-time semantic mapping, vision-language captioning, query-based object filtering, and structured spatial reasoning via LLMs into a deployable system. It demonstrates strong zero-shot performance across two benchmark datasets and on real-world robotic platforms operating in unseen environments.

Together, these systems establish a template for building effective collaborative embodied agents, where the ideal model is a middleground between fully end-to-end learning and a fully heuristics-based approach, and act as a springboard towards the creation of general purpose VLN systems deployable in all environments.

Notes:

@mastersthesis{Zantout-2025-148165,
author = {Nader Zantout},
title = {Towards Practical Vision-and-Language Navigation Systems Through 3D Referential Grounding},
year = {2025},
month = {July},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-60},
keywords = {Vision-and-Language Navigation, Embodied AI, 3D Computer Vision, Natural Language Processing},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.