Learning Geometric, Physical, and Semantic Priors for Embodied Planning and Control

May 2026

Learning Geometric, Physical, and Semantic Priors for Embodied Planning and Control

Authors:

Junyu Nan

Abstract:

Embodied intelligence requires perceiving, predicting, and acting in environments with an understanding of the geometric, physical, and semantic structure of the world. Recently, the dominant trend in robotics has been to acquire such world understanding implicitly in a data-driven manner using end-to-end models. While these approaches have achieved impressive milestones, they often rely on substantial amounts of in-domain data and may remain brittle when success depends on long-horizon reasoning, precise physical interaction, or generalization from limited task-specific data. This thesis studies a perspective complementary to end-to-end approaches: when aspects of geometric, physical, and semantic structure are known and reusable, explicitly learning priors over them can improve the data efficiency, fidelity, and robustness of embodied learning systems. We instantiate this perspective within a modular paradigm that separates a high-level task understanding module from a low-level task execution module, and use explicitly learned geometric, physical, and semantic priors as intermediate representations that bridge the two.
Concretely, this thesis explores four embodied settings: scene-level forecasting in autonomous driving, learning deformable object dynamics from robot interaction videos, relational reasoning and cross-instance manipulation transfer, and zero-shot long-horizon manipulation. For scene-level prediction, we learn a predictive geometric prior over the future evolution of the full 3D scene representation. By modeling future motion at the scene level, the predictive geometric representation preserves coherence across agents and the environment, improving downstream prediction and planning under multi-agent uncertainty. Moving from passive scene forecasting to robot interaction, physical priors are needed to model how state evolves in response to actions. To learn contact-rich dynamics and topological change directly from RGB-D robot interaction videos, we represent deformable objects as adaptive sets of 3D Gaussians and frame state estimation as a learned, differentiable approximate Bayesian filtering framework inspired by particle filtering, with physics-inspired interaction modeling and resampling mechanisms over this adaptive geometry. Extending to multi-object manipulation, we develop semantic priors based on correspondence as a reusable representation for relational reasoning and cross-instance manipulation transfer. These correspondence-based priors allow a robot to identify functionally meaningful object structure, reason about object alignment under geometric ambiguity, and transfer manipulation knowledge from demonstrated objects to novel instances. Finally, we integrate learned geometric, physical, and semantic priors into a zero-shot long-horizon manipulation system that connects high-level video and language planning with executable robot motion through geometric grounding of generated videos. Together, these works show that explicitly learned geometric, physical, and semantic priors can improve the data efficiency, fidelity, and robustness of embodied prediction, planning, and manipulation systems.

Notes:

@phdthesis{Nan-2026-88296,
author = {Junyu Nan},
title = {Learning Geometric, Physical, and Semantic Priors for Embodied Planning and Control},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-47},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.