Online Policy Improvement via Reliable Critics and Deployment-Aligned Data

May 2026

Online Policy Improvement via Reliable Critics and Deployment-Aligned Data

Authors:

Haotian Lin

Abstract:

Robotic foundation models have made broad, reusable robot competence increasingly plausible, but deployment still exposes a central limitation: policies trained primarily by imitation can lack the precision, recovery behavior, and task-specific adaptation needed in high-dimensional and contact-rich settings. This thesis studies online policy improvement as a mechanism for closing this gap. Its central premise is that scalable robotic adaptation requires reliable learning signals, including accurate critics for bootstrapping online experience and data collection procedures aligned with the states a deployed policy actually visits. The first part develops TD-MPC^2, a model-based reinforcement learning method for high-dimensional continuous control. In plan-based model-based reinforcement learning, an online MPC planner collects data while a nominal actor and critic are learned from the replay buffer. This creates a persistent mismatch between the planner policy used for exploration and the nominal policy evaluated by the critic. TD-MPC^2 addresses this mismatch as a source of value overestimation and mitigates the issue with a soft behavior-constrained policy update that preserves planning-based exploration while reducing unreliable critic queries, improving performance on DMControl and HumanoidBench. The second part develops PLD, a post-training framework for Vision-Language-Action models. PLD freezes a pretrained VLA generalist, trains lightweight residual RL specialists, collects hybrid recovery trajectories from states induced by the base policy, and distills the resulting deployment-aligned data back into the generalist through supervised fine-tuning. Across LIBERO, SimplerEnv, and real-world Franka and YAM manipulation experiments, PLD shows that RL-generated data can improve VLA policies without requiring additional human teleoperation to generate post-training data. Together, these two works argue that online policy improvement for robotics depends not only on stronger priors but also on mechanisms for reliably evaluating and curating the experience produced during deployment.

Notes:

@mastersthesis{Lin-2026-88288,
author = {Haotian Lin},
title = {Online Policy Improvement via Reliable Critics and Deployment-Aligned Data},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-40},
keywords = {reinforcement learning, robotics, model-based reinforcement learning, robotic foundation models.},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.