May
2026
Online Policy Improvement via Reliable Critics and Deployment-Aligned Data
Authors:
Abstract:
Robotic foundation models have made broad, reusable robot competence increasingly plausible, but deployment still exposes a central limitation: policies trained primarily by imitation can lack the precision, recovery behavior, and task-specific adaptation needed in high-dimensional and contact-rich settings. This thesis studies online policy improvement as a mechanism for closing this gap. Its central premise is that scalable robotic adaptation requires reliable learning signals, including accurate critics for bootstrapping online experience and data collection procedures aligned with the states a deployed policy actually visits. The first part develops TD-MPC^2, a model-based reinforcement learning method for high-dimensional continuous control. In plan-based model-based reinforcement learning, an online MPC planner collects data while a nominal actor and critic are learned from the replay buffer. This creates a persistent mismatch between the planner policy used for exploration and the nominal policy evaluated by the critic. TD-MPC^2 addresses this mismatch as a source of value overestimation and mitigates the issue with a soft behavior-constrained policy update that preserves planning-based exploration while reducing unreliable critic queries, improving performance on DMControl and HumanoidBench. The second part develops PLD, a post-training framework for Vision-Language-Action models. PLD freezes a pretrained VLA generalist, trains lightweight residual RL specialists, collects hybrid recovery trajectories from states induced by the base policy, and distills the resulting deployment-aligned data back into the generalist through supervised fine-tuning. Across LIBERO, SimplerEnv, and real-world Franka and YAM manipulation experiments, PLD shows that RL-generated data can improve VLA policies without requiring additional human teleoperation to generate post-training data. Together, these two works argue that online policy improvement for robotics depends not only on stronger priors but also on mechanisms for reliably evaluating and curating the experience produced during deployment.
Notes:
copied = false, 2000);
">
@mastersthesis{Lin-2026-88288,
author = {Haotian Lin},
title = {Online Policy Improvement via Reliable Critics and Deployment-Aligned Data},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-40},
keywords = {reinforcement learning, robotics, model-based reinforcement learning, robotic foundation models.},
}
author = {Haotian Lin},
title = {Online Policy Improvement via Reliable Critics and Deployment-Aligned Data},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-40},
keywords = {reinforcement learning, robotics, model-based reinforcement learning, robotic foundation models.},
}