February
2026
A Layered Foundation for Reliable Trajectory Forecasting: Data, Evaluation, and Methods
Authors:
Abstract:
Reliable trajectory forecasting is a foundational requirement for autonomous robotic systems operating in environments with humans, where reliability means producing predictions that are collision-free, socially consistent, and robust across both routine and safety-critical scenarios. Despite substantial progress in modeling techniques, existing forecasting systems often fail under distribution shift, exhibit socially implausible behaviors, or report misleading performance. The field has largely treated these as modeling problems and thus has invested heavily in ever more expressive architectures while under-investing in the infrastructure that models depend on. This thesis takes a different position: that reliable trajectory forecasting requires treating data curation, evaluation design, and modeling as co-equal engineering challenges, organized as a layered stack where each layer depends on the soundness of those below it. Good methods are only as useful as the benchmarks that evaluate them, and good benchmarks are only as meaningful as the data that underlies them.
Forecasting systems are only as reliable as the data they learn from, yet current datasets systematically under-represent the rare, safety-critical tail behaviors that matter most for deployment. We present JaywalkerVR, a Virtual Reality human-in-the-loop system, and the CARLA-VR dataset of safety-critical pedestrian-vehicle interactions collected with it. We show that this incomplete coverage significantly impairs forecasting reliability, and that augmenting training data with VR-collected interactions yields 10.7% lower displacement error and 4.9% lower collision rate on interactive scenarios, establishing the base layer upon which meaningful evaluation and modeling must rest.
Even with better data, progress is illusory if we measure it poorly. Widely used forecasting metrics obscure critical failure modes such as collisions and socially implausible interactions, giving a false sense of readiness for deployment. Building on the data foundation, we introduce joint evaluation metrics (JADE, JFDE) and collision rate, revealing a 2× gap between marginal and joint performance. Optimizing for joint metrics with no architectural changes yields a 16% collision rate reduction, confirming that evaluation design directly shapes the models the community builds. Without these metrics, improvements in model design cannot be trusted to reflect genuine progress.
Only once data and evaluation are sound does it become productive to ask how we can improve these models. Building on these foundations, we present PECT (Pose and Environment-Contextualized Transformer), a three-stream architecture that incorporates human body pose and dense Bird's Eye View environmental semantics alongside trajectory history. We introduce the environment collision rate (ECR) metric and a gated curriculum fusion strategy that aligns trajectory, pose, and dense environment features so that the additional modalities improve collision avoidance rather than introducing noise. PECT improves agent-agent collision rate by 6–12% and environment collision rate by 8–10%, without sacrificing displacement accuracy. The value of these richer inputs is only legible because the underlying data coverage and evaluation criteria are equipped to surface the differences that matter.
Taken together, this thesis argues that the trajectory forecasting community should approach deployment readiness not as a modeling problem but as a systems problem. Data, evaluation, and methods are deeply interdependent—neglecting any one undermines the others. By addressing all three as a unified stack, this work contributes a framework, concrete tools, and a philosophy for building forecasting systems genuinely aligned with the demands of real-world autonomous decision-making.
Forecasting systems are only as reliable as the data they learn from, yet current datasets systematically under-represent the rare, safety-critical tail behaviors that matter most for deployment. We present JaywalkerVR, a Virtual Reality human-in-the-loop system, and the CARLA-VR dataset of safety-critical pedestrian-vehicle interactions collected with it. We show that this incomplete coverage significantly impairs forecasting reliability, and that augmenting training data with VR-collected interactions yields 10.7% lower displacement error and 4.9% lower collision rate on interactive scenarios, establishing the base layer upon which meaningful evaluation and modeling must rest.
Even with better data, progress is illusory if we measure it poorly. Widely used forecasting metrics obscure critical failure modes such as collisions and socially implausible interactions, giving a false sense of readiness for deployment. Building on the data foundation, we introduce joint evaluation metrics (JADE, JFDE) and collision rate, revealing a 2× gap between marginal and joint performance. Optimizing for joint metrics with no architectural changes yields a 16% collision rate reduction, confirming that evaluation design directly shapes the models the community builds. Without these metrics, improvements in model design cannot be trusted to reflect genuine progress.
Only once data and evaluation are sound does it become productive to ask how we can improve these models. Building on these foundations, we present PECT (Pose and Environment-Contextualized Transformer), a three-stream architecture that incorporates human body pose and dense Bird's Eye View environmental semantics alongside trajectory history. We introduce the environment collision rate (ECR) metric and a gated curriculum fusion strategy that aligns trajectory, pose, and dense environment features so that the additional modalities improve collision avoidance rather than introducing noise. PECT improves agent-agent collision rate by 6–12% and environment collision rate by 8–10%, without sacrificing displacement accuracy. The value of these richer inputs is only legible because the underlying data coverage and evaluation criteria are equipped to surface the differences that matter.
Taken together, this thesis argues that the trajectory forecasting community should approach deployment readiness not as a modeling problem but as a systems problem. Data, evaluation, and methods are deeply interdependent—neglecting any one undermines the others. By addressing all three as a unified stack, this work contributes a framework, concrete tools, and a philosophy for building forecasting systems genuinely aligned with the demands of real-world autonomous decision-making.
Notes:
copied = false, 2000);
">
@phdthesis{Weng-2026-150446,
author = {Erica Weng},
title = {A Layered Foundation for Reliable Trajectory Forecasting: Data, Evaluation, and Methods},
year = {2026},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-10},
keywords = {trajectory forecasting, evaluation, benchmarks, Autonomous Systems, Human–Robot Interaction, Safety-Critical, Data Curation, Long-Tail, virtual reality},
}
author = {Erica Weng},
title = {A Layered Foundation for Reliable Trajectory Forecasting: Data, Evaluation, and Methods},
year = {2026},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-10},
keywords = {trajectory forecasting, evaluation, benchmarks, Autonomous Systems, Human–Robot Interaction, Safety-Critical, Data Curation, Long-Tail, virtual reality},
}