Abstract:
State estimation is a fundamental component of embodied perception. For safe navigation, we argue that robots (and autonomous vehicles (AVs) specifically) must detect, track, and forecast all object categories, not just those seen during training. In this thesis, we study open-world 3D perception along three complementary axes: (i) long-tailed recognition for offline data curation, (ii) rapid model adaptation to new concepts via few-shot multi-modal examples, and (iii) low-level 3D motion understanding for fast reactive control.
Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors on large-scale data. Notably, although prior work has nearly solved 3D object detection for a few common classes (e.g., pedestrian and car), detecting many rare classes in the tail (e.g., debris and stroller) remains challenging. This limitation is especially critical for offline scenario mining, where identifying rare but safety-critical events is essential. We show that fine-grained tail class accuracy is significantly improved via multi-modal fusion of RGB images with LiDAR; fine-grained classes are difficult to identify from sparse LiDAR geometry alone, suggesting that multi-modal cues are crucial for long-tailed 3D detection. To this end, we study a simple late-fusion framework that ensembles independently trained uni-modal LiDAR and RGB detectors. Importantly, this formulation allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train stronger RGB detectors, unlike prevailing multimodal approaches that require paired multi-modal data. While such models improve the detection accuracy of rare categories, open-world perception also requires adapting to new and evolving concepts from limited supervision.
The emergence of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of open-world perception. We revisit few-shot object detection (FSOD) in the context of such foundation models. Zero-shot predictions from models such as GroundingDINO already outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO, yet remain misaligned with out-of-distribution target domains. For instance, trucks on the web (e.g., pickup trucks) may be defined differently from trucks in autonomous driving scenarios (e.g., semi-trucks). We therefore reformulate few-shot recognition as aligning foundation models to target concepts using a small number of examples. These examples can be naturally multi-modal, combining text and visual cues, analogous to how human annotators learn to annotate new categories. Concretely, we propose Foundational FSOD, a benchmark protocol that evaluates detectors that are pre-trained on arbitrary external data and are adapted using multi-modal K-shot examples per class. Together with long-tailed detection, Foundational FSOD enables scalable discovery of rare and ambiguously defined object categories for scenario mining.
Finally, beyond semantic recognition and offline discovery, on-robot open-world perception systems must support fast, reactive decision-making. In safety-critical scenarios, we argue that accurate 3D motion estimation is more important for evasive maneuvering than explicit categorization. We therefore study LiDAR scene flow, which formalizes the task of estimating per-point 3D motion between consecutive point clouds. Prior methods achieve centimeter-level accuracy but are typically trained on a single sensor, limiting generalization. In contrast, we learn motion priors that transfer across diverse and unseen LiDAR sensors. While prior work in LiDAR segmentation and detection suggests that naive multi-dataset training degrades performance, we find that this conventional wisdom does not hold for motion estimation: scene flow models benefit substantially from cross-dataset training without architectural changes. Our analysis suggests that low-level motion cues are less sensitive to sensor configuration; indeed, models trained on fast-moving objects (e.g., from highway datasets) perform well on fast-moving objects, even across different datasets. Building on this insight, we propose UniFlow, a simple feedforward model trained jointly on multiple large-scale scene flow datasets with diverse sensor setups. UniFlow establishes a new state-of-the-art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2%, respectively, and generalizes to unseen datasets like TruckScenes and AEVAScenes.
Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors on large-scale data. Notably, although prior work has nearly solved 3D object detection for a few common classes (e.g., pedestrian and car), detecting many rare classes in the tail (e.g., debris and stroller) remains challenging. This limitation is especially critical for offline scenario mining, where identifying rare but safety-critical events is essential. We show that fine-grained tail class accuracy is significantly improved via multi-modal fusion of RGB images with LiDAR; fine-grained classes are difficult to identify from sparse LiDAR geometry alone, suggesting that multi-modal cues are crucial for long-tailed 3D detection. To this end, we study a simple late-fusion framework that ensembles independently trained uni-modal LiDAR and RGB detectors. Importantly, this formulation allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train stronger RGB detectors, unlike prevailing multimodal approaches that require paired multi-modal data. While such models improve the detection accuracy of rare categories, open-world perception also requires adapting to new and evolving concepts from limited supervision.
The emergence of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of open-world perception. We revisit few-shot object detection (FSOD) in the context of such foundation models. Zero-shot predictions from models such as GroundingDINO already outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO, yet remain misaligned with out-of-distribution target domains. For instance, trucks on the web (e.g., pickup trucks) may be defined differently from trucks in autonomous driving scenarios (e.g., semi-trucks). We therefore reformulate few-shot recognition as aligning foundation models to target concepts using a small number of examples. These examples can be naturally multi-modal, combining text and visual cues, analogous to how human annotators learn to annotate new categories. Concretely, we propose Foundational FSOD, a benchmark protocol that evaluates detectors that are pre-trained on arbitrary external data and are adapted using multi-modal K-shot examples per class. Together with long-tailed detection, Foundational FSOD enables scalable discovery of rare and ambiguously defined object categories for scenario mining.
Finally, beyond semantic recognition and offline discovery, on-robot open-world perception systems must support fast, reactive decision-making. In safety-critical scenarios, we argue that accurate 3D motion estimation is more important for evasive maneuvering than explicit categorization. We therefore study LiDAR scene flow, which formalizes the task of estimating per-point 3D motion between consecutive point clouds. Prior methods achieve centimeter-level accuracy but are typically trained on a single sensor, limiting generalization. In contrast, we learn motion priors that transfer across diverse and unseen LiDAR sensors. While prior work in LiDAR segmentation and detection suggests that naive multi-dataset training degrades performance, we find that this conventional wisdom does not hold for motion estimation: scene flow models benefit substantially from cross-dataset training without architectural changes. Our analysis suggests that low-level motion cues are less sensitive to sensor configuration; indeed, models trained on fast-moving objects (e.g., from highway datasets) perform well on fast-moving objects, even across different datasets. Building on this insight, we propose UniFlow, a simple feedforward model trained jointly on multiple large-scale scene flow datasets with diverse sensor setups. UniFlow establishes a new state-of-the-art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2%, respectively, and generalizes to unseen datasets like TruckScenes and AEVAScenes.
Notes:
copied = false, 2000);
">
@phdthesis{Peri-2026-88259,
author = {Neehar Peri},
title = {Towards Scalable Open-World 3D Perception},
year = {2026},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-21},
keywords = {Open-World Perception, 3D Foundation Models, LiDAR Scene Flow, Vision-Language Models, Autonomous Vehicles},
}
author = {Neehar Peri},
title = {Towards Scalable Open-World 3D Perception},
year = {2026},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-21},
keywords = {Open-World Perception, 3D Foundation Models, LiDAR Scene Flow, Vision-Language Models, Autonomous Vehicles},
}