Learned Metrics-Aware Covariance for Visual-Inertial Fusion

April 2026

Learned Metrics-Aware Covariance for Visual-Inertial Fusion

Authors:

Xiang Fei

Abstract:

Visual-inertial (VI) state estimation fuses cameras and inertial measurement units (IMUs) to achieve accurate, metric-scale state estimation for autonomous systems. The covariance matrices associated with visual and inertial measurements govern how the estimator weights each sensing modality, making accurate covariance modeling critical for fusion quality and consistency. However, most existing VI estimators rely on constant or heuristically tuned covariance parameters that fail to capture the observation-dependent nature of real sensor uncertainty, requiring tedious calibration for different platforms and environments, and often yielding suboptimal performance.

This thesis develops learned metrics-aware covariance models for both visual and inertial modalities and integrates them into principled, tuning-free VI fusion. Here, metrics-aware means that the predicted covariance faithfully reflects the true error distribution in physical units, so that it can be used directly in probabilistic estimation without further scaling or tuning.
On the visual side, we estimated visual pose covariances from learned feature-matching uncertainties provided by MAC-VO. On the inertial side, we improve the AirIMU framework with improved encoder design, a learnable initial covariance and a covariance fine-tuning stage, producing metrics-aware IMU integration covariances. Building on these two components, we present two VI systems (MAC-I$^2$ and MAC-VIO) that leverage the learned covariances for downstream estimation.

MAC-I$^2$ addresses VI initialization and extrinsic calibration in challenging environments involving illumination changes, dynamic objects, and occlusions. Empowered by metrics-aware visual pose and IMU integration covariances, we reformulate initialization and calibration as an uncertainty-aware joint optimization problem. Unlike loosely coupled methods (e.g., VINS-Mono) that fully trust visual poses, or tightly coupled methods that jointly optimize raw measurements but require good initial values for stability, our framework enables principled fusion by weighting each constraint according to its predicted covariance rather than directly trusting either sensor. On the EuRoC benchmark, MAC-I$^2$ achieves a 100% initialization success rate (compared to 74% for the best baseline) and a 59% improvement in gravity direction estimation, all without manual uncertainty tuning. We further demonstrate that the learned covariance models generalize across diverse scenes, sensor types, and motion regimes through evaluation on TUM-VI and TartanAir v2.

MAC-VIO extends metrics-aware covariance to continuous visual-inertial odometry, performing two-frame optimization with feature-level visual constraints and IMU preintegration constraints. On the EuRoC benchmark, MAC-VIO reduces the absolute trajectory error by 33.4% compared to the visual-only MAC-VO baseline. Together, these results demonstrate that learned, metrics-aware covariance modeling enables robust and accurate VI fusion without hand-tuned uncertainty parameters.

Notes:

@mastersthesis{Fei-2026-88274,
author = {Xiang Fei},
title = {Learned Metrics-Aware Covariance for Visual-Inertial Fusion},
year = {2026},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-19},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.