Abstract:
Recovering three-dimensional structure from two-dimensional observations is a foundational problem in computer vision. Classical multi-view geometry showed that the problem becomes tractable when correspondence is known: given 2D points tracked across views, triangulation and factorization recover 3D. Modern learning systems took a different path: they resolve the ambiguity of single-view 2D observations by supervising on large-scale 3D labels. This requirement has become the dominant bottleneck in 3D learning. Capturing 3D ground-truth requires geodesic camera domes, controlled lab conditions, and millions of dollars of equipment; as a result, 3D data is three to five orders of magnitude scarcer than 2D data on the web, and most of what exists consists of synthetic CAD models rather than real-world capture.
This dissertation revisits a counterintuitive alternative: learning 3D lifting from 2D-only supervision. Prior work (2017-2019) demonstrated that neural networks can recover coherent 3D structure when trained with only a 2D reprojection loss, leveraging the implicit inductive bias of gradient descent toward smooth, plausible shapes. But these methods did not scale: each object category required a bespoke architecture with fixed keypoint schemas and category-specific bottlenecks. Transformers, with their permutation-equivariant attention, seemed to offer the scalability that was missing. Yet under 2D-only supervision they fail catastrophically, and the community responded by scaling 3D supervision instead - a trajectory that reinforces rather than removes the 3D-data bottleneck.
The central question of this dissertation is: what went wrong in the transformer era, and can we recover scalable 2D-only 3D lifting?
The main answer is that preserving correspondence, rather than adding supervision, is what unlocks scale. We formalize a non-identifiability result (Proposition 1): under permutation-equivariant architectures and permutation-invariant 2D reprojection loss, token identity is mathematically unidentifiable from 2D supervision alone. The very property that makes transformers scale (permutation equivariance) is incompatible with the loss function required for 2D-only learning. We resolve this tension with a simple architectural change: injecting positional encoding at every attention layer, rather than only at the input. This preserves token identity throughout the forward pass without sacrificing the scalability advantages of transformer architectures. The empirical consequence is an 18x reduction in reconstruction error - from >150mm to 8.1mm on Pascal3D+ from a single architectural change, with no new parameters and no category-specific design.
The resulting 2D Lifting Foundation Model (2D-LFM) trains on 45+ heterogeneous object categories simultaneously using only 2D keypoint annotations, matches the accuracy of fully 3D-supervised counterparts, and exhibits strong cross-category transfer behavior: data-poor categories (e.g., Drosophila with 80 training samples) benefit enormously from geometric patterns learned across the full taxonomy, reducing error from 23.4mm in isolated training to 1.8mm when co-trained - with zero 3D labels at any stage.
Beyond the core contribution, the dissertation presents three supporting results that establish the unifying theme of geometric structure as a substitute for supervision cost: (i) Multi-view Bootstrapping in the Wild (MBW), which reduces 2D annotation requirements by 98% through automatic geometric verification; (ii) 3D-LFM, which establishes transformer-based lifting as a foundation-model paradigm under 3D supervision and demonstrates strong cross-category transfer across 30+ categories, along with its temporal extension 3D-LFM-Time; and (iii) RAT4D, which extends sparse landmark lifting to dense, animatable reconstruction by coupling Gaussian splatting with rendering-pose joint optimization, without category-specific surface templates.
The thesis argues that Kanade's long-standing emphasis on "correspondence, correspondence, correspondence" remains a useful guide in the foundation-model era. Understanding why correspondence matters, and designing architectures that preserve it, enables a different scaling trajectory for 3D learning: one grounded in the widely available 2D observations of the internet, rather than in the expensive multi-camera rigs of the laboratory.
This dissertation revisits a counterintuitive alternative: learning 3D lifting from 2D-only supervision. Prior work (2017-2019) demonstrated that neural networks can recover coherent 3D structure when trained with only a 2D reprojection loss, leveraging the implicit inductive bias of gradient descent toward smooth, plausible shapes. But these methods did not scale: each object category required a bespoke architecture with fixed keypoint schemas and category-specific bottlenecks. Transformers, with their permutation-equivariant attention, seemed to offer the scalability that was missing. Yet under 2D-only supervision they fail catastrophically, and the community responded by scaling 3D supervision instead - a trajectory that reinforces rather than removes the 3D-data bottleneck.
The central question of this dissertation is: what went wrong in the transformer era, and can we recover scalable 2D-only 3D lifting?
The main answer is that preserving correspondence, rather than adding supervision, is what unlocks scale. We formalize a non-identifiability result (Proposition 1): under permutation-equivariant architectures and permutation-invariant 2D reprojection loss, token identity is mathematically unidentifiable from 2D supervision alone. The very property that makes transformers scale (permutation equivariance) is incompatible with the loss function required for 2D-only learning. We resolve this tension with a simple architectural change: injecting positional encoding at every attention layer, rather than only at the input. This preserves token identity throughout the forward pass without sacrificing the scalability advantages of transformer architectures. The empirical consequence is an 18x reduction in reconstruction error - from >150mm to 8.1mm on Pascal3D+ from a single architectural change, with no new parameters and no category-specific design.
The resulting 2D Lifting Foundation Model (2D-LFM) trains on 45+ heterogeneous object categories simultaneously using only 2D keypoint annotations, matches the accuracy of fully 3D-supervised counterparts, and exhibits strong cross-category transfer behavior: data-poor categories (e.g., Drosophila with 80 training samples) benefit enormously from geometric patterns learned across the full taxonomy, reducing error from 23.4mm in isolated training to 1.8mm when co-trained - with zero 3D labels at any stage.
Beyond the core contribution, the dissertation presents three supporting results that establish the unifying theme of geometric structure as a substitute for supervision cost: (i) Multi-view Bootstrapping in the Wild (MBW), which reduces 2D annotation requirements by 98% through automatic geometric verification; (ii) 3D-LFM, which establishes transformer-based lifting as a foundation-model paradigm under 3D supervision and demonstrates strong cross-category transfer across 30+ categories, along with its temporal extension 3D-LFM-Time; and (iii) RAT4D, which extends sparse landmark lifting to dense, animatable reconstruction by coupling Gaussian splatting with rendering-pose joint optimization, without category-specific surface templates.
The thesis argues that Kanade's long-standing emphasis on "correspondence, correspondence, correspondence" remains a useful guide in the foundation-model era. Understanding why correspondence matters, and designing architectures that preserve it, enables a different scaling trajectory for 3D learning: one grounded in the widely available 2D observations of the internet, rather than in the expensive multi-camera rigs of the laboratory.
Notes:
copied = false, 2000);
">
@phdthesis{Dabhi-2026-88271,
author = {Mosam Dabhi},
title = {Correspondence-Preserving Transformers for Scalable 3D Lifting},
year = {2026},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-28},
keywords = {Self-supervised 3D labeling, 3D reconstruction, 2D to 3D Lifting, Geometry, Self-supervision, Transformers},
}
author = {Mosam Dabhi},
title = {Correspondence-Preserving Transformers for Scalable 3D Lifting},
year = {2026},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-28},
keywords = {Self-supervised 3D labeling, 3D reconstruction, 2D to 3D Lifting, Geometry, Self-supervision, Transformers},
}