SPOT: Spectral Preconditioning of Text Embeddings

May 2026

SPOT: Spectral Preconditioning of Text Embeddings

Authors:

Runkai Zheng

Abstract:

Vision-language models such as CLIP enable zero-shot recognition by aligning images and natural-language prompts in a shared embedding space. Their appeal is strongest in settings where task-specific training data are scarce: a practitioner can describe classes with prompts, encode them as text axes, and classify images by image-text similarity. However, the same zero-shot pipeline can fail unevenly across subpopulations. Pretraining spans broad and heterogeneous data, and the resulting text axes may entangle intended class semantics with dataset-specific nuisances such as demographic attributes, background context, collection source, or other spurious factors. When these factors correlate with labels, worst-group accuracy can degrade even when average accuracy appears acceptable. This thesis presents SPOT (Spectral Preconditioning of Text Embeddings), a closed-form calibration method for CLIP-like zero-shot classifiers. SPOT estimates the covariance structure of target-domain image embeddings, decomposes the prompt-derived text axes in the corresponding eigenspace, and applies a smooth spectral response that preserves class-relevant bands while suppressing nuisance bands. The resulting calibrated axes replace the original text axes in the standard cosine-scoring pipeline; the image encoder, text encoder, and prompting interface remain frozen. The method is lightweight, data-efficient, label-efficient, and deterministic, reducing adaptation to a small validation search over spectral parameters and class margins. The thesis also introduces a Personal Calibration Protocol (PCP) for on-device, privacy-aware adaptation. PCP models a realistic human-sensing scenario in which a user has an unlabeled image album, task retrievals generated by a fixed reference model, and sparse oracle corrections of that model's own errors. Unlike standard group-robustness evaluations, PCP uses error corrections as the supervision signal and does not require explicit group labels during adaptation. The PCP-Face-v1 suite instantiates this protocol on face-attribute recognition tasks using CelebA, FairFace, UTKFace, and a shared unlabeled pool. Experiments on standard group-robustness benchmarks show that SPOT improves worst-group accuracy and is competitive with, or better than, existing debiasing methods, particularly in multi-class settings. Under the PCP-Face-v1 protocol, common few-shot and adaptation baselines are often unstable or can fall below the zero-shot reference model, while SPOT improves worst-group accuracy across CelebA, FairFace, and UTKFace and scales with the number of oracle corrections. Finally, the thesis develops a differentially private version, DP-SPOT, by releasing a noisy calibrated weight matrix with sensitivity bounds derived from spectral Lipschitzness. Together, these results support SPOT as a practical calibration layer for fair, private, and resource-constrained zero-shot recognition.

Notes:

@mastersthesis{Zheng-2026-88291,
author = {Runkai Zheng},
title = {SPOT: Spectral Preconditioning of Text Embeddings},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-41},
keywords = {vision-language models, zero-shot learning, fairness, group robustness, differential privacy, personal calibration},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.