Abstract:
Vision-language models such as CLIP enable zero-shot recognition by aligning images and natural-language prompts in a shared embedding space. Their appeal is strongest in settings where task-specific training data are scarce: a practitioner can describe classes with prompts, encode them as text axes, and classify images by image-text similarity. However, the same zero-shot pipeline can fail unevenly across subpopulations. Pretraining spans broad and heterogeneous data, and the resulting text axes may entangle intended class semantics with dataset-specific nuisances such as demographic attributes, background context, collection source, or other spurious factors. When these factors correlate with labels, worst-group accuracy can degrade even when average accuracy appears acceptable. This thesis presents SPOT (Spectral Preconditioning of Text Embeddings), a closed-form calibration method for CLIP-like zero-shot classifiers. SPOT estimates the covariance structure of target-domain image embeddings, decomposes the prompt-derived text axes in the corresponding eigenspace, and applies a smooth spectral response that preserves class-relevant bands while suppressing nuisance bands. The resulting calibrated axes replace the original text axes in the standard cosine-scoring pipeline; the image encoder, text encoder, and prompting interface remain frozen. The method is lightweight, data-efficient, label-efficient, and deterministic, reducing adaptation to a small validation search over spectral parameters and class margins. The thesis also introduces a Personal Calibration Protocol (PCP) for on-device, privacy-aware adaptation. PCP models a realistic human-sensing scenario in which a user has an unlabeled image album, task retrievals generated by a fixed reference model, and sparse oracle corrections of that model's own errors. Unlike standard group-robustness evaluations, PCP uses error corrections as the supervision signal and does not require explicit group labels during adaptation. The PCP-Face-v1 suite instantiates this protocol on face-attribute recognition tasks using CelebA, FairFace, UTKFace, and a shared unlabeled pool. Experiments on standard group-robustness benchmarks show that SPOT improves worst-group accuracy and is competitive with, or better than, existing debiasing methods, particularly in multi-class settings. Under the PCP-Face-v1 protocol, common few-shot and adaptation baselines are often unstable or can fall below the zero-shot reference model, while SPOT improves worst-group accuracy across CelebA, FairFace, and UTKFace and scales with the number of oracle corrections. Finally, the thesis develops a differentially private version, DP-SPOT, by releasing a noisy calibrated weight matrix with sensitivity bounds derived from spectral Lipschitzness. Together, these results support SPOT as a practical calibration layer for fair, private, and resource-constrained zero-shot recognition.
Notes:
copied = false, 2000);
">
@mastersthesis{Zheng-2026-88291,
author = {Runkai Zheng},
title = {SPOT: Spectral Preconditioning of Text Embeddings},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-41},
keywords = {vision-language models, zero-shot learning, fairness, group robustness, differential privacy, personal calibration},
}
author = {Runkai Zheng},
title = {SPOT: Spectral Preconditioning of Text Embeddings},
year = {2026},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-41},
keywords = {vision-language models, zero-shot learning, fairness, group robustness, differential privacy, personal calibration},
}