Abstract:
How can we make video models faster? In fields such as vision and
language, the dominant trend has been large-scale end-to-end learning on
massive curated datasets. Scaling video training for both understanding
and generation has proved challenging, however, due to the drastically
larger size of the input. Current video transformers model videos the same
way as images: a single, very long, sequence of tokens. In contrast, modern
video codecs achieve impressive compression by explicitly modeling motion
and redundancy, enabling efficient storage and transmission. Inspired by
these principles, this thesis explores how ideas from video compression
such as motion estimation, residual modeling, and adaptive sampling can
be used to accelerate video models.
We begin by applying these ideas to video understanding tasks. We
introduce a series of methods that reduce the number of redundant input
tokens without sacrificing performance. First, Run-Length Tokenization
(RLT) accelerates video transformers by collapsing temporally redundant
patches into a single token, inspired by run-length encoding. Second,
Flow-based Tokenization (FlowTok) extends this idea by using optical
flow to detect redundant visual content even under motion, outperforming
grid-based pruning methods and enabling substantial gains on dynamic,
egocentric video. Finally, Adaptive Patch Transformer (APT) general-
izes these ideas to images by allocating patch sizes adaptively—using
large patches for homogeneous regions and small patches for detailed
ones—achieving major speedups on high-resolution visual tasks. Together,
these methods demonstrate how structure and redundancy in visual data
can be exploited to scale up transformer models more efficiently, enabling
faster training and inference without compromising accuracy.
We next apply these principles to video generation. Specifically, we
propose SkipSR, a cascaded generation framework that combines fast
video super-resolution with cascaded diffusion models. Instead of relying
on fixed heuristics to determine token importance, SkipSR learns which
tokens are critical for synthesis using end-to-end supervision. Finally, we
introduce a benchmark to systematically evaluate the impact of frame
rate and resolution on downstream video understanding tasks, offering
insights into which aspects of fidelity truly matter for model performance.
By unifying efficient video tokenization with scalable video synthesis and
principled evaluation, this thesis enables significantly faster visual models
in both understanding and generation tasks, unlocking further scaling.
language, the dominant trend has been large-scale end-to-end learning on
massive curated datasets. Scaling video training for both understanding
and generation has proved challenging, however, due to the drastically
larger size of the input. Current video transformers model videos the same
way as images: a single, very long, sequence of tokens. In contrast, modern
video codecs achieve impressive compression by explicitly modeling motion
and redundancy, enabling efficient storage and transmission. Inspired by
these principles, this thesis explores how ideas from video compression
such as motion estimation, residual modeling, and adaptive sampling can
be used to accelerate video models.
We begin by applying these ideas to video understanding tasks. We
introduce a series of methods that reduce the number of redundant input
tokens without sacrificing performance. First, Run-Length Tokenization
(RLT) accelerates video transformers by collapsing temporally redundant
patches into a single token, inspired by run-length encoding. Second,
Flow-based Tokenization (FlowTok) extends this idea by using optical
flow to detect redundant visual content even under motion, outperforming
grid-based pruning methods and enabling substantial gains on dynamic,
egocentric video. Finally, Adaptive Patch Transformer (APT) general-
izes these ideas to images by allocating patch sizes adaptively—using
large patches for homogeneous regions and small patches for detailed
ones—achieving major speedups on high-resolution visual tasks. Together,
these methods demonstrate how structure and redundancy in visual data
can be exploited to scale up transformer models more efficiently, enabling
faster training and inference without compromising accuracy.
We next apply these principles to video generation. Specifically, we
propose SkipSR, a cascaded generation framework that combines fast
video super-resolution with cascaded diffusion models. Instead of relying
on fixed heuristics to determine token importance, SkipSR learns which
tokens are critical for synthesis using end-to-end supervision. Finally, we
introduce a benchmark to systematically evaluate the impact of frame
rate and resolution on downstream video understanding tasks, offering
insights into which aspects of fidelity truly matter for model performance.
By unifying efficient video tokenization with scalable video synthesis and
principled evaluation, this thesis enables significantly faster visual models
in both understanding and generation tasks, unlocking further scaling.
Notes:
copied = false, 2000);
">
@phdthesis{Choudhury-2026-150451,
author = {Rohan C. Choudhury},
title = {Efficient Visual Modeling with Adaptive Representations},
year = {2026},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-02},
keywords = {computer vision, efficient machine learning, generative modeling, adaptive computation},
}
author = {Rohan C. Choudhury},
title = {Efficient Visual Modeling with Adaptive Representations},
year = {2026},
month = {February},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-26-02},
keywords = {computer vision, efficient machine learning, generative modeling, adaptive computation},
}