EMNLP 2025

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

A training-free framework that adapts image-pretrained VLMs to video understanding through dynamic compression and question decomposition.

Yiyang Huang, Yizhou Wang, Yun Fu

Northeastern University

Abstract

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks.

Motivation: Challenges in Image-to-Video Adaptation

Adapting image-pretrained VLMs to video requires effective token compression and comprehensive understanding of the resulting representations. However, the perception bottleneck hinders efficient compression with minimal information loss, while token overload limits comprehensive interpretation of the compressed tokens, as their number still exceeds the capacity of image-pretrained VLMs.

Perception bottleneck and token overload challenges

Two Key Challenges

We identify two key challenges in extending image-based VLMs to the video domain: Perception Bottleneck — static compression (uniform sampling, average pooling) treats all content uniformly, discarding informative cues that are dynamically distributed across temporal and spatial dimensions; Token Overload — even after compression, the number of visual tokens still far exceeds image-model capacity, causing performance saturation as the model cannot effectively interpret the excess information.

Perception Bottleneck

Static compression strategies such as uniform frame sampling and spatial average pooling lack semantic adaptivity and tend to discard informative cues that are unevenly distributed across temporal and spatial dimensions. As shown below, static methods lead to notable performance degradation, while our dynamic compression alleviates this drop and even surpasses the uncompressed baseline.

Perception bottleneck experiment

Token Overload

Even after compression, video inputs contain substantially more visual tokens than static images, exceeding the capacity of image-pretrained VLMs. The vanilla model exhibits a performance plateau as the token count increases. In contrast, question decomposition consistently outperforms the baseline with a widening accuracy gap, demonstrating superior scalability.

Token overload experiment

Method

To tackle these challenges, we introduce D-CoDe, a training-free framework that scales image-pretrained VLMs to video by integrating dynamic compression and question decomposition. Dynamic compression augments uniform sampling with supplementary frames and filters uninformative spatial tokens, while question decomposition reformulates complex queries into focused sub-questions for comprehensive understanding.

D-CoDe Framework Pipeline

Overview of the D-CoDe framework. Dynamic compression adaptively selects representative frames and aggregates spatial tokens, while question decomposition reformulates the query into sub-questions for more comprehensive video understanding.

Supplementary Frame Selection

Two-stage temporal compression: uniformly sample ⌊α·N⌋ frames, then iteratively select supplementary frames with maximum semantic diversity using CLIP global features.

Supplementary Frame Selection

Spatial Token Pruning & Merging

Prune low-activation tokens by ℓ2 norm, then greedily merge semantically similar tokens (cosine similarity > τ) to reduce spatial redundancy while preserving informative content.

Spatial Token Pruning and Merging

Results

Multiple-Choice VideoQA ↑

MethodNExT-QAEgoSchemaIntentQA
IG-VLM63.135.860.3
SF-LLaVA64.247.260.1
TS-LLaVA66.550.261.7
D-CoDe68.358.064.2

Open-Ended VideoQA — Accuracy ↑

MethodMSVDMSRVTTTGIFANet
IG-VLM78.863.773.054.3
SF-LLaVA79.165.878.755.5
TS-LLaVA79.065.177.756.7
D-CoDe80.064.279.156.4

Full Comparison — All Methods (7B LLM, CLIP-L Visual Encoder)

MethodTypeNExT-QAEgoSchemaIntentQAMSVDMSRVTTTGIFANet
Video-LLaVAtrained60.537.070.759.270.045.3
MovieChat+trained54.856.476.553.948.1
PLLaVAtrained76.662.077.556.3
DeepStack-Lfree61.038.476.049.3
free63.136.858.8
IG-VLMfree63.135.860.378.863.773.054.3
SF-LLaVAfree64.247.260.179.165.878.755.5
TS-LLaVAfree66.550.261.779.065.177.756.7
D-CoDe (Ours)free68.358.064.280.064.279.156.4

Module Ablation (EgoSchema) ↑

ModuleAcc. (%)
Baseline44.8
+ Spatial Compression50.6
+ Temporal Selection51.8
+ Question Decomposition58.0

Efficiency Analysis (EgoSchema)

ModuleAcc. (%)s/sample
Baseline44.83.93
+ Dynamic Compression51.86.12
+ Question Decomposition58.037.40
Key takeaway: On the challenging long-video benchmark EgoSchema, D-CoDe achieves 58.0% accuracy — a +7.8% improvement over the previous best training-free method TS-LLaVA (50.2%), demonstrating its strength in complex video understanding.

Qualitative Examples

Temporal Compression Visualization

Yellow frames are uniformly sampled; green frames are supplementary selections that maximize semantic diversity, capturing key moments missed by uniform sampling.

Temporal compression visualization

Supplementary frame selection enhances temporal diversity by selecting frames with maximum semantic difference from uniformly sampled frames.

Spatial Compression Visualization

Low-activation tokens (background regions) are pruned, and semantically similar remaining tokens are merged into clusters.

Spatial compression example 1

Spatial token pruning & merging — Example 1

Spatial compression example 2

Spatial token pruning & merging — Example 2

Attention Visualization

Question decomposition guides the model to attend to different aspects of the video. Each sub-question shifts the attention distribution to focus on distinct temporal regions.

Attention: original question

Original question

Attention: sub-question 1

Sub-question 1

Attention: sub-question 2

Sub-question 2

Code

D-CoDe provides three modular functions that can be used independently or combined:

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,
    N=15,
    uniform_ratio=0.85,
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,
    top_k=288,
    merge_strategy="mean",
    similarity_threshold=0.8
)

See the GitHub repository for full installation instructions, data preparation, and evaluation scripts.

BibTeX

If you find this work useful, please cite our paper:

EMNLP version:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}