GitHub - hrlics/HoPE: HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

📢 News

[07/13/2025] All training and evaluation scripts are released. Check it out!
[06/29/2025] Our work is covered by JIQIZHIXIN (机器之心)!
[05/26/2025] Release our paper on arXiv.

🔭 Overview

Extending RoPE to multimodal scenarios typically involves allocating different frequencies to encode different positional components (t, x, y). In this paper:

1️⃣ We first investigate how different frequency allocation strategies impact the semantic modeling capabilities of VLMs. Our analysis reveals that current multimodal RoPEs, which keep all frequencies, are unreliable in long-term semantic modeling. HoPE tackles this issue by Hybrid Frequency Allocation (HFA), which integrates zero frequencies for reliable semantic modeling over extended contexts.

2️⃣ Moreover, we point out that existing temporal index scaling of visual tokens lacks flexibility and robustness during inference, where videos exhibit varying speeds and information densities. To address this, HoPE introduces Dynamic Temporal Scaling (DTS), which enables VLMs to learn multi-scale temporal relationships during training and adaptively select temporal scaling during inference.

🛠️ Requirements

Clone this repository and install transformers==4.45.2 from source

git clone https://github.com/hrlics/HoPE.git
cd HoPE
wget https://github.com/huggingface/transformers/archive/refs/tags/v4.45.2.tar.gz
tar -xzf v4.45.2.tar.gz

Install required packages

bash setup_env.sh

Replace the code in

HoPE/transformers-4.45.2/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

with HoPE/modeling_hope.py. The differences are marked with # MODIFIED.

🚀 Train

We utilize a subset of LLaVA-Video-178K as training data, which comprises 30k videos with durations under 2 minutes and 3k videos with durations between 2 to 3 minuts (~300k pairs).

Under LLaMA-Factory/, run the following script to start training:

train_hope.sh

Adjustments are made to LLaMA-Factory/src/llamafactory/data/mm_plugin.py to accomodate Qwen2-VL's training recipe.

🔍 Evaluation

Long Video Understanding

Evaluations on long video understanding are based on lmms-eval. The first step is to install relevant dependencies:

cd lmms-eval
pip install -e .

Then, run the following script to start evaluations on MLVU, LongVideoBench, and Video-MME:

bash eval_LVU.sh

Adjustments are made to lmms-eval/lmms_eval/models/qwen2_vl.py to accomodate our evaluation configs.

Long Video Retrieval

Under vision_niah/, run the following script to produce haystack and needle embeddings for long video retrieval:

bash produce_haystack_and_needle_embedding.sh

Now, we can run evaluations:

bash eval.sh

📖 Citation

If you find our work helpful, please consider citing 📝 and giving us a star ⭐

@article{li2025hope,
  title={HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models},
  author={Li, Haoran and Qin, Yingjie and Ou, Baoyuan and Xu, Lai and Xu, Ruiwen},
  journal={arXiv preprint arXiv:2505.20444},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
LLaMA-Factory		LLaMA-Factory
assets		assets
lmms-eval		lmms-eval
vision_niah		vision_niah
LICENSE		LICENSE
README.md		README.md
modeling_hope.py		modeling_hope.py
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

📢 News

🔭 Overview

🛠️ Requirements

🚀 Train

🔍 Evaluation

Long Video Understanding

Long Video Retrieval

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

hrlics/HoPE

Folders and files

Latest commit

History

Repository files navigation

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

📢 News

🔭 Overview

🛠️ Requirements

🚀 Train

🔍 Evaluation

Long Video Understanding

Long Video Retrieval

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages