This is the official repository for the paper InfiGUI-G1.
InfiGUI-G1 enhances GUI grounding with Adaptive Exploration Policy Optimization (AEPO) to overcome exploration bottlenecks.
A fundamental challenge for GUI agents is robustly grounding natural language instructions, which requires not only precise spatial alignment (locating elements accurately) but also correct semantic alignment (identifying the functionally appropriate element). While existing Reinforcement Learning with Verifiable Rewards (RLVR) methods have enhanced spatial precision, they often suffer from inefficient exploration. This "confidence trap" bottlenecks semantic alignment, preventing models from discovering correct actions for difficult semantic associations.
To address this critical exploration problem, we introduce InfiGUI-G1, a series of models trained with Adaptive Exploration Policy Optimization (AEPO). AEPO overcomes the exploration bottleneck by integrating a multi-answer generation strategy to explore a diverse set of candidate actions in a single forward pass. This exploration is guided by a theoretically-grounded Adaptive Exploration Reward (AER) function, derived from first principles of efficiency (

Comparison between a naive RL baseline and our AEPO framework. AEPO's multi-answer generation and adaptive reward mechanism break the exploration bottleneck, enabling robust semantic alignment by deriving an informative learning signal.
- π₯
2025/08/11
Our paper "InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization" released. - π₯
2025/05/15
Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" is accepted by ACL 2025. - π₯
2025/4/19
Our paper "InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners" released. - π₯
2025/1/9
Our paper "InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection" released. - π₯
2024/12/12
Our paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" released. - π₯
2024/4/2
Our paper "InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks" is accepted by ICML 2024.
- π
2025/08/18
The training script is now available. See the Training section for details. - π
2025/08/11
The evaluation script is now available. See the Evaluation section for details. - π
2025/08/11
The models InfiGUI-G1-3B and InfiGUI-G1-7B are now publicly available on Hugging Face. - π
2025/08/08
The official repository for InfiGUI-G1 is now public.
- β InfiGUI-G1-3B Model Release
- β InfiGUI-G1-7B Model Release
- β Evaluation Code and Instructions
- β Training Code and Scripts
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.
On the comprehensive MMBench-GUI benchmark, which evaluates performance across various platforms and instruction complexities, our InfiGUI-G1 models establish new state-of-the-art results for open-source models in their respective size categories.
On the challenging ScreenSpot-Pro benchmark, designed to test semantic understanding on high-resolution professional software, InfiGUI-G1 demonstrates significant improvements, particularly on icon-based grounding tasks. This highlights AEPO's effectiveness in enhancing semantic alignment by associating abstract visual symbols with their functions.
InfiGUI-G1 shows strong generalization capabilities on the UI-Vision benchmark, which is designed to test robustness across a wide variety of unseen desktop applications. Achieving high performance confirms that our AEPO framework fosters a robust understanding rather than overfitting to the training data.
To further probe semantic reasoning, we evaluated on UI-I2E-Bench, a benchmark featuring a high proportion of implicit instructions that require reasoning beyond direct text matching. Our model's strong performance underscores AEPO's ability to handle complex, indirect commands.
On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverage across mobile, desktop, and web platforms, InfiGUI-G1 consistently outperforms strong baselines, demonstrating the broad applicability and data efficiency of our approach.
This section provides instructions for reproducing the evaluation results reported in our paper.
Clone the repository and navigate to the project directory:
git clone https://github.com/InfiXAI/InfiGUI-G1.git
cd InfiGUI-G1
The evaluation pipeline is built upon the vLLM library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
- Python:
3.10.12
- PyTorch:
2.6.0
- Transformers:
4.50.1
- vLLM:
0.8.2
- CUDA:
12.6
The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
Download the InfiGUI-G1 models from the Hugging Face Hub into the ./models
directory.
# Create a directory for models
mkdir -p ./models
# Download InfiGUI-G1-3B
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
# Download InfiGUI-G1-7B
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
Download the required evaluation benchmarks into the ./data
directory.
# Create a directory for datasets
mkdir -p ./data
# Download benchmarks
huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
After downloading, some datasets require unzipping compressed image files.
# Unzip images for ScreenSpot-v2
unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
# Unzip images for MMBench-GUI
unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
To run the evaluation, use the eval/eval.py
script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
Here is an example command to evaluate the InfiGUI-G1-3B
model on the screenspot-pro
benchmark using 4 GPUs:
python eval/eval.py \
./models/InfiGUI-G1-3B \
--benchmark screenspot-pro \
--tensor-parallel 4
model_path
: The first positional argument specifies the path to the downloaded model directory (e.g.,./models/InfiGUI-G1-3B
).--benchmark
: Specifies the benchmark to evaluate. Available options includescreenspot-pro
,screenspot-v2
,ui-vision
,mmbench-gui
, andi2e-bench
.--tensor-parallel
: Sets the tensor parallelism size, which should typically match the number of available GPUs.
Evaluation results, including detailed logs and performance metrics, will be saved to the ./output/{model_name}/{benchmark}/
directory.
This section provides instructions for reproducing the training results of InfiGUI-G1. Our training framework is built upon the VERL repository.
Please follow the environment setup instructions for VERL with vLLM support. You can either follow their installation guide or use the official Docker images provided by VERL, which come pre-configured with the necessary dependencies, including vLLM.
We provide our code and training scripts in the recipe/infigui-g1
directory. A sample dataset is also included to help you get started.
For detailed instructions on how to launch the training for both the 3B and 7B models, please refer to the recipe/infigui-g1/README.md
file.
If you find this work useful, citations to the following papers are welcome:
@article{liu2025infiguig1,
title={InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization},
author={Liu, Yuhang and Liu, Zeyu and Zhu, Shuanghe and Li, Pengxiang and Xie, Congkai and Wang, Jiasheng and Hu, Xueyu and Han, Xiaotian and Yuan, Jianbo and Wang, Xinyao and others},
journal={arXiv preprint arXiv:2508.05731},
year={2025}
}
@article{liu2025infiguir1,
title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2504.14239},
year={2025}
}
@article{liu2025infiguiagent,
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2501.04575},
year={2025}
}
We would like to express our gratitude for the following open-source projects: VERL, Qwen2.5-VL and vLLM.