Skip to content

Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.

License

Notifications You must be signed in to change notification settings

InfiXAI/InfiGUI-G1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InfiGUI-G1
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

arXiv Paper Hugging Face Paper InfiGUI-G1 3B Model InfiGUI-G1 7B Model

This is the official repository for the paper InfiGUI-G1.
InfiGUI-G1 enhances GUI grounding with Adaptive Exploration Policy Optimization (AEPO) to overcome exploration bottlenecks.

🌟 Overview

A fundamental challenge for GUI agents is robustly grounding natural language instructions, which requires not only precise spatial alignment (locating elements accurately) but also correct semantic alignment (identifying the functionally appropriate element). While existing Reinforcement Learning with Verifiable Rewards (RLVR) methods have enhanced spatial precision, they often suffer from inefficient exploration. This "confidence trap" bottlenecks semantic alignment, preventing models from discovering correct actions for difficult semantic associations.

To address this critical exploration problem, we introduce InfiGUI-G1, a series of models trained with Adaptive Exploration Policy Optimization (AEPO). AEPO overcomes the exploration bottleneck by integrating a multi-answer generation strategy to explore a diverse set of candidate actions in a single forward pass. This exploration is guided by a theoretically-grounded Adaptive Exploration Reward (AER) function, derived from first principles of efficiency ($\eta=U/C$), which provides rich, informative learning signals to dynamically balance exploration and exploitation.

AEPO Framework

Comparison between a naive RL baseline and our AEPO framework. AEPO's multi-answer generation and adaptive reward mechanism break the exploration bottleneck, enabling robust semantic alignment by deriving an informative learning signal.

πŸ”₯ News

πŸš€ Updates

  • πŸš€ 2025/08/18 The training script is now available. See the Training section for details.
  • πŸš€ 2025/08/11 The evaluation script is now available. See the Evaluation section for details.
  • πŸš€ 2025/08/11 The models InfiGUI-G1-3B and InfiGUI-G1-7B are now publicly available on Hugging Face.
  • πŸš€ 2025/08/08 The official repository for InfiGUI-G1 is now public.

πŸ—ΊοΈ Roadmap

  • βœ… InfiGUI-G1-3B Model Release
  • βœ… InfiGUI-G1-7B Model Release
  • βœ… Evaluation Code and Instructions
  • βœ… Training Code and Scripts

πŸ“Š Results

Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks.

MMBench-GUI (L2) Results

On the comprehensive MMBench-GUI benchmark, which evaluates performance across various platforms and instruction complexities, our InfiGUI-G1 models establish new state-of-the-art results for open-source models in their respective size categories.

MMBench-GUI Results

ScreenSpot-Pro Results

On the challenging ScreenSpot-Pro benchmark, designed to test semantic understanding on high-resolution professional software, InfiGUI-G1 demonstrates significant improvements, particularly on icon-based grounding tasks. This highlights AEPO's effectiveness in enhancing semantic alignment by associating abstract visual symbols with their functions.

ScreenSpot-Pro Results

UI-Vision (Element Grounding) Results

InfiGUI-G1 shows strong generalization capabilities on the UI-Vision benchmark, which is designed to test robustness across a wide variety of unseen desktop applications. Achieving high performance confirms that our AEPO framework fosters a robust understanding rather than overfitting to the training data.

UI-Vision Results

UI-I2E-Bench Results

To further probe semantic reasoning, we evaluated on UI-I2E-Bench, a benchmark featuring a high proportion of implicit instructions that require reasoning beyond direct text matching. Our model's strong performance underscores AEPO's ability to handle complex, indirect commands.

UI-I2E-Bench Results

ScreenSpot-V2 Results

On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverage across mobile, desktop, and web platforms, InfiGUI-G1 consistently outperforms strong baselines, demonstrating the broad applicability and data efficiency of our approach.

ScreenSpot-V2 Results

βš™οΈ Evaluation

This section provides instructions for reproducing the evaluation results reported in our paper.

1. Getting Started

Clone the repository and navigate to the project directory:

git clone https://github.com/InfiXAI/InfiGUI-G1.git
cd InfiGUI-G1

2. Environment Setup

The evaluation pipeline is built upon the vLLM library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:

  • Python: 3.10.12
  • PyTorch: 2.6.0
  • Transformers: 4.50.1
  • vLLM: 0.8.2
  • CUDA: 12.6

The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.

3. Model Download

Download the InfiGUI-G1 models from the Hugging Face Hub into the ./models directory.

# Create a directory for models
mkdir -p ./models

# Download InfiGUI-G1-3B
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B

# Download InfiGUI-G1-7B
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B

4. Dataset Download and Preparation

Download the required evaluation benchmarks into the ./data directory.

# Create a directory for datasets
mkdir -p ./data

# Download benchmarks
huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench

After downloading, some datasets require unzipping compressed image files.

# Unzip images for ScreenSpot-v2
unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/

# Unzip images for MMBench-GUI
unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/

5. Running the Evaluation

To run the evaluation, use the eval/eval.py script. You must specify the path to the model, the benchmark name, and the tensor parallel size.

Here is an example command to evaluate the InfiGUI-G1-3B model on the screenspot-pro benchmark using 4 GPUs:

python eval/eval.py \
    ./models/InfiGUI-G1-3B \
    --benchmark screenspot-pro \
    --tensor-parallel 4
  • model_path: The first positional argument specifies the path to the downloaded model directory (e.g., ./models/InfiGUI-G1-3B).
  • --benchmark: Specifies the benchmark to evaluate. Available options include screenspot-pro, screenspot-v2, ui-vision, mmbench-gui, and i2e-bench.
  • --tensor-parallel: Sets the tensor parallelism size, which should typically match the number of available GPUs.

Evaluation results, including detailed logs and performance metrics, will be saved to the ./output/{model_name}/{benchmark}/ directory.

πŸš† Training

This section provides instructions for reproducing the training results of InfiGUI-G1. Our training framework is built upon the VERL repository.

1. Environment Setup

Please follow the environment setup instructions for VERL with vLLM support. You can either follow their installation guide or use the official Docker images provided by VERL, which come pre-configured with the necessary dependencies, including vLLM.

2. Training Recipe

We provide our code and training scripts in the recipe/infigui-g1 directory. A sample dataset is also included to help you get started.

For detailed instructions on how to launch the training for both the 3B and 7B models, please refer to the recipe/infigui-g1/README.md file.

πŸ“š Citation Information

If you find this work useful, citations to the following papers are welcome:

@article{liu2025infiguig1,
  title={InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization},
  author={Liu, Yuhang and Liu, Zeyu and Zhu, Shuanghe and Li, Pengxiang and Xie, Congkai and Wang, Jiasheng and Hu, Xueyu and Han, Xiaotian and Yuan, Jianbo and Wang, Xinyao and others},
  journal={arXiv preprint arXiv:2508.05731},
  year={2025}
}
@article{liu2025infiguir1,
  title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
  author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2504.14239},
  year={2025}
}
@article{liu2025infiguiagent,
  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2501.04575},
  year={2025}
}

πŸ™ Acknowledgements

We would like to express our gratitude for the following open-source projects: VERL, Qwen2.5-VL and vLLM.

About

Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.

Topics

Resources

License

Stars

Watchers

Forks