Skip to content

MikeWangWZHL/PAPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning (ICLR 2026)

Project Page arXiv GitHub Hugging Face Hugging Face

PAPO, a novel policy gradient algorithm that enhances multimodal reasoning through visually grounded optimization. PAPO can serve as a direct drop-in replacement for GRPO or DAPO without any additional assumptions.

πŸ”₯ News

  • Jan 2026: PAPO is accepted to ICLR 2026
  • July 2025: Released PAPO_G (GRPO) models
  • July 2025: Released PAPO_G (GRPO) code
  • August 2025: Released PAPO_D (DAPO) models
  • August 2025: Released PAPO_D (DAPO) code

🌟 Key Highlights

  • 4.4%-17.5% overall improvement on diverse multimodal benchmarks
  • 8.0%-19.1% improvement on tasks high vision-dependentcy
  • 30.5% reduction in perception errors
  • No additional data or external reward models required
  • Serves as a direct drop-in replacement for GRPO and DAPO

πŸ“– Methodology

Perception Bottleneck

We identified that 67% of errors in current multimodal reasoning models stem from poor perception rather than logical reasoning failures.

PAPO Overview

PAPO Algorithm

PAPO extends GRPO/DAPO by adding an Implicit Perception Loss that maximizes the KL divergence between model outputs on original vs. corrupted (masked) images:

PAPO Method

The core intuition is that a well-behaved multimodal model should produce significantly different outputs when visual information is corrupted, indicating reliance on meaningful visual content. To further enhance training stability, we introduce Double Entropy Loss, an effective regularizer that prevents model collapse while preserving performance.

PAPO Objective

Main Results

PAPO consistently outperforms GRPO/DAPO across diverse benchmarks, with particularly pronounced improvements on vision-dependent tasks:

Main Results

πŸ“Š Data

We adapt multiple multimodel reasoning benchmarks to construct our training and evaluation datasets.

Training Data

Evaluation Data

We adapted 8 different multimodal reasoning benchmarks to evaluate PAPO, which are further identify two groups, including General Multimodal Reasoning and Vision-Dependent Multimodal Reasoning. All evaluation benchmarks can be found in https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval. For MathVista and MathVerse, we filter out instances with free-form answers to ensure verifiable evaluation and to avoid relying on LLM-as-a-judge.

All results in the paper are average accurarcy @ 8 (repeating 8 times), with a temperature set to 1.0.

πŸš€ Quick Start (Qwen2.5-VL)

Update Support for Qwen3-VL

Please refer to the main_qwen3 branch for instructions on running PAPO with Qwen3-VL.

Environment Setup

Option 1: All-in-one Installation Script

conda create -n papo python=3.10
conda activate papo

cd PAPO
bash scripts/install.sh

Option 2: Using pip

pip install -e .

Training

The main training pipeline is adopted from EasyR1. We support training with different configurations for both Qwen2.5-VL 3B and 7B models:

  • Qwen2.5-VL 3B: We typically use 2 80G H100 GPUs
  • Qwen2.5-VL 7B: We typically use 4 80G H100 GPUs

GRPO Baseline

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo.sh

# 7B model  
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo.sh

DAPO Baseline

# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo.sh

# 7B model  
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo.sh

PAPO-G (Config for Table 1 Results)

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo.sh

# 7B model  
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo.sh

PAPO-D (Config for Table 1 Results)

# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo_papo.sh

# 7B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo_papo.sh

PAPO-G + No Reference KL (Config for Table 7 Results)

# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo_no_kl_ref.sh

# 7B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo_no_kl_ref.sh

Pretrained Checkpoints

A collection of 7B/3B pretrained checkpoints on ViRL39K can be downloaded from here. The checkpoints follows Qwen2.5-VL Huggingface format, which can be inferenced as drop-in replacement to https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. All checkpoints are corresponding to the last step.

Performance Evaluation

To run model inference and evaluation, we integrate the evaluation submodule located at PAPO/PAPO-Eval. Detailed instructions for running inference and evaluation can be found in PAPO-Eval.

# Navigate to PAPO evaluation submodule
cd PAPO-Eval

# Data preprocessing
bash papo_eval/preprocess/preprocess.sh

# Run model inference
bash papo_eval/run_infer.sh

# Run model evaluation
bash papo_eval/run_eval.sh

Additional Implementation Notes on Entropy Losses

In theory, when enabling double entropy loss (adding aug_entropy_loss during the workers/actor/dp_actor.py/update_policy) we need to do an additional forward pass on the masked sequence to recompute the aug_log_probs. In practice, we find that whether doing this additional forward pass does not signiticantly affect the performance. Thus, by default in current implementation, we skipped the recomputation, which still empirically brings slight improvement over single entropy. Detailed discussion can be found in #20. We also provide a switch RECOMPUTE_AUG_LOG_PROBS in workers/actor/dp_actor.py to turn on/off this recomputation if one requires the explicit impact on the graidents from the aug_log_probs (note that this will slow down training due to the additional forward pass).

πŸ₯° Acknowledgements

We thank the EasyR1 team for providing the foundational codebase that we adapted to implement PAPO. Our implementation builds upon their efficient RLVR framework and extends it with perception-aware optimization methodologies. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.

πŸ“ Citation

@article{wang2025perception,
  title={Perception-Aware Policy Optimization for Multimodal Reasoning},
  author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
  journal={arXiv preprint arXiv:2507.06448},
  year={2025}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Learning to perceive while learning to reason!

🌐 Project Page | πŸ“„ Paper | πŸ’» GitHub | πŸ€— Models | πŸ€— Data

About

Official repo for "PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors