PAPO, a novel policy gradient algorithm that enhances multimodal reasoning through visually grounded optimization. PAPO can serve as a direct drop-in replacement for GRPO or DAPO without any additional assumptions.
- Jan 2026: PAPO is accepted to ICLR 2026
- July 2025: Released PAPO_G (GRPO) models
- July 2025: Released PAPO_G (GRPO) code
- August 2025: Released PAPO_D (DAPO) models
- August 2025: Released PAPO_D (DAPO) code
- 4.4%-17.5% overall improvement on diverse multimodal benchmarks
- 8.0%-19.1% improvement on tasks high vision-dependentcy
- 30.5% reduction in perception errors
- No additional data or external reward models required
- Serves as a direct drop-in replacement for GRPO and DAPO
We identified that 67% of errors in current multimodal reasoning models stem from poor perception rather than logical reasoning failures.
PAPO extends GRPO/DAPO by adding an Implicit Perception Loss that maximizes the KL divergence between model outputs on original vs. corrupted (masked) images:
The core intuition is that a well-behaved multimodal model should produce significantly different outputs when visual information is corrupted, indicating reliance on meaningful visual content. To further enhance training stability, we introduce Double Entropy Loss, an effective regularizer that prevents model collapse while preserving performance.
PAPO consistently outperforms GRPO/DAPO across diverse benchmarks, with particularly pronounced improvements on vision-dependent tasks:
We adapt multiple multimodel reasoning benchmarks to construct our training and evaluation datasets.
- Training: We adapt TIGER-Lab/ViRL39K for training. The processed dataset can be found at: PAPOGalaxy/PAPO_ViRL39K_train.
- Validation (optional): We use the testset from MMK12 for validation during training. Note that this is solely for monitoring, we do not pick checkpoints based on this. The processed dataset can be found PAPOGalaxy/PAPO_MMK12_test.
We adapted 8 different multimodal reasoning benchmarks to evaluate PAPO, which are further identify two groups, including General Multimodal Reasoning and Vision-Dependent Multimodal Reasoning.
All evaluation benchmarks can be found in https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval.
For MathVista and MathVerse, we filter out instances with free-form answers to ensure verifiable evaluation and to avoid relying on LLM-as-a-judge.
All results in the paper are average accurarcy @ 8 (repeating 8 times), with a temperature set to 1.0.
Please refer to the main_qwen3 branch for instructions on running PAPO with Qwen3-VL.
conda create -n papo python=3.10
conda activate papo
cd PAPO
bash scripts/install.shpip install -e .The main training pipeline is adopted from EasyR1. We support training with different configurations for both Qwen2.5-VL 3B and 7B models:
- Qwen2.5-VL 3B: We typically use 2
80G H100GPUs - Qwen2.5-VL 7B: We typically use 4
80G H100GPUs
# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo.sh
# 7B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo.sh# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo.sh
# 7B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo.sh# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo.sh
# 7B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo.sh# 3B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_3b_dapo_papo.sh
# 7B model
cd PAPO
bash examples/papo_dapo/qwen2_5_vl_7b_dapo_papo.sh# 3B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_3b_grpo_papo_no_kl_ref.sh
# 7B model
cd PAPO
bash examples/papo_grpo/qwen2_5_vl_7b_grpo_papo_no_kl_ref.shA collection of 7B/3B pretrained checkpoints on ViRL39K can be downloaded from here. The checkpoints follows Qwen2.5-VL Huggingface format, which can be inferenced as drop-in replacement to https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. All checkpoints are corresponding to the last step.
- PAPO-GRPO model collection: PAPO-G
- PAPO-G 3B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-3B
- PAPO-G 7B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-7B
- PAPO-DAPO model collection: PAPO-D
- PAPO-D 3B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-D-Qwen2.5-VL-3B
- PAPO-D 7B model in Table 1: https://huggingface.co/PAPOGalaxy/PAPO-D-Qwen2.5-VL-7B
To run model inference and evaluation, we integrate the evaluation submodule located at PAPO/PAPO-Eval.
Detailed instructions for running inference and evaluation can be found in PAPO-Eval.
# Navigate to PAPO evaluation submodule
cd PAPO-Eval
# Data preprocessing
bash papo_eval/preprocess/preprocess.sh
# Run model inference
bash papo_eval/run_infer.sh
# Run model evaluation
bash papo_eval/run_eval.shIn theory, when enabling double entropy loss (adding aug_entropy_loss during the workers/actor/dp_actor.py/update_policy) we need to do an additional forward pass on the masked sequence to recompute the aug_log_probs. In practice, we find that whether doing this additional forward pass does not signiticantly affect the performance.
Thus, by default in current implementation, we skipped the recomputation, which still empirically brings slight improvement over single entropy. Detailed discussion can be found in #20.
We also provide a switch RECOMPUTE_AUG_LOG_PROBS in workers/actor/dp_actor.py to turn on/off this recomputation if one requires the explicit impact on the graidents from the aug_log_probs (note that this will slow down training due to the additional forward pass).
We thank the EasyR1 team for providing the foundational codebase that we adapted to implement PAPO. Our implementation builds upon their efficient RLVR framework and extends it with perception-aware optimization methodologies. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.
@article{wang2025perception,
title={Perception-Aware Policy Optimization for Multimodal Reasoning},
author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
journal={arXiv preprint arXiv:2507.06448},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
Learning to perceive while learning to reason!
π Project Page | π Paper | π» GitHub | π€ Models | π€ Data



