The code here is heavily inspired from the amazing repo: https://github.com/McGill-NLP/nano-aha-moment
Each baseline is implemented in a seperate hackable file.
Baselines implemented:
- Dr. GRPO
- VinePPO
- Reward Progress
- Best-of-N aware finetuning
For a quick summary of our results refer to our Notion Blog Post: "What to do when you have zero rewards during RL?"
# create new env using conda or uv or venv
pip install -r requirements.txtWe provide data generation scripts for star-graph that was used in the above blogpost. However, one could add their own tasks as well. Have a look for tasks directory for inspiration.
Please follow the notebook create_star_graph_data.ipynb to generate a star-graph dataset and push to HF.
To create dataset mixtures follow instructions in combine_datasets.ipynb.
python grpo.py \
--model_name Qwen/Qwen2.5-1.5B-Instruct \
--task star-graph-deg-3-path-3-nodes-300 \
--run_id Qwen2.5-1.5B-Instruct-Deg-3-Path-3Chunk advantages are estimated using Monte carlo rollouts from top-3 high entropy tokens in the response
python vineppo_and_reward_progress.py \
--prover_policy_model_name Qwen/Qwen2.5-1.5B-Instruct \
--model_name Qwen/Qwen2.5-1.5B-Instruct \
--run_id custom_run_id \
--top_k_entropy_tokens 3 \
--vineppo_k 3 \
--prover_alpha 1.00 \
--prover_policy_best_of_n 1 \
--current_policy_as_prover 1 \
--task hf_username/star-graph-deg-3-path-3-nodes-300Use prover as Best-of-4(Qwen/Qwen2.5-1.5B-Instruct) and advantage under the prover is estimated using roll outs from top-3 high entropy tokens
python vineppo_and_reward_progress.py \
--prover_policy_model_name Qwen/Qwen2.5-1.5B-Instruct \
--model_name Qwen/Qwen2.5-1.5B-Instruct \
--run_id custom_run_id \
--top_k_entropy_tokens 3 \
--vineppo_k 3 \
--prover_alpha 0.83 \
--prover_policy_best_of_n 4 \
--current_policy_as_prover 0 \
--task star-graph-deg-3-path-3-nodes-300Best-of-8 finetuning, using KL schedule from 0.1 to 0.001 in 1000 steps
python best_of_n_aware_finetune.py \
--model_name Qwen/Qwen2.5-1.5B-Instruct \
--task star-graph-deg-10-path-10-nodes-300 \
--run_id "10x10-bo8-kl-0.1-to-0.001-r2" \
--loss_type "best_of_n" \
--num_generations 8 \
--kl_schedule linear --initial_kl_coeff 0.1 --final_kl_coeff 0.001In case you find this repo helpful, consider citing it as:
@article{jpab2025rlzero,
title={What Can You Do When You Have Zero Rewards During RL?},
author={Prakash, Jatin and Buvanesh, Anirudh},
journal={arXiv preprint arXiv:2510.03971},
year={2025}
}