Context
We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches rollout_func.
Decision doc: docs/verl_agent_decision.md (PR #84)
Upstream Dependency
When to Revisit
Check quarterly (June, September, December 2026). If any of:
- TRL #5120 is resolved or has a merged fix
- TRL's GRPOTrainer passes multi-turn VLM E2E tests
- TRL release notes announce multi-turn VLM GRPO support
Then:
- Test TRL against our WAA RL environment (
RLEnvironment / WAADesktopEnv)
- Benchmark: verl-agent vs TRL on same task (wall time, VRAM, convergence)
- If TRL matches verl-agent AND adds per-step credit assignment (GiGPO equivalent), consider switching
Why We'd Want to Switch
verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. But only if TRL also adds per-step credit assignment — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.
Related
Context
We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches
rollout_func.Decision doc:
docs/verl_agent_decision.md(PR #84)Upstream Dependency
When to Revisit
Check quarterly (June, September, December 2026). If any of:
Then:
RLEnvironment/WAADesktopEnv)Why We'd Want to Switch
verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. But only if TRL also adds per-step credit assignment — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.
Related
WAADesktopEnvadapter)openadapt_ml/training/grpo/trainer.py: inline comment tracking this