Skip to content

Preserve batched env evaluation in async validation rollouts#2209

Draft
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1798-async-validation
Draft

Preserve batched env evaluation in async validation rollouts#2209
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1798-async-validation

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

Addresses #1798 by preserving batched environment evaluation during async validation.

Today, when async vLLM generation is enabled, GRPO and distillation validation route through run_async_multi_turn_rollout(). That helper is optimized for sample-level pipelining, which is a good fit for training and async trajectory collection, but it evaluates environments from per-sample loops instead of from the batched rollout loop used by synchronous validation. In validation, that means we lose task-level batching in reward/env evaluation and pay unnecessary latency.

This PR keeps the training path unchanged and introduces a validation-specific rollout helper that combines async generation with the existing batched multi-turn environment loop.

What Changed

  • added run_multi_turn_rollout_async_generation() to nemo_rl/experience/rollouts.py
    • mirrors run_multi_turn_rollout()'s batched multi-turn control flow
    • uses generate_responses_async() for async vLLM generation
    • preserves batched task grouping via calculate_rewards()
    • keeps stop string updates, env metadata propagation, truncation handling, reward accumulation, and rollout metrics aligned with the synchronous helper
  • updated GRPO validation to use the new helper when _should_use_async_rollouts(master_config) is true
  • updated distillation validation to use the new helper in the same async-validation case
  • left run_async_multi_turn_rollout() unchanged for training and async trajectory collection, where sample-level pipelining is still the intended behavior
  • added targeted unit coverage for:
    • the new rollout helper's batching behavior for same-task batches
    • task grouping behavior for mixed-task batches
    • GRPO validation selecting the new helper in async mode
    • distillation validation selecting the new helper in async mode

Root Cause

The regression comes from using the same async rollout helper for both:

  • validation, where preserving batched reward/environment calls is important, and
  • training/async collection, where per-sample pipelining is desirable

run_async_multi_turn_rollout() processes each sample independently across turns. That architecture improves overlap for some training scenarios, but it also changes where environment evaluation happens. Validation ended up on the pipelined path and lost the batching characteristics of run_multi_turn_rollout().

Why This Design

This change fixes the validation bottleneck without broadening the blast radius:

  • no scheduler rewrite
  • no changes to the async training path
  • no change to NeMo-Gym validation handling
  • minimal call-site changes in the two validation entry points that were affected

The new helper is intentionally narrow and reuses the proven synchronous rollout structure, which keeps the fix easier to reason about and reduces regression risk.

User / Developer Impact

  • async validation with batched environments now preserves task-level batching again
  • training and async rollout collection behavior remain unchanged
  • the code path is clearer: validation uses a batched helper, while sample-pipelined async rollout remains dedicated to the training-style path it was designed for

Validation

  • python3.12 -m py_compile on all changed source and test files
  • targeted unit tests added for rollout batching and validation branch selection
  • direct smoke test of the new helper against a stubbed environment/generation setup to confirm it matches run_multi_turn_rollout() semantics while preserving batched env grouping

Notes

I could not run repo-native uv run pytest end-to-end in this environment because the project dependency resolution path pulls cuda-bindings==13.0.1, which is not available for the current macOS arm64 platform. The verification above was chosen to maximize signal despite that platform constraint.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants