Preserve batched env evaluation in async validation rollouts#2209
Draft
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Draft
Preserve batched env evaluation in async validation rollouts#2209taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #1798 by preserving batched environment evaluation during async validation.
Today, when async vLLM generation is enabled, GRPO and distillation validation route through
run_async_multi_turn_rollout(). That helper is optimized for sample-level pipelining, which is a good fit for training and async trajectory collection, but it evaluates environments from per-sample loops instead of from the batched rollout loop used by synchronous validation. In validation, that means we lose task-level batching in reward/env evaluation and pay unnecessary latency.This PR keeps the training path unchanged and introduces a validation-specific rollout helper that combines async generation with the existing batched multi-turn environment loop.
What Changed
run_multi_turn_rollout_async_generation()tonemo_rl/experience/rollouts.pyrun_multi_turn_rollout()'s batched multi-turn control flowgenerate_responses_async()for async vLLM generationcalculate_rewards()_should_use_async_rollouts(master_config)is truerun_async_multi_turn_rollout()unchanged for training and async trajectory collection, where sample-level pipelining is still the intended behaviorRoot Cause
The regression comes from using the same async rollout helper for both:
run_async_multi_turn_rollout()processes each sample independently across turns. That architecture improves overlap for some training scenarios, but it also changes where environment evaluation happens. Validation ended up on the pipelined path and lost the batching characteristics ofrun_multi_turn_rollout().Why This Design
This change fixes the validation bottleneck without broadening the blast radius:
The new helper is intentionally narrow and reuses the proven synchronous rollout structure, which keeps the fix easier to reason about and reduces regression risk.
User / Developer Impact
Validation
python3.12 -m py_compileon all changed source and test filesrun_multi_turn_rollout()semantics while preserving batched env groupingNotes
I could not run repo-native
uv run pytestend-to-end in this environment because the project dependency resolution path pullscuda-bindings==13.0.1, which is not available for the current macOS arm64 platform. The verification above was chosen to maximize signal despite that platform constraint.