fix(evaluator): auto-init gloo process group for multi-rank launches#1306
Draft
Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Draft
fix(evaluator): auto-init gloo process group for multi-rank launches#1306Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Conversation
The evaluator already calls torch.distributed collectives like gather_object and barrier when WORLD_SIZE > 1. Today, whether those succeed depends on the model backend calling init_process_group itself (most LLM backends via accelerate do, but generation backends using only diffusers often don't). When a model skips that init, the first collective in evaluate() blows up with "default process group has not been initialized" — even though WORLD_SIZE is perfectly set. If we are about to run collectives and no group exists yet, init a gloo group ourselves. Gloo is CPU-only and universally available, so it works for generation-only / diffusers-only backends that don't need NCCL. Backends that already initialize NCCL continue to work unchanged because the is_initialized() check short-circuits. No behavior change for WORLD_SIZE == 1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
evaluator.evaluate()already callstorch.distributedcollectives (gather_object,barrier) whenWORLD_SIZE > 1. Whether those succeed today depends entirely on the model backend callinginit_process_groupitself. Most LLM backends viaacceleratedo, but pure-diffusers generation backends (image/video T2X) often don't — and then the first collective inevaluate()blows up with "default process group has not been initialized", even thoughWORLD_SIZEis set correctly.Change
Right after reading
WORLD_SIZE/RANK/LOCAL_RANKfrom env, ifWORLD_SIZE > 1and no group has been initialized yet, init a gloo group. Gloo is CPU-only and universally available, so it works for backends that don't need NCCL. Backends that already initialize NCCL continue to work unchanged becauseis_initialized()short-circuits.Non-goals / scope
WORLD_SIZE == 1.not is_initialized()guard skips the branch entirely.Test plan
init_process_group.