Skip to content

fix(evaluator): auto-init gloo process group for multi-rank launches#1306

Draft
Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Luodian:fix/auto-init-gloo-process-group
Draft

fix(evaluator): auto-init gloo process group for multi-rank launches#1306
Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Luodian:fix/auto-init-gloo-process-group

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Apr 23, 2026

Summary

evaluator.evaluate() already calls torch.distributed collectives (gather_object, barrier) when WORLD_SIZE > 1. Whether those succeed today depends entirely on the model backend calling init_process_group itself. Most LLM backends via accelerate do, but pure-diffusers generation backends (image/video T2X) often don't — and then the first collective in evaluate() blows up with "default process group has not been initialized", even though WORLD_SIZE is set correctly.

Change

Right after reading WORLD_SIZE / RANK / LOCAL_RANK from env, if WORLD_SIZE > 1 and no group has been initialized yet, init a gloo group. Gloo is CPU-only and universally available, so it works for backends that don't need NCCL. Backends that already initialize NCCL continue to work unchanged because is_initialized() short-circuits.

world_size = int(os.environ.get("WORLD_SIZE", 1))
+
+ # Auto-init torch.distributed for multi-rank launches where the model
+ # backend did not call init_process_group itself (e.g. image/video
+ # generation models using diffusers). Without this, evaluator collectives
+ # like gather_object / barrier crash with "process group not initialized".
+ # gloo is used because it requires no GPU and works everywhere.
+ if world_size > 1 and dist.is_available() and not dist.is_initialized():
+     dist.init_process_group(backend="gloo")
+     eval_logger.info(f"evaluator: auto-initialized gloo process group (world={world_size})")

Non-goals / scope

  • No behavior change for WORLD_SIZE == 1.
  • Does not touch backends that already init NCCL — the not is_initialized() guard skips the branch entirely.
  • Does not change where NCCL backends are initialized.

Test plan

  • Multi-rank run of a diffusers-only backend — verify collectives succeed without the backend explicitly calling init_process_group.
  • Multi-rank run of an existing NCCL-initialized backend — verify the log line does not appear and behavior is unchanged.

The evaluator already calls torch.distributed collectives like
gather_object and barrier when WORLD_SIZE > 1. Today, whether those
succeed depends on the model backend calling init_process_group
itself (most LLM backends via accelerate do, but generation backends
using only diffusers often don't). When a model skips that init, the
first collective in evaluate() blows up with "default process group
has not been initialized" — even though WORLD_SIZE is perfectly set.

If we are about to run collectives and no group exists yet, init a
gloo group ourselves. Gloo is CPU-only and universally available, so
it works for generation-only / diffusers-only backends that don't
need NCCL. Backends that already initialize NCCL continue to work
unchanged because the is_initialized() check short-circuits.

No behavior change for WORLD_SIZE == 1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant