fix(evaluator): auto-init gloo process group for multi-rank launches by Luodian · Pull Request #1306 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-04-23T07:09:42Z

Summary

evaluator.evaluate() already calls torch.distributed collectives (gather_object, barrier) when WORLD_SIZE > 1. Whether those succeed today depends entirely on the model backend calling init_process_group itself. Most LLM backends via accelerate do, but pure-diffusers generation backends (image/video T2X) often don't — and then the first collective in evaluate() blows up with "default process group has not been initialized", even though WORLD_SIZE is set correctly.

Change

Right after reading WORLD_SIZE / RANK / LOCAL_RANK from env, if WORLD_SIZE > 1 and no group has been initialized yet, init a gloo group. Gloo is CPU-only and universally available, so it works for backends that don't need NCCL. Backends that already initialize NCCL continue to work unchanged because is_initialized() short-circuits.

world_size = int(os.environ.get("WORLD_SIZE", 1))
+
+ # Auto-init torch.distributed for multi-rank launches where the model
+ # backend did not call init_process_group itself (e.g. image/video
+ # generation models using diffusers). Without this, evaluator collectives
+ # like gather_object / barrier crash with "process group not initialized".
+ # gloo is used because it requires no GPU and works everywhere.
+ if world_size > 1 and dist.is_available() and not dist.is_initialized():
+     dist.init_process_group(backend="gloo")
+     eval_logger.info(f"evaluator: auto-initialized gloo process group (world={world_size})")

Non-goals / scope

No behavior change for WORLD_SIZE == 1.
Does not touch backends that already init NCCL — the not is_initialized() guard skips the branch entirely.
Does not change where NCCL backends are initialized.

Test plan

Multi-rank run of a diffusers-only backend — verify collectives succeed without the backend explicitly calling init_process_group.
Multi-rank run of an existing NCCL-initialized backend — verify the log line does not appear and behavior is unchanged.

The evaluator already calls torch.distributed collectives like gather_object and barrier when WORLD_SIZE > 1. Today, whether those succeed depends on the model backend calling init_process_group itself (most LLM backends via accelerate do, but generation backends using only diffusers often don't). When a model skips that init, the first collective in evaluate() blows up with "default process group has not been initialized" — even though WORLD_SIZE is perfectly set. If we are about to run collectives and no group exists yet, init a gloo group ourselves. Gloo is CPU-only and universally available, so it works for generation-only / diffusers-only backends that don't need NCCL. Backends that already initialize NCCL continue to work unchanged because the is_initialized() check short-circuits. No behavior change for WORLD_SIZE == 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluator): auto-init gloo process group for multi-rank launches#1306

fix(evaluator): auto-init gloo process group for multi-rank launches#1306
Luodian wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
Luodian:fix/auto-init-gloo-process-group

Luodian commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Apr 23, 2026

Summary

Change

Non-goals / scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant