Skip to content

Model eval recipe skeleton#791

Draft
pzharrington wants to merge 5 commits intoNVIDIA:mainfrom
pzharrington:vnv-recipe-skeleton
Draft

Model eval recipe skeleton#791
pzharrington wants to merge 5 commits intoNVIDIA:mainfrom
pzharrington:vnv-recipe-skeleton

Conversation

@pzharrington
Copy link
Copy Markdown
Collaborator

Earth2Studio Pull Request

Description

Adds the basic skeleton of the model eval recipe: README, config system, main inference script, unit testing. Structure is aimed at being a simplified/consolidated version of many of the existing features in the HENS + S2S recipe: multi-GPU distribution, perturbation/diagnostics, I/O handling, etc. Will add features as needed after this initial PR is in place, main item being the scoring system which will be a separate script in the recipe

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.
  • Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

@pzharrington pzharrington self-assigned this Apr 3, 2026
@pzharrington
Copy link
Copy Markdown
Collaborator Author

@greptileai

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR introduces the skeleton of a model-evaluation recipe for Earth2Studio: a Hydra-configured distributed inference pipeline (main.py), modular source files for work distribution, model loading, output management, and a full test suite including multi-GPU tests launched via torchrun. The architecture is a clean, simplified consolidation of patterns from the HENS/S2S recipes, with proper rank-ordered store creation, deterministic seed generation, and optional threaded zarr writes.

Key findings:

  • Missing pathlib.Path import in _multigpu_worker.pyPath is used at many call sites but is never imported; every multi-GPU test launched via torchrun will fail immediately with a NameError.
  • Threaded-writer exceptions deadlock distributed runs — in OutputManager.__exit__, a failed future raises via f.result() before the all_reduce/barrier collectives are reached, leaving all other ranks hanging indefinitely.
  • dx.to(device) discards return value — minor inconsistency with how the prognostic model's .to() is handled; works for standard nn.Module implementations but is fragile for models that return a new object from .to().

Confidence Score: 4/5

Two P1 issues remain: a NameError that breaks all multi-GPU tests, and a deadlock path in OutputManager.exit when threaded writes fail under distributed execution.

The single-GPU inference path is correct and well-tested. However, _multigpu_worker.py is entirely broken (missing pathlib.Path import), making the multi-GPU test suite dead code until fixed. The threaded-writer deadlock in OutputManager.exit is a real reliability problem for the distributed + threaded-write configuration that is the default (thread_writers: 4). Both are straightforward fixes.

recipes/eval/test/_multigpu_worker.py (missing import breaks all multi-GPU tests) and recipes/eval/src/output.py (future exception handling in exit needs a try/except to avoid bypassing distributed collectives).

Important Files Changed

Filename Overview
recipes/eval/test/_multigpu_worker.py Worker script for multi-GPU pytest tests — pathlib.Path is used extensively but never imported, causing a NameError on every multi-GPU test run.
recipes/eval/src/output.py Distributed zarr output manager — correct lead-time coord construction (0…nsteps*stride), but threaded-writer exceptions in __exit__ bypass the all_reduce/barrier collectives, causing deadlocks in multi-GPU runs.
recipes/eval/src/inference.py Core inference loop — clean structure with correct step/break logic; minor inconsistency where dx.to(device) discards the return value unlike prognostic = prognostic.to(device).
recipes/eval/src/distributed.py Rank-ordered execution helper and logging setup — functionally correct; inline comments are admittedly confusing (noted in prior thread) but logic is sound.
recipes/eval/main.py Recipe entry point — model loading and OutputManager are now correctly placed before the idle-rank check so all ranks participate in barriers (prior-thread concern resolved).
recipes/eval/test/test_multigpu.py Multi-GPU pytest harness that launches _multigpu_worker.py via torchrun — tests are well-structured and correctly skip when GPUs are unavailable, but all tests will fail at runtime due to the missing Path import in the worker.
recipes/eval/src/models.py Prognostic and diagnostic model loading via Hydra — clean pattern supporting both _target_ and architecture styles with proper run_on_rank0_first wrapping.
recipes/eval/src/work.py Work item generation and rank-based partitioning — correct balanced distribution with remainder handling; deterministic FNV-1a seed derivation is a good choice.
recipes/eval/test/test_output.py Comprehensive unit tests for OutputManager — good coverage of overwrite, threading, ensemble, and lead-time length; all tests mock distributed correctly.
recipes/eval/test/test_inference.py Inference unit tests covering single IC, multiple ICs, empty work list, and ensemble writes — good use of fixtures and mocking.
recipes/eval/pyproject.toml nvidia-physicsnemo is now listed as a dependency — prior-thread concern resolved.
recipes/eval/cfg/default.yaml Well-documented default Hydra config — covers IC specification, forecast settings, data source, output, and placeholder scoring section.

Reviews (5): Last reviewed commit: "Address early exit case" | Re-trigger Greptile

@pzharrington
Copy link
Copy Markdown
Collaborator Author

@greptileai

@pzharrington
Copy link
Copy Markdown
Collaborator Author

@greptileai

@pzharrington
Copy link
Copy Markdown
Collaborator Author

@greptileai

items = list(range(10))
my_items = distribute_work(items, dist.rank, dist.world_size)

result_file = Path(output_dir) / f"rank{dist.rank}_items.json"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Path used but never imported — NameError at runtime

pathlib.Path is used throughout this worker script (Path(output_dir) / … at lines 61, 68, 71, 88–89, 97–99) but the module is never imported. Every multi-GPU test that launches this file via torchrun will fail immediately with:

NameError: name 'Path' is not defined

The fix is to add the missing import at the top of the file with the other stdlib imports:

from pathlib import Path

Comment on lines +104 to +108
if self._executor is not None:
for f in self._futures:
f.result()
self._executor.shutdown(wait=True)
self._futures.clear()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Threaded-writer exceptions bypass distributed all_reduce/barrier — deadlock

If any future in self._futures fails (e.g. a zarr write error in a thread), f.result() re-raises that exception and __exit__ exits early before reaching the all_reduce (line 115) or torch.distributed.barrier() (line 125). Both are collective operations — every rank must call them. Because this rank never reaches them, every other rank hangs indefinitely, resulting in a deadlock.

The has_error flag on line 110 (exc_type is not None) does not capture exceptions from futures, so the safety net built below this block is completely bypassed. This matters especially because the default config uses thread_writers: 4.

A robust fix is to collect future errors inside a try/except so they feed into the all_reduce:

if self._executor is not None:
    future_exc: BaseException | None = None
    for f in self._futures:
        try:
            f.result()
        except Exception as e:
            if future_exc is None:
                future_exc = e
    self._executor.shutdown(wait=True)
    self._futures.clear()
    if future_exc is not None:
        exc_type = type(future_exc)

has_error = exc_type is not None

This ensures the all_reduce and barrier are always reached regardless of write failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant