Model eval recipe skeleton by pzharrington · Pull Request #791 · NVIDIA/earth2studio

pzharrington · 2026-04-03T01:13:08Z

Earth2Studio Pull Request

Description

Adds the basic skeleton of the model eval recipe: README, config system, main inference script, unit testing. Structure is aimed at being a simplified/consolidated version of many of the existing features in the HENS + S2S recipe: multi-GPU distribution, perturbation/diagnostics, I/O handling, etc. Will add features as needed after this initial PR is in place, main item being the scoring system which will be a separate script in the recipe

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

pzharrington · 2026-04-03T01:13:23Z

@greptileai

greptile-apps · 2026-04-03T01:17:25Z

Greptile Summary

This PR introduces the skeleton of a model-evaluation recipe for Earth2Studio: a Hydra-configured distributed inference pipeline (main.py), modular source files for work distribution, model loading, output management, and a full test suite including multi-GPU tests launched via torchrun. The architecture is a clean, simplified consolidation of patterns from the HENS/S2S recipes, with proper rank-ordered store creation, deterministic seed generation, and optional threaded zarr writes.

Key findings:

Missing pathlib.Path import in _multigpu_worker.py — Path is used at many call sites but is never imported; every multi-GPU test launched via torchrun will fail immediately with a NameError.
Threaded-writer exceptions deadlock distributed runs — in OutputManager.__exit__, a failed future raises via f.result() before the all_reduce/barrier collectives are reached, leaving all other ranks hanging indefinitely.
dx.to(device) discards return value — minor inconsistency with how the prognostic model's .to() is handled; works for standard nn.Module implementations but is fragile for models that return a new object from .to().

Confidence Score: 4/5

Two P1 issues remain: a NameError that breaks all multi-GPU tests, and a deadlock path in OutputManager.exit when threaded writes fail under distributed execution.

The single-GPU inference path is correct and well-tested. However, _multigpu_worker.py is entirely broken (missing pathlib.Path import), making the multi-GPU test suite dead code until fixed. The threaded-writer deadlock in OutputManager.exit is a real reliability problem for the distributed + threaded-write configuration that is the default (thread_writers: 4). Both are straightforward fixes.

recipes/eval/test/_multigpu_worker.py (missing import breaks all multi-GPU tests) and recipes/eval/src/output.py (future exception handling in exit needs a try/except to avoid bypassing distributed collectives).

Important Files Changed

Filename	Overview
recipes/eval/test/_multigpu_worker.py	Worker script for multi-GPU pytest tests — `pathlib.Path` is used extensively but never imported, causing a `NameError` on every multi-GPU test run.
recipes/eval/src/output.py	Distributed zarr output manager — correct lead-time coord construction (0…nsteps*stride), but threaded-writer exceptions in `__exit__` bypass the `all_reduce`/`barrier` collectives, causing deadlocks in multi-GPU runs.
recipes/eval/src/inference.py	Core inference loop — clean structure with correct step/break logic; minor inconsistency where `dx.to(device)` discards the return value unlike `prognostic = prognostic.to(device)`.
recipes/eval/src/distributed.py	Rank-ordered execution helper and logging setup — functionally correct; inline comments are admittedly confusing (noted in prior thread) but logic is sound.
recipes/eval/main.py	Recipe entry point — model loading and `OutputManager` are now correctly placed before the idle-rank check so all ranks participate in barriers (prior-thread concern resolved).
recipes/eval/test/test_multigpu.py	Multi-GPU pytest harness that launches `_multigpu_worker.py` via `torchrun` — tests are well-structured and correctly skip when GPUs are unavailable, but all tests will fail at runtime due to the missing `Path` import in the worker.
recipes/eval/src/models.py	Prognostic and diagnostic model loading via Hydra — clean pattern supporting both `_target_` and `architecture` styles with proper `run_on_rank0_first` wrapping.
recipes/eval/src/work.py	Work item generation and rank-based partitioning — correct balanced distribution with remainder handling; deterministic FNV-1a seed derivation is a good choice.
recipes/eval/test/test_output.py	Comprehensive unit tests for `OutputManager` — good coverage of overwrite, threading, ensemble, and lead-time length; all tests mock distributed correctly.
recipes/eval/test/test_inference.py	Inference unit tests covering single IC, multiple ICs, empty work list, and ensemble writes — good use of fixtures and mocking.
recipes/eval/pyproject.toml	`nvidia-physicsnemo` is now listed as a dependency — prior-thread concern resolved.
recipes/eval/cfg/default.yaml	Well-documented default Hydra config — covers IC specification, forecast settings, data source, output, and placeholder scoring section.

_{Reviews (5): Last reviewed commit: "Address early exit case" | Re-trigger Greptile}

recipes/eval/pyproject.toml

recipes/eval/README.md

recipes/eval/src/distributed.py

pzharrington · 2026-04-03T01:37:09Z

@greptileai

recipes/eval/src/output.py

pzharrington · 2026-04-03T01:54:21Z

@greptileai

recipes/eval/main.py

pzharrington · 2026-04-03T03:46:35Z

@greptileai

greptile-apps · 2026-04-03T03:51:25Z

recipes/eval/test/_multigpu_worker.py

+    items = list(range(10))
+    my_items = distribute_work(items, dist.rank, dist.world_size)
+
+    result_file = Path(output_dir) / f"rank{dist.rank}_items.json"


Path used but never imported — NameError at runtime

pathlib.Path is used throughout this worker script (Path(output_dir) / … at lines 61, 68, 71, 88–89, 97–99) but the module is never imported. Every multi-GPU test that launches this file via torchrun will fail immediately with:

NameError: name 'Path' is not defined

The fix is to add the missing import at the top of the file with the other stdlib imports:

from pathlib import Path

greptile-apps · 2026-04-03T03:51:26Z

recipes/eval/src/output.py

+        if self._executor is not None:
+            for f in self._futures:
+                f.result()
+            self._executor.shutdown(wait=True)
+            self._futures.clear()


Threaded-writer exceptions bypass distributed all_reduce/barrier — deadlock

If any future in self._futures fails (e.g. a zarr write error in a thread), f.result() re-raises that exception and __exit__ exits early before reaching the all_reduce (line 115) or torch.distributed.barrier() (line 125). Both are collective operations — every rank must call them. Because this rank never reaches them, every other rank hangs indefinitely, resulting in a deadlock.

The has_error flag on line 110 (exc_type is not None) does not capture exceptions from futures, so the safety net built below this block is completely bypassed. This matters especially because the default config uses thread_writers: 4.

A robust fix is to collect future errors inside a try/except so they feed into the all_reduce:

if self._executor is not None: future_exc: BaseException | None = None for f in self._futures: try: f.result() except Exception as e: if future_exc is None: future_exc = e self._executor.shutdown(wait=True) self._futures.clear() if future_exc is not None: exc_type = type(future_exc) has_error = exc_type is not None

This ensures the all_reduce and barrier are always reached regardless of write failures.

pzharrington added 2 commits March 31, 2026 17:29

eval recipe skeleton

2c081c3

More complete testing, cleanup

3a4ad9a

pzharrington self-assigned this Apr 3, 2026

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

recipes/eval/pyproject.toml Show resolved Hide resolved

recipes/eval/README.md Outdated Show resolved Hide resolved

recipes/eval/src/distributed.py Show resolved Hide resolved

Feedback

e4ee9b7

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

recipes/eval/src/output.py Show resolved Hide resolved

Improvements

869b519

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

recipes/eval/main.py Outdated Show resolved Hide resolved

Address early exit case

aabd049

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model eval recipe skeleton#791

Model eval recipe skeleton#791
pzharrington wants to merge 5 commits intoNVIDIA:mainfrom
pzharrington:vnv-recipe-skeleton

pzharrington commented Apr 3, 2026

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pzharrington commented Apr 3, 2026

Earth2Studio Pull Request

Description

Checklist

Dependencies

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

Uh oh!

pzharrington commented Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Apr 3, 2026 •

edited

Loading