Conversation
Greptile SummaryThis PR introduces the skeleton of a model-evaluation recipe for Earth2Studio: a Hydra-configured distributed inference pipeline ( Key findings:
|
| Filename | Overview |
|---|---|
| recipes/eval/test/_multigpu_worker.py | Worker script for multi-GPU pytest tests — pathlib.Path is used extensively but never imported, causing a NameError on every multi-GPU test run. |
| recipes/eval/src/output.py | Distributed zarr output manager — correct lead-time coord construction (0…nsteps*stride), but threaded-writer exceptions in __exit__ bypass the all_reduce/barrier collectives, causing deadlocks in multi-GPU runs. |
| recipes/eval/src/inference.py | Core inference loop — clean structure with correct step/break logic; minor inconsistency where dx.to(device) discards the return value unlike prognostic = prognostic.to(device). |
| recipes/eval/src/distributed.py | Rank-ordered execution helper and logging setup — functionally correct; inline comments are admittedly confusing (noted in prior thread) but logic is sound. |
| recipes/eval/main.py | Recipe entry point — model loading and OutputManager are now correctly placed before the idle-rank check so all ranks participate in barriers (prior-thread concern resolved). |
| recipes/eval/test/test_multigpu.py | Multi-GPU pytest harness that launches _multigpu_worker.py via torchrun — tests are well-structured and correctly skip when GPUs are unavailable, but all tests will fail at runtime due to the missing Path import in the worker. |
| recipes/eval/src/models.py | Prognostic and diagnostic model loading via Hydra — clean pattern supporting both _target_ and architecture styles with proper run_on_rank0_first wrapping. |
| recipes/eval/src/work.py | Work item generation and rank-based partitioning — correct balanced distribution with remainder handling; deterministic FNV-1a seed derivation is a good choice. |
| recipes/eval/test/test_output.py | Comprehensive unit tests for OutputManager — good coverage of overwrite, threading, ensemble, and lead-time length; all tests mock distributed correctly. |
| recipes/eval/test/test_inference.py | Inference unit tests covering single IC, multiple ICs, empty work list, and ensemble writes — good use of fixtures and mocking. |
| recipes/eval/pyproject.toml | nvidia-physicsnemo is now listed as a dependency — prior-thread concern resolved. |
| recipes/eval/cfg/default.yaml | Well-documented default Hydra config — covers IC specification, forecast settings, data source, output, and placeholder scoring section. |
Reviews (5): Last reviewed commit: "Address early exit case" | Re-trigger Greptile
| items = list(range(10)) | ||
| my_items = distribute_work(items, dist.rank, dist.world_size) | ||
|
|
||
| result_file = Path(output_dir) / f"rank{dist.rank}_items.json" |
There was a problem hiding this comment.
Path used but never imported — NameError at runtime
pathlib.Path is used throughout this worker script (Path(output_dir) / … at lines 61, 68, 71, 88–89, 97–99) but the module is never imported. Every multi-GPU test that launches this file via torchrun will fail immediately with:
NameError: name 'Path' is not defined
The fix is to add the missing import at the top of the file with the other stdlib imports:
from pathlib import Path| if self._executor is not None: | ||
| for f in self._futures: | ||
| f.result() | ||
| self._executor.shutdown(wait=True) | ||
| self._futures.clear() |
There was a problem hiding this comment.
Threaded-writer exceptions bypass distributed all_reduce/barrier — deadlock
If any future in self._futures fails (e.g. a zarr write error in a thread), f.result() re-raises that exception and __exit__ exits early before reaching the all_reduce (line 115) or torch.distributed.barrier() (line 125). Both are collective operations — every rank must call them. Because this rank never reaches them, every other rank hangs indefinitely, resulting in a deadlock.
The has_error flag on line 110 (exc_type is not None) does not capture exceptions from futures, so the safety net built below this block is completely bypassed. This matters especially because the default config uses thread_writers: 4.
A robust fix is to collect future errors inside a try/except so they feed into the all_reduce:
if self._executor is not None:
future_exc: BaseException | None = None
for f in self._futures:
try:
f.result()
except Exception as e:
if future_exc is None:
future_exc = e
self._executor.shutdown(wait=True)
self._futures.clear()
if future_exc is not None:
exc_type = type(future_exc)
has_error = exc_type is not NoneThis ensures the all_reduce and barrier are always reached regardless of write failures.
Earth2Studio Pull Request
Description
Adds the basic skeleton of the model eval recipe: README, config system, main inference script, unit testing. Structure is aimed at being a simplified/consolidated version of many of the existing features in the HENS + S2S recipe: multi-GPU distribution, perturbation/diagnostics, I/O handling, etc. Will add features as needed after this initial PR is in place, main item being the scoring system which will be a separate script in the recipe
Checklist
Dependencies