Fix FSDP issues when "random" forecast policies are used by SavvasMel · Pull Request #2128 · ecmwf/WeatherGenerator

SavvasMel · 2026-03-28T16:38:02Z

Description

This PR introduces two changes:

Add a new forecasting policy called "sequential_prob". This means that a specific sequence of forecast steps can be selected with specific probabilities. For example the forecast steps below have been chosen with some spike probabilities:

0: 000 : 00010/00125 : 000010 : loss = 3.0734E-01 (lr=1.64E-06, fsteps=[19, 5, 5, 3, 5, 5, 5, 5, 3, 3, 2], s/sec=0.167)
0: 	
0: LossPhysical.ERA5.mse.avg : 6.1012E-01 	
0: LossPhysical.loss_avg : 3.0734E-01 	
0: 
0: 
0: 000 : 00020/00125 : 000020 : loss = 3.0750E-01 (lr=3.34E-06, fsteps=[5, 1, 1, 5, 3, 5, 5, 1, 19, 5], s/sec=0.482)
0: 	
0: LossPhysical.ERA5.mse.avg : 6.1149E-01 	
0: LossPhysical.loss_avg : 3.0750E-01

When the forecast policy is the one described above or when it is "random", FSDP complains because the different number of forecast steps are not aligned well among the workers. This PR fixes this issue but introducing a rank-dependent random generator.

Issue Number

Closes #1984

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2026-03-29T08:06:53Z

src/weathergen/datasets/multi_stream_data_sampler.py

+        worker_info = torch.utils.data.get_worker_info()
+        worker_id = worker_info.id if worker_info is not None else 0
+        forecast_seed = self._base_rng_seed * ((worker_id + 1) * 37) * (self.mini_epoch + 13) * 7
+        forecast_rng = np.random.default_rng(forecast_seed)


Shouldn't this be a class variable that is initialized in the constructor?

clessig · 2026-03-29T08:07:27Z

src/weathergen/datasets/multi_stream_data_sampler.py

+        # worker_workset seed formula is used to  but exclude the rank component.
+        worker_info = torch.utils.data.get_worker_info()
+        worker_id = worker_info.id if worker_info is not None else 0
+        forecast_seed = self._base_rng_seed * ((worker_id + 1) * 37) * (self.mini_epoch + 13) * 7


We use this formula twice (also at the bottom of the file). Please put into a small function.

Introduce random generator for random or seq_prob policies

d27ce09

github-project-automation bot added this to WeatherGen-dev Mar 28, 2026

Linting

718416e

SavvasMel changed the title ~~Introduce random generator for random or seq_prob policies~~ Fix FSDP issues when "random" forecast policies are used Mar 28, 2026

github-actions bot added the bug Something isn't working label Mar 28, 2026

clessig reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP issues when "random" forecast policies are used#2128

Fix FSDP issues when "random" forecast policies are used#2128
SavvasMel wants to merge 2 commits intoecmwf:developfrom
SavvasMel:SavvasMel/develop/FSDP_policy_fix

SavvasMel commented Mar 28, 2026 •

edited

Loading

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SavvasMel commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

clessig Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SavvasMel commented Mar 28, 2026 •

edited

Loading