Skip to content

Fix FSDP issues when "random" forecast policies are used#2128

Open
SavvasMel wants to merge 2 commits intoecmwf:developfrom
SavvasMel:SavvasMel/develop/FSDP_policy_fix
Open

Fix FSDP issues when "random" forecast policies are used#2128
SavvasMel wants to merge 2 commits intoecmwf:developfrom
SavvasMel:SavvasMel/develop/FSDP_policy_fix

Conversation

@SavvasMel
Copy link
Copy Markdown
Contributor

@SavvasMel SavvasMel commented Mar 28, 2026

Description

This PR introduces two changes:

  1. Add a new forecasting policy called "sequential_prob". This means that a specific sequence of forecast steps can be selected with specific probabilities. For example the forecast steps below have been chosen with some spike probabilities:
0: 000 : 00010/00125 : 000010 : loss = 3.0734E-01 (lr=1.64E-06, fsteps=[19, 5, 5, 3, 5, 5, 5, 5, 3, 3, 2], s/sec=0.167)
0: 	
0: LossPhysical.ERA5.mse.avg : 6.1012E-01 	
0: LossPhysical.loss_avg : 3.0734E-01 	
0: 
0: 
0: 000 : 00020/00125 : 000020 : loss = 3.0750E-01 (lr=3.34E-06, fsteps=[5, 1, 1, 5, 3, 5, 5, 1, 19, 5], s/sec=0.482)
0: 	
0: LossPhysical.ERA5.mse.avg : 6.1149E-01 	
0: LossPhysical.loss_avg : 3.0750E-01 	

  1. When the forecast policy is the one described above or when it is "random", FSDP complains because the different number of forecast steps are not aligned well among the workers. This PR fixes this issue but introducing a rank-dependent random generator.

Issue Number

Closes #1984

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@SavvasMel SavvasMel changed the title Introduce random generator for random or seq_prob policies Fix FSDP issues when "random" forecast policies are used Mar 28, 2026
@github-actions github-actions bot added the bug Something isn't working label Mar 28, 2026
worker_info = torch.utils.data.get_worker_info()
worker_id = worker_info.id if worker_info is not None else 0
forecast_seed = self._base_rng_seed * ((worker_id + 1) * 37) * (self.mini_epoch + 13) * 7
forecast_rng = np.random.default_rng(forecast_seed)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a class variable that is initialized in the constructor?

# worker_workset seed formula is used to but exclude the rank component.
worker_info = torch.utils.data.get_worker_info()
worker_id = worker_info.id if worker_info is not None else 0
forecast_seed = self._base_rng_seed * ((worker_id + 1) * 37) * (self.mini_epoch + 13) * 7
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this formula twice (also at the bottom of the file). Please put into a small function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Random forecast policy and related policies not functional

2 participants