[2091][performance] Track throughput and utilization metrics (optional) by florianscheidl · Pull Request #2124 · ecmwf/WeatherGenerator

florianscheidl · 2026-03-27T17:49:21Z

Description

TBA

Issue Number

Closes #2091.

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

…mance-metric-profiling

clessig · 2026-03-29T08:28:00Z

src/weathergen/train/trainer.py

        self.batch_size_validation_per_gpu = -1
        self.batch_size_test_per_gpu = -1
        self.collapse_monitor: CollapseMonitor | None = None
+        self.throughput: Throughput | None = None


Please avoid having many more variables and code in Trainer. It's a critical and complex part in the code that we need to keep as clean as possible.

These variables could go into a small separate class.

clessig · 2026-03-29T08:28:33Z

src/weathergen/train/trainer.py

        collapse_config = cf.train_logging.get("collapse_monitoring", {})
        self.collapse_monitor = CollapseMonitor(collapse_config, None)  # device set later in run()

+        if cf.train_logging.get("track_performance_metrics"):


Can this be implemented in the class above?

clessig · 2026-03-29T08:29:59Z

src/weathergen/train/trainer.py

+            sample_batch = next(iter(self.data_loader))
+            sample_batch.to_device(self.device)
+            self.precompute_flops(sample_batch)
+            self.precompute_source_bytes(sample_batch)


Note this is not a constant but might fluctuate widely between samples.

clessig · 2026-03-29T08:31:00Z

src/weathergen/train/trainer.py


+        if self.cf.train_logging.get("track_performance_metrics"):
+            # precompute batch statistics for throughput/MFU tracking
+            sample_batch = next(iter(self.data_loader))


With this we change the training samples and logic--can't we collect on the fly in train, perhaps sparsely if there are performance concerns (see also the comment right below).

clessig · 2026-03-29T08:32:11Z

src/weathergen/train/trainer.py

            else:
                assert False, "validate_before_training must be integer or boolean."

+    def _compute_targets_and_auxs(self, sample_batch) -> dict:


Why has this changed/moved?

clessig · 2026-03-29T08:33:51Z

src/weathergen/train/trainer.py

+            )
+        return targets_and_auxs
+
+    def precompute_flops(self, sample_batch) -> None:


I really have concerns that we run training before we start training. Eg, for model dev any bug will be triggered here and not in the actual training loop--this is not feasible.

clessig · 2026-03-29T08:35:02Z

src/weathergen/train/trainer.py

            self._log_terminal(bidx, mini_epoch, TRAIN)
            if bidx % self.train_logging.metrics == 0:
                self._log(TRAIN)
+                # Log throughput metrics


Again, too much code here

clessig · 2026-03-29T08:35:17Z

src/weathergen/train/trainer.py

+                    if self.cf.general.istep == self._THROUGHPUT_WARMUP_STEPS - 1:
+                        self._t0_throughput = time.time()
+                        self._throughput_warmup_done = True
+                else:


What is happening here exactly?

florianscheidl added 30 commits March 25, 2026 15:49

Throughput calculation with some shortcuts

c5f3f7e

meta device

a847645

small config to quickly iterate

a164514

debug refactor

52acede

Track throughput in terms of source data size

a889e5d

Correct torch device wrapping

a9f67d9

Update get_last_lr behavior

66b0578

Update minimal config

5fc0081

Proide loss function

80691d1

lr scheduler none issue without warmup

32c238f

Throughput warmup behaviour implies no updates

5f10d0f

Reinitialize throughput tracker

0dd6b67

Increasing throughput logging instead of resetting

062c953

Smaller config

35ce423

Align datatypes

16aa00f

Update config since we're running on 4 gpus per node

bc1333a

Destroying process group to get successful training?

312ccc4

Use bigger config again to avoid chaos

e5b18f3

Adapt window size to see throughput metrics

954daa5

Adapt config hoping metrics weill be logged to mlflow

fe1952f

window must be greater than 1

15bc44a

Add step to logging for perf metrics

841af12

Add tests, isolate perf related functions, add doc strings

c7ca2a6

Configure perf metrics logging behind config

844a33b

Bigger config based on jepa

bf3cb64

Try with increasing number of samples

44e7672

Update config

8f94e78

Cleanup

db20a6e

Update config

1dbd25e

Clean up and track global throughput in mb

7274c75

florianscheidl added 6 commits March 27, 2026 16:05

Num samples of model input and target must match

7c56bee

Include fsdp in performance integration test config

dae755b

Remove fsdp (not supported for student-teacher)

1697b0d

Fix student teacher config

d9be43e

Formatting and cleanup

b0f8e1e

Format

1ec1d0e

github-project-automation bot added this to WeatherGen-dev Mar 27, 2026

Merge branch 'develop' into fscheidl/flo-85-first-iteration-of-perfor…

f25cc74

…mance-metric-profiling

github-actions bot added data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline infra Issues related to infrastructure performance Work related to performance improvements labels Mar 27, 2026

clessig reviewed Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2091][performance] Track throughput and utilization metrics (optional)#2124

[2091][performance] Track throughput and utilization metrics (optional)#2124
florianscheidl wants to merge 37 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling

florianscheidl commented Mar 27, 2026 •

edited

Loading

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

clessig Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

florianscheidl commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

florianscheidl commented Mar 27, 2026 •

edited

Loading