Skip to content

[2091][performance] Track throughput and utilization metrics (optional)#2124

Draft
florianscheidl wants to merge 37 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling
Draft

[2091][performance] Track throughput and utilization metrics (optional)#2124
florianscheidl wants to merge 37 commits intoecmwf:developfrom
florianscheidl:fscheidl/flo-85-first-iteration-of-performance-metric-profiling

Conversation

@florianscheidl
Copy link
Copy Markdown
Contributor

@florianscheidl florianscheidl commented Mar 27, 2026

Description

TBA

Issue Number

Closes #2091.

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline infra Issues related to infrastructure performance Work related to performance improvements labels Mar 27, 2026
self.batch_size_validation_per_gpu = -1
self.batch_size_test_per_gpu = -1
self.collapse_monitor: CollapseMonitor | None = None
self.throughput: Throughput | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid having many more variables and code in Trainer. It's a critical and complex part in the code that we need to keep as clean as possible.

These variables could go into a small separate class.

collapse_config = cf.train_logging.get("collapse_monitoring", {})
self.collapse_monitor = CollapseMonitor(collapse_config, None) # device set later in run()

if cf.train_logging.get("track_performance_metrics"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented in the class above?

sample_batch = next(iter(self.data_loader))
sample_batch.to_device(self.device)
self.precompute_flops(sample_batch)
self.precompute_source_bytes(sample_batch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is not a constant but might fluctuate widely between samples.


if self.cf.train_logging.get("track_performance_metrics"):
# precompute batch statistics for throughput/MFU tracking
sample_batch = next(iter(self.data_loader))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this we change the training samples and logic--can't we collect on the fly in train, perhaps sparsely if there are performance concerns (see also the comment right below).

else:
assert False, "validate_before_training must be integer or boolean."

def _compute_targets_and_auxs(self, sample_batch) -> dict:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this changed/moved?

)
return targets_and_auxs

def precompute_flops(self, sample_batch) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really have concerns that we run training before we start training. Eg, for model dev any bug will be triggered here and not in the actual training loop--this is not feasible.

self._log_terminal(bidx, mini_epoch, TRAIN)
if bidx % self.train_logging.metrics == 0:
self._log(TRAIN)
# Log throughput metrics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, too much code here

if self.cf.general.istep == self._THROUGHPUT_WARMUP_STEPS - 1:
self._t0_throughput = time.time()
self._throughput_warmup_done = True
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is happening here exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Anything related to the datasets used in the project eval anything related to the model evaluation pipeline infra Issues related to infrastructure performance Work related to performance improvements

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Performance metrics

2 participants