[2091][performance] Track throughput and utilization metrics (optional)#2124
Conversation
…mance-metric-profiling
| self.batch_size_validation_per_gpu = -1 | ||
| self.batch_size_test_per_gpu = -1 | ||
| self.collapse_monitor: CollapseMonitor | None = None | ||
| self.throughput: Throughput | None = None |
There was a problem hiding this comment.
Please avoid having many more variables and code in Trainer. It's a critical and complex part in the code that we need to keep as clean as possible.
These variables could go into a small separate class.
| collapse_config = cf.train_logging.get("collapse_monitoring", {}) | ||
| self.collapse_monitor = CollapseMonitor(collapse_config, None) # device set later in run() | ||
|
|
||
| if cf.train_logging.get("track_performance_metrics"): |
There was a problem hiding this comment.
Can this be implemented in the class above?
| sample_batch = next(iter(self.data_loader)) | ||
| sample_batch.to_device(self.device) | ||
| self.precompute_flops(sample_batch) | ||
| self.precompute_source_bytes(sample_batch) |
There was a problem hiding this comment.
Note this is not a constant but might fluctuate widely between samples.
|
|
||
| if self.cf.train_logging.get("track_performance_metrics"): | ||
| # precompute batch statistics for throughput/MFU tracking | ||
| sample_batch = next(iter(self.data_loader)) |
There was a problem hiding this comment.
With this we change the training samples and logic--can't we collect on the fly in train, perhaps sparsely if there are performance concerns (see also the comment right below).
| else: | ||
| assert False, "validate_before_training must be integer or boolean." | ||
|
|
||
| def _compute_targets_and_auxs(self, sample_batch) -> dict: |
There was a problem hiding this comment.
Why has this changed/moved?
| ) | ||
| return targets_and_auxs | ||
|
|
||
| def precompute_flops(self, sample_batch) -> None: |
There was a problem hiding this comment.
I really have concerns that we run training before we start training. Eg, for model dev any bug will be triggered here and not in the actual training loop--this is not feasible.
| self._log_terminal(bidx, mini_epoch, TRAIN) | ||
| if bidx % self.train_logging.metrics == 0: | ||
| self._log(TRAIN) | ||
| # Log throughput metrics |
There was a problem hiding this comment.
Again, too much code here
| if self.cf.general.istep == self._THROUGHPUT_WARMUP_STEPS - 1: | ||
| self._t0_throughput = time.time() | ||
| self._throughput_warmup_done = True | ||
| else: |
There was a problem hiding this comment.
What is happening here exactly?
Description
TBA
Issue Number
Closes #2091.
Is this PR a draft? Mark it as draft.
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test./scripts/actions.sh integration-testlaunch-slurm.py --time 60