Skip to content

Distributed Evaluation for Acceleration#673

Open
wang-jinbo wants to merge 1 commit intokarpathy:masterfrom
wang-jinbo:patch-1
Open

Distributed Evaluation for Acceleration#673
wang-jinbo wants to merge 1 commit intokarpathy:masterfrom
wang-jinbo:patch-1

Conversation

@wang-jinbo
Copy link

@wang-jinbo wang-jinbo commented Dec 23, 2025

Adjust loss estimation using DDP support. estimate_loss now is run across all devices to save up time.
torchrun --standalone --nproc_per_node=4 train.py with eval_interval=2 reduces eval iteration time from 21s to 8s, 38% speed for eval iteration.
Old branch

step 0: train loss 10.9886, val loss 10.9897
iter 0: loss 10.9766, time 26086.69ms, mfu -100.00%
iter 1: loss 10.9558, time 813.91ms, mfu -100.00%
step 2: train loss 10.9370, val loss 10.9377
saving checkpoint to out
iter 2: loss 10.9481, time 20176.69ms, mfu -100.00%
iter 3: loss 10.8799, time 923.17ms, mfu -100.00%
step 4: train loss 10.8181, val loss 10.8185
saving checkpoint to out
iter 4: loss 10.8340, time 20753.85ms, mfu -100.00%
iter 5: loss 10.7294, time 923.15ms, mfu 36.47%
step 6: train loss 10.6521, val loss 10.6532
saving checkpoint to out
iter 6: loss 10.6651, time 20960.05ms, mfu 32.98%
iter 7: loss 10.5586, time 923.50ms, mfu 33.33%
step 8: train loss 10.4676, val loss 10.4699
saving checkpoint to out
iter 8: loss 10.4841, time 20972.46ms, mfu 30.16%
iter 9: loss 10.4411, time 924.84ms, mfu 30.78%

New branch

step 0: train loss 10.9894, val loss 10.9906
iter 0: loss 10.9944, time 13836.17ms, mfu -100.00%
iter 1: loss 10.9413, time 813.67ms, mfu -100.00%
step 2: train loss 10.9373, val loss 10.9382
saving checkpoint to out
iter 2: loss 10.9396, time 8738.35ms, mfu -100.00%
iter 3: loss 10.8950, time 928.53ms, mfu -100.00%
step 4: train loss 10.8183, val loss 10.8194
saving checkpoint to out
iter 4: loss 10.8238, time 8496.28ms, mfu -100.00%
iter 5: loss 10.7193, time 924.64ms, mfu 36.41%
step 6: train loss 10.6518, val loss 10.6492
saving checkpoint to out
iter 6: loss 10.6451, time 7954.27ms, mfu 33.19%
iter 7: loss 10.5497, time 923.82ms, mfu 33.52%
step 8: train loss 10.4732, val loss 10.4695
saving checkpoint to out
iter 8: loss 10.4469, time 7863.56ms, mfu 30.59%
iter 9: loss 10.3595, time 923.27ms, mfu 31.18%

Adjust loss estimation and logging for DDP support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant