Skip to content

Commit 0b18cca

Browse files
committed
Add guidance on step/latency interpretation
Signed-off-by: Tingfeng Lan <[email protected]>
1 parent 0309313 commit 0b18cca

File tree

3 files changed

+14
-7
lines changed

3 files changed

+14
-7
lines changed

training/DeepSpeed-ZenFlow/finetuning/README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,18 +45,25 @@ Below is a sample log showing step time and loss values. You can see significant
4545

4646
```
4747
ZenFlowCPUAdam initialized with overlap step.
48-
Step 5, Loss: 1.2599, Time: 719.58ms
49-
Step 6, Loss: 0.9847, Time: 702.81ms
48+
Step 5, Loss: 1.2599, Time: 719.58ms
49+
Step 6, Loss: 0.9847, Time: 702.81ms <-- gradient accumulation with overlapped update
5050
Step 7, Loss: 0.6220, Time: 705.50ms
51-
Step 8, Loss: 0.5173, Time: 1912.92ms
51+
Step 8, Loss: 0.5173, Time: 1912.92ms <-- full optimizer step of remaining part and update parameters
5252
Step 9, Loss: 0.4557, Time: 890.60ms
5353
Step 10, Loss: 0.3882, Time: 740.11ms
5454
Step 11, Loss: 0.3627, Time: 731.95ms
5555
Step 12, Loss: 0.3341, Time: 2221.18ms
5656
Step 13, Loss: 0.2453, Time: 1061.80ms
5757
```
5858

59-
ZenFlow reduces optimizer-induced stalls by overlapping CPU computation and GPU execution.
59+
## Key Insight
60+
Steps like 5,6 and 7 are accumulation steps where ZenFlow overlaps part of the optimizer step in the background. These steps remain fast (~700ms).
61+
62+
Steps 8 performs the remaining part of optimizer step and updates parameters to the GPU (2–2.2s).
63+
64+
Without ZenFlow, a full update would take nearly 4 seconds, and ZenFlow distributes half of this cost across earlier accumulation steps via asynchronous overlap.
65+
66+
This demonstrates how ZenFlow hides much of the CPU offload cost, enabling near stall-free training. Crucially, ZenFlow not only overlaps the CPU optimizer step but also maintains training progress on the GPU by immediately updating the most important gradients.
6067

6168
## Notes
6269

@@ -70,7 +77,7 @@ To cite DeepSpeed Chat, please cite our [arxiv report](https://arxiv.org/abs/250
7077
```bib
7178
@misc{lan2025zenflowenablingstallfreeoffloading,
7279
title={ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates},
73-
author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Dong Li and Yue Cheng},
80+
author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li and Yue Cheng},
7481
year={2025},
7582
eprint={2505.12242},
7683
archivePrefix={arXiv},

training/DeepSpeed-ZenFlow/finetuning/finetune_llama.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ DS_CONFIG_JSON="./zf_config.json"
1717
# Note: LR, batch_size, weight_decay are defined in the config file
1818
# These parameters are kept for fallback only
1919
LR=2e-5
20-
BATCH_SIZE=32
20+
BATCH_SIZE=8
2121
WARMUP=0.03
2222
WEIGHT_DECAY=0.01
2323

training/DeepSpeed-ZenFlow/finetuning/zf_config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"train_batch_size": 32,
2+
"train_batch_size": 8,
33
"bf16": { "enabled": true },
44
"zero_optimization": {
55
"stage": 2,

0 commit comments

Comments
 (0)