Add guidance on step/latency interpretation

Antlera · Antlera · commit 0b18cca8dec9 · 2025-08-04T14:45:37.000-04:00
Signed-off-by: Tingfeng Lan &lt;erc8gx@virginia.edu&gt;
diff --git a/training/DeepSpeed-ZenFlow/finetuning/README.md b/training/DeepSpeed-ZenFlow/finetuning/README.md
@@ -45,18 +45,25 @@ Below is a sample log showing step time and loss values. You can see significant
 
 ```
 ZenFlowCPUAdam initialized with overlap step.
-Step 5, Loss: 1.2599, Time: 719.58ms
-Step 6, Loss: 0.9847, Time: 702.81ms
+Step 5, Loss: 1.2599, Time: 719.58ms 
+Step 6, Loss: 0.9847, Time: 702.81ms <-- gradient accumulation with overlapped update
 Step 7, Loss: 0.6220, Time: 705.50ms
-Step 8, Loss: 0.5173, Time: 1912.92ms
+Step 8, Loss: 0.5173, Time: 1912.92ms <-- full optimizer step of remaining part and update parameters
 Step 9, Loss: 0.4557, Time: 890.60ms
 Step 10, Loss: 0.3882, Time: 740.11ms
 Step 11, Loss: 0.3627, Time: 731.95ms
 Step 12, Loss: 0.3341, Time: 2221.18ms
 Step 13, Loss: 0.2453, Time: 1061.80ms
 ```
 
-ZenFlow reduces optimizer-induced stalls by overlapping CPU computation and GPU execution.
+## Key Insight
+Steps like 5，6 and 7 are accumulation steps where ZenFlow overlaps part of the optimizer step in the background. These steps remain fast (~700ms).
+
+Steps 8 performs the remaining part of optimizer step and updates parameters to the GPU (2–2.2s).
+
+Without ZenFlow, a full update would take nearly 4 seconds, and ZenFlow distributes half of this cost across earlier accumulation steps via asynchronous overlap.
+
+This demonstrates how ZenFlow hides much of the CPU offload cost, enabling near stall-free training. Crucially, ZenFlow not only overlaps the CPU optimizer step but also maintains training progress on the GPU by immediately updating the most important gradients.
 
 ## Notes
 
@@ -70,7 +77,7 @@ To cite DeepSpeed Chat, please cite our [arxiv report](https://arxiv.org/abs/250
 ```bib
 @misc{lan2025zenflowenablingstallfreeoffloading,
       title={ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates}, 
-      author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Dong Li and Yue Cheng},
+      author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li and Yue Cheng},
       year={2025},
       eprint={2505.12242},
       archivePrefix={arXiv},
diff --git a/training/DeepSpeed-ZenFlow/finetuning/finetune_llama.sh b/training/DeepSpeed-ZenFlow/finetuning/finetune_llama.sh
@@ -17,7 +17,7 @@ DS_CONFIG_JSON="./zf_config.json"
 # Note: LR, batch_size, weight_decay are defined in the config file
 # These parameters are kept for fallback only
 LR=2e-5
-BATCH_SIZE=32
+BATCH_SIZE=8
 WARMUP=0.03
 WEIGHT_DECAY=0.01
 
diff --git a/training/DeepSpeed-ZenFlow/finetuning/zf_config.json b/training/DeepSpeed-ZenFlow/finetuning/zf_config.json
@@ -1,5 +1,5 @@
 {
-    "train_batch_size": 32,
+    "train_batch_size": 8,
     "bf16": { "enabled": true },
     "zero_optimization": {
       "stage": 2,

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`		`- "train_batch_size": 32,`
	`2`	`+ "train_batch_size": 8,`
`3`	`3`	`"bf16": { "enabled": true },`
`4`	`4`	`"zero_optimization": {`
`5`	`5`	`"stage": 2,`