You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Step 8, Loss: 0.5173, Time: 1912.92ms <-- full optimizer step of remaining part and update parameters
52
52
Step 9, Loss: 0.4557, Time: 890.60ms
53
53
Step 10, Loss: 0.3882, Time: 740.11ms
54
54
Step 11, Loss: 0.3627, Time: 731.95ms
55
55
Step 12, Loss: 0.3341, Time: 2221.18ms
56
56
Step 13, Loss: 0.2453, Time: 1061.80ms
57
57
```
58
58
59
-
ZenFlow reduces optimizer-induced stalls by overlapping CPU computation and GPU execution.
59
+
## Key Insight
60
+
Steps like 5,6 and 7 are accumulation steps where ZenFlow overlaps part of the optimizer step in the background. These steps remain fast (~700ms).
61
+
62
+
Steps 8 performs the remaining part of optimizer step and updates parameters to the GPU (2–2.2s).
63
+
64
+
Without ZenFlow, a full update would take nearly 4 seconds, and ZenFlow distributes half of this cost across earlier accumulation steps via asynchronous overlap.
65
+
66
+
This demonstrates how ZenFlow hides much of the CPU offload cost, enabling near stall-free training. Crucially, ZenFlow not only overlaps the CPU optimizer step but also maintains training progress on the GPU by immediately updating the most important gradients.
title={ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates},
73
-
author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Dong Li and Yue Cheng},
80
+
author={Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li and Yue Cheng},
0 commit comments