Skip to content

Commit 9a1020c

Browse files
committed
update estimation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
1 parent fe10b7e commit 9a1020c

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

training/bf16_master_weight/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,7 @@ This example demonstrates DeepSpeed's [new low-precision training options](https
1010

1111
The following commands run training for 1000 steps on the Wikitext-103 dataset using both the baseline and BF16 low-precision configurations, then generates a loss comparison plot.
1212
The model has approximately 6.86 billion parameters (hidden=4096, layers=32, heads=32, batch=1, seq=512).
13-
For BF16 low-precision training, we use `torch.autocast`.
14-
13+
For BF16 low-precision training, we use `torch.autocast`. ZeRO stage is set to 3 for both.
1514

1615
```bash
1716
# Run 1000 steps with wikitext dataset
@@ -59,13 +58,16 @@ For a model with N parameters:
5958
| Component | Baseline | BF16 Low-Precision |
6059
|-----------|----------|-------------------|
6160
| Model params | 2N bytes (BF16) | 2N bytes (BF16) |
61+
| Gradients | 2N bytes (BF16) | 2N bytes (BF16) |
6262
| Master weights | 4N bytes (FP32) | 2N bytes (BF16) |
63-
| Gradients | 4N bytes (FP32) | 2N bytes (BF16) |
63+
| Master Gradients | 4N bytes (FP32) | 2N bytes (BF16) |
6464
| Adam momentum | 4N bytes (FP32) | 2N bytes (BF16) |
6565
| Adam variance | 4N bytes (FP32) | 2N bytes (BF16) |
66-
| **Total** | **18N bytes** | **10N bytes** |
66+
| **Total** | **20 bytes** | **12N bytes** |
67+
68+
Note that DeepSpeed ZeRO partitions model states across multiple GPUs. ZeRO Stage 1 partitions master parameters, gradients, and Adam’s momentum and variance. ZeRO Stage 2 additionally partitions gradients. With ZeRO Stage 3, all of these model states are partitioned.
6769

68-
This gives a theoretical ~44% reduction in optimizer-related memory. The actual savings depend on activation memory and other factors, but our results show a very close match to the theoretical savings.
70+
With ZeRO-3, BF16 low-precision configurations provide a theoretical ~40% reduction in optimizer-related memory. Actual savings depend on activation memory and other factors, but our results show a close match to the theoretical estimate.
6971

7072
## Related Resources
7173

0 commit comments

Comments
 (0)