perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107
Open
perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107
Conversation
Contributor
📝 WalkthroughWalkthroughA single configuration parameter has been added to the LLaMA 70B SFT preset for H100 FP8 compute server training. The Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
The llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sft test is flaky OOMing ~30-60% of runs on H100 80GB. After bisecting ~50 CI pipelines, confirmed this is a pre-existing borderline memory issue (not a commit regression). Adding recompute_num_layers=1 trades minimal compute overhead for memory headroom to stabilize the test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
… compatibility recompute_num_layers=1 sets recompute_granularity="full", which MCore rejects when TE-scoped CUDA graphs are enabled on mlp. Switch to recompute_modules=["core_attn"] (selective recompute) which is compatible with CUDA graphs and still provides memory headroom to avoid OOM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
7b6203f to
2232e89
Compare
Contributor
Author
|
/ok to test 2232e89 |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
CI Validation✅ Passed with VPP=10 (TP=4, PP=4, DP=2): |
Contributor
Author
Stability Validation: 4/4 Passes with VPP=10All 4 runs passed perf + memory + convergence validation. 0 OOM.
Avg GPU util: ~697.9 TFLOP/s/GPU (-1.7% vs golden 709.9). Well within the 5% regression threshold. Full Experiment Summary
🤖 Generated with Claude Code |
yaoyu-33
added a commit
that referenced
this pull request
Apr 3, 2026
…M experiment data Add two new perf technique skills from Llama3 70B SFT OOM fix experiment (PR #3107): activation-recompute (per-module cost/savings) and memory-tuning (VPP tuning, parallelism resizing, CPU offloading constraints). Fix docs hyperlinks to skills (../skills → ../../skills from docs/training/). Add CUDA graph + layer-level recompute interaction to cuda-graphs skill. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
yaoyu-33
added a commit
that referenced
this pull request
Apr 3, 2026
…M experiment data Add two new perf technique skills from Llama3 70B SFT OOM fix experiment (PR #3107): activation-recompute (per-module cost/savings) and memory-tuning (VPP tuning, parallelism resizing, CPU offloading constraints). Fix docs hyperlinks to skills (../skills → ../../skills from docs/training/). Add CUDA graph + layer-level recompute interaction to cuda-graphs skill. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sftby increasing virtual pipeline model parallel size from 5 to 10What was tried
recompute_modules: [core_attn](baseline)recompute_modules: [mlp]recompute_modules: [mlp, core_attn]recompute_modules: [core_attn, layernorm]activation_offload_layers: 2/4/6Validation
7b6203f2e7b77f62ec393ca2e764d738dbb50ddeTest plan
🤖 Generated with Claude Code