Skip to content

perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107

Open
yaoyu-33 wants to merge 3 commits intomainfrom
yuya/fix-llama3-70b-sft-h100-oom
Open

perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107
yaoyu-33 wants to merge 3 commits intomainfrom
yuya/fix-llama3-70b-sft-h100-oom

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Apr 2, 2026

Summary

  • Fix OOM on llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sft by increasing virtual pipeline model parallel size from 5 to 10
  • The test was flaky OOMing ~30-60% of runs on H100 80GB. After bisecting ~50 CI pipelines, confirmed this is a pre-existing borderline memory issue, not a commit regression
  • VPP=10 reduces peak activation memory through more virtual pipeline chunks (2 layers/chunk instead of 4) while keeping TP=4, PP=4, DP=2 unchanged

What was tried

Approach Result
recompute_modules: [core_attn] (baseline) OOM
recompute_modules: [mlp] -16% GPU util regression
recompute_modules: [mlp, core_attn] Pending
recompute_modules: [core_attn, layernorm] OOM
activation_offload_layers: 2/4/6 Pending
TP=4 PP=8 VPP=5 (DP=1) Pending
TP=8 PP=4 VPP=5 (DP=1) Pending
TP=4 PP=4 VPP=10 (DP=2) Passed

Validation

Test plan

  • Verify the test passes with VPP=10 (pipeline 47594754)
  • Verify perf numbers (iter_time, gpu_util) are within acceptable range

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

A single configuration parameter has been added to the LLaMA 70B SFT preset for H100 FP8 compute server training. The LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1 configuration object was updated to include recompute_num_layers=1 in its replace() call.

Changes

Cohort / File(s) Summary
LLaMA Configuration Updates
scripts/performance/configs/llama/llama3_workload_base_configs.py
Added recompute_num_layers=1 parameter to the LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1 configuration object.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title mentions 'increasing VPP to 10' but the actual change adds recompute_num_layers=1. This is a mismatch between the stated action and the implemented solution. Update the title to accurately reflect the actual change, such as: 'perf: Fix OOM for llama3_70b SFT H100 FP8 CS by adding recompute_num_layers=1'
Test Results For Major Changes ❓ Inconclusive PR changes a single configuration parameter (recompute_num_layers: 40 → 35) in a performance test config file with no test results provided. Unable to determine if performance regression testing was executed; recommend verifying test execution status with CI/build logs or PR comments.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/fix-llama3-70b-sft-h100-oom

Comment @coderabbitai help to get the list of available commands and usage tips.

@dingqingy-nv dingqingy-nv added performance performance/release Performance items related with NeMo release area:perf Performance optimizations and benchmarking r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. labels Apr 2, 2026
yaoyu-33 and others added 2 commits April 2, 2026 19:08
The llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sft test is
flaky OOMing ~30-60% of runs on H100 80GB. After bisecting ~50 CI
pipelines, confirmed this is a pre-existing borderline memory issue
(not a commit regression). Adding recompute_num_layers=1 trades
minimal compute overhead for memory headroom to stabilize the test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
… compatibility

recompute_num_layers=1 sets recompute_granularity="full", which MCore
rejects when TE-scoped CUDA graphs are enabled on mlp. Switch to
recompute_modules=["core_attn"] (selective recompute) which is compatible
with CUDA graphs and still provides memory headroom to avoid OOM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 force-pushed the yuya/fix-llama3-70b-sft-h100-oom branch from 7b6203f to 2232e89 Compare April 3, 2026 02:08
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Apr 3, 2026

/ok to test 2232e89

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yaoyu-33 yaoyu-33 changed the title perf: Add recompute_num_layers=1 to llama3_70b SFT H100 FP8 CS config perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10 Apr 3, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Apr 3, 2026

CI Validation

Passed with VPP=10 (TP=4, PP=4, DP=2):

@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Apr 3, 2026

Stability Validation: 4/4 Passes with VPP=10

All 4 runs passed perf + memory + convergence validation. 0 OOM.

Run Pipeline TFLOP/s/GPU Peak Mem (GB) Perf Mem Convergence
1 47594754 698.9 56.2
2 47597796 700.1 59.4
3 47597798 699.2 56.2
4 47597799 693.6 60.2

Avg GPU util: ~697.9 TFLOP/s/GPU (-1.7% vs golden 709.9). Well within the 5% regression threshold.

Full Experiment Summary

# Approach TFLOP/s/GPU vs Golden Peak Mem (GB) Result
0 Baseline recompute_modules: [core_attn] ~704 -0.8% 58.8 OOM
1 recompute_modules: [mlp] 593.6 -16.4% 55.6 ❌ Perf regression
2 recompute_modules: [mlp, core_attn] 586.8 -17.3% 55.6 ❌ Perf regression
3 recompute_modules: [core_attn, layernorm] ~702 -1.1% 59.6 OOM
4-6 activation_offload_layers: 2/4/6 N/A N/A N/A ❌ Incompatible with PP>1
7 TP=4 PP=8 VPP=5 (DP=1) 668.0 -5.9% 53.2 ⚠️ Borderline perf
8 TP=8 PP=4 VPP=5 (DP=1) 508.7 -28.4% 50.2 ❌ Severe regression
9 TP=4 PP=4 VPP=10 (DP=2) 698.9 -1.6% 60.2 ✅ Passed (4/4)

🤖 Generated with Claude Code

yaoyu-33 added a commit that referenced this pull request Apr 3, 2026
…M experiment data

Add two new perf technique skills from Llama3 70B SFT OOM fix experiment
(PR #3107): activation-recompute (per-module cost/savings) and memory-tuning
(VPP tuning, parallelism resizing, CPU offloading constraints). Fix docs
hyperlinks to skills (../skills → ../../skills from docs/training/). Add
CUDA graph + layer-level recompute interaction to cuda-graphs skill.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
yaoyu-33 added a commit that referenced this pull request Apr 3, 2026
…M experiment data

Add two new perf technique skills from Llama3 70B SFT OOM fix experiment
(PR #3107): activation-recompute (per-module cost/savings) and memory-tuning
(VPP tuning, parallelism resizing, CPU offloading constraints). Fix docs
hyperlinks to skills (../skills → ../../skills from docs/training/). Add
CUDA graph + layer-level recompute interaction to cuda-graphs skill.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking performance/release Performance items related with NeMo release performance r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants