perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10 by yaoyu-33 · Pull Request #3107 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-04-02T04:05:09Z

Summary

Fix OOM on llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sft by increasing virtual pipeline model parallel size from 5 to 10
The test was flaky OOMing ~30-60% of runs on H100 80GB. After bisecting ~50 CI pipelines, confirmed this is a pre-existing borderline memory issue, not a commit regression
VPP=10 reduces peak activation memory through more virtual pipeline chunks (2 layers/chunk instead of 4) while keeping TP=4, PP=4, DP=2 unchanged

What was tried

Approach	Result
`recompute_modules: [core_attn]` (baseline)	OOM
`recompute_modules: [mlp]`	-16% GPU util regression
`recompute_modules: [mlp, core_attn]`	Pending
`recompute_modules: [core_attn, layernorm]`	OOM
`activation_offload_layers: 2/4/6`	Pending
TP=4 PP=8 VPP=5 (DP=1)	Pending
TP=8 PP=4 VPP=5 (DP=1)	Pending
TP=4 PP=4 VPP=10 (DP=2)	Passed

Validation

Pipeline: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/47594754
Mbridge commit: 7b6203f2e7b77f62ec393ca2e764d738dbb50dde

Test plan

Verify the test passes with VPP=10 (pipeline 47594754)
Verify perf numbers (iter_time, gpu_util) are within acceptable range

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-02T04:05:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-02T04:13:34Z

📝 Walkthrough

Walkthrough

A single configuration parameter has been added to the LLaMA 70B SFT preset for H100 FP8 compute server training. The LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1 configuration object was updated to include recompute_num_layers=1 in its replace() call.

Changes

Cohort / File(s)	Summary
LLaMA Configuration Updates `scripts/performance/configs/llama/llama3_workload_base_configs.py`	Added `recompute_num_layers=1` parameter to the `LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1` configuration object.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title mentions 'increasing VPP to 10' but the actual change adds recompute_num_layers=1. This is a mismatch between the stated action and the implemented solution.	Update the title to accurately reflect the actual change, such as: 'perf: Fix OOM for llama3_70b SFT H100 FP8 CS by adding recompute_num_layers=1'
Test Results For Major Changes	❓ Inconclusive	PR changes a single configuration parameter (recompute_num_layers: 40 → 35) in a performance test config file with no test results provided.	Unable to determine if performance regression testing was executed; recommend verifying test execution status with CI/build logs or PR comments.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yuya/fix-llama3-70b-sft-h100-oom

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The llama3_70b_32gpu_h100_fp8_cs_50steps_perf_finetune_sft test is flaky OOMing ~30-60% of runs on H100 80GB. After bisecting ~50 CI pipelines, confirmed this is a pre-existing borderline memory issue (not a commit regression). Adding recompute_num_layers=1 trades minimal compute overhead for memory headroom to stabilize the test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

… compatibility recompute_num_layers=1 sets recompute_granularity="full", which MCore rejects when TE-scoped CUDA graphs are enabled on mlp. Switch to recompute_modules=["core_attn"] (selective recompute) which is compatible with CUDA graphs and still provides memory headroom to avoid OOM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-04-03T02:09:01Z

/ok to test 2232e89

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yaoyu-33 · 2026-04-03T02:54:49Z

CI Validation

✅ Passed with VPP=10 (TP=4, PP=4, DP=2):

Pipeline: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/47594754

yaoyu-33 · 2026-04-03T03:35:23Z

Stability Validation: 4/4 Passes with VPP=10

All 4 runs passed perf + memory + convergence validation. 0 OOM.

Run	Pipeline	TFLOP/s/GPU	Peak Mem (GB)	Perf	Mem	Convergence
1	47594754	698.9	56.2	✅	✅	✅
2	47597796	700.1	59.4	✅	✅	✅
3	47597798	699.2	56.2	✅	✅	✅
4	47597799	693.6	60.2	✅	✅	✅

Avg GPU util: ~697.9 TFLOP/s/GPU (-1.7% vs golden 709.9). Well within the 5% regression threshold.

Full Experiment Summary

#	Approach	TFLOP/s/GPU	vs Golden	Peak Mem (GB)	Result
0	Baseline `recompute_modules: [core_attn]`	~704	-0.8%	58.8	OOM
1	`recompute_modules: [mlp]`	593.6	-16.4%	55.6	❌ Perf regression
2	`recompute_modules: [mlp, core_attn]`	586.8	-17.3%	55.6	❌ Perf regression
3	`recompute_modules: [core_attn, layernorm]`	~702	-1.1%	59.6	OOM
4-6	`activation_offload_layers: 2/4/6`	N/A	N/A	N/A	❌ Incompatible with PP>1
7	TP=4 PP=8 VPP=5 (DP=1)	668.0	-5.9%	53.2	⚠️ Borderline perf
8	TP=8 PP=4 VPP=5 (DP=1)	508.7	-28.4%	50.2	❌ Severe regression
9	TP=4 PP=4 VPP=10 (DP=2)	698.9	-1.6%	60.2	✅ Passed (4/4)

🤖 Generated with Claude Code

…M experiment data Add two new perf technique skills from Llama3 70B SFT OOM fix experiment (PR #3107): activation-recompute (per-module cost/savings) and memory-tuning (VPP tuning, parallelism resizing, CPU offloading constraints). Fix docs hyperlinks to skills (../skills → ../../skills from docs/training/). Add CUDA graph + layer-level recompute interaction to cuda-graphs skill. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

dingqingy-nv requested review from dingqingy-nv, ko3n1g and malay-nagda April 2, 2026 17:07

dingqingy-nv added performance performance/release Performance items related with NeMo release area:perf Performance optimizations and benchmarking r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. labels Apr 2, 2026

yaoyu-33 and others added 2 commits April 2, 2026 19:08

yaoyu-33 force-pushed the yuya/fix-llama3-70b-sft-h100-oom branch from 7b6203f to 2232e89 Compare April 3, 2026 02:08

copy-pr-bot bot temporarily deployed to test April 3, 2026 02:09 Inactive

perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10

00919dc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yaoyu-33 changed the title ~~perf: Add recompute_num_layers=1 to llama3_70b SFT H100 FP8 CS config~~ perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10 Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107

perf: Fix OOM for llama3_70b SFT H100 FP8 CS by increasing VPP to 10#3107
yaoyu-33 wants to merge 3 commits intomainfrom
yuya/fix-llama3-70b-sft-h100-oom

yaoyu-33 commented Apr 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

yaoyu-33 commented Apr 3, 2026

Uh oh!

yaoyu-33 commented Apr 3, 2026

Uh oh!

yaoyu-33 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoyu-33 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was tried

Validation

Test plan

Uh oh!

copy-pr-bot bot commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

yaoyu-33 commented Apr 3, 2026

Uh oh!

yaoyu-33 commented Apr 3, 2026

CI Validation

Uh oh!

yaoyu-33 commented Apr 3, 2026

Stability Validation: 4/4 Passes with VPP=10

Full Experiment Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Apr 2, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading