Skip to content

CPU offloading incorrectly enabled when activation_offload_layers=0 #2992

@rutayan-nv

Description

@rutayan-nv

Description

When activation_offload_layers=0 (or cpu_offloading_num_layers=0), CPU offloading is still incorrectly enabled because the condition only checks for is not None rather than checking if the value is actually greater than 0.

This causes a ValueError when using pipeline parallelism:

ValueError: Currently there is no support for Pipeline parallelism with CPU offloading

Location

scripts/performance/utils/overrides.py around line 134

Current Code

if cpu_offloading_num_layers is not None:
    # enables CPU offloading even when value is 0

Proposed Fix

if cpu_offloading_num_layers is not None and cpu_offloading_num_layers > 0:
    # only enable CPU offloading when explicitly requested with layers > 0

Reproduction

  1. Configure a model with activation_offload_layers = 0 and pipeline parallelism (pp > 1)
  2. Run training
  3. Error: ValueError: Currently there is no support for Pipeline parallelism with CPU offloading

Environment

  • Megatron-Bridge v0.3.1
  • Container: nvcr.io/nvidian/nemo:26.04.rc2 (which has Megatron-Bridge v0.4.0rc0 internally)

Metadata

Metadata

Assignees

Labels

area:trainingTraining loop, callbacks, and runtime integrationbugSomething isn't workingcommunity-request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions