-
Notifications
You must be signed in to change notification settings - Fork 255
CPU offloading incorrectly enabled when activation_offload_layers=0 #2992
Copy link
Copy link
Closed
Labels
area:trainingTraining loop, callbacks, and runtime integrationTraining loop, callbacks, and runtime integrationbugSomething isn't workingSomething isn't workingcommunity-request
Description
Description
When activation_offload_layers=0 (or cpu_offloading_num_layers=0), CPU offloading is still incorrectly enabled because the condition only checks for is not None rather than checking if the value is actually greater than 0.
This causes a ValueError when using pipeline parallelism:
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading
Location
scripts/performance/utils/overrides.py around line 134
Current Code
if cpu_offloading_num_layers is not None:
# enables CPU offloading even when value is 0Proposed Fix
if cpu_offloading_num_layers is not None and cpu_offloading_num_layers > 0:
# only enable CPU offloading when explicitly requested with layers > 0Reproduction
- Configure a model with
activation_offload_layers = 0and pipeline parallelism (pp > 1) - Run training
- Error:
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading
Environment
- Megatron-Bridge v0.3.1
- Container: nvcr.io/nvidian/nemo:26.04.rc2 (which has Megatron-Bridge v0.4.0rc0 internally)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area:trainingTraining loop, callbacks, and runtime integrationTraining loop, callbacks, and runtime integrationbugSomething isn't workingSomething isn't workingcommunity-request