Skip to content

Comments

AllReduce ctring AutoTune#745

Open
arttianezhu wants to merge 4 commits intometa-pytorch:mainfrom
arttianezhu:export-D93342742
Open

AllReduce ctring AutoTune#745
arttianezhu wants to merge 4 commits intometa-pytorch:mainfrom
arttianezhu:export-D93342742

Conversation

@arttianezhu
Copy link
Contributor

Summary:
TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing.

tl;dr adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations.


Logic

Pipeline auto-tuning (getAutoTunedPipeline):

  • The message size is rounded to the nearest power-of-2 and clamped to a per-architecture maximum bandwidth-delay product (BDP) (overridable via NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP)
  • 128MB for GB200/Blackwell
  • 32MB for H100/Hopper

A pipeline depth multiplier is chosen based on per-rank message size:

  • depth 2 for small (<1MB) and large (>=4MB) messages
  • depth 4 for medium (1-4MB) messages where pushing smaller chunks faster helps latency, throughput.

numChunks = pipelineDepth * nRanks
chunkSize = partitionMessageBytes / numChunks clamped to [256KB, 16MB].

A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee chunkSize * numChunks <= maxBDP.

Block auto-tuning (getAutoTunedBlockParams):

  • A tiered lookup table maps chunkSize ranges to (numBlocks, blockSize) pairs with separate tiers per GPU architecture.
    • Default arch (GB200/Blackwell) scales from 1 block at <8KB chunks up to 8 blocks at >=64KB, using the CUDA occupancy-reported block size.
  • Hopper uses fewer blocks (max 4) with explicit block sizes (384 for tiny chunks, 512 otherwise). numBlocks is clamped by cudaOccupancyMaxPotentialBlockSize.

Buffer pre-allocation (CtranAlgo.cc):

  • Ring tmp buffers are now sized to the maximum BDP the auto-tuner could produce (from CVAR or constant), rather than
    the old fixed chunk_size * num_chunks product, ensuring the buffer is always large enough for any runtime decision.

Note: non-pow2 message bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk.

Existing CVARs (TMPBUF_CHUNK_SIZE, TMPBUF_NUM_CHUNKS, MAX_NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE) now default to 0,

New Cvar Override order:

Highest:

  • NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_NUM_CHUNKS > 0
  • NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_CHUNK_SIZE > 0
  • NCCL_CTRAN_ALLREDUCE_RING_MAX_NUM_THREAD_BLOCKS > 0
    0
  • NCCL_CTRAN_ALLREDUCE_RING_THREAD_BLOCK_SIZE > 0

Next:

  • NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP

Default:

  • use autotune

Notes:

I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between

  • AllReduce ctring AutoTune
  • CtranAlgo tmpbuf allocation.

CtranAlgoConsts.h is created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc)

Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained.

Differential Revision: D93342742

Summary:

There is a bug in the algorithm that creates trailing 16B chunks in certain situations. I observed this from the profiles.

E.g. if TotalTmpNumel == remainNumel, and both are already power of 2, the old logic would cut remainNumel into two chunks: 16, remainNumel - 16. The new logic avoids this issue

Differential Revision: D93698089
Summary:

As title.

Differential Revision: D93751899
Summary:

Integrate D93751899 with AllReduce ctring algo

Differential Revision: D93342743
Summary:
TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing.


**tl;dr**  adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations.

--------------

**Logic**

**Pipeline auto-tuning** (`getAutoTunedPipeline`):
-  The message size is rounded to the nearest power-of-2 and clamped to a per-architecture maximum bandwidth-delay product (BDP) (overridable via `NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP`)
  - 128MB for GB200/Blackwell
  - 32MB for H100/Hopper

A pipeline depth multiplier is chosen based on per-rank message size:
- depth 2 for small (<1MB) and large (>=4MB) messages
- depth 4 for medium (1-4MB) messages where pushing smaller chunks faster helps latency, throughput.

`numChunks = pipelineDepth * nRanks`
`chunkSize = partitionMessageBytes / numChunks` clamped to [256KB,  16MB].

A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee `chunkSize * numChunks <= maxBDP`.

**Block auto-tuning** (`getAutoTunedBlockParams`):
- A tiered lookup table maps chunkSize ranges to (numBlocks, blockSize) pairs with separate tiers per GPU architecture. 
  - Default arch (GB200/Blackwell) scales from 1 block at <8KB chunks up to 8 blocks at >=64KB, using the CUDA occupancy-reported block size.
- Hopper uses fewer blocks (max 4) with explicit block sizes (384 for tiny chunks, 512 otherwise). numBlocks is clamped by `cudaOccupancyMaxPotentialBlockSize`.

**Buffer pre-allocation** (`CtranAlgo.cc`): 
- Ring tmp buffers are now sized to the maximum BDP the auto-tuner could produce (from CVAR or constant), rather than
   the old fixed `chunk_size * num_chunks` product, ensuring the buffer is always large enough for any runtime decision.

**Note: non-pow2 message** bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk.
--------------

Existing CVARs (`TMPBUF_CHUNK_SIZE`, `TMPBUF_NUM_CHUNKS`, `MAX_NUM_THREAD_BLOCKS`, `THREAD_BLOCK_SIZE`) now default to 0,

**New Cvar Override order:**

Highest:
  - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_NUM_CHUNKS > 0
  - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_CHUNK_SIZE > 0
  - NCCL_CTRAN_ALLREDUCE_RING_MAX_NUM_THREAD_BLOCKS > 0
0
  - NCCL_CTRAN_ALLREDUCE_RING_THREAD_BLOCK_SIZE > 0

Next:
  - NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP

Default:
  - use autotune


------------
**Notes:**

I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between
- AllReduce ctring AutoTune
- CtranAlgo tmpbuf allocation.

`CtranAlgoConsts.h` is created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc)

Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained.

Differential Revision: D93342742
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 19, 2026

@arttianezhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93342742.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant