AllReduce ctring AutoTune by arttianezhu · Pull Request #745 · meta-pytorch/torchcomms

arttianezhu · 2026-02-19T21:55:31Z

Summary:
TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing.

tl;dr adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations.

Logic

Pipeline auto-tuning (getAutoTunedPipeline):

The message size is rounded to the nearest power-of-2 and clamped to a per-architecture maximum bandwidth-delay product (BDP) (overridable via NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP)
128MB for GB200/Blackwell
32MB for H100/Hopper

A pipeline depth multiplier is chosen based on per-rank message size:

depth 2 for small (<1MB) and large (>=4MB) messages
depth 4 for medium (1-4MB) messages where pushing smaller chunks faster helps latency, throughput.

numChunks = pipelineDepth * nRanks
chunkSize = partitionMessageBytes / numChunks clamped to [256KB, 16MB].

A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee chunkSize * numChunks <= maxBDP.

Block auto-tuning (getAutoTunedBlockParams):

A tiered lookup table maps chunkSize ranges to (numBlocks, blockSize) pairs with separate tiers per GPU architecture.
- Default arch (GB200/Blackwell) scales from 1 block at <8KB chunks up to 8 blocks at >=64KB, using the CUDA occupancy-reported block size.
Hopper uses fewer blocks (max 4) with explicit block sizes (384 for tiny chunks, 512 otherwise). numBlocks is clamped by cudaOccupancyMaxPotentialBlockSize.

Buffer pre-allocation (CtranAlgo.cc):

Ring tmp buffers are now sized to the maximum BDP the auto-tuner could produce (from CVAR or constant), rather than
the old fixed chunk_size * num_chunks product, ensuring the buffer is always large enough for any runtime decision.

Note: non-pow2 message bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk.

Existing CVARs (TMPBUF_CHUNK_SIZE, TMPBUF_NUM_CHUNKS, MAX_NUM_THREAD_BLOCKS, THREAD_BLOCK_SIZE) now default to 0,

New Cvar Override order:

Highest:

NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_NUM_CHUNKS > 0
NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_CHUNK_SIZE > 0
NCCL_CTRAN_ALLREDUCE_RING_MAX_NUM_THREAD_BLOCKS > 0
0
NCCL_CTRAN_ALLREDUCE_RING_THREAD_BLOCK_SIZE > 0

NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP

Default:

use autotune

Notes:

I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between

AllReduce ctring AutoTune
CtranAlgo tmpbuf allocation.

CtranAlgoConsts.h is created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc)

Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained.

Differential Revision: D93342742

Summary: There is a bug in the algorithm that creates trailing 16B chunks in certain situations. I observed this from the profiles. E.g. if TotalTmpNumel == remainNumel, and both are already power of 2, the old logic would cut remainNumel into two chunks: 16, remainNumel - 16. The new logic avoids this issue Differential Revision: D93698089

Summary: As title. Differential Revision: D93751899

Summary: Integrate D93751899 with AllReduce ctring algo Differential Revision: D93342743

Summary: TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing. **tl;dr** adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations. -------------- **Logic** **Pipeline auto-tuning** (`getAutoTunedPipeline`): - The message size is rounded to the nearest power-of-2 and clamped to a per-architecture maximum bandwidth-delay product (BDP) (overridable via `NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP`) - 128MB for GB200/Blackwell - 32MB for H100/Hopper A pipeline depth multiplier is chosen based on per-rank message size: - depth 2 for small (<1MB) and large (>=4MB) messages - depth 4 for medium (1-4MB) messages where pushing smaller chunks faster helps latency, throughput. `numChunks = pipelineDepth * nRanks` `chunkSize = partitionMessageBytes / numChunks` clamped to [256KB, 16MB]. A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee `chunkSize * numChunks <= maxBDP`. **Block auto-tuning** (`getAutoTunedBlockParams`): - A tiered lookup table maps chunkSize ranges to (numBlocks, blockSize) pairs with separate tiers per GPU architecture. - Default arch (GB200/Blackwell) scales from 1 block at <8KB chunks up to 8 blocks at >=64KB, using the CUDA occupancy-reported block size. - Hopper uses fewer blocks (max 4) with explicit block sizes (384 for tiny chunks, 512 otherwise). numBlocks is clamped by `cudaOccupancyMaxPotentialBlockSize`. **Buffer pre-allocation** (`CtranAlgo.cc`): - Ring tmp buffers are now sized to the maximum BDP the auto-tuner could produce (from CVAR or constant), rather than the old fixed `chunk_size * num_chunks` product, ensuring the buffer is always large enough for any runtime decision. **Note: non-pow2 message** bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk. -------------- Existing CVARs (`TMPBUF_CHUNK_SIZE`, `TMPBUF_NUM_CHUNKS`, `MAX_NUM_THREAD_BLOCKS`, `THREAD_BLOCK_SIZE`) now default to 0, **New Cvar Override order:** Highest: - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_NUM_CHUNKS > 0 - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_CHUNK_SIZE > 0 - NCCL_CTRAN_ALLREDUCE_RING_MAX_NUM_THREAD_BLOCKS > 0 0 - NCCL_CTRAN_ALLREDUCE_RING_THREAD_BLOCK_SIZE > 0 Next: - NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP Default: - use autotune ------------ **Notes:** I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between - AllReduce ctring AutoTune - CtranAlgo tmpbuf allocation. `CtranAlgoConsts.h` is created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc) Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained. Differential Revision: D93342742

meta-codesync · 2026-02-19T21:55:38Z

@arttianezhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93342742.

arttianezhu added 4 commits February 19, 2026 13:55

ctranCopyKern new copy 1 src 2 dsts (meta-pytorch#740)

c0fdd26

Summary: As title. Differential Revision: D93751899

AllReduce ctring improve copy efficiency 1 src 2 dsts (meta-pytorch#741)

60a7b4f

Summary: Integrate D93751899 with AllReduce ctring algo Differential Revision: D93342743

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

AllReduce ctring AutoTune#745

AllReduce ctring AutoTune#745
arttianezhu wants to merge 4 commits intometa-pytorch:mainfrom
arttianezhu:export-D93342742

arttianezhu commented Feb 19, 2026

Uh oh!

meta-codesync bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

arttianezhu commented Feb 19, 2026

Note: non-pow2 message bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk.

Uh oh!

meta-codesync bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant