Open
Conversation
Summary: There is a bug in the algorithm that creates trailing 16B chunks in certain situations. I observed this from the profiles. E.g. if TotalTmpNumel == remainNumel, and both are already power of 2, the old logic would cut remainNumel into two chunks: 16, remainNumel - 16. The new logic avoids this issue Differential Revision: D93698089
Summary: As title. Differential Revision: D93751899
Summary: Integrate D93751899 with AllReduce ctring algo Differential Revision: D93342743
Summary: TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing. **tl;dr** adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations. -------------- **Logic** **Pipeline auto-tuning** (`getAutoTunedPipeline`): - The message size is rounded to the nearest power-of-2 and clamped to a per-architecture maximum bandwidth-delay product (BDP) (overridable via `NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP`) - 128MB for GB200/Blackwell - 32MB for H100/Hopper A pipeline depth multiplier is chosen based on per-rank message size: - depth 2 for small (<1MB) and large (>=4MB) messages - depth 4 for medium (1-4MB) messages where pushing smaller chunks faster helps latency, throughput. `numChunks = pipelineDepth * nRanks` `chunkSize = partitionMessageBytes / numChunks` clamped to [256KB, 16MB]. A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee `chunkSize * numChunks <= maxBDP`. **Block auto-tuning** (`getAutoTunedBlockParams`): - A tiered lookup table maps chunkSize ranges to (numBlocks, blockSize) pairs with separate tiers per GPU architecture. - Default arch (GB200/Blackwell) scales from 1 block at <8KB chunks up to 8 blocks at >=64KB, using the CUDA occupancy-reported block size. - Hopper uses fewer blocks (max 4) with explicit block sizes (384 for tiny chunks, 512 otherwise). numBlocks is clamped by `cudaOccupancyMaxPotentialBlockSize`. **Buffer pre-allocation** (`CtranAlgo.cc`): - Ring tmp buffers are now sized to the maximum BDP the auto-tuner could produce (from CVAR or constant), rather than the old fixed `chunk_size * num_chunks` product, ensuring the buffer is always large enough for any runtime decision. **Note: non-pow2 message** bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk. -------------- Existing CVARs (`TMPBUF_CHUNK_SIZE`, `TMPBUF_NUM_CHUNKS`, `MAX_NUM_THREAD_BLOCKS`, `THREAD_BLOCK_SIZE`) now default to 0, **New Cvar Override order:** Highest: - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_NUM_CHUNKS > 0 - NCCL_CTRAN_ALLREDUCE_RING_TMPBUF_CHUNK_SIZE > 0 - NCCL_CTRAN_ALLREDUCE_RING_MAX_NUM_THREAD_BLOCKS > 0 0 - NCCL_CTRAN_ALLREDUCE_RING_THREAD_BLOCK_SIZE > 0 Next: - NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP Default: - use autotune ------------ **Notes:** I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between - AllReduce ctring AutoTune - CtranAlgo tmpbuf allocation. `CtranAlgoConsts.h` is created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc) Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained. Differential Revision: D93342742
|
@arttianezhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93342742. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
TODO: to reviewers: I still need to remove small message size coverage (perrank < 64KB) but otherwise is ready for reviewing.
tl;dr adds AutoTune module and integrates with allreduce ctring algorithm + CtranAlgo tmpbuf allocations.
Logic
Pipeline auto-tuning (
getAutoTunedPipeline):NCCL_CTRAN_ALLREDUCE_RING_AUTO_TUNE_MAX_BDP)A pipeline depth multiplier is chosen based on per-rank message size:
numChunks = pipelineDepth * nRankschunkSize = partitionMessageBytes / numChunksclamped to [256KB, 16MB].A BDP-enforcement loop reduces chunkSize first and then numChunks to guarantee
chunkSize * numChunks <= maxBDP.Block auto-tuning (
getAutoTunedBlockParams):cudaOccupancyMaxPotentialBlockSize.Buffer pre-allocation (
CtranAlgo.cc):the old fixed
chunk_size * num_chunksproduct, ensuring the buffer is always large enough for any runtime decision.Note: non-pow2 message bytes would be round to the nearest pow2 message bytes to perform chunk tuning, so that eventually the produced chunks would be mostly 16B aligned (fast path), except for the last chunk.
Existing CVARs (
TMPBUF_CHUNK_SIZE,TMPBUF_NUM_CHUNKS,MAX_NUM_THREAD_BLOCKS,THREAD_BLOCK_SIZE) now default to 0,New Cvar Override order:
Highest:
0
Next:
Default:
Notes:
I chose not to split this diff into smaller, because the autotune logic is very much dependent on the buffer size, decided by BDP, and have overrides from the cvars for experiments. The override logic must align between
CtranAlgoConsts.his created to avoid adding per algorithm dependency into the main CtranAlgo.(h|cc)Unfortunately, it's also difficult to justify the gains if we play with only one of the parameters at a time. Going forward, once we have this baseline, the further optimizations will be more fine grained.
Differential Revision: D93342742