Skip to content

Conversation

@seagater
Copy link
Contributor

@seagater seagater commented Dec 1, 2025

Add padding to make the total size divisible by (nRanksPerNode * alignment).
Fix the issue "ALLREDUCE assert failed" #682

In test/torch/correctness_test.py, updated the device initialization to use device_id=torch.device(f"cuda:{local_rank}") in dist.init_process_group for compatibility with older versions of PyTorch, such as v2.7.0.

Copilot finished reviewing on behalf of seagater December 1, 2025 20:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses alignment issues in the NVLS (NVLink Switch) allreduce kernel by ensuring data element counts meet the 16-byte alignment requirement per rank. The PR also updates PyTorch distributed initialization in the test file to use a torch.device object instead of an integer for the device_id parameter, claiming compatibility with older PyTorch versions.

Key changes:

  • Added alignment logic to AllreduceNvlsWithCopy::allreduceKernelFunc to ensure (size / nRanksPerNode) is 16-byte aligned
  • Changed PyTorch test initialization from device_id=local_rank to device_id=torch.device(f"cuda:{local_rank}")

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
apps/nccl/src/allreduce.cu Adds 16-byte alignment calculation for NVLS kernels by computing alignedCount and passing it to the kernel instead of the original count
test/torch/correctness_test.py Changes device_id parameter in dist.init_process_group from integer to torch.device object

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@mahdiehghazim mahdiehghazim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a few benchmark tests that use non-aligned input buffer sizes. This will help ensure we’re covering that case as we add more code in the future.

@seagater
Copy link
Contributor Author

seagater commented Dec 5, 2025

I recommend adding a few benchmark tests that use non-aligned input buffer sizes. This will help ensure we’re covering that case as we add more code in the future.

Pass the test with following nelem:
10556587,
10556576, 10556592,10556608, 1048576, 9999999, 12345678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants