-
Notifications
You must be signed in to change notification settings - Fork 74
Align the data element counts for Allreduce nvls kernel #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses alignment issues in the NVLS (NVLink Switch) allreduce kernel by ensuring data element counts meet the 16-byte alignment requirement per rank. The PR also updates PyTorch distributed initialization in the test file to use a torch.device object instead of an integer for the device_id parameter, claiming compatibility with older PyTorch versions.
Key changes:
- Added alignment logic to
AllreduceNvlsWithCopy::allreduceKernelFuncto ensure(size / nRanksPerNode)is 16-byte aligned - Changed PyTorch test initialization from
device_id=local_ranktodevice_id=torch.device(f"cuda:{local_rank}")
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| apps/nccl/src/allreduce.cu | Adds 16-byte alignment calculation for NVLS kernels by computing alignedCount and passing it to the kernel instead of the original count |
| test/torch/correctness_test.py | Changes device_id parameter in dist.init_process_group from integer to torch.device object |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mahdiehghazim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding a few benchmark tests that use non-aligned input buffer sizes. This will help ensure we’re covering that case as we add more code in the future.
…nment) in allreduce11
Pass the test with following nelem: |
Add padding to make the total size divisible by (nRanksPerNode * alignment).
Fix the issue "ALLREDUCE assert failed" #682
In test/torch/correctness_test.py, updated the device initialization to use
device_id=torch.device(f"cuda:{local_rank}")indist.init_process_groupfor compatibility with older versions of PyTorch, such as v2.7.0.