[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support#33517
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Related Documentation No published documentation to review for changes on this repository. |
There was a problem hiding this comment.
Code Review
This pull request adds support for SM121 (DGX Spark) by introducing an enable_sm120_or_later kernel wrapper. The change is logical and consistent with the existing codebase structure. I have one suggestion to improve the long-term robustness of the new wrapper by adding an upper bound to the architecture check, which will prevent potential issues with future, incompatible GPU architectures.
csrc/cutlass_extensions/common.hpp
Outdated
| struct enable_sm120_or_later : Kernel { | ||
| template <typename... Args> | ||
| CUTLASS_DEVICE void operator()(Args&&... args) { | ||
| #if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 1200 |
There was a problem hiding this comment.
While using >= 1200 correctly enables this kernel for SM120 and SM121 as intended, it makes a strong assumption about forward compatibility with all future architectures. Highly-tuned kernels like this can be sensitive to changes in future hardware generations. To make this safer and more explicit, I recommend adding an upper bound to the check to limit it to the Blackwell architecture series (presumably SM12x). This will prevent potential hard-to-debug issues on future, incompatible hardware.
#if defined __CUDA_ARCH__ && (__CUDA_ARCH__ >= 1200 && __CUDA_ARCH__ < 1300)
Add enable_sm120_or_later kernel wrapper to support SM121 (DGX Spark GB10) in addition to SM120 (RTX 5090/6000 Pro) for Blackwell CUTLASS kernels. The existing enable_sm120_only wrapper uses __CUDA_ARCH__ == 1200 which excludes SM121 (arch 1210). The new wrapper uses __CUDA_ARCH__ >= 1200 to include both SM120 and SM121+ architectures. Changes: - csrc/cutlass_extensions/common.hpp: Add enable_sm120_or_later template - scaled_mm_blockwise_sm120_fp8_dispatch.cuh: Use enable_sm120_or_later for FP8 blockwise GEMM kernels SM121 shares the same tensor core capabilities as SM120, so these kernels work correctly on both architectures. Tested on DGX Spark GB10 (SM121) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Signed-off-by: code4me2 <[email protected]>
81450fd to
9179e0a
Compare
csrc/cutlass_extensions/common.hpp
Outdated
|
|
||
| // SM12x family includes SM120 (RTX 5090) and SM121 (DGX Spark GB10) | ||
| template <typename Kernel> | ||
| struct enable_sm120_or_later : Kernel { |
There was a problem hiding this comment.
Let's update this to enable_sm120_family since "later" sounds like >= sm120
Address reviewer feedback: rename to enable_sm120_family since "later" sounds like >= sm120, while this specifically targets the SM12x family (SM120, SM121). Signed-off-by: code4me2 <[email protected]>
7e07daa to
b21d37d
Compare
|
@mgoin was there anything else to do for this one? I think this PR has all the changes you wanted |
|
Thanks @Code4me2 ! Enabled CI |
|
@mgoin is there anything else for me to do here? the checks that failed seem unrelated to the implementatino |
|
Nope just flaky CI at the moment, thanks for the ping! |
|
hi @mgoin, @Code4me2, thanks for the fix. I built it from source on a machine with a Blackwell 6000 Pro and on a machine with a 5090. On the Blackwell 6000 Pro, it works fine, but on the machine with the 5090, it does not. Am I doing something wrong? Error log: 5090 machine Blackwell 6000 Pro My steps: |
|
@shahizat let me check on my setup that I was testing with. Which model were you running when it failed? |
…and NVFP4 MoE oracle checks PR vllm-project#33417 added is_device_capability_family(120) to flashinfer_cutlass_moe.py and cutlass_moe.py but missed three other NVFP4 MoE backend files that still only check family(100). RTX 5090 (SM120) and SM110 GPUs are rejected by the oracle when using FlashInfer TRT-LLM, CuteDSL, or NVFP4 TRT-LLM weight prep backends. Add the same family(110) and family(120) checks to match the pattern established by vllm-project#33417. Fixes the issue reported in vllm-project#33517. Signed-off-by: code4me2 <[email protected]>
…and NVFP4 MoE oracle checks PR vllm-project#33417 added is_device_capability_family(120) to flashinfer_cutlass_moe.py and cutlass_moe.py but missed four other checks that still only match family(100). RTX 5090 (SM120) and SM110 GPUs are rejected by the oracle when using FlashInfer TRT-LLM, CuteDSL, or NVFP4 TRT-LLM weight prep backends. A fourth check in flashinfer_utils.py silently downgrades the TRT-LLM backend to CUTLASS for non-SM100 devices. Add the same family(110) and family(120) checks to match the pattern established by vllm-project#33417. Fixes the issue reported in vllm-project#33517. Signed-off-by: code4me2 <[email protected]>
Summary
Add
enable_sm120_or_laterkernel wrapper to support SM121 (DGX Spark GB10) in addition to SM120 (RTX 5090) for Blackwell CUTLASS kernels.Problem
DGX Spark GB10 (SM121) cannot use CUTLASS kernels because
enable_sm120_onlyuses exact architecture match.Root Cause
The existing
enable_sm120_onlywrapper uses:#if defined __CUDA_ARCH__ && __CUDA_ARCH__ == 1200This excludes SM121 (arch 1210) which has identical tensor core capabilities.
Solution
Add new
enable_sm120_or_laterwrapper:#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 1200This includes both SM120 (RTX 5090) and SM121+ (DGX Spark) architectures.
Files Changed
csrc/cutlass_extensions/common.hpp: Addenable_sm120_or_latertemplatecsrc/quantization/w8a8/cutlass/c3x/scaled_mm_blockwise_sm120_fp8_dispatch.cuh: Useenable_sm120_or_laterfor FP8 blockwise GEMMTesting
Tested on DGX Spark GB10 (SM121) with
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4Related Issues
Fixes #28589