Skip to content

Comprehensive fix for Windows MSVC build errors (C2872 std, LNK2001) and thread-safety#355

Open
munder-sa wants to merge 3 commits intothu-ml:mainfrom
munder-sa:main
Open

Comprehensive fix for Windows MSVC build errors (C2872 std, LNK2001) and thread-safety#355
munder-sa wants to merge 3 commits intothu-ml:mainfrom
munder-sa:main

Conversation

@munder-sa
Copy link
Copy Markdown

This PR provides a comprehensive and robust fix for compiling SageAttention and SageAttention3 on Windows with MSVC, addressing critical issues not fully resolved in previous attempts (such as PR #352).

Key Fixes:

1. Reliable fix for the small macro conflict (bool char error)
When <windows.h> is included (often implicitly via PyTorch's rpcndr.h), it incorrectly defines #define small char. This turns bool is_small or bool small in PyTorch's c10/cuda/CUDACachingAllocator.h into bool char, causing compilation failures. Simply passing -Usmall to nvcc_flags is insufficient because the macro gets redefined after the command line arguments are processed.
Fix: Explicitly added #undef small in api.cu and fp4_quantization_4d.cu right after <windows.h> inclusions.

2. Fixed TORCH_EXTENSION_NAME overwrite bug during concurrent builds (LNK2001 error)
In setup.py and sageattention3_blackwell/setup.py, the CXX_FLAGS and NVCC_FLAGS lists were previously shared across all CUDAExtension definitions. During parallel compilations (e.g., using pip install -e .), torch.utils.cpp_extension modifies extra_compile_args['cxx'] in-place by appending -DTORCH_EXTENSION_NAME={name}. This caused race conditions where the extension names were overwritten for earlier extensions (e.g., _qattn_sm80 receiving _fused instead), resulting in unresolved external symbols (LNK2001: PyInit__qattn_sm80) during linking.
Fix: Passed shallow copies of the flag lists (e.g., CXX_FLAGS[:]) to each CUDAExtension to ensure thread-safety.

3. Resolved error C2872: 'std': ambiguous symbol reliably
The ambiguous std resolution from compiled_autograd.h with MSVC is safely bypassed by appending -DUSE_CUDA, /Zc:preprocessor, and /DCCCL_IGNORE_MSVC_TRADITIONAL_PREPROCESSOR_WARNING unconditionally for sys.platform == "win32" (without relying on DISTUTILS_USE_SDK==1, ensuring it works uniformly across standard Windows terminal environments).

These changes have been thoroughly tested locally and successfully build all .whl binaries on Windows 11 with Python 3.12, PyTorch 2.10.0, and CUDA 13.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant