Comprehensive fix for Windows MSVC build errors (C2872 std, LNK2001) and thread-safety#355
Open
munder-sa wants to merge 3 commits intothu-ml:mainfrom
Open
Comprehensive fix for Windows MSVC build errors (C2872 std, LNK2001) and thread-safety#355munder-sa wants to merge 3 commits intothu-ml:mainfrom
munder-sa wants to merge 3 commits intothu-ml:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR provides a comprehensive and robust fix for compiling SageAttention and SageAttention3 on Windows with MSVC, addressing critical issues not fully resolved in previous attempts (such as PR #352).
Key Fixes:
1. Reliable fix for the
smallmacro conflict (bool charerror)When
<windows.h>is included (often implicitly via PyTorch'srpcndr.h), it incorrectly defines#define small char. This turnsbool is_smallorbool smallin PyTorch'sc10/cuda/CUDACachingAllocator.hintobool char, causing compilation failures. Simply passing-Usmalltonvcc_flagsis insufficient because the macro gets redefined after the command line arguments are processed.Fix: Explicitly added
#undef smallinapi.cuandfp4_quantization_4d.curight after<windows.h>inclusions.2. Fixed
TORCH_EXTENSION_NAMEoverwrite bug during concurrent builds (LNK2001error)In
setup.pyandsageattention3_blackwell/setup.py, theCXX_FLAGSandNVCC_FLAGSlists were previously shared across allCUDAExtensiondefinitions. During parallel compilations (e.g., usingpip install -e .),torch.utils.cpp_extensionmodifiesextra_compile_args['cxx']in-place by appending-DTORCH_EXTENSION_NAME={name}. This caused race conditions where the extension names were overwritten for earlier extensions (e.g.,_qattn_sm80receiving_fusedinstead), resulting in unresolved external symbols (LNK2001: PyInit__qattn_sm80) during linking.Fix: Passed shallow copies of the flag lists (e.g.,
CXX_FLAGS[:]) to eachCUDAExtensionto ensure thread-safety.3. Resolved
error C2872: 'std': ambiguous symbolreliablyThe ambiguous
stdresolution fromcompiled_autograd.hwith MSVC is safely bypassed by appending-DUSE_CUDA,/Zc:preprocessor, and/DCCCL_IGNORE_MSVC_TRADITIONAL_PREPROCESSOR_WARNINGunconditionally forsys.platform == "win32"(without relying onDISTUTILS_USE_SDK==1, ensuring it works uniformly across standard Windows terminal environments).These changes have been thoroughly tested locally and successfully build all
.whlbinaries on Windows 11 with Python 3.12, PyTorch 2.10.0, and CUDA 13.1.