Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005
Draft
carlobertolli wants to merge 1 commit intoROCm:developfrom
Draft
Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli wants to merge 1 commit intoROCm:developfrom
Conversation
This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor. On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array. There are two observations on why this works: 1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation). 2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.
|
Jenkins build for 8e34cc54569d0f3fcc53506c2de31f88306fa178 commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.
On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.
There are two observations on why this works:
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist