Skip to content

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005

Draft
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm
Draft

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm

Conversation

@carlobertolli
Copy link

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.

On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.

There are two observations on why this works:

  1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation).
  2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

This patch was triggered but a failure observed in ROCm GPU asan support.
It happens when we attempt an atomic operation on the last element
of a tensor and the element address has the correct alignment.
Normally, we would use a 4-byte read + atomic cas loop for 4-bytes.
However, in this case, we cannot backtrack 2 bytes before the last element
because that would mean a mis-aligned atomic, resulting in a gpu memory error.
The solution is to read 2 bytes (change unsigned int to short) and
then use atomicCAS loop over 2 byte data type of the last element in the tensor.

On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is
the only legal hw instruction available. Code generation attempts to align down the
input address to a 4-byte boundary alignment. In the pytorch example, the address of
element 14 (last one) is already aligned at 4-bytes so the result is the same and
the atomic cmpswp in the generated code will actually read and write 2 bytes after
the end of the array.

There are two observations on why this works:

1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan
happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen
(asan instrumentation runs before code generation).
2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting
the same 2 bytes into it. So there's still an issue here after atomic legalization,
but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation
to a larger size or that we are touching - without modifying - some other application data.
As long as that data is not read-only, we won't see the issue.
@carlobertolli carlobertolli marked this pull request as draft February 26, 2026 20:17
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Feb 26, 2026

Jenkins build for 8e34cc54569d0f3fcc53506c2de31f88306fa178 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant