Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations. by carlobertolli · Pull Request #3005 · ROCm/pytorch

carlobertolli · 2026-02-26T20:17:32Z

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor.

On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array.

There are two observations on why this works:

ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation).
We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

This patch was triggered but a failure observed in ROCm GPU asan support. It happens when we attempt an atomic operation on the last element of a tensor and the element address has the correct alignment. Normally, we would use a 4-byte read + atomic cas loop for 4-bytes. However, in this case, we cannot backtrack 2 bytes before the last element because that would mean a mis-aligned atomic, resulting in a gpu memory error. The solution is to read 2 bytes (change unsigned int to short) and then use atomicCAS loop over 2 byte data type of the last element in the tensor. On AMDGPUs, LLVM expands the 2-byte atomic to a 4-byte one during code gen as it is the only legal hw instruction available. Code generation attempts to align down the input address to a 4-byte boundary alignment. In the pytorch example, the address of element 14 (last one) is already aligned at 4-bytes so the result is the same and the atomic cmpswp in the generated code will actually read and write 2 bytes after the end of the array. There are two observations on why this works: 1. ASAN doesn't catch the atomic cmpswp out-of-bound error because instrumentation for asan happens before the 2-byte atomic is legalized to a 4-byte atomic in code gen (asan instrumentation runs before code generation). 2. We do not see the gpu memory error because we are effectively reading the red zone and rewriting the same 2 bytes into it. So there's still an issue here after atomic legalization, but the test passes with and without asan. Without asan, it's likely that hipMalloc pads the allocation to a larger size or that we are touching - without modifying - some other application data. As long as that data is not read-only, we won't see the issue.

rocm-repo-management-api · 2026-02-26T20:33:21Z

Jenkins build for 8e34cc54569d0f3fcc53506c2de31f88306fa178 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

carlobertolli requested a review from jeffdaily February 26, 2026 20:17

carlobertolli marked this pull request as draft February 26, 2026 20:17

carlobertolli requested a review from pruthvistony February 26, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005

Atomics on 16 bits: prevent reading 4 bytes for 2-byte locations.#3005
carlobertolli wants to merge 1 commit intoROCm:developfrom
carlobertolli:fix_16bit_atomics.rocm

carlobertolli commented Feb 26, 2026

Uh oh!

rocm-repo-management-api bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlobertolli commented Feb 26, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

rocm-repo-management-api bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rocm-repo-management-api bot commented Feb 26, 2026 •

edited

Loading