This program showcases an implementation of a simple matrix transpose kernel, which uses a different codepath depending on the target architecture.
- A number of constants are defined to control the problem details and the kernel launch parameters.
- Input matrix is set up in host memory.
- The necessary amount of device memory is allocated and input is copied to the device.
- The GPU transposition kernel is launched with previously defined arguments.
- The kernel will have two different codepaths for its data movement, depending on the target architecture.
- The transposed matrix is copied back to the host and all device memory is freed.
- The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output.
This example showcases two different codepaths inside a GPU kernel, depending on the target architecture.
You may want to use architecture-specific inline assembly when compiling for a specific architecture, without losing compatibility with other architectures (see the inline_assembly example).
These architecture-specific compiler definitions only exist within GPU kernels. If you would like to have GPU architecture-specific host-side code, you could query the stream/device information at runtime.
threadIdx,blockIdx,blockDim__gfx1010__,__gfx1011__,__gfx1012__,__gfx1030__,__gfx1031__,__gfx1100__,__gfx1101__,__gfx1102__
hipMallochipMemcpyhipGetLastErrorhipFree