A stream encapsulates a queue of tasks that are launched on the GPU device. This example showcases usage of multiple streams, each with their own tasks. These tasks include asynchronous memory copies using hipMemcpyAsync and asynchronous kernel launches using myKernelName<<<...>>>.
- Host side input and output memory is allocated using
hipHostMallocas pinned memory. It will ensure that the memory copies will be performed asynchronously when usinghipMemcpyAsync. - Host input is instantiated.
- Device side storage is allocated using
hipMalloc. - Two
hipStream_tstreams are created usinghipStreamCreate. The example demonstrates launching two different kernels therefore each stream queues tasks related to each kernel launch. - Data is copied from host to device using
hipMemcpyAsync. - Two kernels,
matrix_transpose_static_sharedandmatrix_transpose_dynamic_sharedare asynchronously launched using both the streams, repectively. - An asynchronous memory copy task (using
hipMemcpyAsync) is queued into the streams that transfers the results from device to host. - The streams are destroyed using
hipStreamDestroy. - The host explicitly waits for all tasks to finish using
hipDeviceSynchronize. - Free any device side memory using
hipFree. - Free host side pinned memory using
hipHostFree.
A HIP stream allows device tasks to be grouped and launched asynchronously and independently from other tasks, which can be used to hide latencies and increase task completion throughput. When results of a task queued on a particular stream are needed, it can be explicitly synchronized without blocking work queued on other streams. Each HIP stream is tied to a particular device, which enables HIP streams to be used to schedule work across multiple devices simultaneously.
__shared____syncthreadshipStream_thipStreamCreatehipStreamDestroyhipMallochipHostMallochipMemcpyAsynchipDeviceSynchronizehipFreehipHostFree