Skip to content

[DOC]: how the cutile support multistage/warp_specialization in gemm? #12

@irasin

Description

@irasin

How would you describe the priority of this documentation request?

High

Please provide a link or source to the relevant docs

https://github.com/NVIDIA/cutile-python/blob/main/samples/MatMul.py

Describe the problems in the documentation

Refer to the sample code of matmul

    for k in range(num_tiles_k):
        # Load tile from matrix A.
        # The `index=(bidx, k_tile_idx)` specifies which (M-tile, K-tile) to load
        # from global memory A. `shape=(tm, tk)` defines the size of this tile.
        a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype)

        # Load tile from matrix B.
        # The `index=(k_tile_idx, bidy)` specifies which (K-tile, N-tile) to load
        # from global memory B. `shape=(tk, tn)` defines the size of this tile.
        b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad).astype(dtype)

        # Perform Matrix Multiplication for the current tiles.
        # `ct.mma` computes the product of the two loaded tiles and accumulates the result.
        accumulator = ct.mma(a, b, accumulator)
  1. It seems there is no strategy of multistage/warp_specialization applied for gemm?
  2. Is so, I wonder whether the performance of cutile is comparable to Cutlass/CuteDSL or Triton, or are there any relevant passes for the optimization in cutile?
  3. And I also want to know, how to use cp.async/tma/mbarrier in cutile.

(Optional) Propose a correction

No response

Contributing Guidelines

  • I agree to follow cuTile Python's contributing guidelines
  • I have searched the open documentation and have found no duplicates for this documentation request

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions