-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Labels
documentationstatus: waiting-for-feedbackMaintainer is blocker and waiting for user feedbackMaintainer is blocker and waiting for user feedback
Description
How would you describe the priority of this documentation request?
High
Please provide a link or source to the relevant docs
https://github.com/NVIDIA/cutile-python/blob/main/samples/MatMul.py
Describe the problems in the documentation
Refer to the sample code of matmul,
for k in range(num_tiles_k):
# Load tile from matrix A.
# The `index=(bidx, k_tile_idx)` specifies which (M-tile, K-tile) to load
# from global memory A. `shape=(tm, tk)` defines the size of this tile.
a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype)
# Load tile from matrix B.
# The `index=(k_tile_idx, bidy)` specifies which (K-tile, N-tile) to load
# from global memory B. `shape=(tk, tn)` defines the size of this tile.
b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad).astype(dtype)
# Perform Matrix Multiplication for the current tiles.
# `ct.mma` computes the product of the two loaded tiles and accumulates the result.
accumulator = ct.mma(a, b, accumulator)- It seems there is no strategy of multistage/warp_specialization applied for gemm?
- Is so, I wonder whether the performance of cutile is comparable to Cutlass/CuteDSL or Triton, or are there any relevant passes for the optimization in cutile?
- And I also want to know, how to use cp.async/tma/mbarrier in cutile.
(Optional) Propose a correction
No response
Contributing Guidelines
- I agree to follow cuTile Python's contributing guidelines
- I have searched the open documentation and have found no duplicates for this documentation request
Metadata
Metadata
Assignees
Labels
documentationstatus: waiting-for-feedbackMaintainer is blocker and waiting for user feedbackMaintainer is blocker and waiting for user feedback