The current kernel uses the following FP4 Tensor Core instruction:
mma.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.block_scale.scale_vec::4X.f32.e2m1.e2m1.f32.ue4m3
This instruction is only supported on SM_120+ (Hopper/Blackwell architectures). Our target platform is SM_101, which does not have native FP4 Tensor Core support, so the instruction cannot execute.
Is there a recommended method to emulate this FP4 Tensor Core MMA on SM_101 while maintaining equivalent numerical results?