Skip to content

Commit 849bfe1

Browse files
authored
Optimize AVX2 Haswell DGEMM SUP Kernels for Improved FMA Throughput (#894)
Details: - This commit enhances the performance of AVX2 DGEMM SUP edge kernels by addressing FMA instruction latency issues in low-computation scenarios, particularly in corner cases handled by edge kernels. - Key Improvements: - Reduced FMA Latency: Previously, edge kernels reused a limited set of vector registers to hold FMA results, creating dependencies that forced the CPU to wait for prior FMA instructions to complete before issuing new ones. This bottleneck was especially pronounced in small-sized matrix multiplications. - Register Set Expansion: The updated implementation utilizes two distinct sets of vector registers to hold intermediate FMA results. This allows subsequent FMA instructions to proceed without waiting for previous ones, improving instruction-level parallelism and throughput. - Final Accumulation Strategy: At the end of the unrolled K-loop, the two register sets are summed to produce the final result, ensuring correctness while maintaining performance gains. - Modified Kernels: - m_left edge kernels: bli_dgemmsup_rv_haswell_asm_6x8m bli_dgemmsup_rv_haswell_asm_6x6m bli_dgemmsup_rv_haswell_asm_6x4m bli_dgemmsup_rv_haswell_asm_6x2m - mn_left edge kernels: bli_dgemmsup_rv_haswell_asm_{1..6}x{2,4,6,8} - Newly Added Kernels: - m_left kernels: bli_dgemmsup_rv_haswell_asm_6x{1,3,5,7}m - mn_left kernels: bli_dgemmsup_rv_haswell_asm_{1..6}x{1,3,5,7} These additions ensure comprehensive coverage for all edge-case matrix sizes, improving robustness and performance consistency across the DGEMM SUP microkernel suite.
1 parent 5c2b22d commit 849bfe1

11 files changed

+17217
-2347
lines changed

frame/include/bli_x86_asm_macros.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,7 @@
777777
#define VMOVDQA(_0, _1) INSTR_(vmovdqa, _0, _1)
778778
#define VMOVDQA32(_0, _1) INSTR_(vmovdqa32, _0, _1)
779779
#define VMOVDQA64(_0, _1) INSTR_(vmovdqa64, _0, _1)
780+
#define VMOVDQU(_0, _1) INSTR_(vmovdqu, _0, _1)
780781
#define VBROADCASTSS(_0, _1) INSTR_(vbroadcastss, _0, _1)
781782
#define VBROADCASTSD(_0, _1) INSTR_(vbroadcastsd, _0, _1)
782783
#define VPBROADCASTD(_0, _1) INSTR_(vpbroadcastd, _0, _1)
@@ -810,6 +811,7 @@
810811
#define vmovdqa(_0, _1) VMOVDQA(_0, _1)
811812
#define vmovdqa32(_0, _1) VMOVDQA32(_0, _1)
812813
#define vmovdqa64(_0, _1) VMOVDQA64(_0, _1)
814+
#define vmovdqu(_0, _1) VMOVDQU(_0, _1)
813815
#define vbroadcastss(_0, _1) VBROADCASTSS(_0, _1)
814816
#define vbroadcastsd(_0, _1) VBROADCASTSD(_0, _1)
815817
#define vpbroadcastd(_0, _1) VPBROADCASTD(_0, _1)
@@ -912,6 +914,8 @@
912914
#define VCOMISS(_0, _1) INSTR_(vcomiss, _0, _1)
913915
#define VCOMISD(_0, _1) INSTR_(vcomisd, _0, _1)
914916

917+
#define VMASKMOVPD(_0, _1, _2) INSTR_(vmaskmovpd, _0, _1, _2)
918+
915919
#define VFMADD132SS(_0, _1, _2) INSTR_(vfmadd132ss, _0, _1, _2)
916920
#define VFMADD213SS(_0, _1, _2) INSTR_(vfmadd213ss, _0, _1, _2)
917921
#define VFMADD231SS(_0, _1, _2) INSTR_(vfmadd231ss, _0, _1, _2)
@@ -1239,6 +1243,8 @@
12391243
#define vblendmps(_0, _1, _2) VBLENDMSD(_0, _1, _2)
12401244
#define vblendmpd(_0, _1, _2) VBLENDMPD(_0, _1, _2)
12411245

1246+
#define vmaskmovpd(_0, _1, _2) VMASKMOVPD(_0, _1, _2)
1247+
12421248
// Prefetches
12431249

12441250
#define PREFETCH(_0, _1) INSTR_(prefetcht##_0, _1)

0 commit comments

Comments
 (0)