Skip to content

Conversation

@Anndrey24
Copy link
Contributor

Description

This commit introduces an f32 ASIMD softmax JIT implementation using the exp eltwise injector added in #4376, while also improving performance for the existing sve_* implementations (primarily by increasing the unrolling factor unroll_regs_ and skipping the multiplication with default dequantization / requantization factors src_scales / dst_scales). For jit:asimd and jit:sve_128, the exp function is also effectively inlined by setting preserve_vmm = false, whereas jit:sve_256 did not benefit from such a change.

As the previous softmax implementation heavily relied on predicated instructions, jit_softmax_base_t was refactored to only include common logic for SVE and non-SVE implementations alike. At the same time, two different derived constructs were added to handle ISA-specific work: jit_softmax_sve_t and jit_softmax_asimd_t.

In addition, the JIT eltwise injector was changed to support storing/loading preserved vectors on non-SVE targets.

Performance improvements (f32)

c6g

Shape Threads jit:asimd (ms) acl (ms) Speedup
1539x387 1 1.21689 1.5615 1.28
1539x387 4 0.306583 0.394197 1.29
1539x387 16 0.078976 0.103172 1.31
1539x387 64 0.02816 0.04522 1.61
1024x4096 1 8.12552 10.4083 1.28
1024x4096 4 2.05314 2.62449 1.28
1024x4096 16 0.526042 0.678114 1.29
1024x4096 64 0.13881 0.182793 1.32
4096x4096 1 32.5925 41.3373 1.27
4096x4096 4 8.19186 10.3651 1.27
4096x4096 16 2.0928 2.66398 1.27
4096x4096 64 0.734764 0.937735 1.28

c7g

Shape Threads jit:sve_256 (after) jit:sve_256 (before) Speedup
1539x387 1 0.58647 0.748606 1.28
1539x387 4 0.150092 0.189787 1.26
1539x387 16 0.03906 0.049228 1.26
1539x387 64 0.018721 0.021218 1.13
1024x4096 1 3.94334 5.12185 1.30
1024x4096 4 0.991868 1.30929 1.32
1024x4096 16 0.24468 0.329952 1.35
1024x4096 64 0.084429 0.108232 1.28
4096x4096 1 15.9669 20.4236 1.28
4096x4096 4 4.08712 5.56156 1.36
4096x4096 16 1.08677 1.43602 1.32
4096x4096 64 0.369658 0.432615 1.17

c8g

Shape Threads jit:sve_128 (after) jit:sve_128 (before) Speedup
1539x387 1 0.669235 0.863312 1.29
1539x387 4 0.168464 0.217245 1.29
1539x387 16 0.043956 0.055711 1.27
1539x387 64 0.018259 0.023519 1.29
1024x4096 1 4.95383 6.07039 1.23
1024x4096 4 1.17104 1.50691 1.29
1024x4096 16 0.295833 0.367653 1.24
1024x4096 64 0.09172 0.130347 1.42
4096x4096 1 20.0518 24.4886 1.22
4096x4096 4 5.11177 6.25783 1.22
4096x4096 16 1.3261 1.58102 1.19
4096x4096 64 0.341221 0.478697 1.40

@Anndrey24 Anndrey24 requested review from a team as code owners December 9, 2025 13:31
@github-actions github-actions bot added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 component:common labels Dec 9, 2025
@michalowski-arm
Copy link
Contributor

As this change is pretty big, do you think it would be possible to neatly split it into two commits: one for the sve optimizations and one for the asimd impl? The sve changes should even maybe be a separate PR.

This commit moves all SVE-specific code into a new construct `jit_softmax_sve_t`.
This commit introduces an f32 ASIMD `softmax` JIT implementation.
This commit adapts some of the ASIMD softmax changes for the SVE  kernels.

In particular, the `jit:sve_128` logic more closely resembles `jit:asimd` (e.g. its `exp` eltwise injector is inlined and uses `compute_vector_range()` instead of `compute_vector()`).
@Anndrey24
Copy link
Contributor Author

Anndrey24 commented Dec 9, 2025

I've now split up the changes into 3 separate commits:

  1. cpu: aarch64: refactor jit_uni_softmax:
    • keeps ISA-agnostic logic in jit_softmax_base_t, while all SVE-specific code is moved into a new construct jit_softmax_sve_t.
    • most of the changes are due to indentation differences.
  2. cpu: aarch64: add ASIMD softmax JIT implementation:
    • adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct jit_softmax_base_t.
  3. cpu: aarch64: improve SVE JIT softmax performance:
    • adapts some of the ASIMD performance gains for the SVE kernels too, in particular SVE 128 as they share the same vector length.

I will move the final commit to a follow-up PR if you think that's best. I've only left all 3 together for now as the c7g/c8g speedups would be less noticeable at a glance with the SVE improvements in commits 2 and 3 split up, compared to being altogether in a single table like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants