-
Notifications
You must be signed in to change notification settings - Fork 7
Description
(per conversion from 2025-07-19 SIMD meeting filing it here)
Currently LLVM toolchain generates non-efficiently-translatable-to-native shuffle operations. The shuffle operation is doing right thing, but it is really hard to produce an efficient translation to native instructions. For example (1):
i8x16.shuffle 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0
v128.const i32x4 0x00000001 0x00000001 0x00000001 0x00000001
v128.and
The shuffle operation will be encoded ad multiple instructions including loading additional constants. But it is possible to encode it more efficiently if it is known that some lanes will be discarded in the final value, e.g. we can place any lane in 'x' places 0 x x x 1 x x x 2 x x x 3 x x x and now it is possible to use couple of zero extend instructions instead. I guess the code is automatically generated by a auto-vectorizer which chose 0 as arbitrary lane.
More interesting operations:
i8x16.shuffle 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v128.store16_lane align=1 0
i8x16.shuffle 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0
i16x8.extend_low_i8x16_u
i32x4.extend_low_i16x8_u
or
i8x16.shuffle 8 9 10 11 12 13 14 15 0 1 0 1 0 1 0 1
i32x4.extend_low_i16x8_u
local.get 3
local.get 3
local.get 3
i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
i8x16.min_u
local.tee 3
local.get 3
local.get 3
i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
i8x16.min_u
local.tee 3
local.get 3
local.get 3
i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i8x16.min_u
local.tee 3
local.get 3
local.get 3
i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i8x16.min_u
i8x16.extract_lane_u 0
The common thing that if it is known that some zero-lane references are not important, it is easier to select a more performant instructions for i8x16.shuffle compiled code. If this burden falls on a toolchain/auto-vectorizer, the selected shuffle may "prefer" one CPU.
The snippets are taken from https://cdn.jsdelivr.net/npm/[email protected]/dist/ort-wasm-simd.jsep.wasm