Masked lanes SIMD operations

(per conversion from 2025-07-19 SIMD meeting filing it here)

Currently LLVM toolchain generates non-efficiently-translatable-to-native shuffle operations. The shuffle operation is doing right thing, but it is really hard to produce an efficient translation to native instructions. For example (1):

```
    i8x16.shuffle 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0
    v128.const i32x4 0x00000001 0x00000001 0x00000001 0x00000001
    v128.and
```

The shuffle operation will be encoded ad  multiple instructions including loading additional constants. But it is possible to encode it more efficiently if it is known that some lanes will be discarded in the final value, e.g. we can place any lane in 'x' places `0 x x x 1 x x x 2 x x x 3 x x x` and now it is possible to use couple of zero extend instructions instead. I guess the code is automatically generated by a auto-vectorizer which chose `0` as arbitrary lane.

More interesting operations:

2)
```
    i8x16.shuffle 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    v128.store16_lane align=1 0
```

3)
```
    i8x16.shuffle 12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0
    i16x8.extend_low_i8x16_u
    i32x4.extend_low_i16x8_u
```
or
```
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 1 0 1 0 1 0 1
    i32x4.extend_low_i16x8_u
```

4)

```
    local.get 3
    local.get 3
    local.get 3
    i8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    local.tee 3
    local.get 3
    local.get 3
    i8x16.shuffle 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    i8x16.min_u
    i8x16.extract_lane_u 0
```

The common thing that if it is known that some zero-lane references are not important, it is easier to select a more performant instructions for `i8x16.shuffle` compiled code. If this burden falls on a toolchain/auto-vectorizer, the selected shuffle may "prefer" one CPU.

The snippets are taken from https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.1/dist/ort-wasm-simd.jsep.wasm 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masked lanes SIMD operations #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Masked lanes SIMD operations #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions