QNN HTP Graph Finalize failure on quantized ViT attention — `q::linearclip` unsupported 5D layout for `[1, 1028, 384]` tensor



**Environment:**
- AIMET-ONNX: 2.22.0+cu121
- QNN SDK: v2.41.0
- HTP target: v73 (soc_id 52, dsp_arch v73)
- Encoding format: 0.6.1
- Python 3.10, PyTorch 2.1.2, CUDA 12.1

**Model:**
RF-DETR Small — DINOv2-S/16 ViT backbone with windowed attention (from [Roboflow RF-DETR](https://github.com/roboflow/rf-detr))

**Summary:**

The model compiles and runs successfully on HTP in **FP16** (60.4ms latency, mAP 0.6065 on COCO val). However, **any quantization that touches the backbone encoder** causes `qnn-context-binary-generator` to fail at graph finalization. This is not limited to specific layers — it affects the entire DINOv2 ViT encoder.

AIMET quantization simulation shows excellent accuracy:
- W8A16 full model: mAP -1.13% (0.5403 vs 0.5465 baseline) — **but fails at QNN graph finalize**
- W8A8 full model: mAP -34% — **also fails at QNN graph finalize**
- W8A8 with entire backbone excluded: compiles, but **latency is worse than FP16** (67ms vs 60.4ms) and mAP drops 35% since only the small decoder/heads are quantized

**Root Cause Analysis:**

The DINOv2 backbone uses windowed attention. Windowed layers process tokens in 4 windows: `[4, 257, 384]`. Global attention layers (3, 6, 9) merge all windows: `[1, 1028, 384]`. The `nn.Linear` query projection + bias Add operates on these 3D tensors.

The QNN HTP compiler maps these 3D shapes to an internal 5D layout (`QuantUint16_5D_TCMEE` for W8A16, `QuantUint8_5D_TCMEE` for W8A8). The quantized Add kernel (`q::linearclip`) does not support this 5D layout, causing graph finalization to fail.

**Critically, this is not limited to the global attention layers.** We attempted progressive exclusion:

1. **Exclude global attention only (layers 3, 6, 9):** The `q::linearclip` failure moved to the **windowed attention layers** (0, 2, 7, 8) — their `value_Add` ops now fail with the same `QuantUint16_5D_TCMEE` error.
2. **Exclude ALL 12 backbone attention layers:** The failure cascaded to **everything downstream** — backbone `layer_scale` Mul, `norm` LayerNormalization, projector Conv/BN/SiLU, and the entire transformer decoder (Add, self_attn, cross_attn, LayerNorm). 129 ops failed total.

The cascade happens because excluding attention creates FP16→quantized boundaries. The downstream quantized ops inherit the problematic 5D internal layout from the FP16 attention outputs, and `q::linearclip` fails on them too. **Partial exclusion within the backbone encoder is not viable.**

**Error Log (from `qnn-context-binary-generator`):**

W8A16 full model (first failure):
```
graph_prepare.cc:222::ERROR:could not create op: q::linearclip
graph_prepare.cc:224::ERROR:Op creation failure, op id=0x5a7b000000196 (q::linearclip) total_inputs=2
graph_prepare.cc:210:  Input 0: id=[0x51be500000196] op=[Reshape_free@F5H.FH] output0=[14ConcreteTensorIN5Tdefs18QuantUint16_5D_TCMEE]
graph_prepare.cc:210:  Input 1: id=[0x5a7af00000196] op=[Const] output0=[14ConcreteTensorIN5Tdefs5Int32EE]
graph_prepare.cc:1658::ERROR:Op 0x5a7b000000196 preparation failed with err:-1
```

W8A16 with layers 3,6,9 attention excluded (failure moves to windowed layers):
```
"_backbone_backbone_0_encoder_encoder_encoder_layer_0_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_2_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_7_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_8_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_Transpose_1" generated: could not create op
```

W8A16 with ALL backbone attention excluded (129 ops cascade):
```
"_backbone_backbone_0_encoder_encoder_encoder_layer_0_layer_scale2_Mul" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_10_norm2_LayerNormalization" generated: could not create op
"_backbone_backbone_0_projector_stages_0_stages_0_0_m_0_cv2_conv_Conv" generated: could not create op
"_transformer_decoder_layers_0_Add" generated: could not create op
"_transformer_decoder_layers_1_self_attn_MatMul" generated: could not create op
"_transformer_decoder_layers_2_cross_attn_Mul_4" generated: could not create op
... (129 total failing ops across backbone, projector, decoder, and heads)
```

**Failing ONNX nodes (progressive exclusion attempts):**

| Attempt | Excluded | New failures | Root cause |
|---------|----------|-------------|------------|
| W8A16 full model | nothing | layers 3,6,9 query/key/value Add, Transpose | `q::linearclip` on `QuantUint16_5D_TCMEE` |
| W8A16 exclude layers 3,6,9 attn | `layer.{3,6,9}/attention` | layers 0,2,7,8 value Add | Same 5D layout issue in windowed layers |
| W8A16 exclude ALL 12 attn layers | `attention/attention` | 129 ops: layer_scale, norm, projector, decoder, heads | FP16→quant boundary cascade |
| W8A8 full model | nothing | layers 3,5,6,8,9 query Add, Transpose | Same, also hits windowed layers |
| W8A8 exclude backbone | `backbone` | compiles | 67ms latency, mAP 0.4518 (worse than FP16) |

**Results summary:**

| Configuration | AIMET Sim mAP | QNN Compile | On-target latency | On-target mAP |
|---------------|---------------|-------------|-------------------|---------------|
| FP16 (no quant) | — | OK | 60.4 ms | 0.6065 |
| W8A16 full model | 0.5403 (-1.13%) | **FAIL** | — | — |
| W8A8 full model | 0.3594 (-34%) | **FAIL** | — | — |
| W8A8, backbone excluded | 0.3553 (-35%) | OK | 67 ms (worse) | 0.4518 |
| W8A16, layers 3,6,9 attn excluded | — | **FAIL** | — | — |
| W8A16, all 12 attn layers excluded | — | **FAIL** | — | — |

**Questions:**
1. Is `q::linearclip` on 5D internal layout (`QuantUint16_5D_TCMEE` / `QuantUint8_5D_TCMEE`) a known HTP v73 limitation? Is it fixed in newer QNN SDK versions?
2. The 5D layout assignment appears to be an internal HTP compiler decision for 3D tensors like `[1, 1028, 384]` and `[4, 257, 384]`. Is there any way to influence this layout choice?
3. Partial exclusion within the backbone is not viable due to FP16→quant boundary cascades. Is there a recommended approach for quantizing ViT/DINOv2 models for HTP? Has any ViT-based model been successfully quantized for HTP v73?
4. Would a different encoding version or AIMET configuration help avoid this specific HTP compilation path?

**QNN Backend / Graph Config:**

`backend_config.json`:
```json
{
    "backend_extensions": {
        "shared_library_path": "libQnnHtpNetRunExtensions.so",
        "config_file_path": "graph_config.json"
    },
    "context_configs": {
        "context_priority": "normal"
    }
}
```

`graph_config.json`:
```json
{
    "graphs": [
        {
            "graph_names": ["model"],
            "fp16_relaxed_precision": 1,
            "vtcm_mb": 8,
            "O": 3,
            "finalize_config": {"P": 3}
        }
    ],
    "devices": [
        {
            "device_id": 0,
            "soc_id": 52,
            "dsp_arch": "v73"
        }
    ]
}
```

We reviewed the full set of `QnnHtpGraph_ConfigOption_t` options from `QnnHtpGraph.h` (SDK v2.41.0). None of the available graph config options (`vtcm_mb`, `O`, `dlbc`, `fold_relu_activation_into_conv_off`, `short_depth_conv_on_hmx_off`, `advanced_activation_fusion`, `use_high_precision_fp16_sigmoid`, `weights_packing`, `num_cores`) control internal tensor layout or provide a way to keep specific ops in FP16 within a quantized graph. The `precision` option (`QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION`) is deprecated since SDK 2.35 and only applies to FP32→FP16 math for float graphs.

**Reproduction:**
The model is RF-DETR Small from https://github.com/roboflow/rf-detr. Export with `model.export()` then `torch.onnx.export(model, dummy, "model.onnx", opset_version=17)`. Quantize with AIMET-ONNX `QuantizationSimModel` using `config_file="htp_v73"`, W8A16 precision, `tf_enhanced` scheme, 1000 calibration samples from COCO train. Convert with `qnn-onnx-converter` (opset 17, `--quantization_overrides`, `--float_fallback`). Generate context binary with `qnn-context-binary-generator --backend libQnnHtp.so --config_file backend_config.json`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QNN HTP Graph Finalize failure on quantized ViT attention — `q::linearclip` unsupported 5D layout for `[1, 1028, 384]` tensor #4086

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attempt	Excluded	New failures	Root cause
W8A16 full model	nothing	layers 3,6,9 query/key/value Add, Transpose	`q::linearclip` on `QuantUint16_5D_TCMEE`
W8A16 exclude layers 3,6,9 attn	`layer.{3,6,9}/attention`	layers 0,2,7,8 value Add	Same 5D layout issue in windowed layers
W8A16 exclude ALL 12 attn layers	`attention/attention`	129 ops: layer_scale, norm, projector, decoder, heads	FP16→quant boundary cascade
W8A8 full model	nothing	layers 3,5,6,8,9 query Add, Transpose	Same, also hits windowed layers
W8A8 exclude backbone	`backbone`	compiles	67ms latency, mAP 0.4518 (worse than FP16)

Configuration	AIMET Sim mAP	QNN Compile	On-target latency	On-target mAP
FP16 (no quant)	—	OK	60.4 ms	0.6065
W8A16 full model	0.5403 (-1.13%)	FAIL	—	—
W8A8 full model	0.3594 (-34%)	FAIL	—	—
W8A8, backbone excluded	0.3553 (-35%)	OK	67 ms (worse)	0.4518
W8A16, layers 3,6,9 attn excluded	—	FAIL	—	—
W8A16, all 12 attn layers excluded	—	FAIL	—	—

QNN HTP Graph Finalize failure on quantized ViT attention — q::linearclip unsupported 5D layout for [1, 1028, 384] tensor #4086

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

QNN HTP Graph Finalize failure on quantized ViT attention — `q::linearclip` unsupported 5D layout for `[1, 1028, 384]` tensor #4086