-
Notifications
You must be signed in to change notification settings - Fork 450
QNN HTP Graph Finalize failure on quantized ViT attention — q::linearclip unsupported 5D layout for [1, 1028, 384] tensor #4086
Description
Environment:
- AIMET-ONNX: 2.22.0+cu121
- QNN SDK: v2.41.0
- HTP target: v73 (soc_id 52, dsp_arch v73)
- Encoding format: 0.6.1
- Python 3.10, PyTorch 2.1.2, CUDA 12.1
Model:
RF-DETR Small — DINOv2-S/16 ViT backbone with windowed attention (from Roboflow RF-DETR)
Summary:
The model compiles and runs successfully on HTP in FP16 (60.4ms latency, mAP 0.6065 on COCO val). However, any quantization that touches the backbone encoder causes qnn-context-binary-generator to fail at graph finalization. This is not limited to specific layers — it affects the entire DINOv2 ViT encoder.
AIMET quantization simulation shows excellent accuracy:
- W8A16 full model: mAP -1.13% (0.5403 vs 0.5465 baseline) — but fails at QNN graph finalize
- W8A8 full model: mAP -34% — also fails at QNN graph finalize
- W8A8 with entire backbone excluded: compiles, but latency is worse than FP16 (67ms vs 60.4ms) and mAP drops 35% since only the small decoder/heads are quantized
Root Cause Analysis:
The DINOv2 backbone uses windowed attention. Windowed layers process tokens in 4 windows: [4, 257, 384]. Global attention layers (3, 6, 9) merge all windows: [1, 1028, 384]. The nn.Linear query projection + bias Add operates on these 3D tensors.
The QNN HTP compiler maps these 3D shapes to an internal 5D layout (QuantUint16_5D_TCMEE for W8A16, QuantUint8_5D_TCMEE for W8A8). The quantized Add kernel (q::linearclip) does not support this 5D layout, causing graph finalization to fail.
Critically, this is not limited to the global attention layers. We attempted progressive exclusion:
- Exclude global attention only (layers 3, 6, 9): The
q::linearclipfailure moved to the windowed attention layers (0, 2, 7, 8) — theirvalue_Addops now fail with the sameQuantUint16_5D_TCMEEerror. - Exclude ALL 12 backbone attention layers: The failure cascaded to everything downstream — backbone
layer_scaleMul,normLayerNormalization, projector Conv/BN/SiLU, and the entire transformer decoder (Add, self_attn, cross_attn, LayerNorm). 129 ops failed total.
The cascade happens because excluding attention creates FP16→quantized boundaries. The downstream quantized ops inherit the problematic 5D internal layout from the FP16 attention outputs, and q::linearclip fails on them too. Partial exclusion within the backbone encoder is not viable.
Error Log (from qnn-context-binary-generator):
W8A16 full model (first failure):
graph_prepare.cc:222::ERROR:could not create op: q::linearclip
graph_prepare.cc:224::ERROR:Op creation failure, op id=0x5a7b000000196 (q::linearclip) total_inputs=2
graph_prepare.cc:210: Input 0: id=[0x51be500000196] op=[[email protected]] output0=[14ConcreteTensorIN5Tdefs18QuantUint16_5D_TCMEE]
graph_prepare.cc:210: Input 1: id=[0x5a7af00000196] op=[Const] output0=[14ConcreteTensorIN5Tdefs5Int32EE]
graph_prepare.cc:1658::ERROR:Op 0x5a7b000000196 preparation failed with err:-1
W8A16 with layers 3,6,9 attention excluded (failure moves to windowed layers):
"_backbone_backbone_0_encoder_encoder_encoder_layer_0_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_2_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_7_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_8_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_Transpose_1" generated: could not create op
W8A16 with ALL backbone attention excluded (129 ops cascade):
"_backbone_backbone_0_encoder_encoder_encoder_layer_0_layer_scale2_Mul" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_10_norm2_LayerNormalization" generated: could not create op
"_backbone_backbone_0_projector_stages_0_stages_0_0_m_0_cv2_conv_Conv" generated: could not create op
"_transformer_decoder_layers_0_Add" generated: could not create op
"_transformer_decoder_layers_1_self_attn_MatMul" generated: could not create op
"_transformer_decoder_layers_2_cross_attn_Mul_4" generated: could not create op
... (129 total failing ops across backbone, projector, decoder, and heads)
Failing ONNX nodes (progressive exclusion attempts):
| Attempt | Excluded | New failures | Root cause |
|---|---|---|---|
| W8A16 full model | nothing | layers 3,6,9 query/key/value Add, Transpose | q::linearclip on QuantUint16_5D_TCMEE |
| W8A16 exclude layers 3,6,9 attn | layer.{3,6,9}/attention |
layers 0,2,7,8 value Add | Same 5D layout issue in windowed layers |
| W8A16 exclude ALL 12 attn layers | attention/attention |
129 ops: layer_scale, norm, projector, decoder, heads | FP16→quant boundary cascade |
| W8A8 full model | nothing | layers 3,5,6,8,9 query Add, Transpose | Same, also hits windowed layers |
| W8A8 exclude backbone | backbone |
compiles | 67ms latency, mAP 0.4518 (worse than FP16) |
Results summary:
| Configuration | AIMET Sim mAP | QNN Compile | On-target latency | On-target mAP |
|---|---|---|---|---|
| FP16 (no quant) | — | OK | 60.4 ms | 0.6065 |
| W8A16 full model | 0.5403 (-1.13%) | FAIL | — | — |
| W8A8 full model | 0.3594 (-34%) | FAIL | — | — |
| W8A8, backbone excluded | 0.3553 (-35%) | OK | 67 ms (worse) | 0.4518 |
| W8A16, layers 3,6,9 attn excluded | — | FAIL | — | — |
| W8A16, all 12 attn layers excluded | — | FAIL | — | — |
Questions:
- Is
q::linearclipon 5D internal layout (QuantUint16_5D_TCMEE/QuantUint8_5D_TCMEE) a known HTP v73 limitation? Is it fixed in newer QNN SDK versions? - The 5D layout assignment appears to be an internal HTP compiler decision for 3D tensors like
[1, 1028, 384]and[4, 257, 384]. Is there any way to influence this layout choice? - Partial exclusion within the backbone is not viable due to FP16→quant boundary cascades. Is there a recommended approach for quantizing ViT/DINOv2 models for HTP? Has any ViT-based model been successfully quantized for HTP v73?
- Would a different encoding version or AIMET configuration help avoid this specific HTP compilation path?
QNN Backend / Graph Config:
backend_config.json:
{
"backend_extensions": {
"shared_library_path": "libQnnHtpNetRunExtensions.so",
"config_file_path": "graph_config.json"
},
"context_configs": {
"context_priority": "normal"
}
}graph_config.json:
{
"graphs": [
{
"graph_names": ["model"],
"fp16_relaxed_precision": 1,
"vtcm_mb": 8,
"O": 3,
"finalize_config": {"P": 3}
}
],
"devices": [
{
"device_id": 0,
"soc_id": 52,
"dsp_arch": "v73"
}
]
}We reviewed the full set of QnnHtpGraph_ConfigOption_t options from QnnHtpGraph.h (SDK v2.41.0). None of the available graph config options (vtcm_mb, O, dlbc, fold_relu_activation_into_conv_off, short_depth_conv_on_hmx_off, advanced_activation_fusion, use_high_precision_fp16_sigmoid, weights_packing, num_cores) control internal tensor layout or provide a way to keep specific ops in FP16 within a quantized graph. The precision option (QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION) is deprecated since SDK 2.35 and only applies to FP32→FP16 math for float graphs.
Reproduction:
The model is RF-DETR Small from https://github.com/roboflow/rf-detr. Export with model.export() then torch.onnx.export(model, dummy, "model.onnx", opset_version=17). Quantize with AIMET-ONNX QuantizationSimModel using config_file="htp_v73", W8A16 precision, tf_enhanced scheme, 1000 calibration samples from COCO train. Convert with qnn-onnx-converter (opset 17, --quantization_overrides, --float_fallback). Generate context binary with qnn-context-binary-generator --backend libQnnHtp.so --config_file backend_config.json.