Skip to content

QNN HTP Graph Finalize failure on quantized ViT attention — q::linearclip unsupported 5D layout for [1, 1028, 384] tensor #4086

@korkland

Description

@korkland

Environment:

  • AIMET-ONNX: 2.22.0+cu121
  • QNN SDK: v2.41.0
  • HTP target: v73 (soc_id 52, dsp_arch v73)
  • Encoding format: 0.6.1
  • Python 3.10, PyTorch 2.1.2, CUDA 12.1

Model:
RF-DETR Small — DINOv2-S/16 ViT backbone with windowed attention (from Roboflow RF-DETR)

Summary:

The model compiles and runs successfully on HTP in FP16 (60.4ms latency, mAP 0.6065 on COCO val). However, any quantization that touches the backbone encoder causes qnn-context-binary-generator to fail at graph finalization. This is not limited to specific layers — it affects the entire DINOv2 ViT encoder.

AIMET quantization simulation shows excellent accuracy:

  • W8A16 full model: mAP -1.13% (0.5403 vs 0.5465 baseline) — but fails at QNN graph finalize
  • W8A8 full model: mAP -34% — also fails at QNN graph finalize
  • W8A8 with entire backbone excluded: compiles, but latency is worse than FP16 (67ms vs 60.4ms) and mAP drops 35% since only the small decoder/heads are quantized

Root Cause Analysis:

The DINOv2 backbone uses windowed attention. Windowed layers process tokens in 4 windows: [4, 257, 384]. Global attention layers (3, 6, 9) merge all windows: [1, 1028, 384]. The nn.Linear query projection + bias Add operates on these 3D tensors.

The QNN HTP compiler maps these 3D shapes to an internal 5D layout (QuantUint16_5D_TCMEE for W8A16, QuantUint8_5D_TCMEE for W8A8). The quantized Add kernel (q::linearclip) does not support this 5D layout, causing graph finalization to fail.

Critically, this is not limited to the global attention layers. We attempted progressive exclusion:

  1. Exclude global attention only (layers 3, 6, 9): The q::linearclip failure moved to the windowed attention layers (0, 2, 7, 8) — their value_Add ops now fail with the same QuantUint16_5D_TCMEE error.
  2. Exclude ALL 12 backbone attention layers: The failure cascaded to everything downstream — backbone layer_scale Mul, norm LayerNormalization, projector Conv/BN/SiLU, and the entire transformer decoder (Add, self_attn, cross_attn, LayerNorm). 129 ops failed total.

The cascade happens because excluding attention creates FP16→quantized boundaries. The downstream quantized ops inherit the problematic 5D internal layout from the FP16 attention outputs, and q::linearclip fails on them too. Partial exclusion within the backbone encoder is not viable.

Error Log (from qnn-context-binary-generator):

W8A16 full model (first failure):

graph_prepare.cc:222::ERROR:could not create op: q::linearclip
graph_prepare.cc:224::ERROR:Op creation failure, op id=0x5a7b000000196 (q::linearclip) total_inputs=2
graph_prepare.cc:210:  Input 0: id=[0x51be500000196] op=[[email protected]] output0=[14ConcreteTensorIN5Tdefs18QuantUint16_5D_TCMEE]
graph_prepare.cc:210:  Input 1: id=[0x5a7af00000196] op=[Const] output0=[14ConcreteTensorIN5Tdefs5Int32EE]
graph_prepare.cc:1658::ERROR:Op 0x5a7b000000196 preparation failed with err:-1

W8A16 with layers 3,6,9 attention excluded (failure moves to windowed layers):

"_backbone_backbone_0_encoder_encoder_encoder_layer_0_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_2_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_7_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_8_attention_attention_value_Add" generated: could not create op
"_backbone_backbone_0_encoder_encoder_Transpose_1" generated: could not create op

W8A16 with ALL backbone attention excluded (129 ops cascade):

"_backbone_backbone_0_encoder_encoder_encoder_layer_0_layer_scale2_Mul" generated: could not create op
"_backbone_backbone_0_encoder_encoder_encoder_layer_10_norm2_LayerNormalization" generated: could not create op
"_backbone_backbone_0_projector_stages_0_stages_0_0_m_0_cv2_conv_Conv" generated: could not create op
"_transformer_decoder_layers_0_Add" generated: could not create op
"_transformer_decoder_layers_1_self_attn_MatMul" generated: could not create op
"_transformer_decoder_layers_2_cross_attn_Mul_4" generated: could not create op
... (129 total failing ops across backbone, projector, decoder, and heads)

Failing ONNX nodes (progressive exclusion attempts):

Attempt Excluded New failures Root cause
W8A16 full model nothing layers 3,6,9 query/key/value Add, Transpose q::linearclip on QuantUint16_5D_TCMEE
W8A16 exclude layers 3,6,9 attn layer.{3,6,9}/attention layers 0,2,7,8 value Add Same 5D layout issue in windowed layers
W8A16 exclude ALL 12 attn layers attention/attention 129 ops: layer_scale, norm, projector, decoder, heads FP16→quant boundary cascade
W8A8 full model nothing layers 3,5,6,8,9 query Add, Transpose Same, also hits windowed layers
W8A8 exclude backbone backbone compiles 67ms latency, mAP 0.4518 (worse than FP16)

Results summary:

Configuration AIMET Sim mAP QNN Compile On-target latency On-target mAP
FP16 (no quant) OK 60.4 ms 0.6065
W8A16 full model 0.5403 (-1.13%) FAIL
W8A8 full model 0.3594 (-34%) FAIL
W8A8, backbone excluded 0.3553 (-35%) OK 67 ms (worse) 0.4518
W8A16, layers 3,6,9 attn excluded FAIL
W8A16, all 12 attn layers excluded FAIL

Questions:

  1. Is q::linearclip on 5D internal layout (QuantUint16_5D_TCMEE / QuantUint8_5D_TCMEE) a known HTP v73 limitation? Is it fixed in newer QNN SDK versions?
  2. The 5D layout assignment appears to be an internal HTP compiler decision for 3D tensors like [1, 1028, 384] and [4, 257, 384]. Is there any way to influence this layout choice?
  3. Partial exclusion within the backbone is not viable due to FP16→quant boundary cascades. Is there a recommended approach for quantizing ViT/DINOv2 models for HTP? Has any ViT-based model been successfully quantized for HTP v73?
  4. Would a different encoding version or AIMET configuration help avoid this specific HTP compilation path?

QNN Backend / Graph Config:

backend_config.json:

{
    "backend_extensions": {
        "shared_library_path": "libQnnHtpNetRunExtensions.so",
        "config_file_path": "graph_config.json"
    },
    "context_configs": {
        "context_priority": "normal"
    }
}

graph_config.json:

{
    "graphs": [
        {
            "graph_names": ["model"],
            "fp16_relaxed_precision": 1,
            "vtcm_mb": 8,
            "O": 3,
            "finalize_config": {"P": 3}
        }
    ],
    "devices": [
        {
            "device_id": 0,
            "soc_id": 52,
            "dsp_arch": "v73"
        }
    ]
}

We reviewed the full set of QnnHtpGraph_ConfigOption_t options from QnnHtpGraph.h (SDK v2.41.0). None of the available graph config options (vtcm_mb, O, dlbc, fold_relu_activation_into_conv_off, short_depth_conv_on_hmx_off, advanced_activation_fusion, use_high_precision_fp16_sigmoid, weights_packing, num_cores) control internal tensor layout or provide a way to keep specific ops in FP16 within a quantized graph. The precision option (QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION) is deprecated since SDK 2.35 and only applies to FP32→FP16 math for float graphs.

Reproduction:
The model is RF-DETR Small from https://github.com/roboflow/rf-detr. Export with model.export() then torch.onnx.export(model, dummy, "model.onnx", opset_version=17). Quantize with AIMET-ONNX QuantizationSimModel using config_file="htp_v73", W8A16 precision, tf_enhanced scheme, 1000 calibration samples from COCO train. Convert with qnn-onnx-converter (opset 17, --quantization_overrides, --float_fallback). Generate context binary with qnn-context-binary-generator --backend libQnnHtp.so --config_file backend_config.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Mediumlevel of difficultyaimet-onnxNew feature or bug fix for AIMET ONNXenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions