Skip to content

fix: replace NaN from all-masked SDPA padding rows in Gemma 4 vision#1006

Open
fabiopili wants to merge 1 commit intoBlaizzy:mainfrom
fabiopili:gemma4-nan-fix
Open

fix: replace NaN from all-masked SDPA padding rows in Gemma 4 vision#1006
fabiopili wants to merge 1 commit intoBlaizzy:mainfrom
fabiopili:gemma4-nan-fix

Conversation

@fabiopili
Copy link
Copy Markdown

I've been having problems with Gemma 4 models that I quantized locally and enabled the full 1120 vision token budget when serving from oMLX and accessing from Open WebUI. The model would respond an endless loop of the <pad> token.

A code review is recommended before merging, even though the impact seems minimal. I'm not a proficient Python developer and this change was created with Claude Code assistance for the bug identification. That said, I've been using it locally without problems and tests pass.


Summary

  • Replace NaN values produced by softmax over all-masked padding rows in Gemma 4 vision attention with 0, preventing NaN propagation through residual connections and layer norms
  • The existing ensure_fused_sdpa workaround pads head_dim to force the fused SDPA kernel, but does not help at sequence lengths (e.g. L=10080 for 1120-token budget) where the non-fused fallback is used
  • Padding rows are zeroed out by VisionPooler before pooling anyway, so 0 is the correct replacement value

Problem

MLX's scaled_dot_product_attention computes softmax(QK^T) V. When an entire row of the attention mask is -inf (padding tokens that should attend to nothing), softmax produces 0/0 = NaN. This NaN then propagates through:

  1. The output projection (o_proj)
  2. Residual additions in each of the 24 transformer layers
  3. RMSNorm (any NaN input produces NaN output)
  4. The MLP branch
  5. All subsequent layers

By the time the VisionPooler zeros out padding positions, the NaN has already corrupted non-padding rows via residual connections.

Fix

A single line after the SDPA call in VisionAttention.__call__:

attn_output = mx.where(mx.isnan(attn_output), 0.0, attn_output)

This seems safe because:

  • Only padding rows produce NaN (real content rows have valid attention targets)
  • VisionPooler explicitly zeros padding positions before pooling (vision.py:364-367)
  • The replacement runs inside @mx.compile scope, so the isnan check fuses into the existing compute graph with minimal overhead

Changed files

  • mlx_vlm/models/gemma4/vision.py (+5 lines)

Test plan

  • All 395 unit tests pass (0 failures, 1 skipped, 2 unrelated errors)
  • End-to-end inference with Gemma 4 on single image (fused SDPA path)
  • End-to-end inference with Gemma 4 on multiple different-sized images (non-fused path, L=10080)
  • Verify no NaN in output embeddings at both sequence lengths

MLX's scaled_dot_product_attention produces NaN for fully-masked rows
(softmax over all -inf = 0/0) at most sequence lengths. The existing
ensure_fused_sdpa workaround only helps when the fused kernel activates
(e.g. L=5040) but fails at other sizes like L=10080 (1120-token budget).

Replace NaN with 0 after SDPA — safe because the pooler already zeros
out padding rows before pooling.
nnorris7 added a commit to nnorris7/mlx-vlm that referenced this pull request Apr 13, 2026
…Blaizzy#924)

Two upstream-PR fixes applied to local copy:

vision.py: replace NaN with 0 after SDPA in VisionAttention. All-masked
padding rows produce NaN via softmax (0/0) at sequence lengths where the
non-fused SDPA fallback runs. The pooler zeros these rows anyway, so 0 is
the correct replacement. Adapted from Blaizzy#1006 by Fabio Pili.

processing_gemma4.py + chat_template.jinja: bundle the official Google
Gemma 4 chat template as a fallback. Many mlx-community Gemma 4 models
don't ship chat_template in tokenizer_config. Without this, the prompt
falls back to plain text without turn markers and the model produces
"No text generated". Adapted from Blaizzy#924 by jrp2014.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant