fix: replace NaN from all-masked SDPA padding rows in Gemma 4 vision#1006
Open
fabiopili wants to merge 1 commit intoBlaizzy:mainfrom
Open
fix: replace NaN from all-masked SDPA padding rows in Gemma 4 vision#1006fabiopili wants to merge 1 commit intoBlaizzy:mainfrom
fabiopili wants to merge 1 commit intoBlaizzy:mainfrom
Conversation
MLX's scaled_dot_product_attention produces NaN for fully-masked rows (softmax over all -inf = 0/0) at most sequence lengths. The existing ensure_fused_sdpa workaround only helps when the fused kernel activates (e.g. L=5040) but fails at other sizes like L=10080 (1120-token budget). Replace NaN with 0 after SDPA — safe because the pooler already zeros out padding rows before pooling.
nnorris7
added a commit
to nnorris7/mlx-vlm
that referenced
this pull request
Apr 13, 2026
…Blaizzy#924) Two upstream-PR fixes applied to local copy: vision.py: replace NaN with 0 after SDPA in VisionAttention. All-masked padding rows produce NaN via softmax (0/0) at sequence lengths where the non-fused SDPA fallback runs. The pooler zeros these rows anyway, so 0 is the correct replacement. Adapted from Blaizzy#1006 by Fabio Pili. processing_gemma4.py + chat_template.jinja: bundle the official Google Gemma 4 chat template as a fallback. Many mlx-community Gemma 4 models don't ship chat_template in tokenizer_config. Without this, the prompt falls back to plain text without turn markers and the model produces "No text generated". Adapted from Blaizzy#924 by jrp2014. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I've been having problems with Gemma 4 models that I quantized locally and enabled the full 1120 vision token budget when serving from oMLX and accessing from Open WebUI. The model would respond an endless loop of the
<pad>token.A code review is recommended before merging, even though the impact seems minimal. I'm not a proficient Python developer and this change was created with Claude Code assistance for the bug identification. That said, I've been using it locally without problems and tests pass.
Summary
ensure_fused_sdpaworkaround padshead_dimto force the fused SDPA kernel, but does not help at sequence lengths (e.g. L=10080 for 1120-token budget) where the non-fused fallback is usedVisionPoolerbefore pooling anyway, so 0 is the correct replacement valueProblem
MLX's
scaled_dot_product_attentioncomputessoftmax(QK^T) V. When an entire row of the attention mask is-inf(padding tokens that should attend to nothing), softmax produces0/0 = NaN. This NaN then propagates through:o_proj)By the time the
VisionPoolerzeros out padding positions, the NaN has already corrupted non-padding rows via residual connections.Fix
A single line after the SDPA call in
VisionAttention.__call__:This seems safe because:
VisionPoolerexplicitly zeros padding positions before pooling (vision.py:364-367)@mx.compilescope, so theisnancheck fuses into the existing compute graph with minimal overheadChanged files
mlx_vlm/models/gemma4/vision.py(+5 lines)Test plan