Skip to content

Add Gemma 4 (26B-A4B) LLM and VLM bridge#3148

Draft
yaoyu-33 wants to merge 1 commit intomainfrom
yuya/gemma4-vlm-bridge
Draft

Add Gemma 4 (26B-A4B) LLM and VLM bridge#3148
yaoyu-33 wants to merge 1 commit intomainfrom
yuya/gemma4-vlm-bridge

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Apr 3, 2026

Summary

  • Adds full bridge, provider, and VLM model wrapper for Google's Gemma 4 MoE architecture (google/gemma-4-26B-A4B)
  • Extends the existing Gemma 3 bridge/VL infrastructure with Gemma 4-specific handling: dual local/global RoPE, MoE expert routing, fused router weights, shared expert pre-norm, and sliding window attention
  • VLM combines HF vision tower (SigLIP-based) + multimodal embedder with Megatron-Core GPT language model

New Files

File Description
gemma/gemma4_bridge.py LLM weight mapping — fused router, shared expert pre-norm, QKV/GatedMLP
gemma/gemma4_provider.py Megatron-Core GPT provider — proportional RoPE for global layers, MoE config, logit soft-capping
gemma_vl/gemma4_vl_bridge.py VLM bridge — vision tower + embedder mappings with model.* prefix
gemma_vl/gemma4_vl_provider.py VLM provider extending Gemma4 LLM provider
gemma_vl/modeling_gemma4_vl.py VLM model — HF vision encoder + Megatron language decoder

Validation Results (single-GPU, bf16)

Metric Value
Text-only cosine similarity 0.9998
VLM cosine similarity (causal) 0.9977
VLM cosine similarity (bidirectional) 0.9966
Same top-1 token
Image understanding ✅ ("Red square" correctly identified)

Key Design Decisions

  • Proportional RoPE: Gemma 4 global layers use inv_freq = 1/(base^(arange/head_dim)) with head_dim=512 (not the standard dim=128). Fixed via Gemma4RotaryEmbedding override.
  • Causal-only attention: Currently uses Megatron's default causal masking. HF uses bidirectional attention for image tokens when mm_token_type_ids is provided, which accounts for the 0.9966→0.9977 cosine gap.
  • Vision pipeline: vision_tower.forward() returns last_hidden_state (already pooled + standardized), then embed_vision projects to language hidden dim — matches HF's Gemma4Model.get_image_features.

Remaining Work

  • Bidirectional attention for image tokens: Implement mm_token_type_ids-based attention mask to allow image tokens to attend to each other (improves VLM cosine from 0.9977 → matching HF bidirectional)
  • Unit tests: Add bridge parity tests in tests/unit_tests/models/gemma4/
  • Functional tests: Multi-GPU TP/PP validation
  • Recipes: Add pretrain/SFT recipe configs
  • Requires transformers >= 5.6.0.dev0 (Gemma4 not yet in stable release)

Test plan

  • Single-GPU logit parity test (HF vs Megatron)
  • VLM inference with image input
  • Unit tests for bridge weight mapping
  • Multi-GPU TP/PP/VPP tests
  • SFT training convergence test

🤖 Generated with Claude Code

Add bridge, provider, and VLM model wrapper for Google's Gemma 4
MoE architecture (gemma-4-26B-A4B).

Key components:
- gemma4_bridge.py: Weight mapping with fused router weights,
  shared expert pre-norm fusion, and QKV/GatedMLP mappings
- gemma4_provider.py: Megatron-Core GPT provider with dual
  local/global RoPE (proportional formula for global layers),
  MoE expert routing, sliding window attention, and logit
  soft-capping
- gemma4_vl_bridge.py: VLM bridge with vision tower + embedder
  weight mappings (model.* prefix for raw safetensors keys)
- gemma4_vl_provider.py: VLM provider extending Gemma4 LLM
- modeling_gemma4_vl.py: VLM model combining HF vision tower
  with Megatron language model

Validation (single-GPU, bf16):
- Text-only cosine similarity: 0.9998
- VLM cosine similarity: 0.9977 (causal-only, apples-to-apples)
- Same top-1 token prediction as HF
- Correct image understanding ("Red square" from test image)

Known limitations:
- Bidirectional attention for image tokens (mm_token_type_ids)
  not yet implemented — uses causal-only masking
- Requires transformers >= 5.6.0.dev0 for Gemma4 support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant