Skip to content

Add vision feature caching to all models#1028

Open
Blaizzy wants to merge 2 commits intopc/continous-batchfrom
pc/vision-cache-all-models
Open

Add vision feature caching to all models#1028
Blaizzy wants to merge 2 commits intopc/continous-batchfrom
pc/vision-cache-all-models

Conversation

@Blaizzy
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy commented Apr 16, 2026

Summary

Adds vision_cache kwarg support to all 44 model get_input_embeddings methods. On cache hit, the vision tower is skipped entirely — saving both time and memory on repeated images (multi-turn conversations, batch requests with shared images).

Based on: pc/continous-batch (continuous batching PR)

How it works

Each model's get_input_embeddings now checks:

  1. vision_cache.get(_image_key) before calling vision_tower
  2. Stores computed features via vision_cache.put() after the first call

The server passes vision_cache and _image_key as kwargs — models that don't support it simply ignore the extra kwargs via **kwargs.

Benchmarks (per-request, single image)

Model Cache miss Cache hit Speedup Memory saved
gemma-4-26b 244ms 1ms 228x 1 GB
Qwen3.5-4B 157ms 7ms 23x

Models patched (42 + 2 already done)

All 44 models with cached_image_features support. Syntax-verified and import-tested.

Test plan

  • Syntax check all 44 files
  • Import test all 44 modules
  • Pattern verification (vision_cache get + put)
  • Multi-turn cache hit/miss timing (gemma4 + qwen3.5)
  • Embeddings match between cached and uncached paths
  • Full test suite: 396 passed

🤖 Generated with Claude Code

Blaizzy and others added 2 commits April 16, 2026 20:26
Every model's get_input_embeddings now supports vision_cache and
_image_key kwargs. On cache miss, vision features are computed and
stored. On cache hit, the vision tower is skipped entirely.

Benchmarks (per-request, single image):
- gemma4: 244ms → 1ms (228x speedup), 1GB memory saved
- qwen3.5: 157ms → 7ms (23x speedup)

Pattern added to each model:
  vision_cache = kwargs.get("vision_cache", None)
  cached = kwargs.get("cached_image_features", None)
  if cached is None and vision_cache is not None:
      cached = vision_cache.get(kwargs.get("_image_key"))
  ...
  if vision_cache is not None and kwargs.get("_image_key") is not None:
      mx.eval(features)
      vision_cache.put(kwargs["_image_key"], features)

44 models patched, all syntax-verified and import-tested.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant