Add vision feature caching to all models#1028
Open
Blaizzy wants to merge 2 commits intopc/continous-batchfrom
Open
Add vision feature caching to all models#1028Blaizzy wants to merge 2 commits intopc/continous-batchfrom
Blaizzy wants to merge 2 commits intopc/continous-batchfrom
Conversation
Every model's get_input_embeddings now supports vision_cache and
_image_key kwargs. On cache miss, vision features are computed and
stored. On cache hit, the vision tower is skipped entirely.
Benchmarks (per-request, single image):
- gemma4: 244ms → 1ms (228x speedup), 1GB memory saved
- qwen3.5: 157ms → 7ms (23x speedup)
Pattern added to each model:
vision_cache = kwargs.get("vision_cache", None)
cached = kwargs.get("cached_image_features", None)
if cached is None and vision_cache is not None:
cached = vision_cache.get(kwargs.get("_image_key"))
...
if vision_cache is not None and kwargs.get("_image_key") is not None:
mx.eval(features)
vision_cache.put(kwargs["_image_key"], features)
44 models patched, all syntax-verified and import-tested.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
vision_cachekwarg support to all 44 modelget_input_embeddingsmethods. On cache hit, the vision tower is skipped entirely — saving both time and memory on repeated images (multi-turn conversations, batch requests with shared images).Based on:
pc/continous-batch(continuous batching PR)How it works
Each model's
get_input_embeddingsnow checks:vision_cache.get(_image_key)before callingvision_towervision_cache.put()after the first callThe server passes
vision_cacheand_image_keyas kwargs — models that don't support it simply ignore the extra kwargs via**kwargs.Benchmarks (per-request, single image)
Models patched (42 + 2 already done)
All 44 models with
cached_image_featuressupport. Syntax-verified and import-tested.Test plan
🤖 Generated with Claude Code