Update prefix cache stats release: v0.3.0 (#19)

varungup90 · web-flow · commit b8cba45537dc · 2025-05-21T17:16:35.000-07:00
Signed-off-by: Varun Gupta &lt;varungup90@gmail.com&gt;
diff --git a/content/posts/2025-05-21-v0.3.0-release.md b/content/posts/2025-05-21-v0.3.0-release.md
@@ -164,12 +164,12 @@ This release upgrades routing logic with intelligent, adaptive strategies for LL
 
 Inference engine such as vLLM provides prefix-caching where KV cache of existing queries is cached such that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. To take advantage of this feature, AIBrix gateway introduces prefix-cache aware routing to achieve this goal. Some high level design details are 
 
-- To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
-
-- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT with prefix-cache compared to random routing. 
+- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT (averaged across different request patterns) with prefix-cache compared to random routing. 
 
 - Prefix-cache supports multi-turn conversation, gateway's router identifies multi-turn conversation and routes such requests efficiently to ensure KV cache sharing.
 
+ > To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
+
 <p align="center">
   <img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="80%" style="display:inline-block; margin-right:1%" />
 </p>
@@ -186,7 +186,7 @@ Load Balancing: when the shared prefix portion is smaller than a configurable th
 
 By combining these three costs (L + M + P), Preble assigns each request to the GPU with the lowest total cost, effectively balancing immediate processing efficiency against long-term cluster performance. Our implementation of Preble at AIBrix was done based on the original [Preble code](https://github.com/WukLab/preble).
 
-> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*.
+> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*. Current status is *experimental*.
 
 
 <p align="center">
@@ -202,7 +202,7 @@ The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving b
 
 `vtc-basic` router implements the Windowed Adaptive Fairness Routing algorithm, which uses a windowed adaptive clamped linear approach to ensure load fairness among users.  It has four key components: (1) a sliding window that tracks token usage over configurable time periods, (2) adaptive bucket sizing that dynamically adjusts based on observed token patterns, (3) clamped token values to prevent extreme sensitivity and jitter, and (4) linear mapping between tokens and pod assignments. Using these components, the router creates a hybrid scoring system that balances fairness (based on normalized user token counts) with utilization (based on current pod load) to select the pod with the lowest combined score, ensuring both fair resource allocation and efficient system utilization.  Environment variables to override and configure default values of the router are available [here](https://github.com/vllm-project/aibrix/tree/main/pkg/plugins/gateway/algorithms#environment-variables).  
 
-> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*.
+> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*. Current status is *experimental*.
 
 ### 3. Synthetic Benchmarking & Load Generation Framework