Skip to content

Commit b8cba45

Browse files
authored
Update prefix cache stats release: v0.3.0 (#19)
Signed-off-by: Varun Gupta <varungup90@gmail.com>
1 parent e8fda9b commit b8cba45

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

content/posts/2025-05-21-v0.3.0-release.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -164,12 +164,12 @@ This release upgrades routing logic with intelligent, adaptive strategies for LL
164164

165165
Inference engine such as vLLM provides prefix-caching where KV cache of existing queries is cached such that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. To take advantage of this feature, AIBrix gateway introduces prefix-cache aware routing to achieve this goal. Some high level design details are
166166

167-
- To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
168-
169-
- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT with prefix-cache compared to random routing.
167+
- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT (averaged across different request patterns) with prefix-cache compared to random routing.
170168

171169
- Prefix-cache supports multi-turn conversation, gateway's router identifies multi-turn conversation and routes such requests efficiently to ensure KV cache sharing.
172170

171+
> To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
172+
173173
<p align="center">
174174
<img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="80%" style="display:inline-block; margin-right:1%" />
175175
</p>
@@ -186,7 +186,7 @@ Load Balancing: when the shared prefix portion is smaller than a configurable th
186186

187187
By combining these three costs (L + M + P), Preble assigns each request to the GPU with the lowest total cost, effectively balancing immediate processing efficiency against long-term cluster performance. Our implementation of Preble at AIBrix was done based on the original [Preble code](https://github.com/WukLab/preble).
188188

189-
> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*.
189+
> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*. Current status is *experimental*.
190190
191191

192192
<p align="center">
@@ -202,7 +202,7 @@ The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving b
202202

203203
`vtc-basic` router implements the Windowed Adaptive Fairness Routing algorithm, which uses a windowed adaptive clamped linear approach to ensure load fairness among users. It has four key components: (1) a sliding window that tracks token usage over configurable time periods, (2) adaptive bucket sizing that dynamically adjusts based on observed token patterns, (3) clamped token values to prevent extreme sensitivity and jitter, and (4) linear mapping between tokens and pod assignments. Using these components, the router creates a hybrid scoring system that balances fairness (based on normalized user token counts) with utilization (based on current pod load) to select the pod with the lowest combined score, ensuring both fair resource allocation and efficient system utilization. Environment variables to override and configure default values of the router are available [here](https://github.com/vllm-project/aibrix/tree/main/pkg/plugins/gateway/algorithms#environment-variables).
204204

205-
> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*.
205+
> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*. Current status is *experimental*.
206206
207207
### 3. Synthetic Benchmarking & Load Generation Framework
208208

0 commit comments

Comments
 (0)