You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-05-21-v0.3.0-release.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -164,12 +164,12 @@ This release upgrades routing logic with intelligent, adaptive strategies for LL
164
164
165
165
Inference engine such as vLLM provides prefix-caching where KV cache of existing queries is cached such that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. To take advantage of this feature, AIBrix gateway introduces prefix-cache aware routing to achieve this goal. Some high level design details are
166
166
167
-
- To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
168
-
169
-
- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT with prefix-cache compared to random routing.
167
+
- Prefix-cache routing, does load balancing to ensure no hot spots are created i.e. all requests sharing the same prefix are intelligently balanced across pods. Goal here to increase prefix-cache sharing without creating a hot-spot (more [implementation details](https://github.com/vllm-project/aibrix/blob/main/pkg/plugins/gateway/algorithms/README.md#prefix-cache-aware), [PR](https://github.com/vllm-project/aibrix/pull/933)). Observed \~45% improvement in TTFT (averaged across different request patterns) with prefix-cache compared to random routing.
170
168
171
169
- Prefix-cache supports multi-turn conversation, gateway's router identifies multi-turn conversation and routes such requests efficiently to ensure KV cache sharing.
172
170
171
+
> To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
@@ -186,7 +186,7 @@ Load Balancing: when the shared prefix portion is smaller than a configurable th
186
186
187
187
By combining these three costs (L + M + P), Preble assigns each request to the GPU with the lowest total cost, effectively balancing immediate processing efficiency against long-term cluster performance. Our implementation of Preble at AIBrix was done based on the original [Preble code](https://github.com/WukLab/preble).
188
188
189
-
> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*.
189
+
> To use preble based prefix-cache solution, include header *`"routing-strategy": "prefix-cache-preble"`*. Current status is *experimental*.
190
190
191
191
192
192
<palign="center">
@@ -202,7 +202,7 @@ The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving b
202
202
203
203
`vtc-basic` router implements the Windowed Adaptive Fairness Routing algorithm, which uses a windowed adaptive clamped linear approach to ensure load fairness among users. It has four key components: (1) a sliding window that tracks token usage over configurable time periods, (2) adaptive bucket sizing that dynamically adjusts based on observed token patterns, (3) clamped token values to prevent extreme sensitivity and jitter, and (4) linear mapping between tokens and pod assignments. Using these components, the router creates a hybrid scoring system that balances fairness (based on normalized user token counts) with utilization (based on current pod load) to select the pod with the lowest combined score, ensuring both fair resource allocation and efficient system utilization. Environment variables to override and configure default values of the router are available [here](https://github.com/vllm-project/aibrix/tree/main/pkg/plugins/gateway/algorithms#environment-variables).
204
204
205
-
> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*.
205
+
> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*. Current status is *experimental*.
0 commit comments