-
Notifications
You must be signed in to change notification settings - Fork 307
Open
Labels
Milestone
Description
Summary
Explore semantic-aware session affinity mechanisms to intelligently route conversation requests to optimal backend instances based on semantic understanding of the conversation context.
Background
Traditional session affinity (sticky sessions) routes requests based on:
- Client IP
- Session cookies
- Round-robin with hash
These approaches are semantically blind - they don't consider:
- What the conversation is about
- Which backend has relevant cached context
- Model specialization for certain topics
- KV cache locality for continued conversations
Research Areas
1. Semantic Fingerprinting
- Conversation embedding: Generate embeddings for conversation sessions
- Topic classification: Categorize conversations by domain/topic
- Intent detection: Identify conversation intent patterns
- Entity extraction: Track key entities across conversation
2. KV Cache-Aware Routing
- Cache locality: Route to instances with warm KV cache for the session
- Prefix sharing: Group conversations with similar prefixes
- Cache pressure prediction: Predict cache eviction and route accordingly
- Speculative caching: Pre-warm caches on predicted backends
3. Model Specialization Routing
- Topic-model mapping: Route coding questions to code-tuned models
- Expertise scoring: Score backends by topic expertise
- Dynamic specialization: Learn routing patterns from feedback
- Ensemble routing: Route to multiple specialists and merge
4. Load-Aware Semantic Routing
- Semantic load balancing: Balance by conversation complexity, not just count
- Token budget awareness: Route based on expected token usage
- Latency prediction: Consider semantic complexity in latency estimates
Potential Architecture
┌─────────────────────────────────────────────────────────────┐
│ Response API Request │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Semantic Affinity Router │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Embedding │ │ Topic │ │ KV Cache │ │
│ │ Generator │ │ Classifier │ │ Registry │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ Affinity Score │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Backend A │ │ Backend B │ │ Backend C │
│ (Code) │ │ (General) │ │ (Math) │
└────────────┘ └────────────┘ └────────────┘
Potential Implementation
type SemanticAffinityRouter interface {
// Compute affinity scores for all backends
ComputeAffinity(ctx context.Context, session *SessionContext, backends []Backend) ([]AffinityScore, error)
// Get semantic fingerprint for a session
GetFingerprint(ctx context.Context, session *SessionContext) (*SemanticFingerprint, error)
// Register KV cache state for a backend
RegisterCacheState(backendID string, sessionID string, cacheMetadata *CacheMetadata) error
}
type AffinityScore struct {
BackendID string
SemanticScore float64 // Topic/intent match
CacheScore float64 // KV cache locality
LoadScore float64 // Current load factor
CombinedScore float64 // Weighted combination
}
type SemanticFingerprint struct {
Embedding []float32
Topics []string
Entities []string
Intent string
Complexity float64
}Success Metrics
- Cache hit rate improvement
- Response latency reduction (especially for continued conversations)
- Token efficiency (via prefix sharing)
- Response quality improvement (via specialization)
References
- SGLang RadixAttention - Prefix sharing and KV cache management
- Orca - Iteration-level scheduling
- vLLM PagedAttention - Memory management for KV cache
Related
- Parent PR: [Feat][Memory] Add OpenAI Response API support #802
- Builds on existing semantic router capabilities
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Backlog