Skip to content

[Research] Explore Semantic-Aware Session Affinity in Response API #807

@Xunzhuo

Description

@Xunzhuo

Summary

Explore semantic-aware session affinity mechanisms to intelligently route conversation requests to optimal backend instances based on semantic understanding of the conversation context.

Background

Traditional session affinity (sticky sessions) routes requests based on:

  • Client IP
  • Session cookies
  • Round-robin with hash

These approaches are semantically blind - they don't consider:

  • What the conversation is about
  • Which backend has relevant cached context
  • Model specialization for certain topics
  • KV cache locality for continued conversations

Research Areas

1. Semantic Fingerprinting

  • Conversation embedding: Generate embeddings for conversation sessions
  • Topic classification: Categorize conversations by domain/topic
  • Intent detection: Identify conversation intent patterns
  • Entity extraction: Track key entities across conversation

2. KV Cache-Aware Routing

  • Cache locality: Route to instances with warm KV cache for the session
  • Prefix sharing: Group conversations with similar prefixes
  • Cache pressure prediction: Predict cache eviction and route accordingly
  • Speculative caching: Pre-warm caches on predicted backends

3. Model Specialization Routing

  • Topic-model mapping: Route coding questions to code-tuned models
  • Expertise scoring: Score backends by topic expertise
  • Dynamic specialization: Learn routing patterns from feedback
  • Ensemble routing: Route to multiple specialists and merge

4. Load-Aware Semantic Routing

  • Semantic load balancing: Balance by conversation complexity, not just count
  • Token budget awareness: Route based on expected token usage
  • Latency prediction: Consider semantic complexity in latency estimates

Potential Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Response API Request                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 Semantic Affinity Router                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  Embedding  │  │   Topic     │  │   KV Cache          │  │
│  │  Generator  │  │  Classifier │  │   Registry          │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│                              │                               │
│                     Affinity Score                           │
└─────────────────────────────────────────────────────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           ▼                  ▼                  ▼
    ┌────────────┐     ┌────────────┐     ┌────────────┐
    │  Backend A │     │  Backend B │     │  Backend C │
    │  (Code)    │     │  (General) │     │  (Math)    │
    └────────────┘     └────────────┘     └────────────┘

Potential Implementation

type SemanticAffinityRouter interface {
    // Compute affinity scores for all backends
    ComputeAffinity(ctx context.Context, session *SessionContext, backends []Backend) ([]AffinityScore, error)
    
    // Get semantic fingerprint for a session
    GetFingerprint(ctx context.Context, session *SessionContext) (*SemanticFingerprint, error)
    
    // Register KV cache state for a backend
    RegisterCacheState(backendID string, sessionID string, cacheMetadata *CacheMetadata) error
}

type AffinityScore struct {
    BackendID       string
    SemanticScore   float64  // Topic/intent match
    CacheScore      float64  // KV cache locality
    LoadScore       float64  // Current load factor
    CombinedScore   float64  // Weighted combination
}

type SemanticFingerprint struct {
    Embedding     []float32
    Topics        []string
    Entities      []string
    Intent        string
    Complexity    float64
}

Success Metrics

  • Cache hit rate improvement
  • Response latency reduction (especially for continued conversations)
  • Token efficiency (via prefix sharing)
  • Response quality improvement (via specialization)

References

Related

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions