This project explores System Architecture and Performance Engineering for Large Language Model (LLM) inference, addressing the high cost and limited capacity of GPU High Bandwidth Memory (HBM). It delivers a Hyper-Converged KV-Cache Offloading pipeline leveraging vLLM, LMCache, and KVRocks to redirect KV-cache data from GPU memory to high-throughput SSDs, establishing performance baselines and insights for scalable inference architectures.
Objective: Maximize throughput, reduce latency, and minimize GPU HBM pressure through intelligent KV-cache offloading to DRAM and SSD tiers, while identifying and resolving serialization and I/O bottlenecks.
The architecture comprises four core components operating in a hyper-converged configuration:
- Executes token generation with HBM-resident KV-cache
- Implements prefix-caching for inference efficiency
- Integrates with LMCache for seamless tier transitions
Multi-tier cache hierarchy:
- Tier 0: HBM (fastest, limited capacity)
- Tier 1: DRAM (high-speed intermediate buffer)
- Tier 2: SSD via KVRocks (high-capacity persistent storage)
Enables automatic spillover when GPU memory saturates.
Persistent key-value store optimized for large tensor serialization:
- Tuned block sizes and memtable configurations
- Handles random read patterns characteristic of LLM workloads
- Supports SSD and DRAM backends
- Unified vLLM + LMCache deployment where inference and caching coexist on the same node
- Dynamic offloading policies across HBM → DRAM → SSD tiers
- Seamless integration with NVIDIA Dynamo for orchestration
- Fine-tuned block sizes, memtables, and compaction strategies
- Evaluated random read amplification under LLM workload patterns
- Compared SSD vs. DRAM backend performance
- Throughput: tokens/sec across memory tier configurations
- Latency: P50/P95/P99 analysis for multi-tier scenarios
- I/O Profiling: Identified serialization overhead as primary bottleneck
Demonstrated that performance degradation stems from:
- Tensor serialization/deserialization (not SSD speed alone)
- Round-trip overhead to RocksDB
- Chunk size and batch granularity
Enabling cache persistence showed 2x throughput improvement over vanilla Dynamo execution:
- Dynamo (baseline): Standard inference without KV-cache persistence
- Dynamo + LMCache: Persistent cache across requests
- Result: Cache reuse eliminated redundant computations, doubling effective throughput
Distributed execution across 2 nodes with shared cache demonstrated 2x speedup:
- Workload partitioned across two inference nodes
- Shared LMCache tier enabled cross-node prefix reuse
- Load balancing improved GPU utilization and reduced per-node latency
Performance comparison across storage backends revealed critical trade-offs:
- KVRocks Performance Degradation: Database access proved significantly more expensive than direct DRAM access due to serialization overhead and I/O latency
- Cache-Aware Routing Experiment: Disabling locality-aware routing (precursor to Locality-Preserving) in subsequent tests
- Implication: Routing intelligence becomes critical at scale — eliminating naive routing strategies yields substantial throughput gains in multi-node deployments
- Serialization dominates latency — not raw storage speed
- DRAM tier provides substantial benefit over direct SSD offload
- Prefix-cache reuse improves when KV-cache persists across tiers
- SSD offloading viable when majority of accesses remain in HBM/DRAM
- KV-cache confined to GPU HBM
- Cache eviction under memory pressure
- High I/O regeneration cost
- Limited throughput scalability
- Multi-tier memory hierarchy (HBM → DRAM → SSD)
- Persistent KV-cache across tiers
- Intelligent routing with prefix-awareness
- Enhanced throughput and latency under sustained load
While not implemented in this PoC, the system is architected to support Locality-Preserving Routing, which will:
- Maintain prefix-cache locality to minimize cold misses
- Route requests to nodes with warm KV-cache entries
- Further reduce cross-tier access overhead
This PoC establishes the feasibility of Hyper-Converged KV-Cache Offloading for LLM inference, demonstrating that:
- Multi-tier memory hierarchies can effectively extend GPU HBM capacity
- Serialization overhead is the critical bottleneck, not storage bandwidth
- DRAM acts as a highly effective intermediate cache layer
- System-level co-design (vLLM + LMCache + KVRocks) enables scalable, cost-efficient inference architectures
The insights gained provide a foundation for production-grade deployment of memory-tiered LLM serving systems.
- vLLM: High-performance LLM inference engine
- LMCache: Multi-tier cache management framework
- KVRocks: RocksDB-based key-value store
- NVIDIA Dynamo: Orchestration and routing layer
- RocksDB: Persistent storage backend