This document describes the internal architecture of Llama Stack for contributors and AI agents working with the codebase. For user-facing documentation, see llamastack.github.io. For contribution guidelines, see CONTRIBUTING.md.
Llama Stack is a server that exposes a unified API for AI capabilities: inference, agents, safety, vector storage, evaluation, and more. It is provider-agnostic: the same API works whether the backend is Ollama, OpenAI, vLLM, Fireworks, or dozens of other services.
The codebase is split into two packages:
llama-stack-api(src/llama_stack_api/) -- Lightweight package containing API protocol definitions (PythonProtocolclasses), Pydantic data types, and provider spec definitions. No server code, no provider implementations. Third-party providers depend only on this.llama-stack(src/llama_stack/) -- The server implementation: provider resolution, routing, storage, CLI, and all built-in providers.llama-stack-ui(src/llama_stack_ui/) -- Optional web UI for the chat playground and admin. Built with Next.js.
Client (llama-stack-client SDK or raw HTTP)
|
v
FastAPI Server (src/llama_stack/core/server/server.py)
|
|-- AuthenticationMiddleware (token validation, user extraction)
|-- QuotaMiddleware (rate limiting per client)
|
v
Route Dispatch
|
|-- FastAPI Router routes (e.g. /v1/openai/* via fastapi_router_registry.py)
|-- Legacy @webmethod routes (protocol methods with @webmethod decorator)
|
v
Router (src/llama_stack/core/routers/)
|
|-- Looks up the resource (model, shield, etc.) in the RoutingTable
|-- Resolves which provider handles this resource
|-- Enforces access control policies
|
v
Provider Implementation
|
|-- Inline provider (runs in-process, e.g. meta-reference, sqlite-vec)
|-- Remote provider (calls external service, e.g. ollama, openai, fireworks)
|
v
External Service or Local Computation
- Client sends
POST /v1/openai/chat/completionswithmodel: "ollama/llama3.2:3b-instruct-fp16". server.pydispatches to the inference FastAPI router.- The
InferenceRouter(core/routers/inference.py) callsrouting_table.get_provider_impl(model_id). CommonRoutingTableImpllooks up the model inDistributionRegistry, finds it belongs to providerollama.- The router delegates to the
ollamaprovider'sopenai_chat_completion()method. - The Ollama provider (which extends
OpenAIMixin) creates anAsyncOpenAIclient pointing at the Ollama server and forwards the request. - The response streams back through the router to the client as SSE events.
Provider
|
|-- InlineProviderSpec (runs in-process)
| provider_type: "inline::builtin"
| module: "llama_stack.providers.inline.inference.builtin"
|
|-- RemoteProviderSpec (adapts an external service)
provider_type: "remote::ollama"
module: "llama_stack.providers.remote.inference.ollama"
Each provider spec declares:
api-- which API it implements (e.g.,Api.inference)provider_type-- unique identifier like"remote::openai"module-- Python module with aget_adapter_impl()orget_provider_impl()functionconfig_class-- Pydantic config model for the providerpip_packages-- additional dependencies needed at runtime
src/llama_stack/providers/registry/ contains one file per API (e.g., inference.py, safety.py). Each file defines an available_providers() function that returns all ProviderSpec objects for that API. The registry is loaded at startup by get_provider_registry() in core/distribution.py.
At startup, resolve_impls() in core/resolver.py:
- Validates providers declared in the run config against the registry.
- Sorts providers by dependency order (e.g., agents depends on inference).
- Instantiates each provider by importing its module and calling its factory function.
- Sets up auto-routing: for APIs like inference, creates a
RoutingTable+Routerpair so multiple providers can serve different models through the same API.
Many APIs use automatic routing. For example, Api.inference is paired with Api.models:
Api.models (RoutingTable) <--> Api.inference (Router)
| |
|-- ModelsRoutingTable |-- InferenceRouter
| tracks which provider | delegates to correct
| owns which model | provider per request
The full list of auto-routed pairs is defined in builtin_automatically_routed_apis() in core/distribution.py:
| Routing Table API | Router API |
|---|---|
Api.models |
Api.inference |
Api.shields |
Api.safety |
Api.datasets |
Api.datasetio |
Api.scoring_functions |
Api.scoring |
Api.benchmarks |
Api.eval |
Api.tool_groups |
Api.tool_runtime |
Api.vector_stores |
Api.vector_io |
The llama_stack_api package defines all public-facing types and protocols:
- Protocols -- Python
Protocolclasses likeInference,Safety,Agentsthat define the API contract. Methods are annotated with@webmethodto specify HTTP routes. - Data Types -- Pydantic models for requests, responses, and resources (e.g.,
Model,Shield,ChatCompletionRequest). - Provider Specs --
InlineProviderSpec,RemoteProviderSpec, and related types that define how providers are declared. - Internal utilities -- KVStore and SqlStore abstract interfaces live here so third-party providers can use them without depending on the full server.
Provider implementations import from llama_stack_api for type definitions and from llama_stack.providers.utils for shared functionality.
Storage is configured in the storage section of the run config (StackConfig.storage). It defines backend references that providers and core services use:
storage:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR}/registry.db
stores:
kvstore:
type: kv_sqlite
db_path: ${env.SQLITE_STORE_DIR}/kvstore.db
inference:
type: sql_sqlite
db_path: ${env.SQLITE_STORE_DIR}/inference_store.dbsrc/llama_stack/core/storage/kvstore/ provides a key-value store abstraction (KVStore) with backends:
| Backend | Config Class | Use Case |
|---|---|---|
| SQLite | SqliteKVStoreConfig |
Default, single-node |
| Redis | RedisKVStoreConfig |
Multi-node, caching |
| PostgreSQL | PostgresKVStoreConfig |
Production deployments |
| MongoDB | MongoDBKVStoreConfig |
Document-oriented |
Used by: distribution registry, quota tracking, provider state.
src/llama_stack/core/storage/sqlstore/ provides a SQL store abstraction (SqlStore) with SQLAlchemy backends:
| Backend | Config Class | Use Case |
|---|---|---|
| SQLite | SqliteSqlStoreConfig |
Default, single-node |
| PostgreSQL | PostgresSqlStoreConfig |
Production deployments |
Used by: inference store (chat completion logs), conversations, prompts.
src/llama_stack/core/store/ implements DistributionRegistry, which tracks all registered resources (models, shields, datasets, etc.) across providers. It persists to the configured KVStore so resources survive server restarts.
A YAML file that defines everything about a running Llama Stack instance:
version: 2
distro_name: starter
apis:
- inference
- agents
- safety
# ...
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
base_url: ${env.OLLAMA_URL:=http://localhost:11434/v1}
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
storage:
type: sqlite
db_path: ...Key features:
- Environment variable substitution:
${env.VAR_NAME:=default}syntax for config values. - Conditional providers:
${env.API_KEY:+provider_id}syntax enables a provider only when a variable is set. - Multiple providers per API: e.g., both
ollamaandopenaican serve inference, each handling different models.
A distribution is a pre-built configuration that bundles specific providers for a target environment. Think of it like Kubernetes distributions (AKS, EKS, GKE): the core API stays the same, but each distribution wires different backends. src/llama_stack/distributions/ contains these configurations (e.g., starter, dell, nvidia). Each distribution directory has:
config.yaml-- the run config- Templates and codegen support via
template.py
Used by llama stack build to create container images. Declares which providers to include and what packages to install. Versioned separately from the run config.
Integration tests use a record/replay system (src/llama_stack/testing/api_recorder.py) that intercepts OpenAI client calls to record real API responses, then replays them for fast, deterministic CI runs.
-
Recording: Tests run against a real server. The
APIRecordermonkey-patchesOpenAIclient methods to capture every request/response pair. Responses are stored as JSON files undertests/integration/recordings/. -
Replay: In CI, tests run in replay mode. The recorder matches incoming requests to stored responses by hashing the request parameters, returning cached responses instead of making real API calls.
-
Modes (controlled by
--inference-modeorLLAMA_STACK_TEST_INFERENCE_MODE):replay(default) -- use cached responsesrecord-- force-record all interactionsrecord-if-missing-- record only when no cached response existslive-- bypass recording entirely, make real calls
-
Deterministic IDs: The recorder overrides ID generation (
set_id_override()) during replay so that resource IDs (files, vector stores, etc.) are reproducible across runs.
Recordings live in tests/integration/recordings/ organized by provider and test. Each recording is a JSON file containing the serialized request parameters and response. An SQLite index maps requests to response files.
For more details, see tests/README.md and tests/integration/README.md.
| Component | Location | Purpose |
|---|---|---|
LlamaStack |
core/stack.py |
Composite class implementing all API protocols |
Stack |
core/stack.py |
Initialization, resource registration, lifecycle |
StackApp |
core/server/server.py |
FastAPI app wrapper |
resolve_impls() |
core/resolver.py |
Provider instantiation and dependency resolution |
CommonRoutingTableImpl |
core/routing_tables/common.py |
Base routing table for all auto-routed APIs |
InferenceRouter |
core/routers/inference.py |
Routes inference calls to correct provider |
OpenAIMixin |
providers/utils/inference/openai_mixin.py |
Shared OpenAI-compatible client logic |
get_provider_registry() |
core/distribution.py |
Loads all available provider specs |
APIRecorder |
testing/api_recorder.py |
Record/replay test infrastructure |
src/
llama_stack_api/ # API definitions package (separate pip package)
inference.py # Inference protocol
agents.py # Agents protocol
datatypes.py # Shared data types
providers/ # Provider spec types
internal/ # KVStore/SqlStore interfaces
llama_stack/ # Server implementation
core/
server/ # FastAPI server, auth, routing
routers/ # API-specific routers (inference, safety, etc.)
routing_tables/ # Resource-to-provider mapping
storage/ # KVStore and SqlStore backends
store/ # Distribution registry
resolver.py # Provider resolution engine
distribution.py # Provider registry loading
stack.py # Stack initialization and lifecycle
providers/
inline/ # In-process provider implementations
remote/ # Remote service adapters
registry/ # Provider spec declarations
utils/ # Shared provider utilities
distributions/ # Pre-built distribution configs
cli/ # CLI commands (llama stack run, build, etc.)
testing/ # Test infrastructure (api_recorder)
tests/
unit/ # Fast, isolated tests
integration/ # End-to-end tests with record/replay
recordings/ # Cached API responses