Automated end-to-end conformance tests for llm-d inference deployments on Kubernetes using the LLMInferenceService CRD.
The test framework creates LLMInferenceService CRs on your cluster and validates the full lifecycle — the llm-d/KServe operator handles everything (vLLM image, pods, routing). Each test case is a proper Ginkgo spec with its own description and labels.
Built with Go 1.24, Ginkgo v2, and driven entirely by YAML configs. Zero code changes needed to add new test cases.
Manifests are maintained in a separate repo: llm-d-conformance-manifests with branches per release (3.4-ea1, 3.4-ea2).
- Go 1.24+
- Access to a Kubernetes cluster with:
- llm-d / KServe operator installed (
LLMInferenceServiceCRD) - Gateway API configured
- GPU nodes available
- llm-d / KServe operator installed (
kubectl
# 1. Clone and install dependencies
git clone https://github.com/aneeshkp/llm-d-conformance-test.git
cd llm-d-conformance-test
make deps
# 2. Clone manifests (must match your cluster version)
make setup MANIFEST_REF=3.4-ea1 # EA1 clusters (most common)
make setup MANIFEST_REF=3.4-ea2 # EA2 clusters
# 3. Verify the framework works (no cluster needed)
make test-smoke
# 4. Run a quick smoke test
export KUBECONFIG=~/.kube/my-cluster
make test TESTCASE=single-gpu-smoke
# 5. Run all conformance tests
make test-profile-allmake testcases| Test Case | Default Model | Features |
|---|---|---|
single-gpu-smoke |
Qwen/Qwen3-0.6B | 1 GPU, fast CI/CD smoke test |
single-gpu |
Qwen/Qwen3-0.6B | 1 GPU with scheduler + metrics |
single-gpu-no-scheduler |
Qwen/Qwen3-0.6B | 1 GPU, K8s native routing |
cache-aware |
Qwen/Qwen3-0.6B | Prefix KV cache-aware routing, 2 replicas |
pd |
Qwen/Qwen3-0.6B | P/D disaggregation (2 prefill + 1 decode) |
moe |
DeepSeek-R1 | MoE DP/EP, 8 GPUs, RDMA/RoCE |
Override the default model:
make test TESTCASE=single-gpu MODEL=Qwen/Qwen2.5-7B-Instructmake profiles| Profile | Make target | Test cases |
|---|---|---|
smoke |
make test-profile-smoke |
single-gpu-smoke |
all |
make test-profile-all |
single-gpu, no-scheduler, cache-aware, pd |
cache-aware |
make test-profile-cache |
cache-aware |
pd |
make test-profile-pd |
pd |
moe |
make test-profile-moe |
moe (requires 8 GPUs + RDMA) |
Manifests live in a separate repo: llm-d-conformance-manifests
You must match the branch to your cluster's CRD version — EA1 manifests will not work on EA2 clusters and vice versa. See the manifest repo README for details on EA1 vs EA2 differences.
make setup MANIFEST_REF=3.4-ea1 # EA1 cluster (most common)
make setup MANIFEST_REF=3.4-ea2 # EA2 cluster
make setup # clone main (latest)
make delete-manifests # remove cloned manifestsEach test case runs through these phases:
- PREP — Download model to PVC via KServe storage initializer
- PREREQ — Verify
LLMInferenceServiceCRD exists - DEPLOY —
kubectl applythe manifest (URI patched based on MODEL_SOURCE) - Sub-resources — Validate Service, HTTPRoute, Gateway programmed, InferencePool, Pods
- READY — Wait for
.status.conditions[Ready]=Truewith live status - HEALTH —
GET /health - INFERENCE —
POST /v1/chat/completionswith test prompts - METRICS — Scrape vLLM + EPP
/metrics, validate prefix cache / P/D / scheduler - CLEANUP — Delete the LLMInferenceService
Each metric check is an individual Ginkgo spec:
| Test Type | Metrics Checked |
|---|---|
| Cache-aware | vllm:prefix_cache_queries > 0, vllm:prefix_cache_hits > 0, hit rate, gpu_cache_usage, EPP prefix_indexer_size |
| P/D | vllm:prompt_tokens_total, vllm:generation_tokens_total, request_success, NIXL transfers (warning if absent) |
| Scheduler | scheduler_e2e_duration, request_total, request_error_total = 0, ready_pods |
One manifest per test case — the framework patches the URI based on MODEL_SOURCE:
| Mode | How it works | When to use |
|---|---|---|
hf (default) |
Deploy with hf:// URI, vLLM downloads at pod startup |
No PVC needed, simplest |
pvc |
Pre-download model to PVC, deploy with pvc:// URI |
Fast startup, recommended for repeated runs |
# Run with HuggingFace (default)
make test TESTCASE=single-gpu
# Pre-cache a model to PVC (one-time)
make cache-model TESTCASE=single-gpu
# Run with PVC
make test TESTCASE=single-gpu MODEL_SOURCE=pvc
# Cache with custom storage class and size
make cache-model TESTCASE=single-gpu STORAGE_CLASS=azurefile-rwx STORAGE_SIZE=50GiValidate an already-running LLMInferenceService — skips deploy and cleanup:
make test TESTCASE=single-gpu DISCOVER=true NAMESPACE=my-ns
make test-profile-all DISCOVER=true NAMESPACE=my-ns| Flag | Default | Description |
|---|---|---|
TESTCASE |
— | Test case name (e.g., single-gpu) |
MODEL |
— | Override model (e.g., Qwen/Qwen2.5-7B-Instruct) |
MODEL_SOURCE |
hf |
hf (HuggingFace direct) or pvc (pre-cached) |
MANIFEST_REF |
main |
Manifest repo branch (e.g., 3.4-ea1, 3.4-ea2) |
NO_CLEANUP |
— | Set to 1 to keep resources after test |
DISCOVER |
— | Set to true to validate existing deployment (skip deploy/cleanup) |
STORAGE_CLASS |
cluster default | StorageClass for PVCs |
STORAGE_SIZE |
from test case config | Override PVC storage size (e.g., 50Gi) |
NAMESPACE |
llm-conformance-test |
Target K8s namespace |
KUBECONFIG |
$KUBECONFIG |
Path to kubeconfig |
├── framework/ # Core framework code
│ ├── config/ # Config types, YAML loader, filtering
│ ├── deployer/ # K8s deployer with URI patching + status dashboard
│ ├── client/ # OpenAI-compatible API client
│ ├── metrics/ # Prometheus metrics scraper + validation
│ ├── model/ # Model download via KServe storage initializer
│ ├── reporter/ # JSON + HTML report generator
│ ├── retry/ # Retry utilities
│ └── cleanup/ # Resource cleanup
├── tests/
│ ├── conformance_test.go # Ginkgo specs — one per test case
│ ├── suite_test.go # Ginkgo suite + CLI flags
│ └── smoke/ # Framework validation (no cluster)
├── deploy/manifests/ # Cloned from manifest repo (gitignored)
├── configs/
│ ├── testcases/ # Test case definitions (YAML)
│ └── profiles/ # Named test profiles
├── .github/workflows/ # CI pipeline (lint, vet, build, smoke tests)
├── docs/
│ └── adding-test-cases.md # Guide for adding new test cases
└── reports/ # JSON + HTML reports (generated)
Adding a new test case requires zero code changes — just 2 files:
- Manifest — Add
<name>.yamlto the manifest repo on the appropriate branch - Test case config — Add
configs/testcases/<name>.yamlwith timeouts, prompts, default model
The framework patches the manifest URI based on MODEL_SOURCE at runtime.
See docs/adding-test-cases.md for details.
make help # Show all targets, flags, and examples
make setup # Clone manifest repo
make testcases # List test cases (shows manifest version)
make profiles # List profiles with their test cases
make test-smoke # Framework validation (no cluster needed)
make cache-models # Pre-download all models to PVCs
make clear-manifests # Remove cloned manifests
make clean # Remove generated reports