| layout | default |
|---|---|
| title | Getting Started |
Complete guide to setting up and running your first vLLM performance tests.
📝 Note: This is a simplified quick start guide. For the complete Ansible documentation including all playbooks, roles, and advanced configuration, see the full Ansible automation guide in the repository.
The vLLM CPU Performance Evaluation framework uses Ansible to automate:
- Platform setup and configuration
- vLLM server deployment
- Test execution with GuideLLM
- Results collection and analysis
The control machine is your local laptop/workstation where you run Ansible commands.
Install Ansible:
# On macOS
brew install ansible
# On Ubuntu/Debian
sudo apt update && sudo apt install -y ansible
# On RHEL/Fedora
sudo dnf install -y ansible-core python3-pip
pip install podman-compose
# Verify installation
ansible --version # Should be 2.14+
# Navigate to the ansible directory
cd automation/test-execution/ansible
# Install required Ansible collections
ansible-galaxy collection install -r requirements.yml
# Or install individually
ansible-galaxy collection install containers.podman ansible.posixRequirements:
- OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
- SSH Access: Password-less SSH from control machine (see setup below)
- Sudo privileges: Required for installation and setup
- Python: 3.8+ (usually pre-installed)
- Network: DUT port 8000 accessible from Load Generator
Note: Playbooks automatically install required software (Podman, tuned, numactl, vLLM, GuideLLM) on remote hosts. No manual installation needed.
Set up password-less SSH access from your control machine to both DUT and Load Generator:
# Generate SSH key (if you don't have one)
ssh-keygen -t ed25519 -C "[email protected]"
# Copy SSH key to DUT
ssh-copy-id -i ~/.ssh/id_ed25519.pub ec2-user@your-dut-hostname
# Copy SSH key to Load Generator
ssh-copy-id -i ~/.ssh/id_ed25519.pub ec2-user@your-loadgen-hostname
# Test connectivity (should not prompt for password)
ssh -i ~/.ssh/id_ed25519 ec2-user@your-dut-hostname 'echo "DUT: Connected"'
ssh -i ~/.ssh/id_ed25519 ec2-user@your-loadgen-hostname 'echo "LoadGen: Connected"'
# Test sudo access (required for playbooks)
ssh ec2-user@your-dut-hostname 'sudo whoami' # Should return 'root'
ssh ec2-user@your-loadgen-hostname 'sudo whoami' # Should return 'root'For AWS EC2:
# Use your downloaded .pem key
chmod 400 ~/your-key.pem # Set correct permissions
ssh -i ~/your-key.pem ec2-user@your-dut-hostname
# Or convert to standard SSH key format
ssh-keygen -p -m PEM -f ~/your-key.pemSome models like Llama require a HuggingFace token and license acceptance.
Create a HuggingFace token:
- Sign up/Login: Visit huggingface.co
- Create token: Go to Settings → Access Tokens → New Token
- Set permissions: Select "Read" access
- Copy token: Save it as
hf_xxxxxxxxxxxxx - Accept model licenses: Visit model page (e.g., meta-llama/Llama-3.2-1B-Instruct) and accept license
Save token locally:
# Save to file
echo "hf_xxxxxxxxxxxxx" > ~/hf-token
# Or export directly
export HF_TOKEN=hf_xxxxxxxxxxxxx
# Or load from file
export HF_TOKEN=$(cat ~/hf-token)git clone https://github.com/redhat-et/vllm-cpu-perf-eval.git
cd vllm-cpu-perf-eval/automation/test-execution/ansibleOption A: Environment Variables (Recommended)
# Set hostnames (AWS example)
export DUT_HOSTNAME=ec2-18-117-90-80.us-east-2.compute.amazonaws.com
export LOADGEN_HOSTNAME=ec2-52-15-123-132.us-east-2.compute.amazonaws.com
# SSH credentials
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_PRIVATE_KEY_FILE=~/your-key.pem # Or ~/.ssh/id_ed25519
# Ensure SSH key has correct permissions
chmod 600 ~/your-key.pem
# HuggingFace token (for gated models like Llama)
export HF_TOKEN=$(cat ~/hf-token)
# Container images (optional - defaults are provided)
export VLLM_CONTAINER_IMAGE=docker.io/vllm/vllm-openai-cpu:v0.18.0
export GUIDELLM_CONTAINER_IMAGE=ghcr.io/vllm-project/guidellm:latestThe inventory file automatically uses these environment variables with sensible defaults.
Option B: Edit Inventory File
Alternatively, edit inventory/hosts.yml directly and update the hostname values (lines 63 and 73):
dut:
hosts:
vllm-server:
ansible_host: "192.168.1.10" # Update this
load_generator:
hosts:
guidellm-client:
ansible_host: "192.168.1.20" # Update thisansible -i inventory/hosts.yml all -m pingExpected output:
vllm-server | SUCCESS => {"ping": "pong"}
guidellm-client | SUCCESS => {"ping": "pong"}
Configure your DUT and Load Generator hosts with performance optimizations for deterministic benchmarking:
# Configure both DUT and Load Generator
ansible-playbook -i inventory/hosts.yml setup-platform.yml
# Or configure only specific hosts
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit dut
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit load_generator
# Reboot hosts for kernel parameters to take effect
ansible -i inventory/hosts.yml all -b -m rebootWhat this configures (on DUT and Load Generator only):
- ✅ Installs: Podman, tuned, kernel-tools, numactl
- ✅ CPU Isolation: Sets isolcpus, nohz_full, rcu_nocbs
- ✅ Performance Governor: Locks CPU frequency
- ✅ NUMA Topology: Detects and optimizes for NUMA layout
- ✅ IRQ Balancing: Disables irqbalance
- ✅ Systemd Pinning: Pins system processes to housekeeping CPUs
What it does NOT configure:
- ❌ Your control machine (Ansible host) - no changes needed there
- ❌ vLLM or GuideLLM - those are installed during test execution
Note: You can skip this step if you're just trying out the framework. It's mainly for production-grade deterministic benchmarking. See Platform Setup Guide for details.
Simple LLM test:
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
-e "workload_type=chat" \
-e "requested_cores=16"What this does:
- Deploys vLLM server on DUT with TinyLlama model
- Configures for 16 CPU cores
- Runs chat workload benchmark from Load Generator
- Collects results to local machine
Test takes: ~15-20 minutes (includes vLLM startup and 10-minute test)
Results are automatically collected to your local machine:
# View JSON results
cat results/llm/TinyLlama__TinyLlama-1.1B-Chat-v1.0/chat-*/benchmarks.json
# View CSV results (importable to spreadsheets)
cat results/llm/TinyLlama__TinyLlama-1.1B-Chat-v1.0/chat-*/benchmarks.csvTest existing vLLM deployments (cloud, K8s, production) without managing containers:
# Configure external endpoint
export VLLM_ENDPOINT_MODE=external
export VLLM_ENDPOINT_URL=http://your-vllm-instance:8000
# Run concurrent load test (model auto-detected from endpoint)
ansible-playbook -i inventory/hosts.yml llm-benchmark-concurrent-load.yml \
-e "base_workload=chat"Features:
- ✅ Auto-detects model from endpoint
/v1/models - ✅ Skips vLLM container management
- ✅ Collects client metrics (GuideLLM)
- ✅ Collects server metrics if
/metricsexposed - ✅ Works with cloud, K8s, or on-premise deployments
Environment Variables:
VLLM_ENDPOINT_MODE=external- Enable external modeVLLM_ENDPOINT_URL=http://...- Full URL with protocol and portLOADGEN_HOSTNAME=...- Load generator hostname/IPANSIBLE_SSH_KEY=...- SSH key for load generator access
Note: DUT_HOSTNAME and requested_cores not required in external mode (endpoint accessed directly via HTTP and manages its own CPU allocation).
Test performance under increasing concurrent load:
ansible-playbook -i inventory/hosts.yml llm-benchmark-concurrent-load.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "base_workload=chat" \
-e "requested_cores=32"This runs all 3 testing phases:
- Phase 1: Baseline (fixed tokens, no caching)
- Phase 2: Realistic (variable tokens, no caching)
- Phase 3: Production (variable tokens, with caching)
See 3-Phase Testing Methodology for details.
Test performance across different CPU core counts:
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "core_sweep_enabled=true" \
-e "core_sweep_counts=[8,16,32,64]"ansible-playbook -i inventory/hosts.yml embedding-benchmark.yml \
-e "test_model=ibm-granite/granite-embedding-278m-multilingual" \
-e "scenario=baseline"| Workload | Input:Output | Use Case | Example |
|---|---|---|---|
chat |
512:512 | Interactive chat | Customer support bot |
rag |
8192:512 | Long context RAG | Document Q&A |
code |
1024:1024 | Code generation | GitHub Copilot-style |
summarization |
2048:256 | Summarization | Article summaries |
reasoning |
256:2048 | Long reasoning | Complex analysis |
See Model Catalog for all supported models.
| Parameter | Description | Example |
|---|---|---|
test_model |
Model to test | meta-llama/Llama-3.2-1B-Instruct |
workload_type |
Workload pattern | chat, rag, code, summarization |
requested_cores |
CPU cores to use | 16, 32, 64 |
vllm_caching_mode |
Caching mode | baseline (off), production (on) |
| Parameter | Description | Example |
|---|---|---|
guidellm_profile |
Test profile | concurrent, sweep, synchronous |
guidellm_rate |
Concurrency levels | [1,2,4,8,16,32] |
guidellm_max_seconds |
Test duration | 600 (10 minutes) |
# Verify SSH key permissions
chmod 600 ~/.ssh/your-key.pem
# Test SSH manually
ssh -i ~/.ssh/your-key.pem user@hostname
# Check Ansible can connect
ansible -i inventory/hosts.yml all -m ping -vvv# Check vLLM logs on DUT
ssh user@dut-hostname "podman logs vllm-server"
# Check if port 8000 is accessible
nc -zv dut-hostname 8000# Run with verbose output
ansible-playbook -i inventory/hosts.yml <playbook.yml> -vv
# Check disk space
ansible -i inventory/hosts.yml all -m shell -a "df -h"
# Check Docker/Podman status
ansible -i inventory/hosts.yml all -m shell -a "podman ps -a"- Testing Methodology - Understand the testing approach
- 3-Phase Testing - Baseline, realistic, and production phases
- Metrics Guide - Understanding the metrics
- Test Suites - Available test suites
- Concurrent Load Tests - P95 latency scaling
- Scalability Tests - Maximum throughput
- Embedding Models - Embedding performance
For complete documentation on:
- All available playbooks
- Ansible roles and task structure
- Inventory configuration
- Filter plugins and custom modules
- Advanced usage patterns
See the full Ansible automation documentation.
- Repository: GitHub
- Issues: Report Issues
- Documentation: Browse this site for comprehensive guides