vllm-cpu-perf-eval/docs/getting-started.md at main · redhat-et/vllm-cpu-perf-eval

layout	default
title	Getting Started

Getting Started

Complete guide to setting up and running your first vLLM performance tests.

📝 Note: This is a simplified quick start guide. For the complete Ansible documentation including all playbooks, roles, and advanced configuration, see the full Ansible automation guide in the repository.

Overview

The vLLM CPU Performance Evaluation framework uses Ansible to automate:

Platform setup and configuration
vLLM server deployment
Test execution with GuideLLM
Results collection and analysis

Prerequisites

Control Machine (Where You Run Ansible)

The control machine is your local laptop/workstation where you run Ansible commands.

Install Ansible:

# On macOS
brew install ansible

# On Ubuntu/Debian
sudo apt update && sudo apt install -y ansible

# On RHEL/Fedora
sudo dnf install -y ansible-core python3-pip
pip install podman-compose

# Verify installation
ansible --version  # Should be 2.14+

# Navigate to the ansible directory
cd automation/test-execution/ansible

# Install required Ansible collections
ansible-galaxy collection install -r requirements.yml

# Or install individually
ansible-galaxy collection install containers.podman ansible.posix

Test Hosts (DUT and Load Generator)

Requirements:

OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
SSH Access: Password-less SSH from control machine (see setup below)
Sudo privileges: Required for installation and setup
Python: 3.8+ (usually pre-installed)
Network: DUT port 8000 accessible from Load Generator

Note: Playbooks automatically install required software (Podman, tuned, numactl, vLLM, GuideLLM) on remote hosts. No manual installation needed.

SSH Setup

Set up password-less SSH access from your control machine to both DUT and Load Generator:

# Generate SSH key (if you don't have one)
ssh-keygen -t ed25519 -C "[email protected]"

# Copy SSH key to DUT
ssh-copy-id -i ~/.ssh/id_ed25519.pub ec2-user@your-dut-hostname

# Copy SSH key to Load Generator
ssh-copy-id -i ~/.ssh/id_ed25519.pub ec2-user@your-loadgen-hostname

# Test connectivity (should not prompt for password)
ssh -i ~/.ssh/id_ed25519 ec2-user@your-dut-hostname 'echo "DUT: Connected"'
ssh -i ~/.ssh/id_ed25519 ec2-user@your-loadgen-hostname 'echo "LoadGen: Connected"'

# Test sudo access (required for playbooks)
ssh ec2-user@your-dut-hostname 'sudo whoami'  # Should return 'root'
ssh ec2-user@your-loadgen-hostname 'sudo whoami'  # Should return 'root'

For AWS EC2:

# Use your downloaded .pem key
chmod 400 ~/your-key.pem  # Set correct permissions
ssh -i ~/your-key.pem ec2-user@your-dut-hostname

# Or convert to standard SSH key format
ssh-keygen -p -m PEM -f ~/your-key.pem

HuggingFace Token (For Gated Models)

Some models like Llama require a HuggingFace token and license acceptance.

Create a HuggingFace token:

Sign up/Login: Visit huggingface.co
Create token: Go to Settings → Access Tokens → New Token
Set permissions: Select "Read" access
Copy token: Save it as hf_xxxxxxxxxxxxx
Accept model licenses: Visit model page (e.g., meta-llama/Llama-3.2-1B-Instruct) and accept license

Save token locally:

# Save to file
echo "hf_xxxxxxxxxxxxx" > ~/hf-token

# Or export directly
export HF_TOKEN=hf_xxxxxxxxxxxxx

# Or load from file
export HF_TOKEN=$(cat ~/hf-token)

Quick Start

1. Clone the Repository

git clone https://github.com/redhat-et/vllm-cpu-perf-eval.git
cd vllm-cpu-perf-eval/automation/test-execution/ansible

2. Configure Your Environment

Option A: Environment Variables (Recommended)

# Set hostnames (AWS example)
export DUT_HOSTNAME=ec2-18-117-90-80.us-east-2.compute.amazonaws.com
export LOADGEN_HOSTNAME=ec2-52-15-123-132.us-east-2.compute.amazonaws.com

# SSH credentials
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_PRIVATE_KEY_FILE=~/your-key.pem  # Or ~/.ssh/id_ed25519

# Ensure SSH key has correct permissions
chmod 600 ~/your-key.pem

# HuggingFace token (for gated models like Llama)
export HF_TOKEN=$(cat ~/hf-token)

# Container images (optional - defaults are provided)
export VLLM_CONTAINER_IMAGE=docker.io/vllm/vllm-openai-cpu:v0.18.0
export GUIDELLM_CONTAINER_IMAGE=ghcr.io/vllm-project/guidellm:latest

The inventory file automatically uses these environment variables with sensible defaults.

Option B: Edit Inventory File

Alternatively, edit inventory/hosts.yml directly and update the hostname values (lines 63 and 73):

dut:
  hosts:
    vllm-server:
      ansible_host: "192.168.1.10"  # Update this

load_generator:
  hosts:
    guidellm-client:
      ansible_host: "192.168.1.20"  # Update this

3. Test Connectivity

ansible -i inventory/hosts.yml all -m ping

Expected output:

vllm-server | SUCCESS => {"ping": "pong"}
guidellm-client | SUCCESS => {"ping": "pong"}

4. Platform Setup (Optional but Recommended)

Configure your DUT and Load Generator hosts with performance optimizations for deterministic benchmarking:

# Configure both DUT and Load Generator
ansible-playbook -i inventory/hosts.yml setup-platform.yml

# Or configure only specific hosts
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit dut
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit load_generator

# Reboot hosts for kernel parameters to take effect
ansible -i inventory/hosts.yml all -b -m reboot

What this configures (on DUT and Load Generator only):

✅ Installs: Podman, tuned, kernel-tools, numactl
✅ CPU Isolation: Sets isolcpus, nohz_full, rcu_nocbs
✅ Performance Governor: Locks CPU frequency
✅ NUMA Topology: Detects and optimizes for NUMA layout
✅ IRQ Balancing: Disables irqbalance
✅ Systemd Pinning: Pins system processes to housekeeping CPUs

What it does NOT configure:

❌ Your control machine (Ansible host) - no changes needed there
❌ vLLM or GuideLLM - those are installed during test execution

Note: You can skip this step if you're just trying out the framework. It's mainly for production-grade deterministic benchmarking. See Platform Setup Guide for details.

5. Run Your First Test

Simple LLM test:

ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  -e "workload_type=chat" \
  -e "requested_cores=16"

What this does:

Deploys vLLM server on DUT with TinyLlama model
Configures for 16 CPU cores
Runs chat workload benchmark from Load Generator
Collects results to local machine

Test takes: ~15-20 minutes (includes vLLM startup and 10-minute test)

6. View Results

Results are automatically collected to your local machine:

# View JSON results
cat results/llm/TinyLlama__TinyLlama-1.1B-Chat-v1.0/chat-*/benchmarks.json

# View CSV results (importable to spreadsheets)
cat results/llm/TinyLlama__TinyLlama-1.1B-Chat-v1.0/chat-*/benchmarks.csv

Common Test Scenarios

Testing Against External Endpoints

Test existing vLLM deployments (cloud, K8s, production) without managing containers:

# Configure external endpoint
export VLLM_ENDPOINT_MODE=external
export VLLM_ENDPOINT_URL=http://your-vllm-instance:8000

# Run concurrent load test (model auto-detected from endpoint)
ansible-playbook -i inventory/hosts.yml llm-benchmark-concurrent-load.yml \
  -e "base_workload=chat"

Features:

✅ Auto-detects model from endpoint /v1/models
✅ Skips vLLM container management
✅ Collects client metrics (GuideLLM)
✅ Collects server metrics if /metrics exposed
✅ Works with cloud, K8s, or on-premise deployments

Environment Variables:

VLLM_ENDPOINT_MODE=external - Enable external mode
VLLM_ENDPOINT_URL=http://... - Full URL with protocol and port
LOADGEN_HOSTNAME=... - Load generator hostname/IP
ANSIBLE_SSH_KEY=... - SSH key for load generator access

Note: DUT_HOSTNAME and requested_cores not required in external mode (endpoint accessed directly via HTTP and manages its own CPU allocation).

Concurrent Load Testing

Test performance under increasing concurrent load:

ansible-playbook -i inventory/hosts.yml llm-benchmark-concurrent-load.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "base_workload=chat" \
  -e "requested_cores=32"

This runs all 3 testing phases:

Phase 1: Baseline (fixed tokens, no caching)
Phase 2: Realistic (variable tokens, no caching)
Phase 3: Production (variable tokens, with caching)

See 3-Phase Testing Methodology for details.

Core Count Sweep

Test performance across different CPU core counts:

ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
  -e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
  -e "workload_type=chat" \
  -e "core_sweep_enabled=true" \
  -e "core_sweep_counts=[8,16,32,64]"

Embedding Model Testing

ansible-playbook -i inventory/hosts.yml embedding-benchmark.yml \
  -e "test_model=ibm-granite/granite-embedding-278m-multilingual" \
  -e "scenario=baseline"

Available Workload Types

Workload	Input:Output	Use Case	Example
`chat`	512:512	Interactive chat	Customer support bot
`rag`	8192:512	Long context RAG	Document Q&A
`code`	1024:1024	Code generation	GitHub Copilot-style
`summarization`	2048:256	Summarization	Article summaries
`reasoning`	256:2048	Long reasoning	Complex analysis

See Model Catalog for all supported models.

Key Parameters

Test Configuration

Parameter	Description	Example
`test_model`	Model to test	`meta-llama/Llama-3.2-1B-Instruct`
`workload_type`	Workload pattern	`chat`, `rag`, `code`, `summarization`
`requested_cores`	CPU cores to use	`16`, `32`, `64`
`vllm_caching_mode`	Caching mode	`baseline` (off), `production` (on)

Concurrent Testing

Parameter	Description	Example
`guidellm_profile`	Test profile	`concurrent`, `sweep`, `synchronous`
`guidellm_rate`	Concurrency levels	`[1,2,4,8,16,32]`
`guidellm_max_seconds`	Test duration	`600` (10 minutes)

Troubleshooting

SSH Connection Issues

# Verify SSH key permissions
chmod 600 ~/.ssh/your-key.pem

# Test SSH manually
ssh -i ~/.ssh/your-key.pem user@hostname

# Check Ansible can connect
ansible -i inventory/hosts.yml all -m ping -vvv

vLLM Server Issues

# Check vLLM logs on DUT
ssh user@dut-hostname "podman logs vllm-server"

# Check if port 8000 is accessible
nc -zv dut-hostname 8000

Test Failures

# Run with verbose output
ansible-playbook -i inventory/hosts.yml <playbook.yml> -vv

# Check disk space
ansible -i inventory/hosts.yml all -m shell -a "df -h"

# Check Docker/Podman status
ansible -i inventory/hosts.yml all -m shell -a "podman ps -a"

Next Steps

Learn More

Testing Methodology - Understand the testing approach
3-Phase Testing - Baseline, realistic, and production phases
Metrics Guide - Understanding the metrics
Test Suites - Available test suites

Run More Tests

Concurrent Load Tests - P95 latency scaling
Scalability Tests - Maximum throughput
Embedding Models - Embedding performance

Automation Details

For complete documentation on:

All available playbooks
Ansible roles and task structure
Inventory configuration
Filter plugins and custom modules
Advanced usage patterns

See the full Ansible automation documentation.

Support

Repository: GitHub
Issues: Report Issues
Documentation: Browse this site for comprehensive guides

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Overview

Prerequisites

Control Machine (Where You Run Ansible)

Test Hosts (DUT and Load Generator)

SSH Setup

HuggingFace Token (For Gated Models)

Quick Start

1. Clone the Repository

2. Configure Your Environment

3. Test Connectivity

4. Platform Setup (Optional but Recommended)

5. Run Your First Test

6. View Results

Common Test Scenarios

Testing Against External Endpoints

Concurrent Load Testing

Core Count Sweep

Embedding Model Testing

Available Workload Types

Key Parameters

Test Configuration

Concurrent Testing

Troubleshooting

SSH Connection Issues

vLLM Server Issues

Test Failures

Next Steps

Learn More

Run More Tests

Automation Details

Support

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Overview

Prerequisites

Control Machine (Where You Run Ansible)

Test Hosts (DUT and Load Generator)

SSH Setup

HuggingFace Token (For Gated Models)

Quick Start

1. Clone the Repository

2. Configure Your Environment

3. Test Connectivity

4. Platform Setup (Optional but Recommended)

5. Run Your First Test

6. View Results

Common Test Scenarios

Testing Against External Endpoints

Concurrent Load Testing

Core Count Sweep

Embedding Model Testing

Available Workload Types

Key Parameters

Test Configuration

Concurrent Testing

Troubleshooting

SSH Connection Issues

vLLM Server Issues

Test Failures

Next Steps

Learn More

Run More Tests

Automation Details

Support