A lightweight benchmarking tool for measuring LLM inference performance through Ollama. Get detailed tokens-per-second metrics, load times, and generation speed for any model running on your local hardware.
- Prompt processing speed — measures tokens/sec for input evaluation
- Generation speed — measures tokens/sec for output generation
- Model load time — tracks cold-start overhead
- Multi-model comparison — benchmark several models in a single run
- Table output — side-by-side comparison across models with
-t - Custom prompts — supply your own prompts or use the built-in test suite
- Zero dependencies on cloud — everything runs locally through Ollama
Dual 3090 Ti GPU, Epyc 7763 CPU — Ubuntu 22.04:
----------------------------------------------------
Model: deepseek-r1:70b
Performance Metrics:
Prompt Processing: 336.73 tokens/sec
Generation Speed: 17.65 tokens/sec
Combined Speed: 18.01 tokens/sec
Workload Stats:
Input Tokens: 165
Generated Tokens: 7673
Model Load Time: 6.11s
Processing Time: 0.49s
Generation Time: 434.70s
Total Time: 441.31s
----------------------------------------------------
Single 3090 GPU, 13900KS CPU — WSL2 (Ubuntu 22.04) on Windows 11:
----------------------------------------------------
Model: deepseek-r1:32b
Performance Metrics:
Prompt Processing: 399.05 tokens/sec
Generation Speed: 27.18 tokens/sec
Combined Speed: 27.58 tokens/sec
Workload Stats:
Input Tokens: 168
Generated Tokens: 10601
Model Load Time: 15.44s
Processing Time: 0.42s
Generation Time: 390.00s
Total Time: 405.87s
----------------------------------------------------
- Python 3.11 or higher
- Ollama installed and running
Using pip (recommended):
pip install git+https://github.com/LarHope/ollama-benchmark.gitFrom source:
git clone https://github.com/LarHope/ollama-benchmark.git
cd ollama-benchmark
python3 -m venv venv && source venv/bin/activate # or .\venv\Scripts\activate on Windows
pip install -e .Make sure the Ollama server is running:
ollama serveRun benchmarks:
# Benchmark all available models with default prompts
ollama-benchmark
# Benchmark specific models
ollama-benchmark --models deepseek-r1:70b llama3:8b
# Custom prompts
ollama-benchmark --models mistral --prompts "Write a hello world in Rust" "Explain quantum computing"
# Table comparison output
ollama-benchmark --table_output --models deepseek-r1:70b deepseek-r1:32b llama3:8b
# Verbose mode (shows streaming responses)
ollama-benchmark --verbose --models deepseek-r1:70b| Flag | Description |
|---|---|
-v, --verbose |
Show streaming responses and per-prompt stats |
-m, --models |
Space-separated list of models to benchmark (default: all available) |
-p, --prompts |
Space-separated list of custom prompts |
-t, --table_output |
Display results as a comparison table |
When no custom prompts are provided, the tool runs a suite covering:
- Analytical reasoning
- Creative writing
- Complex analysis
- Technical knowledge
- Structured output generation
- Connects to your local Ollama instance
- Sends each prompt to each model
- Captures timing metrics from the Ollama API response (total duration, prompt eval, generation)
- Calculates tokens/sec for prompt processing, generation, and combined throughput
- Outputs per-model averages or a comparison table
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License — see the LICENSE file for details.