Skip to content

Commit 342a1d2

Browse files
authored
test(performance): add cold start benchmarking infrastructure (#467)
* test(performance): add cold start benchmarking infrastructure Add comprehensive tooling to measure and compare cold start performance across different branches and code changes. Changes: - Add test_cold_start.py: measures import times, module counts, and lazy loading status with 10 iterations per measurement - Add benchmark_cold_start.sh: automates running benchmarks on different git branches with stash/restore logic - Add compare_benchmarks.py: analyzes and visualizes differences between two benchmark runs with colored output - Add benchmark_results/ to .gitignore: exclude generated JSON data The benchmark suite validates: - Import time for runpod, runpod.serverless, and runpod.endpoint - Total module count and runpod-specific module count - Whether paramiko and SSH CLI modules are eagerly or lazy-loaded - Performance regression detection (fails if import > 1000ms) Usage: # Run on current branch uv run pytest tests/test_performance/test_cold_start.py # Compare two branches ./scripts/benchmark_cold_start.sh main feature-branch Results saved to benchmark_results/ as timestamped JSON files for historical comparison and CI/CD integration. * docs(performance): add comprehensive benchmarking usage guide Add detailed README for cold start benchmarking tools covering: - Quick start examples for common use cases - Tool documentation with usage patterns and output examples - Result file structure and naming conventions - Performance targets and interpretation guidance - CI/CD integration examples - Troubleshooting common issues The guide enables developers to effectively measure, compare, and validate cold start performance improvements across code changes. * fix(performance): address Copilot PR feedback Address code review feedback from PR #467: 1. Fix median calculation for even-length lists - Previously only returned single middle value - Now correctly averages the two middle values for even-length lists - Maintains correct behavior for odd-length lists 2. Update usage message to match documented pattern - Changed from "python" to "uv run python scripts/..." - Aligns with project's uv-based tooling conventions - Matches usage examples in README and throughout codebase These fixes improve statistical accuracy and documentation consistency.
1 parent c719b19 commit 342a1d2

File tree

6 files changed

+839
-0
lines changed

6 files changed

+839
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,3 +141,4 @@ runpod/_version.py
141141
.runpod_jobs.pkl
142142

143143
*.lock
144+
benchmark_results/

scripts/README.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# Cold Start Benchmarking
2+
3+
Performance benchmarking tools for measuring and comparing cold start times across different code changes.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Run benchmark on current branch
9+
uv run pytest tests/test_performance/test_cold_start.py
10+
11+
# Compare two branches
12+
./scripts/benchmark_cold_start.sh main my-feature-branch
13+
14+
# Compare two existing result files
15+
uv run python scripts/compare_benchmarks.py benchmark_results/cold_start_baseline.json benchmark_results/cold_start_latest.json
16+
```
17+
18+
## What Gets Measured
19+
20+
- **Import times**: `import runpod`, `import runpod.serverless`, `import runpod.endpoint`
21+
- **Module counts**: Total modules loaded and runpod-specific modules
22+
- **Lazy loading status**: Whether paramiko and SSH CLI are eagerly or lazy-loaded
23+
- **Statistics**: Min, max, mean, median across 10 iterations per measurement
24+
25+
## Tools
26+
27+
### 1. test_cold_start.py
28+
29+
Core benchmark test that measures import performance in fresh Python subprocesses.
30+
31+
```bash
32+
# Run as pytest test
33+
uv run pytest tests/test_performance/test_cold_start.py -v
34+
35+
# Run as standalone script
36+
uv run python tests/test_performance/test_cold_start.py
37+
38+
# Results saved to:
39+
# - benchmark_results/cold_start_<timestamp>.json
40+
# - benchmark_results/cold_start_latest.json (always latest)
41+
```
42+
43+
**Output Example:**
44+
```
45+
Running cold start benchmarks...
46+
------------------------------------------------------------
47+
Measuring 'import runpod'...
48+
Mean: 273.29ms
49+
Measuring 'import runpod.serverless'...
50+
Mean: 332.18ms
51+
Counting loaded modules...
52+
Total modules: 582
53+
Runpod modules: 46
54+
Checking if paramiko is eagerly loaded...
55+
Paramiko loaded: False
56+
```
57+
58+
### 2. benchmark_cold_start.sh
59+
60+
Automated benchmark runner that handles git branch switching, dependency installation, and result collection.
61+
62+
```bash
63+
# Run on current branch (no git operations)
64+
./scripts/benchmark_cold_start.sh
65+
66+
# Run on specific branch
67+
./scripts/benchmark_cold_start.sh main
68+
69+
# Compare two branches (runs both, then compares)
70+
./scripts/benchmark_cold_start.sh main feature/lazy-loading
71+
```
72+
73+
**Features:**
74+
- Automatic stash/unstash of uncommitted changes
75+
- Dependency installation per branch
76+
- Safe branch switching with restoration
77+
- Timestamped result files
78+
- Automatic comparison when comparing branches
79+
80+
**Safety:**
81+
- Stashes uncommitted changes before switching branches
82+
- Restores original branch after completion
83+
- Handles errors gracefully
84+
85+
### 3. compare_benchmarks.py
86+
87+
Analyzes and visualizes differences between two benchmark runs with colored terminal output.
88+
89+
```bash
90+
uv run python scripts/compare_benchmarks.py <baseline.json> <optimized.json>
91+
```
92+
93+
**Output Example:**
94+
```
95+
======================================================================
96+
COLD START BENCHMARK COMPARISON
97+
======================================================================
98+
99+
IMPORT TIME COMPARISON
100+
----------------------------------------------------------------------
101+
Metric Baseline Optimized Δ ms Δ %
102+
----------------------------------------------------------------------
103+
runpod_total 285.64ms 273.29ms ↓ 12.35ms 4.32%
104+
runpod_serverless 376.33ms 395.14ms ↑ -18.81ms -5.00%
105+
runpod_endpoint 378.61ms 399.36ms ↑ -20.75ms -5.48%
106+
107+
MODULE LOAD COMPARISON
108+
----------------------------------------------------------------------
109+
Total modules loaded:
110+
Baseline: 698 Optimized: 582 Δ: 116
111+
Runpod modules loaded:
112+
Baseline: 48 Optimized: 46 Δ: 2
113+
114+
LAZY LOADING STATUS
115+
----------------------------------------------------------------------
116+
Paramiko Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY
117+
SSH CLI Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY
118+
119+
======================================================================
120+
SUMMARY
121+
======================================================================
122+
✓ Cold start improved by 12.35ms
123+
✓ That's a 4.3% improvement over baseline
124+
✓ Baseline: 285.64ms → Optimized: 273.29ms
125+
======================================================================
126+
```
127+
128+
**Color coding:**
129+
- Green: Improvements (faster times, lazy loading achieved)
130+
- Red: Regressions (slower times, eager loading introduced)
131+
- Yellow: No change
132+
133+
## Result Files
134+
135+
All benchmark results are saved to `benchmark_results/` (gitignored).
136+
137+
**File naming:**
138+
- `cold_start_<timestamp>.json` - Timestamped result
139+
- `cold_start_latest.json` - Always contains most recent result
140+
- `cold_start_baseline.json` - Manually saved baseline for comparison
141+
142+
**JSON structure:**
143+
```json
144+
{
145+
"timestamp": 1763179522.0437188,
146+
"python_version": "3.8.20 (default, Oct 2 2024, 16:12:59) [Clang 18.1.8 ]",
147+
"measurements": {
148+
"runpod_total": {
149+
"min": 375.97,
150+
"max": 527.9,
151+
"mean": 393.91,
152+
"median": 380.4,
153+
"iterations": 10
154+
}
155+
},
156+
"module_counts": {
157+
"total": 698,
158+
"filtered": 48
159+
},
160+
"paramiko_eagerly_loaded": true,
161+
"ssh_cli_loaded": true
162+
}
163+
```
164+
165+
## Common Workflows
166+
167+
### Testing a Performance Optimization
168+
169+
```bash
170+
# 1. Save baseline on main branch
171+
git checkout main
172+
./scripts/benchmark_cold_start.sh
173+
cp benchmark_results/cold_start_latest.json benchmark_results/cold_start_baseline.json
174+
175+
# 2. Switch to feature branch
176+
git checkout feature/my-optimization
177+
178+
# 3. Run benchmark and compare
179+
./scripts/benchmark_cold_start.sh
180+
uv run python scripts/compare_benchmarks.py \
181+
benchmark_results/cold_start_baseline.json \
182+
benchmark_results/cold_start_latest.json
183+
```
184+
185+
### Comparing Multiple Approaches
186+
187+
```bash
188+
# Compare three different optimization branches
189+
./scripts/benchmark_cold_start.sh main > results_main.txt
190+
./scripts/benchmark_cold_start.sh feature/approach-1 > results_1.txt
191+
./scripts/benchmark_cold_start.sh feature/approach-2 > results_2.txt
192+
193+
# Then compare each against baseline
194+
uv run python scripts/compare_benchmarks.py \
195+
benchmark_results/cold_start_main_*.json \
196+
benchmark_results/cold_start_approach-1_*.json
197+
```
198+
199+
### CI/CD Integration
200+
201+
Add to your GitHub Actions workflow:
202+
203+
```yaml
204+
- name: Run cold start benchmark
205+
run: |
206+
uv run pytest tests/test_performance/test_cold_start.py --timeout=120
207+
208+
- name: Upload benchmark results
209+
uses: actions/upload-artifact@v3
210+
with:
211+
name: benchmark-results
212+
path: benchmark_results/cold_start_latest.json
213+
```
214+
215+
## Performance Targets
216+
217+
Based on testing with Python 3.8:
218+
219+
- **Cold start (import runpod)**: < 300ms (mean)
220+
- **Serverless import**: < 400ms (mean)
221+
- **Module count**: < 600 total modules
222+
- **Test assertion**: Fails if import > 1000ms
223+
224+
## Interpreting Results
225+
226+
### Import Time Variance
227+
228+
Subprocess-based measurements have inherent variance:
229+
- First run in sequence: Often 20-50ms slower (Python startup overhead)
230+
- Subsequent runs: More stable
231+
- **Use median or mean** for comparison, not single runs
232+
233+
### Module Count
234+
235+
- **Fewer modules = faster cold start**: Each module has import overhead
236+
- **Runpod-specific modules**: Should be minimal (40-50)
237+
- **Total modules**: Includes stdlib and dependencies
238+
- **Target reduction**: Removing 100+ modules typically saves 10-30ms
239+
240+
### Lazy Loading Validation
241+
242+
- `paramiko_eagerly_loaded: false` - Good for serverless workers
243+
- `ssh_cli_loaded: false` - Good for SDK users
244+
- These should only be `true` when CLI commands are invoked
245+
246+
## Troubleshooting
247+
248+
### High Variance in Results
249+
250+
If you see >100ms variance between runs:
251+
- System is under load
252+
- Disk I/O contention
253+
- Python bytecode cache issues
254+
255+
**Solution:** Run multiple times and use median values.
256+
257+
### benchmark_cold_start.sh Fails
258+
259+
```bash
260+
# Check git status
261+
git status
262+
263+
# Manually restore if script failed mid-execution
264+
git checkout <original-branch>
265+
git stash pop
266+
```
267+
268+
### Import Errors During Benchmark
269+
270+
Ensure dependencies are installed:
271+
```bash
272+
uv sync --group test
273+
```
274+
275+
## Benchmark Accuracy
276+
277+
- **Iterations**: 10 per measurement (configurable in test)
278+
- **Process isolation**: Each measurement uses fresh subprocess
279+
- **Python cache**: Cleared by subprocess creation
280+
- **System state**: Cannot control OS-level caching
281+
282+
For production performance testing, consider:
283+
- Running on CI with consistent environment
284+
- Multiple runs at different times
285+
- Comparing trends over multiple commits

0 commit comments

Comments
 (0)