Feature/optimize pyrdown f32 with separable convolution and rayon parallelism#766
Conversation
|
The [pyrdown_f32] and Benchmark Results:
|
| Image Dimension | Previous Time | New execution Time | Time Reduction | Throughput Increase |
|---|---|---|---|---|
| 256x224 | ~ 265 µs | 102 µs | -62.13% | +164.07% |
| 512x448 | ~ 1000 µs | 184 µs | -81.42% | +438.22% |
| 1024x896 | ~ 4000 µs | 504 µs | -87.28% | +686.34% |
1-Channel Images (pyrdown_f32)
| Image Dimension | Previous Time | New execution Time | Time Reduction | Throughput Increase |
|---|---|---|---|---|
| 256x224 | ~ 83 µs | 82 µs | -2.14% | +2.18% |
| 512x448 | ~ 325 µs | 113 µs | -65.10% | +186.53% |
| 1024x896 | ~ 1300 µs | 226 µs | -82.63% | +475.78% |
|
@sidd-27 can you review |
|
Benchmarks, validation? |
Validation
BenchmarksWe achieved massive multi-threaded performance enhancements by splitting the single-threaded Here are the local benchmark results ( 3-Channel Images (
|
| Dimension | Throughput Increase | Execution Time Reduction |
|---|---|---|
| 256x224 | +164.07% | -62.13% |
| 512x448 | +438.22% | -81.42% |
| 1024x896 | +686.34% | -87.28% |
1-Channel Images (pyrdown_f32)
| Dimension | Throughput Increase | Execution Time Reduction |
|---|---|---|
| 256x224 | +2.18% | -2.14% |
| 512x448 | +186.53% | -65.10% |
| 1024x896 | +475.78% | -82.63% |
The algorithmic reductions ($O(25) \to O(10)$) paired with multi-threading provide accelerating returns as resolution increases!
|
Better documentation of what the function does would be helpful, you seem to have removed some comments |
sure let me do that |
|
Could you also benchmarks against opencv we'd like to achieve speeds close to or faster than that |
sure |
Benchmarks vs OpenCV (and Before/After Optimization)I ran local benchmarks comparing the 1-Channel (
|
| Dimension | OpenCV (ms) | kornia-rs Before (ms) |
kornia-rs After (ms) |
Speedup vs Before | Difference vs CV2 |
|---|---|---|---|---|---|
| 256x224 | 0.0277 | 0.123 | 0.081 | 1.52x | 2.92x |
| 512x448 | 0.0341 | 0.487 | 0.120 | 4.05x | 3.51x |
| 1024x896 | 0.0813 | 1.949 | 0.249 | 7.82x | 3.06x |
3-Channel (pyrdown_f32_3c)
| Dimension | OpenCV (ms) | kornia-rs Before (ms) |
kornia-rs After (ms) |
Speedup vs Before | Difference vs CV2 |
|---|---|---|---|---|---|
| 256x224 | 0.0297 | 0.257 | 0.110 | 2.33x | 3.70x |
| 512x448 | 0.0552 | 0.989 | 0.197 | 5.02x | 3.56x |
| 1024x896 | 0.1464 | 3.805 | 0.484 | 7.86x | 3.30x |
kornia-imgproc is now hovering around ~3x slower than OpenCV C++ across various sizes, which is an excellent starting point for pure Rust without unsafe SIMD intrinsics, and represents an almost 8x speedup over the previous nested loop implementation!
@sidd-27
|
@sidd-27 if everything looks fine can we merge this . |
sidd-27
left a comment
There was a problem hiding this comment.
Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism
Good optimization direction — separating the 2D 5×5 convolution into horizontal + vertical 1D passes is textbook and should give meaningful speedups. The safe-region / boundary-region split to skip reflect_101 in the inner loop is also well-motivated. A few issues though:
🔴 Issues
1. No benchmarks included
Per project guidelines: "Benchmark before and after any change to a core algorithm using cargo bench". This is a performance PR — the entire justification is speed. Without benchmark numbers, reviewers can't verify the claimed improvement. Please include before/after results.
2. Extra memory allocation — full intermediate buffer
let mut buffer = vec![0.0f32; buffer_width * buffer_height * C];The original did a single fused pass with no temporary allocation. This PR allocates a full dst_width × src_height × C buffer. For a 4K RGB image that's ~24MB. While separable convolution is algorithmically faster, the allocation + extra memory pass could hurt for large images or memory-constrained environments. This tradeoff should be documented and benchmarked.
3. Potential underflow in safe region calculation (horizontal pass)
let safe_end = if src_width > 2 {
(src_width - 1) / 2
} else {
safe_start
};When src_width = 3, safe_end = 1 which equals safe_start = 1, so the fast path processes zero pixels. When src_width = 4, safe_end = 1 — still no fast path. The fast path only kicks in at src_width >= 6. Is this intentional? For the kernel radius of 2, the safe region should start at dst_x = 1 (i.e., src_x = 2) and end before src_center_x + 2 >= src_width, so safe_end = (src_width - 2) / 2. The current formula (src_width - 1) / 2 may include one pixel too many on the right edge.
4. Vertical pass: y_c - 2 can underflow when dst_y = 0 enters safe path
if dst_y >= safe_start_y && dst_y < safe_end_y {
let y_c = src_center_y as usize; // dst_y * 2
let off_m2 = (y_c - 2) * stride; // when dst_y=1, y_c=2, so 0 — OKWith safe_start_y = 1, the minimum y_c = 2, so y_c - 2 = 0 — that's fine. But the vertical safe_end_y has the same off-by-one risk as horizontal: when dst_y = safe_end_y - 1, src_center_y + 2 must be < src_height. Currently safe_end_y = (src_height - 1) / 2, which means src_center_y = (src_height - 1) / 2 * 2 - 2 at max. For src_height = 5: safe_end_y = 2, max src_center_y = 2, src_center_y + 2 = 4 < 5 ✓. For src_height = 4: safe_end_y = 1 = safe_start_y — no fast path, OK. Seems correct but tight — add a comment explaining the bounds derivation.
5. Missing doc comments on new functions
pyrdown_horizontal_pass_f32, pyrdown_vertical_pass_f32, and process_pixel_pyrdown_checked_f32 have basic descriptions but are missing the required # Arguments, # Returns sections per project documentation guidelines. They're fn (private) not pub, so this is lower severity — but still good practice for complex image processing code.
🟡 Suggestions
6. Inner loop body is duplicated between safe and boundary paths (vertical pass)
The actual convolution computation is identical — only the row offset calculation differs. Extract the 5-point weighted sum into an #[inline(always)] helper to reduce code duplication:
#[inline(always)]
fn gaussian5(a: f32, b: f32, c: f32, d: f32, e: f32) -> f32 {
0.0625 * a + 0.25 * b + 0.375 * c + 0.25 * d + 0.0625 * e
}7. Magic numbers 0.0625, 0.25, 0.375 repeated throughout
Define kernel weights as named constants at the top:
const K_EDGE: f32 = 1.0 / 16.0; // 0.0625
const K_NEAR: f32 = 4.0 / 16.0; // 0.25
const K_CENTER: f32 = 6.0 / 16.0; // 0.3758. #[inline(always)] on process_pixel_pyrdown_checked_f32
This is appropriate for a per-pixel helper. Good.
9. Consider processing the buffer in-place or using a smaller rolling buffer
Instead of allocating the full intermediate buffer, you could process rows in tiles or use a ring buffer of 5 rows for the vertical pass. This would reduce memory from O(W×H) to O(W×5). More complex but would address issue #2.
10. PR title says "separable convolution and rayon parallelism"
The rayon parallelism was already present conceptually (the original had sequential loops). The PR adds par_chunks_mut which is new. Good addition — but the title implies rayon was added when it was more "restructured to enable proper rayon usage."
✅ What's Good
- Separable convolution is the right architectural choice for Gaussian blur
- Safe-region optimization avoids per-pixel branch for boundary checks
reflect_101handling is correctly isolated to border pixels- Rayon parallelization via
par_chunks_muton rows is clean - Small image fallback (
src_height <= 2) is handled gracefully - Well-filled PR description and checklist
Verdict: Promising optimization but needs benchmarks (required per guidelines), bounds verification, and minor cleanup. The intermediate buffer allocation tradeoff should be measured.
Will look into your comments and update the PR according to your comments shortly. |
|
Hi , I have addressed the comments made by you and updated the doc as well .Please review @sidd-27 Regarding Issue #2 and the memory allocation in I have documented the The new approach yields massive ~5x to 7x improvements in throughput for larger images. Given the impact, I think accepting the temporary allocation is worth it for now until we invest in a complex tiling implementation. Here are the benchmark results run locally via criterion (
|
| Image Size | Old Fused (Baseline) | New Separable (PR) | Improvement | Speedup |
|---|---|---|---|---|
| 256x224 |
|
|
||
| 512x448 |
|
|
||
| 1024x896 |
|
|
pyrdown_f32_3c (3 Channels)
| Image Size | Old Fused (Baseline) | New Separable (PR) | Improvement | Speedup |
|---|---|---|---|---|
| 256x224 |
|
|
||
| 512x448 |
|
|
||
| 1024x896 |
|
|
Here are the benchmark results run locally via Criterion (cargo bench -p kornia-imgproc --bench bench_pyramid) and Python OpenCV cv2.pyrDown:
pyrdown (1 Channel, f32)
| Image Size | Old Fused (Kornia) | New Separable (Kornia PR) | OpenCV (cv2.pyrDown) |
Kornia PR vs Old |
|---|---|---|---|---|
| 256x224 |
|
|||
| 512x448 |
|
|||
| 1024x896 |
|
pyrdown_3c (3 Channels, f32)
| Image Size | Old Fused (Kornia) | New Separable (Kornia PR) | OpenCV (cv2.pyrDown) |
Kornia PR vs Old |
|---|---|---|---|---|
| 256x224 |
|
|||
| 512x448 |
|
|||
| 1024x896 |
|
|
@sidd-27 can you check the pr once i have addressed the issues. |
|
@sidd-27 can you review this one . |
|
@sidd-27 can you take a look at this one as well . |
| /// # Implementation Details & Trade-offs | ||
| /// | ||
| /// This function uses a **separable 2D convolution** (horizontal pass followed by vertical pass) | ||
| /// accelerated by Rayon. This provides massive speedups (e.g., 5-7× faster on 1024×896 images) | ||
| /// by reducing per-pixel operations from \(k^2\) to \(2k\). | ||
| /// | ||
| /// However, this optimization requires a **full intermediate buffer** of size `dst_width × src_height × channels`. | ||
| /// For very large images (e.g., 4K RGB requiring ~24MB) in memory-constrained environments, | ||
| /// this temporary peak memory allocation should be taken into account. | ||
| /// |
There was a problem hiding this comment.
these comments feel very ai
There was a problem hiding this comment.
Yes, I used AI to help generate those comments. I found it useful for creating clear, comprehensive documentation quickly. That said, if they feel too formal or off-brand for the project, I'm happy to revise them. Let me know what changes would help them feel more natural or better suited to the code.
@sidd-27
| // For a 5-tap kernel (radius 2), we need `src_center_x + 2 < src_width`. | ||
| // Since `src_center_x = dst_x * 2`, we want `dst_x * 2 + 2 <= src_width - 1`. | ||
| // Solving for `dst_x` (safe range) gives an exclusive upper bound of `(src_width - 1) / 2`. |
|
@sidd-27 Is there anything you would like me to update in this lets wrap this up its been pending for a long time now . |
|
edgar has been writing simd kernels for imgproc. i think we're moving towards those in general as they are faster |
|
Check the latest. If any perf optimisation provided should come with benchmark comparing relative improvement against other libs. Ideally with simd and neon |
Got it will look into it. |
📝 Description
This PR introduces a significant performance optimization to the
f32implementation of thepyrdownimage pyramid operation (pyrdown_vertical_pass_f32). It accelerates the vertical convolution pass by bypassing expensive boundary reflection checks for the vast majority of the image.#763
Fixes/Relates to: # 763
Important:
🛠️ Changes Made
pyrdown_vertical_pass_f32inkornia-imgproc/src/pyramid.rsby introducing a safe-row iteration fast path.reflect_101), and an edge region that falls back to it.rayon(par_chunks_mut) parallelism to ensure multi-threaded performance is maintained alongside the new inner loop optimization.🧪 How Was This Tested?
pyrdownand pyramid testing suites to guarantee output correctness and verify that edge padding/reflection still matches expectations.safe_start_ytosafe_end_y) never violate buffer dimensions.src_height <= 2) by falling back entirely to thereflect_101edge case logic, avoiding out-of-bounds index calculations.🕵️ AI Usage Disclosure
Check one of the following:
🚦 Checklist
💭 Additional Context
By skipping
reflect_101for the inner rows, this eliminates redundant boundary checks and modulo operations for the vast majority of pixels in the vertical pass, vastly improving CPU cache utilization and instruction throughput.