Skip to content

Feature/optimize pyrdown f32 with separable convolution and rayon parallelism#766

Open
namanguptagit wants to merge 6 commits intokornia:mainfrom
namanguptagit:feature/Optimize_pyrdown_f32_with_separable_convolution_and_Rayon_parallelism
Open

Feature/optimize pyrdown f32 with separable convolution and rayon parallelism#766
namanguptagit wants to merge 6 commits intokornia:mainfrom
namanguptagit:feature/Optimize_pyrdown_f32_with_separable_convolution_and_Rayon_parallelism

Conversation

@namanguptagit
Copy link
Copy Markdown
Contributor

📝 Description

This PR introduces a significant performance optimization to the f32 implementation of the pyrdown image pyramid operation (pyrdown_vertical_pass_f32). It accelerates the vertical convolution pass by bypassing expensive boundary reflection checks for the vast majority of the image.

⚠️ Issue Link Required: This PR must be linked to an approved and assigned issue. See Contributing Guide for details.
#763
Fixes/Relates to: # 763

Important:

  • Ensure you are assigned to the linked issue before submitting this PR
  • This PR should strictly implement what the linked issue describes
  • Do not include changes beyond the scope of the linked issue

🛠️ Changes Made

  • Optimized pyrdown_vertical_pass_f32 in kornia-imgproc/src/pyramid.rs by introducing a safe-row iteration fast path.
  • Refactored the vertical pass to split processing into two branches: a "safe" inner region that skips costly border reflection (reflect_101), and an edge region that falls back to it.
  • Retained rayon (par_chunks_mut) parallelism to ensure multi-threaded performance is maintained alongside the new inner loop optimization.

🧪 How Was This Tested?

  • Unit Tests: Relying on existing pyrdown and pyramid testing suites to guarantee output correctness and verify that edge padding/reflection still matches expectations.
  • Manual Verification: Validated the optimization logic by ensuring the indices calculated in the inner "safe" boundary (safe_start_y to safe_end_y) never violate buffer dimensions.
  • Performance/Edge Cases: Handles very small images gracefully (e.g., src_height <= 2) by falling back entirely to the reflect_101 edge case logic, avoiding out-of-bounds index calculations.

🕵️ AI Usage Disclosure

Check one of the following:

  • 🟢 No AI used.
  • 🟡 AI-assisted: I used AI for boilerplate/refactoring but have manually reviewed and tested every line.
  • 🔴 AI-generated: (Note: These PRs may be subject to stricter scrutiny or immediate closure if the logic is not explained).

🚦 Checklist

  • I am assigned to the linked issue (required before PR submission)
  • The linked issue has been approved by a maintainer
  • This PR strictly implements what the linked issue describes (no scope creep)
  • I have performed a self-review of my code (no "ghost" variables or hallucinations).
  • My code follows the existing style guidelines of this project.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • (Optional) I have attached screenshots/recordings for UI changes.

💭 Additional Context

By skipping reflect_101 for the inner rows, this eliminates redundant boundary checks and modulo operations for the vast majority of pixels in the vertical pass, vastly improving CPU cache utilization and instruction throughput.

@namanguptagit
Copy link
Copy Markdown
Contributor Author

The [pyrdown_f32] and pyrdown_f32_3c functions have been heavily optimized by replacing the single-threaded $O(25)$ 5x5 convolution loop with separable horizontal and vertical passes using rayon data-parallelism.

Benchmark Results: pyrdown_f32 Optimization

I've optimized the pyrdown_f32 (and pyrdown_f32_3c) functions in crates/kornia-imgproc/src/pyramid.rs by replacing the single-threaded $O(25)$ 5x5 convolution loop with separable horizontal and vertical passes using rayon data-parallelism.

(Note: The vertical pass was also updated to skip reflect_101 edge-checks for center rows for algorithmic consistency with the horizontal pass).

Here are the results from running cargo bench -p kornia-imgproc --bench bench_pyramid:

3-Channel Images (pyrdown_f32_3c)

Image Dimension Previous Time New execution Time Time Reduction Throughput Increase
256x224 ~ 265 µs 102 µs -62.13% +164.07%
512x448 ~ 1000 µs 184 µs -81.42% +438.22%
1024x896 ~ 4000 µs 504 µs -87.28% +686.34%

1-Channel Images (pyrdown_f32)

Image Dimension Previous Time New execution Time Time Reduction Throughput Increase
256x224 ~ 83 µs 82 µs -2.14% +2.18%
512x448 ~ 325 µs 113 µs -65.10% +186.53%
1024x896 ~ 1300 µs 226 µs -82.63% +475.78%

@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 can you review

@sidd-27
Copy link
Copy Markdown
Collaborator

sidd-27 commented Feb 26, 2026

Benchmarks, validation?

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Validation

  • Mathematical Equivalence: The optimization strictly preserves mathematical correctness by maintaining OpenCV's exact reflect_101 border mode near image boundaries.
  • Testing: Validated against the crate's comprehensive numerical unit tests (cargo test -p kornia-imgproc), which all pass successfully.

Benchmarks

We achieved massive multi-threaded performance enhancements by splitting the single-threaded $O(25)$ per-pixel Gaussian blur into two separable $O(5)$ passes (horizontal and vertical) and utilizing rayon data-parallelism across the image rows.

Here are the local benchmark results (cargo bench -p kornia-imgproc --bench bench_pyramid):

3-Channel Images (pyrdown_f32_3c)

Dimension Throughput Increase Execution Time Reduction
256x224 +164.07% -62.13%
512x448 +438.22% -81.42%
1024x896 +686.34% -87.28%

1-Channel Images (pyrdown_f32)

Dimension Throughput Increase Execution Time Reduction
256x224 +2.18% -2.14%
512x448 +186.53% -65.10%
1024x896 +475.78% -82.63%

The algorithmic reductions ($O(25) \to O(10)$) paired with multi-threading provide accelerating returns as resolution increases!

@sidd-27
Copy link
Copy Markdown
Collaborator

sidd-27 commented Feb 26, 2026

Better documentation of what the function does would be helpful, you seem to have removed some comments

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Better documentation of what the function does would be helpful, you seem to have removed some comments

sure let me do that

@sidd-27
Copy link
Copy Markdown
Collaborator

sidd-27 commented Feb 26, 2026

Could you also benchmarks against opencv we'd like to achieve speeds close to or faster than that

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Could you also benchmarks against opencv we'd like to achieve speeds close to or faster than that

sure

@namanguptagit
Copy link
Copy Markdown
Contributor Author

namanguptagit commented Feb 26, 2026

Benchmarks vs OpenCV (and Before/After Optimization)

I ran local benchmarks comparing the pyrdown_f32 performance before the optimization, after the rayon separable convolution optimization, and against cv2.pyrDown.

1-Channel (pyrdown_f32)

Dimension OpenCV (ms) kornia-rs Before (ms) kornia-rs After (ms) Speedup vs Before Difference vs CV2
256x224 0.0277 0.123 0.081 1.52x 2.92x
512x448 0.0341 0.487 0.120 4.05x 3.51x
1024x896 0.0813 1.949 0.249 7.82x 3.06x

3-Channel (pyrdown_f32_3c)

Dimension OpenCV (ms) kornia-rs Before (ms) kornia-rs After (ms) Speedup vs Before Difference vs CV2
256x224 0.0297 0.257 0.110 2.33x 3.70x
512x448 0.0552 0.989 0.197 5.02x 3.56x
1024x896 0.1464 3.805 0.484 7.86x 3.30x

kornia-imgproc is now hovering around ~3x slower than OpenCV C++ across various sizes, which is an excellent starting point for pure Rust without unsafe SIMD intrinsics, and represents an almost 8x speedup over the previous nested loop implementation!
@sidd-27

@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 if everything looks fine can we merge this .

Copy link
Copy Markdown
Collaborator

@sidd-27 sidd-27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

Good optimization direction — separating the 2D 5×5 convolution into horizontal + vertical 1D passes is textbook and should give meaningful speedups. The safe-region / boundary-region split to skip reflect_101 in the inner loop is also well-motivated. A few issues though:


🔴 Issues

1. No benchmarks included
Per project guidelines: "Benchmark before and after any change to a core algorithm using cargo bench". This is a performance PR — the entire justification is speed. Without benchmark numbers, reviewers can't verify the claimed improvement. Please include before/after results.

2. Extra memory allocation — full intermediate buffer

let mut buffer = vec![0.0f32; buffer_width * buffer_height * C];

The original did a single fused pass with no temporary allocation. This PR allocates a full dst_width × src_height × C buffer. For a 4K RGB image that's ~24MB. While separable convolution is algorithmically faster, the allocation + extra memory pass could hurt for large images or memory-constrained environments. This tradeoff should be documented and benchmarked.

3. Potential underflow in safe region calculation (horizontal pass)

let safe_end = if src_width > 2 {
    (src_width - 1) / 2
} else {
    safe_start
};

When src_width = 3, safe_end = 1 which equals safe_start = 1, so the fast path processes zero pixels. When src_width = 4, safe_end = 1 — still no fast path. The fast path only kicks in at src_width >= 6. Is this intentional? For the kernel radius of 2, the safe region should start at dst_x = 1 (i.e., src_x = 2) and end before src_center_x + 2 >= src_width, so safe_end = (src_width - 2) / 2. The current formula (src_width - 1) / 2 may include one pixel too many on the right edge.

4. Vertical pass: y_c - 2 can underflow when dst_y = 0 enters safe path

if dst_y >= safe_start_y && dst_y < safe_end_y {
    let y_c = src_center_y as usize;  // dst_y * 2
    let off_m2 = (y_c - 2) * stride;  // when dst_y=1, y_c=2, so 0 — OK

With safe_start_y = 1, the minimum y_c = 2, so y_c - 2 = 0 — that's fine. But the vertical safe_end_y has the same off-by-one risk as horizontal: when dst_y = safe_end_y - 1, src_center_y + 2 must be < src_height. Currently safe_end_y = (src_height - 1) / 2, which means src_center_y = (src_height - 1) / 2 * 2 - 2 at max. For src_height = 5: safe_end_y = 2, max src_center_y = 2, src_center_y + 2 = 4 < 5 ✓. For src_height = 4: safe_end_y = 1 = safe_start_y — no fast path, OK. Seems correct but tight — add a comment explaining the bounds derivation.

5. Missing doc comments on new functions
pyrdown_horizontal_pass_f32, pyrdown_vertical_pass_f32, and process_pixel_pyrdown_checked_f32 have basic descriptions but are missing the required # Arguments, # Returns sections per project documentation guidelines. They're fn (private) not pub, so this is lower severity — but still good practice for complex image processing code.


🟡 Suggestions

6. Inner loop body is duplicated between safe and boundary paths (vertical pass)
The actual convolution computation is identical — only the row offset calculation differs. Extract the 5-point weighted sum into an #[inline(always)] helper to reduce code duplication:

#[inline(always)]
fn gaussian5(a: f32, b: f32, c: f32, d: f32, e: f32) -> f32 {
    0.0625 * a + 0.25 * b + 0.375 * c + 0.25 * d + 0.0625 * e
}

7. Magic numbers 0.0625, 0.25, 0.375 repeated throughout
Define kernel weights as named constants at the top:

const K_EDGE: f32 = 1.0 / 16.0;   // 0.0625
const K_NEAR: f32 = 4.0 / 16.0;   // 0.25
const K_CENTER: f32 = 6.0 / 16.0; // 0.375

8. #[inline(always)] on process_pixel_pyrdown_checked_f32
This is appropriate for a per-pixel helper. Good.

9. Consider processing the buffer in-place or using a smaller rolling buffer
Instead of allocating the full intermediate buffer, you could process rows in tiles or use a ring buffer of 5 rows for the vertical pass. This would reduce memory from O(W×H) to O(W×5). More complex but would address issue #2.

10. PR title says "separable convolution and rayon parallelism"
The rayon parallelism was already present conceptually (the original had sequential loops). The PR adds par_chunks_mut which is new. Good addition — but the title implies rayon was added when it was more "restructured to enable proper rayon usage."


✅ What's Good

  • Separable convolution is the right architectural choice for Gaussian blur
  • Safe-region optimization avoids per-pixel branch for boundary checks
  • reflect_101 handling is correctly isolated to border pixels
  • Rayon parallelization via par_chunks_mut on rows is clean
  • Small image fallback (src_height <= 2) is handled gracefully
  • Well-filled PR description and checklist

Verdict: Promising optimization but needs benchmarks (required per guidelines), bounds verification, and minor cleanup. The intermediate buffer allocation tradeoff should be measured.

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

Good optimization direction — separating the 2D 5×5 convolution into horizontal + vertical 1D passes is textbook and should give meaningful speedups. The safe-region / boundary-region split to skip reflect_101 in the inner loop is also well-motivated. A few issues though:

🔴 Issues

1. No benchmarks included Per project guidelines: "Benchmark before and after any change to a core algorithm using cargo bench". This is a performance PR — the entire justification is speed. Without benchmark numbers, reviewers can't verify the claimed improvement. Please include before/after results.

2. Extra memory allocation — full intermediate buffer

let mut buffer = vec![0.0f32; buffer_width * buffer_height * C];

The original did a single fused pass with no temporary allocation. This PR allocates a full dst_width × src_height × C buffer. For a 4K RGB image that's ~24MB. While separable convolution is algorithmically faster, the allocation + extra memory pass could hurt for large images or memory-constrained environments. This tradeoff should be documented and benchmarked.

3. Potential underflow in safe region calculation (horizontal pass)

let safe_end = if src_width > 2 {
    (src_width - 1) / 2
} else {
    safe_start
};

When src_width = 3, safe_end = 1 which equals safe_start = 1, so the fast path processes zero pixels. When src_width = 4, safe_end = 1 — still no fast path. The fast path only kicks in at src_width >= 6. Is this intentional? For the kernel radius of 2, the safe region should start at dst_x = 1 (i.e., src_x = 2) and end before src_center_x + 2 >= src_width, so safe_end = (src_width - 2) / 2. The current formula (src_width - 1) / 2 may include one pixel too many on the right edge.

4. Vertical pass: y_c - 2 can underflow when dst_y = 0 enters safe path

if dst_y >= safe_start_y && dst_y < safe_end_y {
    let y_c = src_center_y as usize;  // dst_y * 2
    let off_m2 = (y_c - 2) * stride;  // when dst_y=1, y_c=2, so 0 — OK

With safe_start_y = 1, the minimum y_c = 2, so y_c - 2 = 0 — that's fine. But the vertical safe_end_y has the same off-by-one risk as horizontal: when dst_y = safe_end_y - 1, src_center_y + 2 must be < src_height. Currently safe_end_y = (src_height - 1) / 2, which means src_center_y = (src_height - 1) / 2 * 2 - 2 at max. For src_height = 5: safe_end_y = 2, max src_center_y = 2, src_center_y + 2 = 4 < 5 ✓. For src_height = 4: safe_end_y = 1 = safe_start_y — no fast path, OK. Seems correct but tight — add a comment explaining the bounds derivation.

5. Missing doc comments on new functions pyrdown_horizontal_pass_f32, pyrdown_vertical_pass_f32, and process_pixel_pyrdown_checked_f32 have basic descriptions but are missing the required # Arguments, # Returns sections per project documentation guidelines. They're fn (private) not pub, so this is lower severity — but still good practice for complex image processing code.

🟡 Suggestions

6. Inner loop body is duplicated between safe and boundary paths (vertical pass) The actual convolution computation is identical — only the row offset calculation differs. Extract the 5-point weighted sum into an #[inline(always)] helper to reduce code duplication:

#[inline(always)]
fn gaussian5(a: f32, b: f32, c: f32, d: f32, e: f32) -> f32 {
    0.0625 * a + 0.25 * b + 0.375 * c + 0.25 * d + 0.0625 * e
}

7. Magic numbers 0.0625, 0.25, 0.375 repeated throughout Define kernel weights as named constants at the top:

const K_EDGE: f32 = 1.0 / 16.0;   // 0.0625
const K_NEAR: f32 = 4.0 / 16.0;   // 0.25
const K_CENTER: f32 = 6.0 / 16.0; // 0.375

8. #[inline(always)] on process_pixel_pyrdown_checked_f32 This is appropriate for a per-pixel helper. Good.

9. Consider processing the buffer in-place or using a smaller rolling buffer Instead of allocating the full intermediate buffer, you could process rows in tiles or use a ring buffer of 5 rows for the vertical pass. This would reduce memory from O(W×H) to O(W×5). More complex but would address issue #2.

10. PR title says "separable convolution and rayon parallelism" The rayon parallelism was already present conceptually (the original had sequential loops). The PR adds par_chunks_mut which is new. Good addition — but the title implies rayon was added when it was more "restructured to enable proper rayon usage."

✅ What's Good

  • Separable convolution is the right architectural choice for Gaussian blur
  • Safe-region optimization avoids per-pixel branch for boundary checks
  • reflect_101 handling is correctly isolated to border pixels
  • Rayon parallelization via par_chunks_mut on rows is clean
  • Small image fallback (src_height <= 2) is handled gracefully
  • Well-filled PR description and checklist

Verdict: Promising optimization but needs benchmarks (required per guidelines), bounds verification, and minor cleanup. The intermediate buffer allocation tradeoff should be measured.

Will look into your comments and update the PR according to your comments shortly.

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Hi , I have addressed the comments made by you and updated the doc as well .Please review @sidd-27

Regarding Issue #2 and the memory allocation in pyrdown_f32:

I have documented the $O(W \times H)$ memory tradeoff directly in the pyrdown_f32 docstrings. I benchmarked the performance of this new separable convolution approach vs. the old fused sequential approach to verify if the memory tradeoff was worth it.

The new approach yields massive ~5x to 7x improvements in throughput for larger images. Given the impact, I think accepting the temporary allocation is worth it for now until we invest in a complex tiling implementation.

Here are the benchmark results run locally via criterion (cargo bench -p kornia-imgproc --bench bench_pyramid):

pyrdown_f32 (1 Channel)

Image Size Old Fused (Baseline) New Separable (PR) Improvement Speedup
256x224 $145.48 \mu s$ $82.02 \mu s$ $-43.1%$ time $1.77\times$ / $+75.9%$ thrpt
512x448 $552.12 \mu s$ $121.34 \mu s$ $-78.0%$ time $4.56\times$ / $+356.5%$ thrpt
1024x896 $2.189 ms$ $316.85 \mu s$ $-85.5%$ time $6.90\times$ / $+590.8%$ thrpt

pyrdown_f32_3c (3 Channels)

Image Size Old Fused (Baseline) New Separable (PR) Improvement Speedup
256x224 $257.65 \mu s$ $119.66 \mu s$ $-53.6%$ time $2.15\times$ / $+115.8%$ thrpt
512x448 $889.65 \mu s$ $213.88 \mu s$ $-75.9%$ time $4.15\times$ / $+316.5%$ thrpt
1024x896 $3.951 ms$ $972.64 \mu s$ $-75.2%$ time $4.06\times$ / $+303.6%$ thrpt

Here are the benchmark results run locally via Criterion (cargo bench -p kornia-imgproc --bench bench_pyramid) and Python OpenCV cv2.pyrDown:

pyrdown (1 Channel, f32)

Image Size Old Fused (Kornia) New Separable (Kornia PR) OpenCV (cv2.pyrDown) Kornia PR vs Old
256x224 $145.48 \mu s$ $\mathbf{82.02 \mu s}$ $25.34 \mu s$ $-43.1%$ time
512x448 $552.12 \mu s$ $\mathbf{121.34 \mu s}$ $37.43 \mu s$ $-78.0%$ time
1024x896 $2.189 ms$ $\mathbf{316.85 \mu s}$ $96.00 \mu s$ $-85.5%$ time

pyrdown_3c (3 Channels, f32)

Image Size Old Fused (Kornia) New Separable (Kornia PR) OpenCV (cv2.pyrDown) Kornia PR vs Old
256x224 $257.65 \mu s$ $\mathbf{119.66 \mu s}$ $33.31 \mu s$ $-53.6%$ time
512x448 $889.65 \mu s$ $\mathbf{213.88 \mu s}$ $65.49 \mu s$ $-75.9%$ time
1024x896 $3.951 ms$ $\mathbf{972.64 \mu s}$ $323.19 \mu s$ $-75.2%$ time

@namanguptagit namanguptagit requested a review from sidd-27 March 8, 2026 01:44
@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 can you check the pr once i have addressed the issues.

@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 can you review this one .

@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 can you take a look at this one as well .

Comment on lines +291 to +300
/// # Implementation Details & Trade-offs
///
/// This function uses a **separable 2D convolution** (horizontal pass followed by vertical pass)
/// accelerated by Rayon. This provides massive speedups (e.g., 5-7× faster on 1024×896 images)
/// by reducing per-pixel operations from \(k^2\) to \(2k\).
///
/// However, this optimization requires a **full intermediate buffer** of size `dst_width × src_height × channels`.
/// For very large images (e.g., 4K RGB requiring ~24MB) in memory-constrained environments,
/// this temporary peak memory allocation should be taken into account.
///
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments feel very ai

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used AI to help generate those comments. I found it useful for creating clear, comprehensive documentation quickly. That said, if they feel too formal or off-brand for the project, I'm happy to revise them. Let me know what changes would help them feel more natural or better suited to the code.
@sidd-27

Comment on lines +402 to +404
// For a 5-tap kernel (radius 2), we need `src_center_x + 2 < src_width`.
// Since `src_center_x = dst_x * 2`, we want `dst_x * 2 + 2 <= src_width - 1`.
// Solving for `dst_x` (safe range) gives an exclusive upper bound of `(src_width - 1) / 2`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these too

@namanguptagit
Copy link
Copy Markdown
Contributor Author

@sidd-27 Is there anything you would like me to update in this lets wrap this up its been pending for a long time now .

@sidd-27
Copy link
Copy Markdown
Collaborator

sidd-27 commented Apr 22, 2026

edgar has been writing simd kernels for imgproc. i think we're moving towards those in general as they are faster

@edgarriba
Copy link
Copy Markdown
Member

Check the latest. If any perf optimisation provided should come with benchmark comparing relative improvement against other libs. Ideally with simd and neon

@namanguptagit
Copy link
Copy Markdown
Contributor Author

Check the latest. If any perf optimisation provided should come with benchmark comparing relative improvement against other libs. Ideally with simd and neon

Got it will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants