Feature/optimize pyrdown f32 with separable convolution and rayon parallelism by namanguptagit · Pull Request #766 · kornia/kornia-rs

namanguptagit · 2026-02-26T09:44:01Z

📝 Description

This PR introduces a significant performance optimization to the f32 implementation of the pyrdown image pyramid operation (pyrdown_vertical_pass_f32). It accelerates the vertical convolution pass by bypassing expensive boundary reflection checks for the vast majority of the image.

⚠️ Issue Link Required: This PR must be linked to an approved and assigned issue. See Contributing Guide for details.
#763
Fixes/Relates to: # 763

Important:

Ensure you are assigned to the linked issue before submitting this PR
This PR should strictly implement what the linked issue describes
Do not include changes beyond the scope of the linked issue

🛠️ Changes Made

Optimized pyrdown_vertical_pass_f32 in kornia-imgproc/src/pyramid.rs by introducing a safe-row iteration fast path.
Refactored the vertical pass to split processing into two branches: a "safe" inner region that skips costly border reflection (reflect_101), and an edge region that falls back to it.
Retained rayon (par_chunks_mut) parallelism to ensure multi-threaded performance is maintained alongside the new inner loop optimization.

🧪 How Was This Tested?

Unit Tests: Relying on existing pyrdown and pyramid testing suites to guarantee output correctness and verify that edge padding/reflection still matches expectations.
Manual Verification: Validated the optimization logic by ensuring the indices calculated in the inner "safe" boundary (safe_start_y to safe_end_y) never violate buffer dimensions.
Performance/Edge Cases: Handles very small images gracefully (e.g., src_height <= 2) by falling back entirely to the reflect_101 edge case logic, avoiding out-of-bounds index calculations.

🕵️ AI Usage Disclosure

Check one of the following:

🟢 No AI used.
🟡 AI-assisted: I used AI for boilerplate/refactoring but have manually reviewed and tested every line.
🔴 AI-generated: (Note: These PRs may be subject to stricter scrutiny or immediate closure if the logic is not explained).

🚦 Checklist

I am assigned to the linked issue (required before PR submission)
The linked issue has been approved by a maintainer
This PR strictly implements what the linked issue describes (no scope creep)
I have performed a self-review of my code (no "ghost" variables or hallucinations).
My code follows the existing style guidelines of this project.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
(Optional) I have attached screenshots/recordings for UI changes.

💭 Additional Context

By skipping reflect_101 for the inner rows, this eliminates redundant boundary checks and modulo operations for the vast majority of pixels in the vertical pass, vastly improving CPU cache utilization and instruction throughput.

…ayon

namanguptagit · 2026-02-26T09:45:07Z

The [pyrdown_f32] and pyrdown_f32_3c functions have been heavily optimized by replacing the single-threaded $O(25)$ 5x5 convolution loop with separable horizontal and vertical passes using rayon data-parallelism.

Benchmark Results: `pyrdown_f32` Optimization

I've optimized the pyrdown_f32 (and pyrdown_f32_3c) functions in crates/kornia-imgproc/src/pyramid.rs by replacing the single-threaded $O(25)$ 5x5 convolution loop with separable horizontal and vertical passes using rayon data-parallelism.

(Note: The vertical pass was also updated to skip reflect_101 edge-checks for center rows for algorithmic consistency with the horizontal pass).

Here are the results from running cargo bench -p kornia-imgproc --bench bench_pyramid:

3-Channel Images (pyrdown_f32_3c)

Image Dimension	Previous Time	New execution Time	Time Reduction	Throughput Increase
256x224	~ 265 µs	102 µs	-62.13%	+164.07%
512x448	~ 1000 µs	184 µs	-81.42%	+438.22%
1024x896	~ 4000 µs	504 µs	-87.28%	+686.34%

1-Channel Images (pyrdown_f32)

Image Dimension	Previous Time	New execution Time	Time Reduction	Throughput Increase
256x224	~ 83 µs	82 µs	-2.14%	+2.18%
512x448	~ 325 µs	113 µs	-65.10%	+186.53%
1024x896	~ 1300 µs	226 µs	-82.63%	+475.78%

namanguptagit · 2026-02-26T09:45:33Z

@sidd-27 can you review

sidd-27 · 2026-02-26T09:49:49Z

Benchmarks, validation?

namanguptagit · 2026-02-26T09:56:17Z

Validation

Mathematical Equivalence: The optimization strictly preserves mathematical correctness by maintaining OpenCV's exact reflect_101 border mode near image boundaries.
Testing: Validated against the crate's comprehensive numerical unit tests (cargo test -p kornia-imgproc), which all pass successfully.

Benchmarks

We achieved massive multi-threaded performance enhancements by splitting the single-threaded $O(25)$ per-pixel Gaussian blur into two separable $O(5)$ passes (horizontal and vertical) and utilizing rayon data-parallelism across the image rows.

Here are the local benchmark results (cargo bench -p kornia-imgproc --bench bench_pyramid):

3-Channel Images (`pyrdown_f32_3c`)

Dimension	Throughput Increase	Execution Time Reduction
256x224	+164.07%	-62.13%
512x448	+438.22%	-81.42%
1024x896	+686.34%	-87.28%

1-Channel Images (pyrdown_f32)

Dimension	Throughput Increase	Execution Time Reduction
256x224	+2.18%	-2.14%
512x448	+186.53%	-65.10%
1024x896	+475.78%	-82.63%

The algorithmic reductions ($O(25) \to O(10)$) paired with multi-threading provide accelerating returns as resolution increases!

sidd-27 · 2026-02-26T09:58:39Z

Better documentation of what the function does would be helpful, you seem to have removed some comments

namanguptagit · 2026-02-26T10:00:53Z

Better documentation of what the function does would be helpful, you seem to have removed some comments

sure let me do that

sidd-27 · 2026-02-26T10:13:24Z

Could you also benchmarks against opencv we'd like to achieve speeds close to or faster than that

namanguptagit · 2026-02-26T10:15:30Z

Could you also benchmarks against opencv we'd like to achieve speeds close to or faster than that

sure

…mizations

namanguptagit · 2026-02-26T15:23:46Z

Benchmarks vs OpenCV (and Before/After Optimization)

I ran local benchmarks comparing the pyrdown_f32 performance before the optimization, after the rayon separable convolution optimization, and against cv2.pyrDown.

1-Channel (`pyrdown_f32`)

Dimension	OpenCV (ms)	`kornia-rs` Before (ms)	`kornia-rs` After (ms)	Speedup vs Before	Difference vs CV2
256x224	0.0277	0.123	0.081	1.52x	2.92x
512x448	0.0341	0.487	0.120	4.05x	3.51x
1024x896	0.0813	1.949	0.249	7.82x	3.06x

3-Channel (`pyrdown_f32_3c`)

Dimension	OpenCV (ms)	`kornia-rs` Before (ms)	`kornia-rs` After (ms)	Speedup vs Before	Difference vs CV2
256x224	0.0297	0.257	0.110	2.33x	3.70x
512x448	0.0552	0.989	0.197	5.02x	3.56x
1024x896	0.1464	3.805	0.484	7.86x	3.30x

kornia-imgproc is now hovering around ~3x slower than OpenCV C++ across various sizes, which is an excellent starting point for pure Rust without unsafe SIMD intrinsics, and represents an almost 8x speedup over the previous nested loop implementation!
@sidd-27

namanguptagit · 2026-03-02T08:59:18Z

@sidd-27 if everything looks fine can we merge this .

sidd-27

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

Good optimization direction — separating the 2D 5×5 convolution into horizontal + vertical 1D passes is textbook and should give meaningful speedups. The safe-region / boundary-region split to skip reflect_101 in the inner loop is also well-motivated. A few issues though:

🔴 Issues

1. No benchmarks included
Per project guidelines: "Benchmark before and after any change to a core algorithm using cargo bench". This is a performance PR — the entire justification is speed. Without benchmark numbers, reviewers can't verify the claimed improvement. Please include before/after results.

2. Extra memory allocation — full intermediate buffer

let mut buffer = vec![0.0f32; buffer_width * buffer_height * C];

The original did a single fused pass with no temporary allocation. This PR allocates a full dst_width × src_height × C buffer. For a 4K RGB image that's ~24MB. While separable convolution is algorithmically faster, the allocation + extra memory pass could hurt for large images or memory-constrained environments. This tradeoff should be documented and benchmarked.

3. Potential underflow in safe region calculation (horizontal pass)

let safe_end = if src_width > 2 {
    (src_width - 1) / 2
} else {
    safe_start
};

When src_width = 3, safe_end = 1 which equals safe_start = 1, so the fast path processes zero pixels. When src_width = 4, safe_end = 1 — still no fast path. The fast path only kicks in at src_width >= 6. Is this intentional? For the kernel radius of 2, the safe region should start at dst_x = 1 (i.e., src_x = 2) and end before src_center_x + 2 >= src_width, so safe_end = (src_width - 2) / 2. The current formula (src_width - 1) / 2 may include one pixel too many on the right edge.

4. Vertical pass: y_c - 2 can underflow when dst_y = 0 enters safe path

if dst_y >= safe_start_y && dst_y < safe_end_y {
    let y_c = src_center_y as usize;  // dst_y * 2
    let off_m2 = (y_c - 2) * stride;  // when dst_y=1, y_c=2, so 0 — OK

With safe_start_y = 1, the minimum y_c = 2, so y_c - 2 = 0 — that's fine. But the vertical safe_end_y has the same off-by-one risk as horizontal: when dst_y = safe_end_y - 1, src_center_y + 2 must be < src_height. Currently safe_end_y = (src_height - 1) / 2, which means src_center_y = (src_height - 1) / 2 * 2 - 2 at max. For src_height = 5: safe_end_y = 2, max src_center_y = 2, src_center_y + 2 = 4 < 5 ✓. For src_height = 4: safe_end_y = 1 = safe_start_y — no fast path, OK. Seems correct but tight — add a comment explaining the bounds derivation.

5. Missing doc comments on new functions
pyrdown_horizontal_pass_f32, pyrdown_vertical_pass_f32, and process_pixel_pyrdown_checked_f32 have basic descriptions but are missing the required # Arguments, # Returns sections per project documentation guidelines. They're fn (private) not pub, so this is lower severity — but still good practice for complex image processing code.

🟡 Suggestions

6. Inner loop body is duplicated between safe and boundary paths (vertical pass)
The actual convolution computation is identical — only the row offset calculation differs. Extract the 5-point weighted sum into an #[inline(always)] helper to reduce code duplication:

#[inline(always)]
fn gaussian5(a: f32, b: f32, c: f32, d: f32, e: f32) -> f32 {
    0.0625 * a + 0.25 * b + 0.375 * c + 0.25 * d + 0.0625 * e
}

7. Magic numbers 0.0625, 0.25, 0.375 repeated throughout
Define kernel weights as named constants at the top:

const K_EDGE: f32 = 1.0 / 16.0;   // 0.0625
const K_NEAR: f32 = 4.0 / 16.0;   // 0.25
const K_CENTER: f32 = 6.0 / 16.0; // 0.375

8. #[inline(always)] on process_pixel_pyrdown_checked_f32
This is appropriate for a per-pixel helper. Good.

9. Consider processing the buffer in-place or using a smaller rolling buffer
Instead of allocating the full intermediate buffer, you could process rows in tiles or use a ring buffer of 5 rows for the vertical pass. This would reduce memory from O(W×H) to O(W×5). More complex but would address issue #2.

10. PR title says "separable convolution and rayon parallelism"
The rayon parallelism was already present conceptually (the original had sequential loops). The PR adds par_chunks_mut which is new. Good addition — but the title implies rayon was added when it was more "restructured to enable proper rayon usage."

✅ What's Good

Separable convolution is the right architectural choice for Gaussian blur
Safe-region optimization avoids per-pixel branch for boundary checks
reflect_101 handling is correctly isolated to border pixels
Rayon parallelization via par_chunks_mut on rows is clean
Small image fallback (src_height <= 2) is handled gracefully
Well-filled PR description and checklist

Verdict: Promising optimization but needs benchmarks (required per guidelines), bounds verification, and minor cleanup. The intermediate buffer allocation tradeoff should be measured.

namanguptagit · 2026-03-06T18:16:27Z

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

Good optimization direction — separating the 2D 5×5 convolution into horizontal + vertical 1D passes is textbook and should give meaningful speedups. The safe-region / boundary-region split to skip reflect_101 in the inner loop is also well-motivated. A few issues though:

🔴 Issues

1. No benchmarks included Per project guidelines: "Benchmark before and after any change to a core algorithm using cargo bench". This is a performance PR — the entire justification is speed. Without benchmark numbers, reviewers can't verify the claimed improvement. Please include before/after results.

2. Extra memory allocation — full intermediate buffer
let mut buffer = vec![0.0f32; buffer_width * buffer_height * C];
The original did a single fused pass with no temporary allocation. This PR allocates a full dst_width × src_height × C buffer. For a 4K RGB image that's ~24MB. While separable convolution is algorithmically faster, the allocation + extra memory pass could hurt for large images or memory-constrained environments. This tradeoff should be documented and benchmarked.

3. Potential underflow in safe region calculation (horizontal pass)
let safe_end = if src_width > 2 {
    (src_width - 1) / 2
} else {
    safe_start
};
When src_width = 3, safe_end = 1 which equals safe_start = 1, so the fast path processes zero pixels. When src_width = 4, safe_end = 1 — still no fast path. The fast path only kicks in at src_width >= 6. Is this intentional? For the kernel radius of 2, the safe region should start at dst_x = 1 (i.e., src_x = 2) and end before src_center_x + 2 >= src_width, so safe_end = (src_width - 2) / 2. The current formula (src_width - 1) / 2 may include one pixel too many on the right edge.

4. Vertical pass: y_c - 2 can underflow when dst_y = 0 enters safe path
if dst_y >= safe_start_y && dst_y < safe_end_y {
    let y_c = src_center_y as usize;  // dst_y * 2
    let off_m2 = (y_c - 2) * stride;  // when dst_y=1, y_c=2, so 0 — OK
With safe_start_y = 1, the minimum y_c = 2, so y_c - 2 = 0 — that's fine. But the vertical safe_end_y has the same off-by-one risk as horizontal: when dst_y = safe_end_y - 1, src_center_y + 2 must be < src_height. Currently safe_end_y = (src_height - 1) / 2, which means src_center_y = (src_height - 1) / 2 * 2 - 2 at max. For src_height = 5: safe_end_y = 2, max src_center_y = 2, src_center_y + 2 = 4 < 5 ✓. For src_height = 4: safe_end_y = 1 = safe_start_y — no fast path, OK. Seems correct but tight — add a comment explaining the bounds derivation.

5. Missing doc comments on new functions pyrdown_horizontal_pass_f32, pyrdown_vertical_pass_f32, and process_pixel_pyrdown_checked_f32 have basic descriptions but are missing the required # Arguments, # Returns sections per project documentation guidelines. They're fn (private) not pub, so this is lower severity — but still good practice for complex image processing code.

🟡 Suggestions

6. Inner loop body is duplicated between safe and boundary paths (vertical pass) The actual convolution computation is identical — only the row offset calculation differs. Extract the 5-point weighted sum into an #[inline(always)] helper to reduce code duplication:
#[inline(always)]
fn gaussian5(a: f32, b: f32, c: f32, d: f32, e: f32) -> f32 {
    0.0625 * a + 0.25 * b + 0.375 * c + 0.25 * d + 0.0625 * e
}
7. Magic numbers 0.0625, 0.25, 0.375 repeated throughout Define kernel weights as named constants at the top:
const K_EDGE: f32 = 1.0 / 16.0;   // 0.0625
const K_NEAR: f32 = 4.0 / 16.0;   // 0.25
const K_CENTER: f32 = 6.0 / 16.0; // 0.375
8. #[inline(always)] on process_pixel_pyrdown_checked_f32 This is appropriate for a per-pixel helper. Good.

9. Consider processing the buffer in-place or using a smaller rolling buffer Instead of allocating the full intermediate buffer, you could process rows in tiles or use a ring buffer of 5 rows for the vertical pass. This would reduce memory from O(W×H) to O(W×5). More complex but would address issue #2.

10. PR title says "separable convolution and rayon parallelism" The rayon parallelism was already present conceptually (the original had sequential loops). The PR adds par_chunks_mut which is new. Good addition — but the title implies rayon was added when it was more "restructured to enable proper rayon usage."

✅ What's Good

Separable convolution is the right architectural choice for Gaussian blur

Safe-region optimization avoids per-pixel branch for boundary checks

reflect_101 handling is correctly isolated to border pixels

Rayon parallelization via par_chunks_mut on rows is clean

Small image fallback (src_height <= 2) is handled gracefully

Well-filled PR description and checklist

Verdict: Promising optimization but needs benchmarks (required per guidelines), bounds verification, and minor cleanup. The intermediate buffer allocation tradeoff should be measured.

Will look into your comments and update the PR according to your comments shortly.

…uffer

namanguptagit · 2026-03-06T23:28:41Z

Hi , I have addressed the comments made by you and updated the doc as well .Please review @sidd-27

Regarding Issue #2 and the memory allocation in pyrdown_f32:

I have documented the $O(W \times H)$ memory tradeoff directly in the pyrdown_f32 docstrings. I benchmarked the performance of this new separable convolution approach vs. the old fused sequential approach to verify if the memory tradeoff was worth it.

The new approach yields massive ~5x to 7x improvements in throughput for larger images. Given the impact, I think accepting the temporary allocation is worth it for now until we invest in a complex tiling implementation.

Here are the benchmark results run locally via criterion (cargo bench -p kornia-imgproc --bench bench_pyramid):

`pyrdown_f32` (1 Channel)

Image Size	Old Fused (Baseline)	New Separable (PR)	Improvement	Speedup
256x224	$145.48 \mu s$	$82.02 \mu s$	$-43.1%$ time	$1.77\times$ / $+75.9%$ thrpt
512x448	$552.12 \mu s$	$121.34 \mu s$	$-78.0%$ time	$4.56\times$ / $+356.5%$ thrpt
1024x896	$2.189 ms$	$316.85 \mu s$	$-85.5%$ time	$6.90\times$ / $+590.8%$ thrpt

`pyrdown_f32_3c` (3 Channels)

Image Size	Old Fused (Baseline)	New Separable (PR)	Improvement	Speedup
256x224	$257.65 \mu s$	$119.66 \mu s$	$-53.6%$ time	$2.15\times$ / $+115.8%$ thrpt
512x448	$889.65 \mu s$	$213.88 \mu s$	$-75.9%$ time	$4.15\times$ / $+316.5%$ thrpt
1024x896	$3.951 ms$	$972.64 \mu s$	$-75.2%$ time	$4.06\times$ / $+303.6%$ thrpt

Here are the benchmark results run locally via Criterion (cargo bench -p kornia-imgproc --bench bench_pyramid) and Python OpenCV cv2.pyrDown:

`pyrdown` (1 Channel, f32)

Image Size	Old Fused (Kornia)	New Separable (Kornia PR)	OpenCV (`cv2.pyrDown`)	Kornia PR vs Old
256x224	$145.48 \mu s$	$\mathbf{82.02 \mu s}$	$25.34 \mu s$	$-43.1%$ time
512x448	$552.12 \mu s$	$\mathbf{121.34 \mu s}$	$37.43 \mu s$	$-78.0%$ time
1024x896	$2.189 ms$	$\mathbf{316.85 \mu s}$	$96.00 \mu s$	$-85.5%$ time

`pyrdown_3c` (3 Channels, f32)

Image Size	Old Fused (Kornia)	New Separable (Kornia PR)	OpenCV (`cv2.pyrDown`)	Kornia PR vs Old
256x224	$257.65 \mu s$	$\mathbf{119.66 \mu s}$	$33.31 \mu s$	$-53.6%$ time
512x448	$889.65 \mu s$	$\mathbf{213.88 \mu s}$	$65.49 \mu s$	$-75.9%$ time
1024x896	$3.951 ms$	$\mathbf{972.64 \mu s}$	$323.19 \mu s$	$-75.2%$ time

namanguptagit · 2026-03-08T08:53:00Z

@sidd-27 can you check the pr once i have addressed the issues.

namanguptagit · 2026-03-13T08:56:50Z

@sidd-27 can you review this one .

namanguptagit · 2026-03-24T19:08:00Z

@sidd-27 can you take a look at this one as well .

sidd-27 · 2026-03-24T21:58:44Z

+/// # Implementation Details & Trade-offs
+///
+/// This function uses a **separable 2D convolution** (horizontal pass followed by vertical pass)
+/// accelerated by Rayon. This provides massive speedups (e.g., 5-7× faster on 1024×896 images)
+/// by reducing per-pixel operations from \(k^2\) to \(2k\).
+///
+/// However, this optimization requires a **full intermediate buffer** of size `dst_width × src_height × channels`.
+/// For very large images (e.g., 4K RGB requiring ~24MB) in memory-constrained environments,
+/// this temporary peak memory allocation should be taken into account.
+///


these comments feel very ai

Yes, I used AI to help generate those comments. I found it useful for creating clear, comprehensive documentation quickly. That said, if they feel too formal or off-brand for the project, I'm happy to revise them. Let me know what changes would help them feel more natural or better suited to the code.
@sidd-27

sidd-27 · 2026-03-24T21:59:22Z

+            // For a 5-tap kernel (radius 2), we need `src_center_x + 2 < src_width`.
+            // Since `src_center_x = dst_x * 2`, we want `dst_x * 2 + 2 <= src_width - 1`.
+            // Solving for `dst_x` (safe range) gives an exclusive upper bound of `(src_width - 1) / 2`.


namanguptagit · 2026-04-22T09:22:41Z

@sidd-27 Is there anything you would like me to update in this lets wrap this up its been pending for a long time now .

sidd-27 · 2026-04-22T09:24:10Z

edgar has been writing simd kernels for imgproc. i think we're moving towards those in general as they are faster

edgarriba · 2026-04-22T13:18:10Z

Check the latest. If any perf optimisation provided should come with benchmark comparing relative improvement against other libs. Ideally with simd and neon

namanguptagit · 2026-04-22T18:35:03Z

Check the latest. If any perf optimisation provided should come with benchmark comparing relative improvement against other libs. Ideally with simd and neon

Got it will look into it.

namanguptagit added 2 commits February 26, 2026 06:35

feat(imgproc): optimize pyrdown_f32 with separable convolutions and r…

bce3b47

…ayon

optimize pyrdown_f32 vertical pass with safe-row iteration

f1ba984

edgarriba mentioned this pull request Feb 26, 2026

[Feature]: Optimize pyrdown_f32 with separable convolution and Rayon parallelism #763

Open

2 tasks

docs(imgproc): restore and improve documentation for pyrdown_f32 opti…

72cfe11

…mizations

sidd-27 reviewed Mar 6, 2026

View reviewed changes

namanguptagit added 3 commits March 7, 2026 00:04

Merge branch 'main' into fix/issue-728-derive-partialeq-lie-groups

2aaf596

Added documentation for extra memory allocation - full intermediate b…

f6f14ca

…uffer

Addressed the comments and added documentation

0e43400

namanguptagit requested a review from sidd-27 March 8, 2026 01:44

sidd-27 reviewed Mar 24, 2026

View reviewed changes

Uh oh!

Conversation

namanguptagit commented Feb 26, 2026

📝 Description

🛠️ Changes Made

🧪 How Was This Tested?

🕵️ AI Usage Disclosure

🚦 Checklist

💭 Additional Context

Uh oh!

namanguptagit commented Feb 26, 2026

Benchmark Results: pyrdown_f32 Optimization

Uh oh!

namanguptagit commented Feb 26, 2026

Uh oh!

sidd-27 commented Feb 26, 2026

Uh oh!

namanguptagit commented Feb 26, 2026

Validation

Benchmarks

3-Channel Images (pyrdown_f32_3c)

1-Channel Images (pyrdown_f32)

Uh oh!

sidd-27 commented Feb 26, 2026

Uh oh!

namanguptagit commented Feb 26, 2026

Uh oh!

sidd-27 commented Feb 26, 2026

Uh oh!

namanguptagit commented Feb 26, 2026

Uh oh!

namanguptagit commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks vs OpenCV (and Before/After Optimization)

1-Channel (pyrdown_f32)

3-Channel (pyrdown_f32_3c)

Uh oh!

namanguptagit commented Mar 2, 2026

Uh oh!

sidd-27 left a comment

Choose a reason for hiding this comment

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

🔴 Issues

🟡 Suggestions

✅ What's Good

Uh oh!

namanguptagit commented Mar 6, 2026

Review: PR #766 — Optimize pyrdown f32 with separable convolution and rayon parallelism

🔴 Issues

🟡 Suggestions

✅ What's Good

Uh oh!

namanguptagit commented Mar 6, 2026

pyrdown_f32 (1 Channel)

pyrdown_f32_3c (3 Channels)

pyrdown (1 Channel, f32)

pyrdown_3c (3 Channels, f32)

Uh oh!

namanguptagit commented Mar 8, 2026

Uh oh!

namanguptagit commented Mar 13, 2026

Uh oh!

namanguptagit commented Mar 24, 2026

Uh oh!

sidd-27 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

namanguptagit Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sidd-27 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

namanguptagit commented Apr 22, 2026

Uh oh!

sidd-27 commented Apr 22, 2026

Uh oh!

edgarriba commented Apr 22, 2026

Uh oh!

namanguptagit commented Apr 22, 2026

Benchmark Results: `pyrdown_f32` Optimization

3-Channel Images (`pyrdown_f32_3c`)

namanguptagit commented Feb 26, 2026 •

edited

Loading

1-Channel (`pyrdown_f32`)

3-Channel (`pyrdown_f32_3c`)

`pyrdown_f32` (1 Channel)

`pyrdown_f32_3c` (3 Channels)

`pyrdown` (1 Channel, f32)

`pyrdown_3c` (3 Channels, f32)