[rlsw] Micro-optimizations, tighter pipeline and cleanup by Bigfoot71 · Pull Request #5673 · raysan5/raylib

Bigfoot71 · 2026-03-18T22:50:00Z

This PR focuses on micro-optimizations and cleanup work made possible by the previous refactor.

Improvements

Blend function generation: all blend factor combinations are now auto-generated, reducing indirect calls from two to one per invocation
Alpha channel analysis: textures are analyzed at load time to skip alpha blending when unnecessary (extensible to binary transparency in the future) (vertex colors are also taken into account)
Tighter attribute layout: the screen member has been removed; coord now carries data through every stage of the pipeline
Reduced interpolation cost: unused vertex attributes are no longer interpolated during triangle row iteration, depending on active states
Affine block interpolation: the span rasterizer now subdivides spans into blocks, reducing the number of reciprocal-W computations by ~x8 per triangle span (default block size to 16px, configurable at compile time)
Texture fetch dispatch at load time: the per-pixel format switch has been replaced by a function pointer resolved at texture alloc, eliminating per-pixel dispatch overhead
SIMD reuse for texture fetch: existing uint8_t <-> float SIMD conversion paths are now reused across more texture formats
Optional color conversion LUT: an optional (enabled by default) 1KB lookup table accelerates uint8_t -> float color conversion as a non-SIMD fallback
Simplified draw calls: glDrawArrays and glDrawElements have been cleaned up and simplified
Improved state safety: better guards against inconsistent global states and unexpected state changes, with fewer redundant checks in the hot path

Plus various other minor adjustments and some code reorganization.

Profiling results

There is no longer a single dominant bottleneck, costs are now fairly evenly distributed across the pipeline.

With these changes I can finally hit (on my machine) a stable 60 FPS in O2 without manual SIMD in models_first_person_maze (including in high overdraw areas).

With O3 + SSE2 this goes up to ~200 FPS in models_first_person_maze and ~2800 bunnies in textures_bunnymark before hitting 30 FPS.

What's next

The next meaningful architectural improvement in the current state, while preserving current capabilities, would be to accumulate vertices and render per scanline rather than in fully immediate mode. This would also open the door to parallelization, which would likely be the most biggest remaining gain.

Beyond that, it may be worth exploring alternative rasterization methods alongside the current scanline approach, this would require some design thought but could open up better paths depending on the target hardware. A tile-based rasterizer in particular would also make lazy clearing more natural than now (in addition to the other benefits it would bring).

This adds a macro system that generate a function for each possible combination of blending factors, resulting in 11*11 functions, hence 121. This then allows for only one indirection and function call instead of two previously (assuming the first call was inlined).

Simplifies the validation of blend functions. Can allow `SW_SRC_ALPHA_SATURATE` as dst factor, but hey

removes `float screen[2]`; each step stores the transformed coordinates in `float coord[4]`. This also simplifies vertex interpolation during triangle rasterization.

… + cleanup

My mistake in a previous commit

This removes the per-pixel switch; it's slightly more efficient on my hardware, but probably a poor prediction Should remain profitable or at worst the same

…ipping + a little cleanup

Bigfoot71 added 20 commits March 17, 2026 22:56

rename dispatch tables for consistency

f8f1988

change blend funcs validity check

af67e11

Simplifies the validation of blend functions. Can allow `SW_SRC_ALPHA_SATURATE` as dst factor, but hey

disables blending when it requires alpha and there is none

c3f8e67

review immediate rendering functions and attribute layout

2f465bf

prevent state changes during immediate record

d6dfe6a

reduce number of op for each vertex push + review primitive struct

f9dc728

simplified draw functions

1117b13

review sw_vertex_t

4ae0058

removes `float screen[2]`; each step stores the transformed coordinates in `float coord[4]`. This also simplifies vertex interpolation during triangle rasterization.

reduces unnecessary interpolation costs during triangle rasterization…

7915f46

… + cleanup

extends the simd color conversion to more cases

23efe6d

affine interpolation per blocks

a31fecf

long side check for each triangle line

0c6de5a

My mistake in a previous commit

style tweaks

594bd10

select the read function on texture load

7f2c9a0

This removes the per-pixel switch; it's slightly more efficient on my hardware, but probably a poor prediction Should remain profitable or at worst the same

use optionnal LUT for uint8_t -> float conversion

bdc2a6e

sets internal the number of vertices post-clipping and the epsilon cl…

6bacc7d

…ipping + a little cleanup

moves color conversion to math part

2803a8f

Merge remote-tracking branch 'origin/master' into rlsw-blend-mode

073b94f

prevents sampling if it's a depth texture that is bound

57d4ff4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rlsw] Micro-optimizations, tighter pipeline and cleanup#5673

[rlsw] Micro-optimizations, tighter pipeline and cleanup#5673
Bigfoot71 wants to merge 20 commits intoraysan5:masterfrom
Bigfoot71:rlsw-blend-mode

Bigfoot71 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Bigfoot71 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improvements

Profiling results

What's next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bigfoot71 commented Mar 18, 2026 •

edited

Loading