Skip to content

[rlsw] Micro-optimizations, tighter pipeline and cleanup#5673

Open
Bigfoot71 wants to merge 20 commits intoraysan5:masterfrom
Bigfoot71:rlsw-blend-mode
Open

[rlsw] Micro-optimizations, tighter pipeline and cleanup#5673
Bigfoot71 wants to merge 20 commits intoraysan5:masterfrom
Bigfoot71:rlsw-blend-mode

Conversation

@Bigfoot71
Copy link
Contributor

@Bigfoot71 Bigfoot71 commented Mar 18, 2026

This PR focuses on micro-optimizations and cleanup work made possible by the previous refactor.

Improvements

  • Blend function generation: all blend factor combinations are now auto-generated, reducing indirect calls from two to one per invocation
  • Alpha channel analysis: textures are analyzed at load time to skip alpha blending when unnecessary (extensible to binary transparency in the future) (vertex colors are also taken into account)
  • Tighter attribute layout: the screen member has been removed; coord now carries data through every stage of the pipeline
  • Reduced interpolation cost: unused vertex attributes are no longer interpolated during triangle row iteration, depending on active states
  • Affine block interpolation: the span rasterizer now subdivides spans into blocks, reducing the number of reciprocal-W computations by ~x8 per triangle span (default block size to 16px, configurable at compile time)
  • Texture fetch dispatch at load time: the per-pixel format switch has been replaced by a function pointer resolved at texture alloc, eliminating per-pixel dispatch overhead
  • SIMD reuse for texture fetch: existing uint8_t <-> float SIMD conversion paths are now reused across more texture formats
  • Optional color conversion LUT: an optional (enabled by default) 1KB lookup table accelerates uint8_t -> float color conversion as a non-SIMD fallback
  • Simplified draw calls: glDrawArrays and glDrawElements have been cleaned up and simplified
  • Improved state safety: better guards against inconsistent global states and unexpected state changes, with fewer redundant checks in the hot path

Plus various other minor adjustments and some code reorganization.

Profiling results

There is no longer a single dominant bottleneck, costs are now fairly evenly distributed across the pipeline.

With these changes I can finally hit (on my machine) a stable 60 FPS in O2 without manual SIMD in models_first_person_maze (including in high overdraw areas).

With O3 + SSE2 this goes up to ~200 FPS in models_first_person_maze and ~2800 bunnies in textures_bunnymark before hitting 30 FPS.

What's next

The next meaningful architectural improvement in the current state, while preserving current capabilities, would be to accumulate vertices and render per scanline rather than in fully immediate mode. This would also open the door to parallelization, which would likely be the most biggest remaining gain.

Beyond that, it may be worth exploring alternative rasterization methods alongside the current scanline approach, this would require some design thought but could open up better paths depending on the target hardware. A tile-based rasterizer in particular would also make lazy clearing more natural than now (in addition to the other benefits it would bring).

This adds a macro system that generate a function for each possible combination of blending factors, resulting in 11*11 functions, hence 121.
This then allows for only one indirection and function call instead of two previously (assuming the first call was inlined).
Simplifies the validation of blend functions.
Can allow `SW_SRC_ALPHA_SATURATE` as dst factor, but hey
removes `float screen[2]`; each step stores the transformed coordinates in `float coord[4]`.
This also simplifies vertex interpolation during triangle rasterization.
My mistake in a previous commit
This removes the per-pixel switch; it's slightly more efficient on my hardware, but probably a poor prediction
Should remain profitable or at worst the same
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant