Perf optimizations by parlar · Pull Request #566 · ksahlin/strobealign

parlar · 2026-03-03T21:07:18Z

Hi!

I actually have another project where I wanted to use strobealign (for improving short read sv calling) but when I sat with it I could not help myself not also making a few performance optimizations. The optimizations where added with the help of Claude Code and have been tested on E.coli data. They should work fine, likely better, on human data as well (bigger, more chromosomes, more reads) but perhaps best that they are tested just to be sure.

Performance Optimizations

Benchmarked on real E. coli K-12 paired-end reads (DRR217225, ~6.8M read pairs, 150bp, 4 threads).

Summary

Stage	Baseline	Optimized	Improvement
Creating strobemers	8.27s	5.31s	-35.8%
Finding hits	7.79s	7.55s	-3.1%
Chaining	2.70s	2.64s	-2.2%
Extending & pairing	16.22s	15.70s	-3.2%
Total mapping	37.23s	33.45s	-10.2%
Wall clock	37.56s	33.75s	-10.1%

Changes

1. AVX2 Smith-Waterman alignment (extension -5.9%)

Added 256-bit AVX2 implementation of the striped Smith-Waterman kernel alongside the existing SSE2 code. Enabled at compile time with -DENABLE_AVX=ON. Doubles the SIMD width from 16 to 32 byte-lanes per vector.

Files: cpp/ext/ssw/ssw_avx2.c, cpp/ext/ssw/ssw_avx2.h, cpp/ext/ssw/ssw.c, CMakeLists.txt

2. Software prefetching in hit finding (hit finding -2.4%)

Prefetch the next query strobe's hash bucket while processing the current one, hiding memory access latency. Expect larger gains on human genome where the index doesn't fit in cache.

Files: cpp/index.hpp, cpp/hits.cpp

3. Hardware POPCNT instruction (strobemer creation -8.7%)

Replaced std::bitset<64>::count() with __builtin_popcountll() and added -mpopcnt compile flag. The compiler was emitting software __popcountdi2 calls despite the CPU having hardware POPCNT support.

Files: cpp/randstrobes.cpp, CMakeLists.txt

4. Eliminate allocations in has_shared_substring()

Replaced std::string::substr() (heap allocation per iteration) with std::string_view::substr() (zero-copy). Bigger impact expected on human genome where rescue alignment is triggered more frequently.

Files: cpp/aln.cpp

5. Circular buffer for SyncmerIterator (strobemer creation -25.6%)

Replaced std::deque<uint64_t> with an inline fixed-capacity circular buffer (SmerBuffer). The deque was heap-allocating for a 5-element, 40-byte sliding window — massive overhead relative to data size. The inline buffer keeps everything on the stack.

Files: cpp/randstrobes.hpp

6. Pre-allocate QueryRandstrobe vectors

Added reserve(syncmers.size()) to avoid ~7 reallocations per read per strand during randstrobe generation.

Files: cpp/randstrobes.cpp

Build

# With AVX2 (recommended on x86-64 with AVX2 support):
cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_AVX=ON

# Without AVX2 (default, SSE2 only):
cmake -B build -DCMAKE_BUILD_TYPE=Release

Correctness

All 50 unit tests pass in both SSE2-only and AVX2 configurations
Mapping results are bit-identical between SSE2 and AVX2 builds
All changes are safe for human genome mapping

ksahlin · 2026-03-05T06:50:00Z

Hi Pär!

Very interesting, and thanks for the contribution! We recently switched to a Rust implementation of strobealign. Perhaps a couple of these optimizations can also be done in the Rust implementation. I will have to talk to the team (@marcelm, @NicolasBuchin, @Itolstoganov ) about this PR.

parlar added 5 commits March 3, 2026 20:13

start state

cc0b4be

avx2 for sw

6ef161e

Prefetching in find_all_hits()

744a861

hardware popcnt instructions

2808a5f

Circular buffer

7c222d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf optimizations#566

Perf optimizations#566
parlar wants to merge 5 commits intoksahlin:mainfrom
parlar:perf-optimizations

parlar commented Mar 3, 2026

Uh oh!

ksahlin commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

parlar commented Mar 3, 2026

Performance Optimizations

Summary

Changes

1. AVX2 Smith-Waterman alignment (extension -5.9%)

2. Software prefetching in hit finding (hit finding -2.4%)

3. Hardware POPCNT instruction (strobemer creation -8.7%)

4. Eliminate allocations in has_shared_substring()

5. Circular buffer for SyncmerIterator (strobemer creation -25.6%)

6. Pre-allocate QueryRandstrobe vectors

Build

Correctness

Uh oh!

ksahlin commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksahlin commented Mar 5, 2026 •

edited

Loading