Open
Conversation
Owner
|
Hi Pär! Very interesting, and thanks for the contribution! We recently switched to a Rust implementation of strobealign. Perhaps a couple of these optimizations can also be done in the Rust implementation. I will have to talk to the team (@marcelm, @NicolasBuchin, @Itolstoganov ) about this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi!
I actually have another project where I wanted to use strobealign (for improving short read sv calling) but when I sat with it I could not help myself not also making a few performance optimizations. The optimizations where added with the help of Claude Code and have been tested on E.coli data. They should work fine, likely better, on human data as well (bigger, more chromosomes, more reads) but perhaps best that they are tested just to be sure.
Performance Optimizations
Benchmarked on real E. coli K-12 paired-end reads (DRR217225, ~6.8M read pairs, 150bp, 4 threads).
Summary
Changes
1. AVX2 Smith-Waterman alignment (extension -5.9%)
Added 256-bit AVX2 implementation of the striped Smith-Waterman kernel alongside the existing SSE2 code. Enabled at compile time with
-DENABLE_AVX=ON. Doubles the SIMD width from 16 to 32 byte-lanes per vector.Files:
cpp/ext/ssw/ssw_avx2.c,cpp/ext/ssw/ssw_avx2.h,cpp/ext/ssw/ssw.c,CMakeLists.txt2. Software prefetching in hit finding (hit finding -2.4%)
Prefetch the next query strobe's hash bucket while processing the current one, hiding memory access latency. Expect larger gains on human genome where the index doesn't fit in cache.
Files:
cpp/index.hpp,cpp/hits.cpp3. Hardware POPCNT instruction (strobemer creation -8.7%)
Replaced
std::bitset<64>::count()with__builtin_popcountll()and added-mpopcntcompile flag. The compiler was emitting software__popcountdi2calls despite the CPU having hardware POPCNT support.Files:
cpp/randstrobes.cpp,CMakeLists.txt4. Eliminate allocations in has_shared_substring()
Replaced
std::string::substr()(heap allocation per iteration) withstd::string_view::substr()(zero-copy). Bigger impact expected on human genome where rescue alignment is triggered more frequently.Files:
cpp/aln.cpp5. Circular buffer for SyncmerIterator (strobemer creation -25.6%)
Replaced
std::deque<uint64_t>with an inline fixed-capacity circular buffer (SmerBuffer). The deque was heap-allocating for a 5-element, 40-byte sliding window — massive overhead relative to data size. The inline buffer keeps everything on the stack.Files:
cpp/randstrobes.hpp6. Pre-allocate QueryRandstrobe vectors
Added
reserve(syncmers.size())to avoid ~7 reallocations per read per strand during randstrobe generation.Files:
cpp/randstrobes.cppBuild
Correctness