Replace the quadratic paired-mapping search with linear-time sweep#565
Open
NicolasBuchin wants to merge 1 commit intomainfrom
Open
Replace the quadratic paired-mapping search with linear-time sweep#565NicolasBuchin wants to merge 1 commit intomainfrom
NicolasBuchin wants to merge 1 commit intomainfrom
Conversation
Instead of a O(N*M) pairing we do a ~O(N*log(N) M*log(M)) sort to pair in ~O(N+M)
Since sorting will be different for paired and single ends, the
sorting logic is extracted out fo get_nams_by_chaining().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous paired-end mapping logic attempted to form pairs by testing all combinations of chains from read1 and read2, ordered by score, with a hard cap on the number of trials. This resulted in worst-case behavior close to O(MAX_PAIRS²) runtime and could not guarantee to find the best pair of chains.
This PR proposes to find pairs by splitting chains of each read into forward and revcomp sets, then sorting in O(N*log(N)) by ref_id then ref_start for each set, and doing a ~O(N+M) sweep on each forward/revcomp pair of sets using 2 pointers. Credit to @ksahlin for the idea.
We can guarantee to find the best scoring pair, that each chain is in one unique pair, but not that all valid pairs of unique chains will be returned.
For scoring pairs, we also introduce a bonus like what is done in paired extension to favor pairs that fit well in the distribution of known paired mappings.
Since returning sorted chains by score isn't needed when calling get_nams_by_chaining(), we extract the logic outside of this function.
Later I plan to introduce this new approach to paired extension, which still uses O(N²) complexity in this PR, but only after some evaluation has been done and we agree to use this new approach for pairing chains.
TODO: