Skip to content

Non-canonical syncmers#546

Draft
marcelm wants to merge 3 commits intomainfrom
noncanonical3
Draft

Non-canonical syncmers#546
marcelm wants to merge 3 commits intomainfrom
noncanonical3

Conversation

@marcelm
Copy link
Collaborator

@marcelm marcelm commented Jan 30, 2026

This PR is more of a status update to show the accuracy of the non-canonical syncmers branch, which I have adjusted to work with the Rust code.

It doesn’t actually look as bad as I remembered:
ends.pdf

chrY shows the accuracy problem most clearly, so I investigated what is going on in sim-chrY-500.

The first observation is that we seem to be picking the wrong strand in many cases. sim0 was doesn’t contain any reads that map to the reverse complement. If I modify strobealign to only pick the best NAM on the forward strand, the accuracy goes up quite a bit, see
ends.pdf.

In many cases, the correct NAM is generated and is even the top NAM among those that map to the forward strand. However, the NAMs in the reverse orientation "overwhelm" the correct candidate. As maybe expected, this is because of unequal filtering.

Here’s an example using query simulated.1.

This is the (shortened) list of hits on the forward strand:

Query: simulated.1
we have 67 + 73 randstrobes
Found 67 hits (2 rescued, 46 filtered):
querypos count (p=partial, F=filtered)
     1    2869 F
     8    1453 F
    15    3923 F
    19    4300 F
    24    4334 F
    29    3771 F
    39    2983 F
    45    3044 F
    56    1756 F
    67    1278 F
    76    1450 F
    86    1218 F
    94    3804 F
    99     836  
   107    1312 F
   113      98  
   128     114 F
   135     546 F
   140      22
...

And these are the hits on the reverse-complemented strand:

Found 61 hits (0 rescued, 0 filtered):
querypos count (p=partial, F=filtered)
     1      12  
     5 p     8  
     9       4  
    13       8  
    18       2  
    23      12  
    62      12  
    72      13  
    77      15  
    85       1  
    89       1  
    93      18  
   102      12  
   111       1  
   117      18  
   121      15  
   125 p    15  
   138       1  
   146       2
...

That is, a large fraction of the hits on the forward strand are repetitive and therefore filtered, and almost no hits on the revcomp strand are filtered.

This is the resulting list of NAMs:

Found 13 NAMs (rescue done: 0)
- Nam(ref_id=0, query: 1..489, ref: 61401778..61402268, rc=1, score=424.24753)
- Nam(ref_id=0, query: 23..494, ref: 29218610..29219081, rc=1, score=338.92)
[ 10 NAMs with rc=1 omitted]
- Nam(ref_id=0, query: 99..466, ref: 54588241..54588608, rc=0, score=281.05)

The last NAM in the list (the only one with rc=0) is the correct one.

@ksahlin
Copy link
Owner

ksahlin commented Jan 31, 2026

The first observation is that we seem to be picking the wrong strand in many cases. sim0 was doesn’t contain any reads that map to the reverse complement. If I modify strobealign to only pick the best NAM on the forward strand, the accuracy goes up quite a bit, see

Yes, I believe it's explained by unequal filtering, where locations in RC have fewer filtered reads and therefore appear better. If we were to go with a non-canonical solution, we would probably have consider the X best solutions in both orientations separately. In extension mode, we could align, but I am not sure how to compare the best between strands in map-only mode.

I thought one of the benefits of non-canonical was fewer hits and thus positive for runtime. This is not at all the case in the runtime plots - perhaps because it is not an efficient implementation (getting seeds in both orientations at the same time), but still, it looks quite a bit slower.

Non-canonical is appealing in theory, but has quite a bit left to iron out to be good in practice.

@marcelm marcelm closed this Mar 11, 2026
@marcelm marcelm deleted the noncanonical3 branch March 11, 2026 11:18
@marcelm marcelm restored the noncanonical3 branch March 11, 2026 13:23
@marcelm marcelm reopened this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants