Conversation
02f880d to
ec02d7e
Compare
Yes, I believe it's explained by unequal filtering, where locations in RC have fewer filtered reads and therefore appear better. If we were to go with a non-canonical solution, we would probably have consider the X best solutions in both orientations separately. In extension mode, we could align, but I am not sure how to compare the best between strands in map-only mode. I thought one of the benefits of non-canonical was fewer hits and thus positive for runtime. This is not at all the case in the runtime plots - perhaps because it is not an efficient implementation (getting seeds in both orientations at the same time), but still, it looks quite a bit slower. Non-canonical is appealing in theory, but has quite a bit left to iron out to be good in practice. |
…nce" This partially reverts commit 0082508.
This PR is more of a status update to show the accuracy of the non-canonical syncmers branch, which I have adjusted to work with the Rust code.
It doesn’t actually look as bad as I remembered:
ends.pdf
chrY shows the accuracy problem most clearly, so I investigated what is going on in sim-chrY-500.
The first observation is that we seem to be picking the wrong strand in many cases. sim0 was doesn’t contain any reads that map to the reverse complement. If I modify strobealign to only pick the best NAM on the forward strand, the accuracy goes up quite a bit, see
ends.pdf.
In many cases, the correct NAM is generated and is even the top NAM among those that map to the forward strand. However, the NAMs in the reverse orientation "overwhelm" the correct candidate. As maybe expected, this is because of unequal filtering.
Here’s an example using query
simulated.1.This is the (shortened) list of hits on the forward strand:
And these are the hits on the reverse-complemented strand:
That is, a large fraction of the hits on the forward strand are repetitive and therefore filtered, and almost no hits on the revcomp strand are filtered.
This is the resulting list of NAMs:
The last NAM in the list (the only one with rc=0) is the correct one.