Skip to content

Determine anchor orientation prior to chaining#574

Open
Itolstoganov wants to merge 20 commits intomainfrom
canonical-bits
Open

Determine anchor orientation prior to chaining#574
Itolstoganov wants to merge 20 commits intomainfrom
canonical-bits

Conversation

@Itolstoganov
Copy link
Copy Markdown
Collaborator

This stores a bool in Syncmer which is true iff the syncmer is canonical. These canonicity bits are then included in the randstrobe hash and used to filter out hits with a different orientation than the reference. The new randstrobe hash layout

<strobe 1 hash><strobe 1 canonicity bit><strobe 2 hash><strobe 2 canonicity bit>

Accuracy is the same as in main, the runtime is also mostly the same (slightly faster for sim6).

ends.pdf

The main benefit is the removal of spurious anchors and the resulting small chains.

chain-stats.pdf

@Itolstoganov Itolstoganov requested a review from marcelm March 22, 2026 23:21
@ksahlin
Copy link
Copy Markdown
Owner

ksahlin commented Mar 24, 2026

Very nice. Looks like it could solve problems with inconsistent NAMs (chains) donwstream -- similar to the non-canonical syncmers idea. This approach still preserves the same syncmers being generated, so more consistency in scoring chains between different directions compared to non-canonical seeds that can be different in abundance and in quantity between strands.

About the commit: The description/code should be changed so that "canonical" is replaced with "forward" in most places. All the syncmers are canonical, but some are forward and some reverse w.r.t. the sequence. So the bit is really keeping whether the seed is forward or not.

@Itolstoganov
Copy link
Copy Markdown
Collaborator Author

About the commit: The description/code should be changed so that "canonical" is replaced with "forward" in most places. All the syncmers are canonical, but some are forward and some reverse w.r.t. the sequence. So the bit is really keeping whether the seed is forward or not.

Depends on how you look at it? My logic was that we only store forward syncmers from the sequence and hashes of their canonical versions, so all syncmers are forward and some of them are canonical. This seems more intuitive to me since the position field of the Syncmer always refers to the forward version.

This only concerns the syncmers code, in the context of the index, "forward"/"unoriented" is used instead of "canonical".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants