Skip to content

Adds one-hot and k-mer encoder for DNA sequences#1845

Merged
pavanvidem merged 20 commits intobgruening:masterfrom
anuprulez:dna_encoder
Apr 22, 2026
Merged

Adds one-hot and k-mer encoder for DNA sequences#1845
pavanvidem merged 20 commits intobgruening:masterfrom
anuprulez:dna_encoder

Conversation

@anuprulez
Copy link
Copy Markdown
Contributor

No description provided.

Comment thread tools/sklearn/to_categorical.xml Outdated
--num_classes '$encoder_type.num_classes'
#end if
#else
--fasta_path '$encoder_type.fasta_path'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe symlinking the fasta to working dir before runnig the script might resolve the issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pavanvidem .. this suggestion fixed the issue.

However, test related to a different tool fails:

restart

Can you restart the CI? thanks!

Comment thread tools/sklearn/to_categorical.xml Outdated
</conditional>
</conditional>
<output name="outfile_matrix" file="dna_ohe_output.h5" ftype="h5" compare="sim_size" delta="32" />
<output name="outfile" file="dna_ohe_output.tabular" ftype="tabular" />
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please user asserts here. Also for h5.

Comment thread tools/sklearn/to_categorical.xml Outdated
</conditional>
</conditional>
<output name="outfile_matrix" file="dna_ohe_output_same_len.h5" ftype="h5" compare="sim_size" delta="32" />
<output name="outfile" file="dna_ohe_output_same_len.tabular" ftype="tabular" />
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. please use asserts.

Comment thread tools/sklearn/to_categorical.xml Outdated
</conditional>
</conditional>
<output name="outfile" file="dna_kmer_output_k3.tabular" ftype="tabular" />
<output name="outfile_vocab" file="dna_kmer_vocab_k3.json" ftype="json" />
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are json asserts too :)

Comment thread tools/sklearn/to_categorical.xml Outdated
</conditional>
</conditional>
<output name="outfile" file="dna_kmer_sanitized_output_k3.tabular" ftype="tabular" />
<output name="outfile_vocab" file="dna_kmer_sanitized_vocab_k3.json" ftype="json" />
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@anuprulez anuprulez closed this Apr 21, 2026
@anuprulez anuprulez reopened this Apr 21, 2026
@pavanvidem
Copy link
Copy Markdown
Collaborator

Can you please cleanup the other output files that are not used in any test. Then we are ready to merge.

@anuprulez
Copy link
Copy Markdown
Contributor Author

Can you please cleanup the other output files that are not used in any test. Then we are ready to merge.

done, thanks!

Copy link
Copy Markdown
Collaborator

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pavanvidem pavanvidem merged commit 16cb5bf into bgruening:master Apr 22, 2026
14 checks passed
@anuprulez anuprulez deleted the dna_encoder branch April 22, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants