Clarification on Training Data Size in Promoter Design Reproduction

Hi Authors,

Thank you for releasing the code implementation — it’s very helpful for my research.

While reproducing the promoter design experiment from your paper “Dirichlet Flow Matching with Applications to DNA Sequence Design”, I noticed a potential inconsistency in the number of data samples used for training the model and baselines.

Specifically, the baseline “Dirichlet Diffusion Score Model for Biological Sequence Generation (DDSM)” appears to have been trained on only 40,000 samples out of the available 100,000 when generating the reported results. For example:
	•	Reported in the original paper: DDSM (time dilation 1x) → 0.0363
	•	Reproduced using 40,000 samples: DDSM (time dilation 1x) → 0.0380

Meanwhile, the main model from your paper seems to use all 100,000 data samples. Using the default script, I reproduced a result of SP-MSE = 0.292. However, when I modify the number of training samples to 40,000 (by changing the n_tsses parameter in lines 42–43 of train_promo.py), the SP-MSE increases to 0.0454, which appears worse than the baseline.

I was wondering:
	•	Have you noticed this discrepancy during your experiments?
	•	Is there anything I may have misunderstood in the code or setup?
	•	Are there additional steps needed to reproduce the same trends as reported in the paper?

Thank you again for your work and support!

Best regards,

YC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Training Data Size in Promoter Design Reproduction #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Clarification on Training Data Size in Promoter Design Reproduction #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions