Skip to content

Clarification on Training Data Size in Promoter Design Reproduction #8

@i-am-your-brother

Description

@i-am-your-brother

Hi Authors,

Thank you for releasing the code implementation — it’s very helpful for my research.

While reproducing the promoter design experiment from your paper “Dirichlet Flow Matching with Applications to DNA Sequence Design”, I noticed a potential inconsistency in the number of data samples used for training the model and baselines.

Specifically, the baseline “Dirichlet Diffusion Score Model for Biological Sequence Generation (DDSM)” appears to have been trained on only 40,000 samples out of the available 100,000 when generating the reported results. For example:
• Reported in the original paper: DDSM (time dilation 1x) → 0.0363
• Reproduced using 40,000 samples: DDSM (time dilation 1x) → 0.0380

Meanwhile, the main model from your paper seems to use all 100,000 data samples. Using the default script, I reproduced a result of SP-MSE = 0.292. However, when I modify the number of training samples to 40,000 (by changing the n_tsses parameter in lines 42–43 of train_promo.py), the SP-MSE increases to 0.0454, which appears worse than the baseline.

I was wondering:
• Have you noticed this discrepancy during your experiments?
• Is there anything I may have misunderstood in the code or setup?
• Are there additional steps needed to reproduce the same trends as reported in the paper?

Thank you again for your work and support!

Best regards,

YC

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions