2-step retention time prediction

This repository contains source code, model weights, a ready-to-use docker image, and additional data for the 2-step approach for transferable retention time prediction, described in the publication Times are changing but order matters: Transferable prediction of small molecule liquid chromatography retention times (Citation).

Use predict.py to first predict retention order indices and then map these to retention times using anchor compounds.

Two input files are needed:

Structures with retention times for anchor compounds in TSV format:

smiles	rt
C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)OC4C(C(C(C(O4)CO)O)O)O
C1=CC(=CC=C1C=CC(=O)OC(C(C(=O)O)O)C(=O)O)O
CCC(C)CN
C1=C(OC(=C1)C=O)CO
C1=CC=C(C(=C1)C(=O)O)N
COC1=C(C=CC(=C1)C=CC=O)O
CC(C)CCN	6.9
C1=CC=C2C(=C1)C(=CN2)CC(=O)C(=O)O	30.285
C(CC(=O)N)C(C(=O)O)N	1.3
C1=CC=C(C(=C1)C(=O)CC(C(=O)O)N)N	11.005

Information on the chromatographic setup in YAML format, similar to what is used in RepoRT:
```
column:
  name: Waters ACQUITY UPLC HSS T3
  t0: 0.735
eluent:
  A:
    pH: 3
```
The column should have HSM and Tanaka parameters available. HSM and Tanaka parameters can also be specified manually:
```
column:
 H: 0.3
 S*: 0.2
 A: 0.4564
 B: 0.1232
 'C (pH 2.8)': -0.1
 'C (pH 7.0)': -0.5
 kPB: -1.2
 αCH2: 0.3
 αT/O: 0.5
 αC/P: 0.6
 αB/P: 0.1
 αB/P.1: -0.2
```
If provided, twice t0 is used as the void threshold, anchors below will not be considered during mapping.

Consider standardizing compound structures before, using e.g., standardizeUtils.

To access HSM and Tanaka parameters, a copy of RepoRT needs to be available.

Use predict.py like this:

python predict.py  --model models/2-step0525.pt  --repo_root_folder <path to RepoRT> \
       --input_compounds test/test_input.tsv --input_metadata test/test_metadata.yaml \
       --out test/test_output.tsv

This should take about 10 seconds on a normal laptop without GPU. For GPU-support add the flag --gpu.

With docker:

docker run -v $(pwd)/test:/app/test -v <path to RepoRT>:/RepoRT -it --rm ghcr.io/boecker-lab/2-step:latest \
       python predict.py --model models/2-step0525.pt --repo_root_folder /RepoRT \
       --input_compounds test/test_input.tsv --input_metadata test/test_metadata.yaml

An output with predicted retention times will be generated (full example output: test/test_output.tsv):

	smiles	rt_pred
0	C1=CC(=C(C=C1C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O)O)OC4C(C(C(C(O4)CO)O)O)O	26.192242
1	C1=CC(=CC=C1C=CC(=O)OC(C(C(=O)O)O)C(=O)O)O	11.60789
2	CCC(C)CN	19.052101
3	C1=C(OC(=C1)C=O)CO	20.10303
4	C1=CC=C(C(=C1)C(=O)O)N	16.751137
5	COC1=C(C=CC(=C1)C=CC=O)O	24.9707

Experimental: alternative models

Warning

These alternative models are not recommended, predictions will be worse!

There are two alternative models which do no use (all) chromatographic parameters, but consequently result in lower performance:

models/2-step0525_nocolumn.pt: does not require column.name, only pH
models/2-step0525_setupagnostic.pt: does not require any chromatographic parameters

These models can be specified with the --model argument. Leaving out required metadata for the normal model leads to an error. Again, this is not the intended usage mode of 2-step, retention behavior is majorly dependent on the chromatographic system employed. Expect a substantial drop in performance.

Dependencies

The following dependencies are required:

- python=3.12
- chemprop=1.6.1
- pulp
- pytorch
- rdkit
- statsmodels
- tensorboard
- tqdm
- yaml
- numpy<2
- setuptools<82

A conda/mamba environment is provided (typical install time: 5 minutes):

mamba env create -n 2-step -f env.yaml
mamba activate 2-step

For GPU support, the pytorch-cuda-package has to be added with the appropriate version, e.g., pytorch-cuda=11.8. See env_cuda.yaml. A Dockerfile and container is provided as well. GPU (or any other special hardware) is not required.

The code should work on any operating system and was tested under Arch Linux@6.6.72-1-lts with the following package versions:

- python=3.12.8
- chemprop=1.6.1
- pulp=2.8.0
- pytorch=2.5.1
- rdkit=2024.09.3
- statsmodels=0.14.4
- tensorboard=2.18.0
- tqdm=4.67.1
- yaml=0.2.5
- numpy=1.26.4

Training ROI prediction models

python train.py --input <IDs of RepoRT datasets> --epsilon 10s \
       --run_name 2-step --save_data  \
       --batch_size 512 --epochs 10 --sysinfo --columns_use_hsm --columns_use_tanaka --use_ph \
       --repo_root_folder <path to RepoRT> --clean_data \
       --encoder_size 512 --sizes 256 64 --sizes_sys 256 256 \
       --pair_step 1 --pair_stop None --sample --sampling_count 500_000 --no_group_weights \
       --no_train_acc_all --no_train_acc \
       --mpn_no_residual_connections_encoder --no_standardize --mpn_no_sigmoid_roi

Add --gpu to enable training on GPU.

Model training creates three files:

The model itself, 2-step.pt (with option --ep_save files for every epoch are created: 2-step_ep1.pt etc.)
Processed training data, 2-step_data.pkl
A JSON file detailing the training configuration, 2-step_config.json

To make the trained model ready for prediction, use the repackage_model.py-script:

python repackage_model.py 2-step.pt 2-step_predready.pt

This combines all information required for prediction into one file.

A model trained on 171 manually curated reversed-phase datasets from RepoRT (version 94f43c1b) is provided in the models subdirectory (models/2-step0525.pt).

Evaluation of retention order prediction accuracy

python evaluate.py --model <path to trained model> --test_sets <IDs of RepoRT datasets> \
       --repo_root_folder <path to RepoRT> --epsilon 10s

Again, add --gpu for GPU mode.

Data splits for benchmark datasets

Splits for the evaluation of six benchmark datasets on retention time prediction are provided in the folder benchmark_splits.

Citation

F. Kretschmer, E.-M. Harrieder, M. Witting, and S. Böcker
Times are changing but order matters: Transferable prediction of small molecule liquid chromatography retention times
Preprint, ChemRxiv 2024-wd5j8, 2024. Version 3 August 2025.
https://doi.org/10.26434/chemrxiv-2024-wd5j8-v3

Name		Name	Last commit message	Last commit date
Latest commit History 410 Commits
.github/workflows		.github/workflows
benchmark_splits		benchmark_splits
models		models
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dmpnn.py		dmpnn.py
dmpnn_graph.py		dmpnn_graph.py
env.yaml		env.yaml
env_cuda.yaml		env_cuda.yaml
evaluate.py		evaluate.py
features.py		features.py
mapping.py		mapping.py
mpnranker2.py		mpnranker2.py
predict.py		predict.py
repackage_model.py		repackage_model.py
sampling.py		sampling.py
train.py		train.py
utils.py		utils.py
utils_newbg.py		utils_newbg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2-step retention time prediction

Experimental: alternative models

Dependencies

Training ROI prediction models

Evaluation of retention order prediction accuracy

Data splits for benchmark datasets

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

2-step retention time prediction

Experimental: alternative models

Dependencies

Training ROI prediction models

Evaluation of retention order prediction accuracy

Data splits for benchmark datasets

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages