|
1 | | -# TDiMS (Topological Distance of intraMolecular Substructures) |
2 | | - |
3 | | -1. [Overview](#overview) |
4 | | -2. [System Requirements](#system-requirements) |
5 | | -3. [Setting Up the Development Environment](#setting-up-the-development-environment) |
6 | | -4. [Example Code](#example-code) |
7 | | - - [Retrieving Full Embeddings](#retrieving-full-embeddings) |
8 | | - - [Retrieving Feature-Selected Embeddings](#retrieving-feature-selected-embeddings) |
9 | | - - [Parameters](#parameters) |
10 | | - |
11 | | -## Overview |
12 | | -TDiMS is a novel molecular descriptor designed to capture non-local interactions of molecules. Unlike conventional descriptors that either focus solely on local features or struggle to effectively learn long-distance intramolecular interactions, TDiMS overcomes these limitations by effectively summarizing enumerated pairwise topological distances between molecular substructures. |
13 | | - |
14 | | -The `tdims` Python package provides molecular embeddings using the TDiMS algorithm. The `tdims_ext` module extends this functionality by offering embeddings with or without feature selection, as well as hyperparameter optimization. |
15 | | - |
16 | | -## System Requirements |
17 | | -This project requires the following libraries: |
18 | | - |
19 | | -- numpy>=1.26.1, <2.0.0 |
20 | | -- pandas>=1.5.3, <2.2.0 |
21 | | -- rdkit-pypi>=2022.9.4, <2023.9.6 |
22 | | -- scikit-learn>=1.2.0, <1.5.0 |
23 | | -- shap>=0.42.1, <0.43.0 |
24 | | -- matplotlib>=3.10.1, <3.11.0 |
25 | | - |
26 | | -## Setting Up the Development Environment |
27 | | -Follow these steps to set up the development environment: |
28 | | - |
29 | | -1. Create a new conda virtual environment named `tdims` |
30 | | -*(Typical Install Time: ~15 seconds)* |
31 | | - ```sh |
32 | | - conda create -n tdims python=3.10 |
33 | | - ``` |
34 | | - |
35 | | -2. Activate the virtual environment |
36 | | - ```sh |
37 | | - conda activate tdims |
38 | | - ``` |
39 | | - |
40 | | -3. Navigate to the project directory |
41 | | - ```sh |
42 | | - cd TDiMS |
43 | | - ``` |
44 | | - |
45 | | -4. Install the required libraries from `requirements.txt` |
46 | | -*(Typical Install Time: ~30 seconds)* |
47 | | - ```sh |
48 | | - pip install -r requirements.txt |
49 | | - ``` |
50 | | - |
51 | | -## Example Code |
52 | | -For basic operations, please refer to `example_notebook.ipynb`. For SHAP-based analysis used in the paper, please refer to `SHAP_analitics.ipynb`. |
53 | | - |
54 | | -> *Note: To run the example notebooks, please ensure you have a Jupyter environment such as [Jupyter Notebook](https://jupyter.org/) or [JupyterLab](https://jupyterlab.readthedocs.io/). These are not included in the default requirements and should be installed separately if needed.* |
55 | | -
|
56 | | -### Retrieving Full Embeddings |
57 | | -To generate full embeddings for a given list of SMILES strings: |
| 1 | +# TDiMS |
| 2 | + |
| 3 | +TDiMS (Topological Distance of intraMolecular Substructures) is a molecular descriptor designed to capture non-local intramolecular interactions by effectively summarizing enumerated pairwise topological distances between molecular substructures. |
| 4 | + |
| 5 | +This repository includes: |
| 6 | + |
| 7 | +- the TDiMS descriptor implementation |
| 8 | +- a minimal Python example |
| 9 | +- a Jupyter notebook example |
| 10 | +- experiment code for nested cross-validation used in the study |
| 11 | + |
| 12 | +## Repository structure |
| 13 | + |
| 14 | +```text |
| 15 | +tdims/ |
| 16 | +├── README.md |
| 17 | +├── requirements.txt |
| 18 | +├── requirements-notebook.txt |
| 19 | +├── data/ |
| 20 | +│ ├── cmpCl3_200.csv |
| 21 | +├── examples/ |
| 22 | +│ ├── example_basic.py |
| 23 | +│ └── example_notebook.ipynb |
| 24 | +├── experiments/ |
| 25 | +│ └── run_nested_cv_experiment.py |
| 26 | +└── src/ |
| 27 | + └── tdims/ |
| 28 | + ├── __init__.py |
| 29 | + ├── tdims_ext.py |
| 30 | + ├── load.py |
| 31 | + ├── sparse_transformers.py |
| 32 | + └── ChemGenerator/ |
| 33 | + ├── __init__.py |
| 34 | + └── ChemGraph.py |
| 35 | +``` |
| 36 | + |
| 37 | +## Installation |
| 38 | + |
| 39 | +### Minimal setup |
| 40 | + |
| 41 | +For the minimal descriptor code and Python example, install the required packages with: |
| 42 | + |
| 43 | +```bash |
| 44 | +pip install -r requirements.txt |
| 45 | +``` |
| 46 | + |
| 47 | +The minimal dependency set is intended for descriptor generation and core functionality. |
| 48 | + |
| 49 | +### Notebook setup |
| 50 | + |
| 51 | +The notebook example uses SHAP. Depending on the platform, installing SHAP with `pip` may trigger `numba` / `llvmlite` build issues. A more stable approach is to install SHAP and its low-level dependencies with conda-forge first, and then install the remaining notebook dependencies. |
| 52 | + |
| 53 | +```bash |
| 54 | +conda create -n NCS python=3.10 -y |
| 55 | +conda activate NCS |
| 56 | +conda install -c conda-forge numba llvmlite shap |
| 57 | +pip install -r requirements-notebook.txt |
| 58 | +``` |
| 59 | + |
| 60 | +## Quick start |
| 61 | + |
| 62 | +Run the basic example: |
| 63 | + |
| 64 | +```bash |
| 65 | +python examples/example_basic.py |
| 66 | +``` |
| 67 | + |
| 68 | +This script generates TDiMS descriptors for a small set of SMILES strings and prints: |
| 69 | + |
| 70 | +- the descriptor matrix shape |
| 71 | +- example feature names |
| 72 | + |
| 73 | +## Usage |
| 74 | + |
| 75 | +The main entry point for descriptor generation is `tdims_ext.get_representation()`. |
| 76 | + |
58 | 77 | ```python |
59 | | -emb, key_all = tdims_ext.get_representation(sm_list, radius=2, func_dis=-2, func_merge=max, fragment_set=True, atom_set=True, fingerprint_set=True) |
| 78 | +from tdims import tdims_ext |
| 79 | + |
| 80 | +sm_list = ["CCO", "c1ccccc1", "CC(=O)O"] |
| 81 | + |
| 82 | +emb, key_all = tdims_ext.get_representation( |
| 83 | + sm_list, |
| 84 | + radius=1, |
| 85 | + func_dis=-2, |
| 86 | + func_merge=max, |
| 87 | + fragment_set=False, |
| 88 | + atom_set=True, |
| 89 | + fingerprint_set=True, |
| 90 | + display=True, |
| 91 | +) |
60 | 92 | ``` |
61 | 93 |
|
62 | | -### Retrieving Feature-Selected Embeddings |
63 | | -To apply feature selection and retrieve optimized embeddings: |
| 94 | +To generate descriptors with feature selection: |
| 95 | + |
64 | 96 | ```python |
65 | | -x_slc, key_slc, key_all, optimized_param = tdims_ext.get_representation_with_fs_selection(sm_list, y, reg_model="Lasso", radius=1, func_dis=-1, func_merge=sum, fragment_set=False, atom_set=True, fingerprint_set=True) |
| 97 | +x_slc, key_slc, key_all = tdims_ext.get_representation_with_fs_selection( |
| 98 | + sm_list, |
| 99 | + y, |
| 100 | + radius=1, |
| 101 | + func_dis=-2, |
| 102 | + func_merge=max, |
| 103 | + fragment_set=False, |
| 104 | + atom_set=True, |
| 105 | + fingerprint_set=True, |
| 106 | + display=True, |
| 107 | +) |
66 | 108 | ``` |
67 | 109 |
|
68 | | -### Parameters |
69 | | -- `reg_model`: Choose from "Lasso", "Ridge", "ElasticNet", "RandomForest" |
70 | | -- `radius`: Integer (≥1), sets the radius for MorganFingerprint substructures |
71 | | -- `func_dis`: Method for calculating feature values from topological distances (e.g., `-1` results in `x^(-1)`) |
72 | | -- `func_merge`: Aggregation method for identical pair distances at different locations (e.g., `sum`, `max`, `min`) |
73 | | -- `fragment_set`: Boolean, whether to include CEP fragments |
74 | | -- `fingerprint_set`: Boolean, whether to include circular fingerprints from Morgan Fingerprints |
75 | | -- `atom_set`: Boolean, whether to include heteroatoms |
| 110 | +## Using this repository in Jupyter Notebook |
| 111 | + |
| 112 | +If you are using the repository directly in a notebook without package installation, add `src` to `sys.path` before importing `tdims`: |
| 113 | + |
| 114 | +```python |
| 115 | +import sys |
| 116 | +from pathlib import Path |
| 117 | + |
| 118 | +src_path = (Path.cwd().resolve().parent / "src").resolve() |
| 119 | +if str(src_path) not in sys.path: |
| 120 | + sys.path.insert(0, str(src_path)) |
| 121 | + |
| 122 | +from tdims import tdims_ext |
| 123 | +``` |
| 124 | + |
| 125 | +For a notebook-based example, see: |
| 126 | + |
| 127 | +- `examples/example_notebook.ipynb` |
| 128 | + |
| 129 | +## Main parameters |
| 130 | + |
| 131 | +Key arguments of `get_representation()` include: |
| 132 | + |
| 133 | +- `radius`: radius used for substructure extraction |
| 134 | +- `func_dis`: transformation applied to topological distance values |
| 135 | +- `func_merge`: aggregation function for repeated substructure-pair distances |
| 136 | +- `fragment_set`: whether fragment-based substructures are included |
| 137 | +- `fingerprint_set`: whether fingerprint-derived substructures are included |
| 138 | +- `atom_set`: whether atom-based substructures are included |
| 139 | +- `display`: if `True`, prints descriptor shape and elapsed time |
| 140 | + |
| 141 | +## Experiment code |
| 142 | + |
| 143 | +The main experiment script is: |
| 144 | + |
| 145 | +```bash |
| 146 | +python experiments/run_nested_cv_experiment.py |
| 147 | +``` |
| 148 | + |
| 149 | +This script is intended for nested cross-validation experiments used in the study. It is separate from the minimal examples above and is provided for experiment-level reproduction. |
| 150 | + |
| 151 | +- `quick`: recommended for a first test run or lightweight debugging |
| 152 | +- `full`: used for the main journal-paper experiments |
| 153 | + |
| 154 | +## Notes |
| 155 | + |
| 156 | +- Internal imports inside `src/tdims/` are package-relative. |
| 157 | +- External usage from notebooks or scripts should use `from tdims import ...`. |
| 158 | +- If the repository is not package-installed, add `src/` to `sys.path` before import. |
| 159 | +- `requirements.txt` is intended for minimal descriptor usage. |
| 160 | +- `requirements-notebook.txt` is intended for the notebook example and related plotting dependencies. |
| 161 | + |
| 162 | +## Citation |
| 163 | + |
| 164 | +If you use this repository in academic work, please cite the corresponding paper. |
| 165 | + |
| 166 | +## License |
76 | 167 |
|
| 168 | +This code is distributed under the Apache License 2.0 as part of the `IBM/materials` repository. See the `LICENSE` file in the root of the repository for details. |
0 commit comments