Skip to content

Commit dc01519

Browse files
authored
Merge pull request #69 from IBM/tdims-reorganize
Reorganize file structure
2 parents 2e6fd95 + 9a666a2 commit dc01519

16 files changed

Lines changed: 2511 additions & 770 deletions

models/.DS_Store

4 KB
Binary file not shown.
-6 KB
Binary file not shown.

models/tdims/README.md

Lines changed: 161 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,168 @@
1-
# TDiMS (Topological Distance of intraMolecular Substructures)
2-
3-
1. [Overview](#overview)
4-
2. [System Requirements](#system-requirements)
5-
3. [Setting Up the Development Environment](#setting-up-the-development-environment)
6-
4. [Example Code](#example-code)
7-
- [Retrieving Full Embeddings](#retrieving-full-embeddings)
8-
- [Retrieving Feature-Selected Embeddings](#retrieving-feature-selected-embeddings)
9-
- [Parameters](#parameters)
10-
11-
## Overview
12-
TDiMS is a novel molecular descriptor designed to capture non-local interactions of molecules. Unlike conventional descriptors that either focus solely on local features or struggle to effectively learn long-distance intramolecular interactions, TDiMS overcomes these limitations by effectively summarizing enumerated pairwise topological distances between molecular substructures.
13-
14-
The `tdims` Python package provides molecular embeddings using the TDiMS algorithm. The `tdims_ext` module extends this functionality by offering embeddings with or without feature selection, as well as hyperparameter optimization.
15-
16-
## System Requirements
17-
This project requires the following libraries:
18-
19-
- numpy>=1.26.1, <2.0.0
20-
- pandas>=1.5.3, <2.2.0
21-
- rdkit-pypi>=2022.9.4, <2023.9.6
22-
- scikit-learn>=1.2.0, <1.5.0
23-
- shap>=0.42.1, <0.43.0
24-
- matplotlib>=3.10.1, <3.11.0
25-
26-
## Setting Up the Development Environment
27-
Follow these steps to set up the development environment:
28-
29-
1. Create a new conda virtual environment named `tdims`
30-
*(Typical Install Time: ~15 seconds)*
31-
```sh
32-
conda create -n tdims python=3.10
33-
```
34-
35-
2. Activate the virtual environment
36-
```sh
37-
conda activate tdims
38-
```
39-
40-
3. Navigate to the project directory
41-
```sh
42-
cd TDiMS
43-
```
44-
45-
4. Install the required libraries from `requirements.txt`
46-
*(Typical Install Time: ~30 seconds)*
47-
```sh
48-
pip install -r requirements.txt
49-
```
50-
51-
## Example Code
52-
For basic operations, please refer to `example_notebook.ipynb`. For SHAP-based analysis used in the paper, please refer to `SHAP_analitics.ipynb`.
53-
54-
> *Note: To run the example notebooks, please ensure you have a Jupyter environment such as [Jupyter Notebook](https://jupyter.org/) or [JupyterLab](https://jupyterlab.readthedocs.io/). These are not included in the default requirements and should be installed separately if needed.*
55-
56-
### Retrieving Full Embeddings
57-
To generate full embeddings for a given list of SMILES strings:
1+
# TDiMS
2+
3+
TDiMS (Topological Distance of intraMolecular Substructures) is a molecular descriptor designed to capture non-local intramolecular interactions by effectively summarizing enumerated pairwise topological distances between molecular substructures.
4+
5+
This repository includes:
6+
7+
- the TDiMS descriptor implementation
8+
- a minimal Python example
9+
- a Jupyter notebook example
10+
- experiment code for nested cross-validation used in the study
11+
12+
## Repository structure
13+
14+
```text
15+
tdims/
16+
├── README.md
17+
├── requirements.txt
18+
├── requirements-notebook.txt
19+
├── data/
20+
│ ├── cmpCl3_200.csv
21+
├── examples/
22+
│ ├── example_basic.py
23+
│ └── example_notebook.ipynb
24+
├── experiments/
25+
│ └── run_nested_cv_experiment.py
26+
└── src/
27+
└── tdims/
28+
├── __init__.py
29+
├── tdims_ext.py
30+
├── load.py
31+
├── sparse_transformers.py
32+
└── ChemGenerator/
33+
├── __init__.py
34+
└── ChemGraph.py
35+
```
36+
37+
## Installation
38+
39+
### Minimal setup
40+
41+
For the minimal descriptor code and Python example, install the required packages with:
42+
43+
```bash
44+
pip install -r requirements.txt
45+
```
46+
47+
The minimal dependency set is intended for descriptor generation and core functionality.
48+
49+
### Notebook setup
50+
51+
The notebook example uses SHAP. Depending on the platform, installing SHAP with `pip` may trigger `numba` / `llvmlite` build issues. A more stable approach is to install SHAP and its low-level dependencies with conda-forge first, and then install the remaining notebook dependencies.
52+
53+
```bash
54+
conda create -n NCS python=3.10 -y
55+
conda activate NCS
56+
conda install -c conda-forge numba llvmlite shap
57+
pip install -r requirements-notebook.txt
58+
```
59+
60+
## Quick start
61+
62+
Run the basic example:
63+
64+
```bash
65+
python examples/example_basic.py
66+
```
67+
68+
This script generates TDiMS descriptors for a small set of SMILES strings and prints:
69+
70+
- the descriptor matrix shape
71+
- example feature names
72+
73+
## Usage
74+
75+
The main entry point for descriptor generation is `tdims_ext.get_representation()`.
76+
5877
```python
59-
emb, key_all = tdims_ext.get_representation(sm_list, radius=2, func_dis=-2, func_merge=max, fragment_set=True, atom_set=True, fingerprint_set=True)
78+
from tdims import tdims_ext
79+
80+
sm_list = ["CCO", "c1ccccc1", "CC(=O)O"]
81+
82+
emb, key_all = tdims_ext.get_representation(
83+
sm_list,
84+
radius=1,
85+
func_dis=-2,
86+
func_merge=max,
87+
fragment_set=False,
88+
atom_set=True,
89+
fingerprint_set=True,
90+
display=True,
91+
)
6092
```
6193

62-
### Retrieving Feature-Selected Embeddings
63-
To apply feature selection and retrieve optimized embeddings:
94+
To generate descriptors with feature selection:
95+
6496
```python
65-
x_slc, key_slc, key_all, optimized_param = tdims_ext.get_representation_with_fs_selection(sm_list, y, reg_model="Lasso", radius=1, func_dis=-1, func_merge=sum, fragment_set=False, atom_set=True, fingerprint_set=True)
97+
x_slc, key_slc, key_all = tdims_ext.get_representation_with_fs_selection(
98+
sm_list,
99+
y,
100+
radius=1,
101+
func_dis=-2,
102+
func_merge=max,
103+
fragment_set=False,
104+
atom_set=True,
105+
fingerprint_set=True,
106+
display=True,
107+
)
66108
```
67109

68-
### Parameters
69-
- `reg_model`: Choose from "Lasso", "Ridge", "ElasticNet", "RandomForest"
70-
- `radius`: Integer (≥1), sets the radius for MorganFingerprint substructures
71-
- `func_dis`: Method for calculating feature values from topological distances (e.g., `-1` results in `x^(-1)`)
72-
- `func_merge`: Aggregation method for identical pair distances at different locations (e.g., `sum`, `max`, `min`)
73-
- `fragment_set`: Boolean, whether to include CEP fragments
74-
- `fingerprint_set`: Boolean, whether to include circular fingerprints from Morgan Fingerprints
75-
- `atom_set`: Boolean, whether to include heteroatoms
110+
## Using this repository in Jupyter Notebook
111+
112+
If you are using the repository directly in a notebook without package installation, add `src` to `sys.path` before importing `tdims`:
113+
114+
```python
115+
import sys
116+
from pathlib import Path
117+
118+
src_path = (Path.cwd().resolve().parent / "src").resolve()
119+
if str(src_path) not in sys.path:
120+
sys.path.insert(0, str(src_path))
121+
122+
from tdims import tdims_ext
123+
```
124+
125+
For a notebook-based example, see:
126+
127+
- `examples/example_notebook.ipynb`
128+
129+
## Main parameters
130+
131+
Key arguments of `get_representation()` include:
132+
133+
- `radius`: radius used for substructure extraction
134+
- `func_dis`: transformation applied to topological distance values
135+
- `func_merge`: aggregation function for repeated substructure-pair distances
136+
- `fragment_set`: whether fragment-based substructures are included
137+
- `fingerprint_set`: whether fingerprint-derived substructures are included
138+
- `atom_set`: whether atom-based substructures are included
139+
- `display`: if `True`, prints descriptor shape and elapsed time
140+
141+
## Experiment code
142+
143+
The main experiment script is:
144+
145+
```bash
146+
python experiments/run_nested_cv_experiment.py
147+
```
148+
149+
This script is intended for nested cross-validation experiments used in the study. It is separate from the minimal examples above and is provided for experiment-level reproduction.
150+
151+
- `quick`: recommended for a first test run or lightweight debugging
152+
- `full`: used for the main journal-paper experiments
153+
154+
## Notes
155+
156+
- Internal imports inside `src/tdims/` are package-relative.
157+
- External usage from notebooks or scripts should use `from tdims import ...`.
158+
- If the repository is not package-installed, add `src/` to `sys.path` before import.
159+
- `requirements.txt` is intended for minimal descriptor usage.
160+
- `requirements-notebook.txt` is intended for the notebook example and related plotting dependencies.
161+
162+
## Citation
163+
164+
If you use this repository in academic work, please cite the corresponding paper.
165+
166+
## License
76167

168+
This code is distributed under the Apache License 2.0 as part of the `IBM/materials` repository. See the `LICENSE` file in the root of the repository for details.

0 commit comments

Comments
 (0)