psma is an open source Python implementation of the probabilistic
surface of molecular activity workflow described in the paper
"A visual approach for analysis and inference of molecular activity spaces".
The package builds molecular similarity matrices, embeds compounds into a 2D reference space, estimates class-conditional activity surfaces, and scores posterior probabilities for projected test compounds.
- RDKit Morgan Tanimoto, embedding cosine, and imported triples similarity backends
- random and Butina train/test splitting
- pure Python API returning typed result objects
- CLI entrypoint for reproducible runs
- static Matplotlib plots and optional interactive Bokeh plots
- Sphinx documentation with a notebook-based plotting tutorial
The package currently supports Python >=3.11,<3.13.
Core install:
pip install psmaOptional extras:
pip install "psma[rdkit]"
pip install "psma[plotting]"
pip install "psma[docs]"For local development, use Pixi:
pixi installRun the CLI on a CSV with a binary endpoint and SMILES column:
pixi run psma run docs/_data/solubility_NCATS-sol.csv \
--output-dir .tmp/ncats_sol_cli \
--y-col low_solubility \
--label-threshold 0.5 \
--label-direction ge \
--similarity-method rdkit_morgan_tanimoto \
--smiles-col canonical_smiles \
--split-method randomFor Python use, call the pure computation API:
from psma import compute_psma_surface
result = compute_psma_surface(
df,
y_col="low_solubility",
smiles_col="canonical_smiles",
similarity_method="rdkit_morgan_tanimoto",
label_threshold=0.5,
label_direction="ge",
)
print(result.metrics.mcc)Build the searchable HTML documentation locally:
pixi run docs
open docs/_build/html/index.htmlThe documentation includes tutorials, how-to guides, explanations, and generated API reference pages.
Common tasks:
pixi run lint
pixi run typecheck
pixi run test
pixi run docsThe project uses:
rufffor formatting and lintingpyrightfor type checkingpytestfor testssphinxandmyst-nbfor documentation
The documentation tutorial uses the NCATS-sol dataset stored at
docs/_data/solubility_NCATS-sol.csv.
Source repository:
The dataset was downloaded from the already-preprocessed NCATS-sol data published by that repository. The upstream authors describe the preprocessing as follows:
- start from an original dataset containing 2,532 records
- drop one compound, represented by two rows, with inconsistent outcomes
- drop one duplicated row
- drop 76 compounds with inconclusive outcomes
- generate the
low_solubilitycolumn fromAnalysis Comment, mappingLowphenotype to positive class1andModerate/Highphenotype to negative class0 - use RDKit to transform SMILES into canonical forms
The upstream NCATS-sol description states that the resulting dataset has 2,453 compounds and binary labels indicating whether each compound has low solubility.
Dataset reference:
H. Sun, P. Shah, K. Nguyen, K. R. Yu, E. Kerns, M. Kabir, Y. Wang, and X. Xu, Predictive models of aqueous solubility of organic compounds built on a large dataset of high integrity, Bioorganic & Medicinal Chemistry 27, 3110 (2019).
This package is distributed under the MIT License. See LICENSE.