Skip to content

KalenJosifovski/psma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

psma

psma is an open source Python implementation of the probabilistic surface of molecular activity workflow described in the paper "A visual approach for analysis and inference of molecular activity spaces".

The package builds molecular similarity matrices, embeds compounds into a 2D reference space, estimates class-conditional activity surfaces, and scores posterior probabilities for projected test compounds.

Features

  • RDKit Morgan Tanimoto, embedding cosine, and imported triples similarity backends
  • random and Butina train/test splitting
  • pure Python API returning typed result objects
  • CLI entrypoint for reproducible runs
  • static Matplotlib plots and optional interactive Bokeh plots
  • Sphinx documentation with a notebook-based plotting tutorial

Installation

The package currently supports Python >=3.11,<3.13.

Core install:

pip install psma

Optional extras:

pip install "psma[rdkit]"
pip install "psma[plotting]"
pip install "psma[docs]"

For local development, use Pixi:

pixi install

Quickstart

Run the CLI on a CSV with a binary endpoint and SMILES column:

pixi run psma run docs/_data/solubility_NCATS-sol.csv \
  --output-dir .tmp/ncats_sol_cli \
  --y-col low_solubility \
  --label-threshold 0.5 \
  --label-direction ge \
  --similarity-method rdkit_morgan_tanimoto \
  --smiles-col canonical_smiles \
  --split-method random

For Python use, call the pure computation API:

from psma import compute_psma_surface

result = compute_psma_surface(
    df,
    y_col="low_solubility",
    smiles_col="canonical_smiles",
    similarity_method="rdkit_morgan_tanimoto",
    label_threshold=0.5,
    label_direction="ge",
)

print(result.metrics.mcc)

Documentation

Build the searchable HTML documentation locally:

pixi run docs
open docs/_build/html/index.html

The documentation includes tutorials, how-to guides, explanations, and generated API reference pages.

Development

Common tasks:

pixi run lint
pixi run typecheck
pixi run test
pixi run docs

The project uses:

  • ruff for formatting and linting
  • pyright for type checking
  • pytest for tests
  • sphinx and myst-nb for documentation

Example Dataset

The documentation tutorial uses the NCATS-sol dataset stored at docs/_data/solubility_NCATS-sol.csv.

Source repository:

The dataset was downloaded from the already-preprocessed NCATS-sol data published by that repository. The upstream authors describe the preprocessing as follows:

  • start from an original dataset containing 2,532 records
  • drop one compound, represented by two rows, with inconsistent outcomes
  • drop one duplicated row
  • drop 76 compounds with inconclusive outcomes
  • generate the low_solubility column from Analysis Comment, mapping Low phenotype to positive class 1 and Moderate/High phenotype to negative class 0
  • use RDKit to transform SMILES into canonical forms

The upstream NCATS-sol description states that the resulting dataset has 2,453 compounds and binary labels indicating whether each compound has low solubility.

Dataset reference:

H. Sun, P. Shah, K. Nguyen, K. R. Yu, E. Kerns, M. Kabir, Y. Wang, and X. Xu, Predictive models of aqueous solubility of organic compounds built on a large dataset of high integrity, Bioorganic & Medicinal Chemistry 27, 3110 (2019).

License

This package is distributed under the MIT License. See LICENSE.

About

Probabilistic surface of molecular activity open source python implementation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages