This directory implements a multi-stage pipeline for constructing a low-homology evaluation dataset from PDB/mmCIF data. The goal is to:
- Filter PDB structures under strict structure-quality and size constraints.
- Define train vs test by release date and compute sequence / ligand similarities.
- Select low-homology chains and interfaces (protein / nucleic acid / ligand).
- Cluster entities, tag biologically meaningful subsets (e.g. antibodies, monomers, peptides).
- Prepare ground-truth CIFs and dataset statistics for downstream benchmarking.
- How to Run
- Dataset Construction Steps
- Step 1 - Filter for RecentPDB
- Step 2 - Make low-homology mapping file
- Step 3 - Filter to low-homology subset
- Step 4 - Cluster low-homology entities
- Step 5 - Find monomer / homomer entries
- Step 6 - Add subset labels
- Step 7 - Copy ground-truth CIFs
- Step 8 - Homology for low-homology subset
- Step 9 - Dataset analysis
The entire dataset construction process is orchestrated by a single entry point:
benchmark/dataset_pipeline/run_pipeline.py
This script runs all stages (Step 1-9) in sequence, from raw mmCIF structures to the final low-homology dataset, cluster assignments, subset tags, and analysis reports.
Before running the pipeline, ensure the following prerequisites are completed:
-
Set the environment variable
PXM_EVAL_DATA_ROOT_PATHThis path serves as the root directory for all pipeline outputs and must be writable. -
Install Python dependencies (Python 3.11+ required)
pip install -r requirements.txt
-
Prepare mmCIF data Download all required mmCIF structures from the PDB and place them under the directory specified by
--mmcif_dir. -
Install external tools
- MMseqs2 must be installed and accessible from your system
PATH.
- MMseqs2 must be installed and accessible from your system
Minimal example:
export PXM_EVAL_DATA_ROOT_PATH="/path/to/output"
python -m benchmark.dataset_pipeline.run_pipeline \
--mmcif_dir /path/to/mmcif \
--after_date 2024-01-01 \
--before_date 2025-01-01 \
--n_cpu 16Key arguments:
-
--mmcif_dirPath to the directory with all mmCIF files. -
--after_date,--before_dateDate cutoffs defining train vs test. Formulate asYYYY-MM-DD. Entries withrelease_date < after_dateare treated as train. Entries withafter_date <= release_date <= before_dateform the candidate test set and are filtered into the low-homology subset. -
--n_cpu(optional, default: machine-dependent) Number of CPU cores to use for parallelized steps (structure parsing, symmetry checks, MMseqs preprocessing, etc.).
The pipeline outputs the following directory structure under PXM_EVAL_DATA_ROOT_PATH:
PXM_EVAL_DATA_ROOT_PATH
├── src_data
│ ├── ccd_to_similar_ccds.json
│ ├── pdb_meta_info.csv
│ ├── pdb_seqs.csv
│ ├── RecentPDB_chain_interface.csv
│ ├── RecentPDB_low_homology_entity_types_count.csv
│ ├── sabdab_summary_all.tsv
│ └── test_to_train_entity_homo.parquet
└── supported_data
├── mmcif
├── RecentPDB_low_homology_cluster_info.csv
├── RecentPDB_low_homology.csv
├── RecentPDB_low_homology_entity_homo.parquet
└── stat_data
├── figs
│ ├── ... (other png files)
│ └── token_num_distribution.png
├── pdb_ids
│ ├── all.txt
│ ├── ... (other txt files)
│ └── subset
│ ├── ... (other txt files)
│ └── antibody-protein.txt
└── stat.txt
For a detailed description of all generated files and their schemas, please refer to the Dataset Output Files Reference.
At a high level, the pipeline does:
- Filter Recent PDB structures →
recentpdb_chain_interface.csv+ metadata - Train-test homology scan →
test_to_train_entity_homo.parquet - Select low-homology chains & interfaces →
RecentPDB_low_homology.csv - Cluster low-homology entities →
RecentPDB_low_homology_cluster_info.csv - Detect monomer / homomer entries →
RecentPDB_low_homology_entity_types_count.csv - Add subset labels → update
RecentPDB_low_homology.csv(subsetcolumn) - Prepare ground-truth CIFs →
true_dir - Fine-grained homology for low-homology entities →
RecentPDB_low_homology_entity_homo.parquet - Dataset statistics and plots →
stat.txt, histogram PNGs, PDB ID lists
Below is a detailed description of each step (step1 - step9).
Goal: Start from all PDB/mmCIF entries, apply strict quality and size filters, and extract chain / interface metadata for assemblies in a given date window.
Logically, the step implements your listed rules:
- Filter entries by
release_datewithin[after_date, before_date]. - Exclude any structure whose experimental methods include
"SOLID-STATE NMR"or"SOLUTION NMR". - Keep only entries with
resolution < 4.5 Å(if available). - Use biological Assembly 1 coordinates.
- Strip hydrogens and waters.
- For crystallographic methods, remove crystallization aids (buffer / precipitant etc.).
- Discard entries with token count (sum over all chains)
> 2560. - For polymers, keep only DNA, RNA, and protein chains; drop entries that contain no such chains.
- Remove entries where any polymer entity has more than 20 chains.
- Remove polymer chains where all residues are unknown.
- Remove protein chains where any neighboring Cα-Cα distance exceeds 5 Å (chain breaks).
- Remove polymer chains where
- the number of resolved residues
< 4, or - resolved residue fraction
< 0.3.
- Ligand chain filters (excluding glycan/ions):
- PDB experimental method must be uniquely
"X-RAY DIFFRACTION". - Resolution ≤ 2.0 Å.
- All atom
occupancy == 1.0on that chain. - Exactly one residue in that ligand chain.
- CCD formula weight in [100, 900] Da.
- CCD element set is subset of
{H, C, O, N, P, S, F, Cl}. - At least 3 heavy (non-H) atoms.
- No covalent bonds to other chains.
- Record remaining chain-level information
- For Ligand/Glycan/Ion chains, only record those in the asymmetric unit.
- Compute and record interfaces: if any atom pair between two chains is within 5 Å (excluding H atoms), that pair is treated as an interface. Only interfaces with at least one chain in the asymmetric unit are kept; for ligand/glycan/ion interfaces, only interfaces where both chains are in the asymmetric unit are kept.
The counts and some of these filters are later re-computed/validated via stat_filtered_num() in step9_analysis_dataset.py when summarising statistics for a given date window.
File: benchmark/dataset_pipeline/step2_make_lowh_file.py
Script: benchmark/dataset_pipeline/step2_make_lowh_file.py
Goal: Given all entity sequences and their release dates, compute train-test similarity mappings:
- Train = entities with
release_date < after_date - Test = entities with
after_date <= release_date <= before_date
The script:
-
Loads a sequence table (
pdb_seq_csv) with columns includingentry_id,entity_id,entity_type(PROTEIN,DNA,RNA,LIGAND),seq,release_date. -
Splits by entity type and runs MMseqs2 sequence search via
calc_mmseqs_seq_identity()(defined instep2_make_lowh_file.py, wrappingmmseqs easy-search).-
For proteins:
--min-seq-id 0.4-e 0.1--max-seqs 500000-s 7.5
-
For DNA/RNA:
--min-seq-id 0.8-e 0.1,--max-seqs 500000,-s 7.5--search-type 3for nucleotide search.
-
-
For short sequences (
len(seq) < min_seq_length), the helper first does an exact-identity match on the raw string and treats those as identity 1.0 hits, avoiding misbehaviour of MMseqs on very short segments. -
For ligands (CCD codes), it constructs Morgan fingerprints (
radius=2,fpSize=2048) and computes Tanimoto similarity between CCDs:gen_ccd_fp()builds an RDKit molecule and fingerprint for each CCD entry from the in-memory CCD CIF blocks.get_ccd_similarity()precomputes CCD-CCD pairs with Tanimoto similarity ≥ 0.6 and saves them as a JSON mapping{ccd_code: [[similar_ccd, similarity], ...]}.get_lowh_by_ccd_similarity()maps test ligand entities to all train ligand entities whose CCDs are similar above this threshold.
-
Concatenates all entity-type-specific results into a single table with columns
query_id,db_id,similarity,aligned_res_num(for sequences) orsimilarity(for CCDs). -
Down-casts dtypes using
shrink_dataframe()and writes to a Parquet file (e.g.test_to_train_entity_homo.parquet).
This Parquet file is consumed by Step 3 to decide which chains/interfaces belong to the low-homology test subset.
Script: benchmark/dataset_pipeline/step3_filter_to_lowh.py
Inputs:
recentpdb_chain_interface_csv: chain & interface table from Step 1, with per-sideentity_type,entity_id,seq_length, etc.test_to_train_entity_homo.parquet: output from Step 2, mapping test entities to train entities.
Goal: Produce recentpdb_low_homology.csv, which contains only:
- chains/interfaces whose component entities have no overlapping train PDB IDs according to Step 2,
- and satisfy additional length rules for short chains and ligand interfaces.
The script reads the Parquet file into a DataFrame with columns
query_id, db_id, similarity, aligned_res_num, and groups it into a dictionary:
test_to_train = {
query_id: [db_id1, db_id2, ...],
...
}
where query_id and db_id have the form {entry_id}_{entity_id}.
For a given row in the chain/interface table:
-
For a chain row:
- Low-homology if
test_to_train[entry_id_entity_1]is empty.
- Low-homology if
-
For an interface row:
- Let
train_pdb_ids_1be PDB IDs of all db entities mapped from entity 1, and likewisetrain_pdb_ids_2. - Low-homology if
train_pdb_ids_1 ∩ train_pdb_ids_2is empty. - If the two chains map to the same
{PDB, entity}in train, they are considered to have overlapping homology and are removed.
- Let
This logic is implemented in _check_for_lowh().
_is_short_chain() defines a “short” polymer as < 25 residues for protein/DNA/RNA.
_filter_short_polymer() implements the length-based exclusion rules:
-
For chains: require
seq_length_1 >= 25to keep (short polymer chains are dropped). -
For interfaces:
- If both sides are long (≥25), keep.
- If both sides are short, drop.
- If one side is short and the other is long, keep the interface only if the long chain has no train hits (i.e. is low-homology); otherwise drop.
This reflects the textual rule you described:
- Remove “short-short” interfaces.
- For “short-polymer”, keep only if the polymer side itself has no similar sequence in the training window.
The script then selects low-homology rows per entity type:
-
Proteins:
- select chains and protein-protein interfaces (
entity_type_1 == PROTEINand, for interfaces,entity_type_2 == PROTEIN), - apply
_check_for_lowh+_filter_short_polymer. - Result:
lowh_protein_df.
- select chains and protein-protein interfaces (
-
Nucleic acids (DNA/RNA):
- include nucleic acid chains, nuc-protein interfaces, and nuc-nuc interfaces,
- apply
_check_for_lowhand then_filter_short_polymerto remove short chains and short-short interfaces, plus short-polymer where the polymer is not low-homology. - Result:
lowh_nuc_df.
-
Ligands:
- keep ligand chains with
entity_type_1 == LIGAND, - and ligand-polymer interfaces where the polymer side is protein/DNA/RNA with
seq_length >= 25. - apply
_check_for_lowhto require that in the training set there is no same-PDB combination of similar ligand and similar polymer. - Result:
lowh_lig_df.
- keep ligand chains with
The three subsets are concatenated into a single DataFrame merged_df.
For ligand chains and ligand-polymer interfaces, filter_lig_in_chain_interface_df() from dataset_pipeline/utils/select_ligand.py applies additional structural quality filters:
-
RCSB validation report via GraphQL:
-
Fetch non-polymer instance validation metrics for
(entry_id, chain_id)pairs. -
Keep only instances satisfying:
intermolecular_clashes == 0is_best_instance == "Y"stereo_outliers == 0completeness == 1.0RSR <= 0.2RSCC >= 0.95.
-
-
Symmetry-mate contact check:
- Load mmCIF, build a 3×3×3 unit cell using Biotite’s
repeat_box, and compute neighbors for ligand atoms via a KD-tree. - If any ligand atom has neighbors within 5 Å in chains that are not in
(Assembly 1 ∩ Asym Unit), the ligand is rejected as potentially crystallographic-artifact-like. - Only ligands that do not contact symmetry mates are kept.
- Load mmCIF, build a 3×3×3 unit cell using Biotite’s
The final filtered DataFrame is written as recentpdb_low_homology.csv.
make_lig_info_df() constructs a DataFrame with columns:
entry_idlabel_asym_id
This function extracts all ligand chains present in the low-homology subset, and records them in a unified table.
This DataFrame is saved as RecentPDB_low_homology_lig_info.csv, which is used in the evaluation pipeline to identify ligand entities that require
pocket-aligned RMSD computation and PoseBusters ligand validity checks.
Only ligands listed in this file will be included in these ligand-specific quality assessments during benchmarking.
Script: benchmark/dataset_pipeline/step4_make_cluster_csv.py
Goal: Cluster the entities (protein/DNA/RNA/ligand) that appear in the low-homology subset and assign cluster IDs that will be used:
- for interface clustering, and
- in analysis / sampling logic downstream.
- Iterate over each row in
recentpdb_low_homology.csv. - For
type == "chain", recordentry_id,entity_id_1,entity_type_1,seq_length_1. - For
type == "interface", record both sides(entry_id, entity_id_1, entity_type_1, seq_length_1)and(entry_id, entity_id_2, entity_type_2, seq_length_2). - Deduplicate by
(entry_id, entity_id).
The core helper is cluster_by_seq_identity():
- For sequences with length
< min_seq_length(default 10), uses the sequence string itself ascluster_id(degenerate cluster). - For longer sequences, writes them to a FASTA file and runs:
mmseqs easy-cluster seq.fasta mm mmseqs_tmp \
--min-seq-id {threshold} -c {coverage} -s 8 --max-seqs 1000 --cluster-mode 1- For proteins:
threshold=0.4,coverage=0.8. - For DNA/RNA:
threshold=0.8,coverage=0.8. - Parses
mm_cluster.tsvto map{entry_id, entity_id}tocluster_id(cluster center ID).
For ligands:
- No MMseqs clustering is performed; the CCD code itself is treated as the cluster ID.
- The code prepends
"CCD_"to the CCD code in thecluster_idcolumn to distinguish from polymer clusters.
The final cluster file has columns:
entry_id,label_entity_id,cluster_id,entity_type
and is saved as RecentPDB_low_homology_cluster_info.csv.
Interfaces in later analysis use cluster IDs formed as {chain_1_cluster_id}_{chain_2_cluster_id}, with special handling to collapse interfaces where one side is ligand/short chain onto the polymer side’s cluster ID.
Script: benchmark/dataset_pipeline/step5_find_monomer.py
Goal: For PDB entries appearing in the low-homology set, detect:
- Protein monomers
- Protein homomers
- RNA monomers
based on Assembly 1 composition (counts of each entity type).
get_entity_counts_from_cif():
-
Loads mmCIF with
Structure.from_mmcif(..., assembly_id=assembly_id)and cleans structure. -
Iterates over
label_entity_ids, determinesentity_type(protein, RNA, ligand, etc.), and counts:- Number of chains per entity type (e.g.
PROTEIN,RNA,LIGANDcolumns). - Number of entities per type (columns like
"protein entities","rna entities").
- Number of chains per entity type (e.g.
get_entity_counts_from_cif_batch() runs this in parallel over all PDB IDs in the low-homology CSV and returns a DataFrame with one row per entry_id.
find_monomer_and_homomer() attaches boolean flags:
-
is_protein_monomer- All non-ligand entity-type columns except
PROTEINare 0. PROTEIN == 1.
- All non-ligand entity-type columns except
-
is_protein_homomer- All
"{entity} entities"columns except"protein entities"are 0. PROTEIN > 1and"protein entities" == 1(multiple identical protein chains).
- All
-
is_rna_monomer- All non-ligand entity-type columns except
RNAare 0. RNA == 1.
- All non-ligand entity-type columns except
find_protein_monomer_for_recentpdb_lowh():
- Reads
recentpdb_low_homology.csvto get the set ofentry_ids, - runs entity count and classification on Assembly 1,
- writes
recentpdb_low_homology_entity_type_count.csv.
This CSV is consumed in Step 6 to tag monomer/homomer subsets.
Script: benchmark/dataset_pipeline/step6_add_subset_to_lowh.py
Goal: For each chain/interface row in recentpdb_low_homology.csv, assign one or more subset labels describing its biological and structural class, and write them into a subset column. A single row can have multiple labels separated by ";".
The subsets include (non-exhaustive):
- Antibody-related:
[antibody],[antibody-protein],[antibody_HL],[antibody_HL-protein], etc. - Monomer / homomer:
[protein_monomer],[protein_homomer],[rna_monomer]. - Peptide-related interfaces:
[peptide-interface],[peptide-peptide],[peptide-protein], etc. - Cyclic peptide interfaces:
[cyclic_peptide-interface],[cyclic_peptide-protein], etc.
identify_antibody_protein():
-
Ensures a SAbDab summary TSV is available (downloads via
wgetif not). -
get_sabdab_ab_chain_to_type()builds a mapping{pdb_id_auth_chain_id -> antibody type}where types are one ofantibody_scFv,antibody_HL,antibody_H,antibody_L. -
get_antibody_antigen_label()inspects each low-homology row:-
For chains, if
entry_id_auth_chain_idexists in the map, label[antibody];[ab_type]. -
For interfaces, it checks both chains and whether they are proteins, and returns:
[antibody-antibody]if both chains are antibodies.[antibody-protein];[{ab_type}-protein]when one side is antibody and the other is protein.
-
identify_monomer_and_homomer() uses recentpdb_low_homology_entity_type_count.csv from Step 5:
-
Builds sets of
entry_ids that are protein monomer, protein homomer, RNA monomer. -
For each low-homology row:
- If the entry is a protein monomer and this row is a protein chain, label
[protein_monomer]. - If the entry is a protein homomer, label
[protein_homomer]. - If the entry is an RNA monomer and this row is an RNA chain, label
[rna_monomer].
- If the entry is a protein monomer and this row is a protein chain, label
identity_peptide():
-
For interfaces only, classify based on
entity_type_{1,2}andseq_length_{1,2}using apeptide_threshold(default 25):- Both chains short proteins →
[peptide-interface];[peptide-peptide]. - Short protein vs long protein →
[peptide-interface];[peptide-protein]. - Short protein vs DNA/RNA →
[peptide-interface];[peptide-dna]/[peptide-interface];[peptide-rna].
- Both chains short proteins →
identity_cyclic_peptide():
-
From rows labelled as peptide-interfaces, collect
entry_ids. -
For each mmCIF file:
- Identify peptide entity IDs (
polypeptide(L)). - For each such chain, if the number of unique residues ≤
peptide_thresholdand at least one bond connects non-adjacent residue IDs (|res_i - res_j| > 1), treat the entity as cyclic peptide.
- Identify peptide entity IDs (
-
Label interfaces where one/both sides are cyclic-peptide entities as:
[cyclic_peptide-interface];[cyclic_peptide-cyclic_peptide][cyclic_peptide-interface];[cyclic_peptide-protein][cyclic_peptide-interface];[cyclic_peptide-dna]/[cyclic_peptide-interface];[cyclic_peptide-rna].
identify_subset():
- Reads low-homology CSV.
- Computes antibody labels, monomer/homomer labels, peptide labels, and cyclic peptide labels.
- Concatenates them per row and joins non-NA entries with
";"to formsubset. - Writes the updated CSV back to
recentpdb_low_homology.csv(in-place).
Script: benchmark/dataset_pipeline/step7_copy_true_cif.py
Goal: Prepare a directory containing ground-truth mmCIFs for all entries in the low-homology dataset, to be used by downstream evaluation code.
copy_true_cif():
-
Reads the low-homology CSV (or another CSV with
entry_idcolumn). -
Builds
(input_path, output_path)pairs for<entry_id>.cif. -
Two modes:
-
copy_all_cif=True(default):- If
symlink=True, create a single symlink frominput_dirtooutput_dir. - If
symlink=False, copy the entire tree.
- If
-
copy_all_cif=False:- Create
output_dirand copy/symlink only the CIFs referenced in the CSV.
- Create
-
-
Parallelised with
joblib.Parallelwhen copying individual files.
This gives you a canonical true_dir that the evaluation pipeline can mount as ground truth.
Script: benchmark/dataset_pipeline/step8_get_homology_for_lowh.py
Goal: Although recentpdb_low_homology.csv already ensures low homology at certain sequence-identity thresholds, Step 8 recomputes and stores fine-grained homology scores between low-homology entities and the older training set, using a lower sequence-identity threshold (0.3) to support stratification of the evaluation set by homology level.
get_homology_for_lowh():
- Reads
pdb_seq_csvintoseq_dfand filterstrain_df = seq_df[release_date < after_date]. - Reads
recentpdb_low_homology.csvand usesget_lowh_sequences()to extract only those sequences that appear in the low-homology dataset (both sides of interfaces).
For each entity type (protein/RNA/DNA), it calls calc_mmseqs_seq_identity() with:
threshold = 0.3e_value_cutoff = 0.1sensitivity = 7.5max_seqs = 500000cov_mode = 0,coverage = 0.0nuc=Truefor nucleic acids.
The helper again:
- handles short sequences with identity-only matching,
- uses
mmseqs easy-searchfor the rest, - returns a DataFrame with
query_id,db_id,similarity,aligned_res_num.
All three entity types are concatenated into merged_df, down-cast with shrink_dataframe(), and saved as Parquet (e.g. recentpdb_low_homology_entity_homo.parquet).
This file can be used to bucket cases by “homology to pre-cutoff PDB” during evaluation.
Script: benchmark/dataset_pipeline/step9_analysis_dataset.py
Goal: Summarise the dataset into human-readable statistics and plots:
- How many PDB complexes survive each filter step.
- Token count distributions.
- Counts per evaluation type / subset.
- PDB ID lists per eval-type / subset.
stat_filtered_num() re-applies the meta-level filters to a PDB meta info CSV (pdb_meta_info_csv):
- Date filter:
[after_date, before_date]; - drop NMR (same NMR method set);
- resolution threshold (default 4.5 Å);
- token threshold (default 2560);
- require standard polymers and limit max polymer chain copies;
- remove entries where all chains are
unkor all chains have breaks; - require that at least one chain has sufficient resolved residues.
It returns a dict of counts by step; _log_filtered_num_stat() turns this into a formatted text block.
stat_lowh_num():
-
Reads
recentpdb_low_homology_cluster_info.csv, normaliseslabel_entity_idto string. -
Joins cluster IDs into the low-homology DataFrame (
add_cluster_id_to_df()), using polymer cluster IDs for interfaces where one side is ligand or short chain. -
Computes:
-
Number of complexes (
#unique entry_id), chains, interfaces. -
For each evaluation type (configured in
eval_type_config):- number of rows,
- number of unique clusters,
- number of PDB IDs.
-
Token number lists (per complex and per chain/interface).
-
-
Explodes the
subsetcolumn and counts occurrences per subset label, plus PDB ID and cluster counts per subset.
_log_lowh_num() formats these into human-readable text.
_draw_token_num_plot():
-
Saves histograms of token counts for:
- Complex-level token counts (
num_tokensfield from meta info). - Chain/interface token counts per eval type.
- Complex-level token counts (
All plots are written as PNG files under output_dir/figs.
run_data_analysis() writes:
-
stat.txt:<DATA FILTERING PIPELINE STATISTICS>section (step-wise counts).<LOW-HOMOLOGY SUBSET STATISTICS>section (per eval type / subset).
-
PDB ID lists under
output_dir/pdb_ids:all.txt,lowh_polymer_only.txt, plus one file per eval type and per subset label.