Skip to content

QC for duplicate labels/exact synonyms in EFO#2628

Merged
aleixpuigb merged 3 commits intomasterfrom
QC_label_duplication
Feb 19, 2026
Merged

QC for duplicate labels/exact synonyms in EFO#2628
aleixpuigb merged 3 commits intomasterfrom
QC_label_duplication

Conversation

@aleixpuigb
Copy link
Collaborator

@aleixpuigb aleixpuigb commented Feb 16, 2026

This pull request introduces a new quality control (QC) check to detect duplicate lexical strings across rdfs:label and oboInOwl:hasExactSynonym properties in ontology terms. The check is integrated into the build process and produces a report highlighting potential collisions, helping to improve ontology consistency and avoid ambiguous naming.

SPARQL extraction for label/synonym data:

  • Introduced label-synonym-dup-extract.sparql to extract all rdfs:label and oboInOwl:hasExactSynonym values from ontology classes, filtering out deprecated and blank nodes.
  • Added Makefile rules to generate the extracted data TSV and run the duplicate detection script, producing the final report.

New duplicate detection QC integration:

  • Added the label_synonym_dup_check target to the qc rule in the Makefile, ensuring the duplicate check runs as part of QC and GitHub Actions workflows.
  • Implemented a new Python script detect_label_synonym_duplicates.py that processes a ROBOT-generated TSV file, normalizes label/synonym values, identifies duplicate clusters across distinct terms, and writes a wide-format TSV report.

Add a new QC step to detect duplicate lexical strings across rdfs:label and oboInOwl:hasExactSynonym. Introduces a SPARQL extraction (label-synonym-dup-extract.sparql), a Python detector script (detect_label_synonym_duplicates.py) that normalizes values and writes a TSV report, and hooks the check into the Makefile (label_synonym_dup_check) so it's run as part of qc. Adds LABEL_DUP_EXPERIMENTAL Makefile flag to run in non-failing experimental mode and generates intermediate label-synonym-data.tsv via ROBOT.
Remove the LABEL_DUP_EXPERIMENTAL variable and the LABEL_DUP_ARGS conditional block, and hardcode the output path for the detect_label_synonym_duplicates.py invocation. The label_synonym_dup_check target now calls the script with an explicit -o reports/label-synonym-duplicates.tsv, simplifying the Makefile and removing the experimental flag handling.
@aleixpuigb aleixpuigb self-assigned this Feb 16, 2026
@aleixpuigb aleixpuigb merged commit b4c15a1 into master Feb 19, 2026
1 check passed
@aleixpuigb aleixpuigb deleted the QC_label_duplication branch February 19, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments