Collaborative workspace for curating METPO terms and experimenting with KG-Microbe ingest with METPO normalization of source data
This workspace provides a structured environment for curating definitions and metadata for the METPO (Microbial Environment and Traits Ontology) and KG-Microbe knowledge graph, following OBO Foundry principles and best practices.
Install these tools once (one-time setup):
- Python >= 3.10: Required for all Python tooling
- uv: Python package manager (installation instructions)
curl -LsSf https://astral.sh/uv/install.sh | sh - just: Command runner (installation instructions)
uv tool install rust-just
- Git: Version control (should already be installed)
-
Clone this repository (or your fork):
git clone https://github.com/berkeleybop/metpo-kgm-studio.git cd metpo-kgm-studio -
Run setup:
just setup
-
Fetch your assignments from Google Sheets:
just fetch-assignments
-
Read the workflow guide:
cat CURATION_GUIDE.md
metpo-kgm-studio/
├── assignments/ # Curator-specific class assignments
│ ├── curator1.tsv # Classes assigned to curator 1
│ ├── curator2.tsv # Classes assigned to curator 2
│ └── curator3.tsv # Classes assigned to curator 3
├── prompts/
│ ├── templates/ # Approved LLM prompt templates
│ │ ├── definition-generation.md
│ │ └── definition-source-finding.md
│ └── executed/ # Record of prompts actually executed
│ └── YYYY-MM-DD_CLASSID_description.md
├── outputs/
│ ├── raw/ # Raw LLM outputs (before review)
│ │ └── curator1/
│ └── reviewed/ # Reviewed and approved outputs
│ └── curator1/
├── src/metpo_kgm_studio/ # Python curation tools
│ ├── splitter.py # Google Sheets assignment splitter
│ └── validators.py # Definition quality validators
├── tests/ # Unit tests
├── justfile # Workflow commands (run `just` to see all)
├── README.md # This file
└── CURATION_GUIDE.md # Detailed workflow documentation
Run just to see all available commands. Here are the most important ones:
just setup # Install dependencies and create directories
just fetch-assignments # Download and split Google Sheet into assignments
just assignment-stats # Show how many classes per curatorjust validate-all # Validate all assignment files
just validate-file FILE # Validate a specific TSV file
just validate-reviewed # Validate reviewed outputsjust test # Run all tests (pytest, mypy, ruff)
just lint # Check code style
just format # Auto-format code
just spell-check # Check spellingjust workflow-help # Show workflow summary
just new-branch NAME BATCH # Create a new branch for curation work
just progress curator1 # Show progress for a curatorjust status # Git status with helpful reminders
just sync-upstream # Sync fork with upstream (for forks)The recommended workflow follows GitHub's issue → branch → commits → PR pattern:
- Get Assignment:
just fetch-assignments - Create Branch:
just new-branch curator1 1orgit checkout -b curator1-batch1 - Curate Definitions:
- Use prompts from
prompts/templates/ - Save executed prompts to
prompts/executed/ - Save LLM outputs to
outputs/raw/your-name/
- Use prompts from
- Validate:
just validate-file assignments/curator1.tsv - Commit:
git add . && git commit -m "Add definitions for METPO:XXX-YYY" - Push & PR:
git push origin curator1-batch1→ Create pull request on GitHub
See CURATION_GUIDE.md for detailed step-by-step instructions.
This project follows OBO Foundry principles, especially:
- Genus-Differentia Form: "An [parent class] that [distinguishing characteristics]"
- Clear & Unambiguous: Intelligible to domain experts
- Avoid Circularity: Don't use term in its own definition
- Sources Required: At least one PMID, DOI, ISBN, or authoritative URL
- Lowercase Labels: Except for proper nouns and acronyms
- No Capitalization: Avoid capitalizing common words
- Consistent Style: Follow established patterns in the ontology
Validators automatically check these principles. See src/metpo_kgm_studio/validators.py.
This project encourages LLM use but with guardrails:
- Start with approved prompt templates in
prompts/templates/ - Save all prompts and outputs to git for reproducibility
- Critically review all LLM outputs (they can hallucinate!)
- Verify all sources (PMIDs, DOIs) actually exist and support the definition
- Iterate on prompts if quality is poor
- Trust LLM outputs blindly
- Use ad-hoc prompts without documenting them
- Skip the review step
- Generate definitions without understanding the biology
- Ignore validation errors
- Claude (3.5 Sonnet or Opus): Excellent for scientific text
- ChatGPT (GPT-4): Good general performance
- CBORG: Specialized for biomedical text (if available)
See Chris Mungall's best practices for more guidance.
- Read This File: You're doing it! ✓
- Read CURATION_GUIDE.md: Detailed workflow walkthrough
- Run Setup:
just setup - Get Your Assignment:
just fetch-assignments - Find Your File:
assignments/curator1.tsv(or curator2, curator3) - Read Prompt Templates:
prompts/templates/README.md - Try One Class: Start small - curate definition for one class
- Validate:
just validate-file assignments/curator1.tsv - Ask Questions: Don't hesitate to ask Montana, Mark, Sujay, or Chris!
Through this project, you will learn:
- ✅ Ontology curation: Following OBO Foundry principles
- ✅ Git workflow: Branch, commit, pull request
- ✅ Python development: Type hints, testing, linting
- ✅ LLM best practices: Prompt engineering, critical evaluation
- ✅ Scientific literature: Finding and citing authoritative sources
- ✅ Collaborative coding: Code review, issue tracking
These are all highly marketable skills for bioinformatics and data science careers!
- Ontology Questions: Ask Montana or Mark
- Python/Technical Questions: Ask Sujay
- LLM/Prompting Questions: Ask Chris
- General Questions: Ask anyone on the team!
See CURATION_GUIDE.md for the detailed workflow. In summary:
- Fork this repo (for interns) or clone directly (for team members)
- Create a branch for your work
- Make changes following OBO Foundry principles
- Validate your work:
just validate-all - Commit with clear messages
- Push and create a pull request
- Address review feedback
- Celebrate when merged! 🎉
BSD-3-Clause
This project was created from the metpo-kgm-copier template, based on monarch-project-copier.
Developed as part of the METPO/KG-Microbe project at Lawrence Berkeley National Laboratory.
Generated: 2025-10-02