Skip to content

Commit b61e86c

Browse files
DahlializiDahlia Lithompsonmjegrace479
authored
Documentation site
--------- Co-authored-by: Dahlia Li <poker1005@outlook.com> Co-authored-by: Matt Thompson <31709066+thompsonmj@users.noreply.github.com> Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
1 parent 6c563a0 commit b61e86c

File tree

24 files changed

+1067
-190
lines changed

24 files changed

+1067
-190
lines changed

.github/workflows/deploy-docs.yaml

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
name: Build & Deploy MkDocs (gh-pages with PR previews)
2+
3+
on:
4+
workflow_dispatch:
5+
pull_request:
6+
branches: [ main ]
7+
types: [opened, synchronize, reopened, closed]
8+
push:
9+
branches: [ main ]
10+
11+
permissions:
12+
contents: write
13+
pages: write
14+
15+
jobs:
16+
build:
17+
# Run for push, workflow dispatch, PRs from SAME repo that are not closed
18+
if: |
19+
github.event_name == 'push' ||
20+
github.event_name == 'workflow_dispatch' ||
21+
(github.event_name == 'pull_request' &&
22+
github.event.pull_request.head.repo.fork == false &&
23+
github.event.action != 'closed')
24+
runs-on: ubuntu-latest
25+
concurrency:
26+
group: ${{ github.workflow }}-${{ github.ref }}
27+
cancel-in-progress: true
28+
steps:
29+
- uses: actions/checkout@v4
30+
with:
31+
fetch-depth: 0
32+
- uses: actions/setup-python@v5
33+
with:
34+
python-version: "3.11"
35+
- name: Install deps
36+
run: |
37+
python -m pip install --upgrade pip
38+
pip install '.[docs]'
39+
- name: Build with MkDocs
40+
run: mkdocs build
41+
- name: Upload built site as artifact
42+
uses: actions/upload-artifact@v4
43+
with:
44+
name: site
45+
path: ./site
46+
47+
deploy:
48+
needs: build
49+
# Deploy on push to main (root) or PRs from SAME repo (not closed) -> pr-<N>/
50+
if: |
51+
github.event_name == 'push' ||
52+
(github.event_name == 'pull_request' &&
53+
github.event.pull_request.head.repo.fork == false &&
54+
github.event.action != 'closed')
55+
runs-on: ubuntu-latest
56+
concurrency:
57+
group: ${{ github.workflow }}-${{ github.ref }}
58+
cancel-in-progress: true
59+
steps:
60+
- name: Download built site
61+
uses: actions/download-artifact@v4
62+
with:
63+
name: site
64+
path: ./site
65+
- name: Deploy to gh-pages
66+
uses: peaceiris/actions-gh-pages@v4
67+
with:
68+
github_token: ${{ secrets.GITHUB_TOKEN }}
69+
publish_branch: gh-pages
70+
publish_dir: ./site
71+
keep_files: true
72+
destination_dir: ${{ github.event_name == 'pull_request' && format('pr-{0}', github.event.number) || '' }}
73+
74+
cleanup:
75+
# Only when a same-repo PR closes
76+
if: >
77+
github.event_name == 'pull_request' &&
78+
github.event.pull_request.head.repo.fork == false &&
79+
github.event.action == 'closed'
80+
runs-on: ubuntu-latest
81+
steps:
82+
- name: Checkout gh-pages
83+
uses: actions/checkout@v4
84+
with:
85+
ref: gh-pages
86+
fetch-depth: 0
87+
- name: Configure git author
88+
run: |
89+
git config user.name "github-actions[bot]"
90+
git config user.email "github-actions[bot]@users.noreply.github.com"
91+
- name: Remove preview folder
92+
shell: bash
93+
run: |
94+
set -euo pipefail
95+
PR_DIR="pr-${{ github.event.number }}"
96+
echo "Attempting to remove $PR_DIR"
97+
if [ -d "$PR_DIR" ]; then
98+
git rm -r "$PR_DIR"
99+
git commit -m "Remove preview for PR #${{ github.event.number }}"
100+
git push origin gh-pages
101+
else
102+
echo "No preview folder $PR_DIR found; nothing to do."
103+
fi

README.md

Lines changed: 7 additions & 183 deletions
Original file line numberDiff line numberDiff line change
@@ -1,194 +1,18 @@
1-
# TaxonoPy
1+
<h1 align="center">
2+
<img src="docs/_assets/taxonopy_banner.svg" alt="TaxonoPy banner">
3+
</h1>
24

35
[![DOI](https://zenodo.org/badge/789041700.svg)](https://doi.org/10.5281/zenodo.15499454)
46

57
[![PyPI - Version](https://img.shields.io/pypi/v/taxonopy.svg)](https://pypi.org/project/taxonopy)
68
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/taxonopy.svg)](https://pypi.org/project/taxonopy)
79

8-
`TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). See below for the structure of inputs and outputs.
10+
## TaxonoPy: Reproducible, Traceable, and Scalable Biological Taxonomy Alignment
911

10-
## Purpose
11-
The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies.
12+
TaxonoPy (taxon-o-pie) is a command-line tool for harmonizing large biodiversity datasets into a consistent taxonomy ready for AI applications. Built on the [Global Names Verifier (GNVerifier)](https://github.com/gnames/gnverifier), it provides complete provenance tracking, flexible resolution strategies, and batch processing of 100M+ records to address challenges in reproducibility and scale in massive multi-source taxonomy alignment.
1213

13-
Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers:
14-
15-
- [The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/)
16-
- [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/)
17-
- [FathomNet](https://www.fathomnet.org/)
18-
- [The Encyclopedia of Life (EOL)](https://eol.org/)
19-
20-
The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa.
21-
22-
### Input
23-
24-
A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include:
25-
- `uuid`: a unique identifier for each sample (required).
26-
- `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`: the taxonomic ranks of the organism (required, may have sparsity).
27-
- `scientific_name`: the scientific name of the organism, to the most specific rank available (optional).
28-
- `common_name`: the common (i.e. vernacular) name of the organism (optional).
29-
30-
See the example data in
31-
- `examples/input/sample.parquet`
32-
- `examples/resolved/sample.resolved.parquet` (generated with [`taxonopy resolve`](#command-resolve))
33-
- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with [`taxonopy common-names`](#command-common-names))
34-
35-
### Challenges
36-
This taxonomy information is provided by each data provider and the original sources, but the classification can be...
37-
38-
- **Inconsistent**: both between and within sources (e.g. kingdom Metazoa vs. Animalia).
39-
- **Incomplete**: many samples are missing one or more ranks. Some have 'holes' where higher and lower ranks are present, but intermediate ranks are missing.
40-
- **Incorrect**: some samples have incorrect classifications. This can come in the form of spelling errors, nonstandard ideosyncratic terms, or outdated classifications.
41-
- **Ambiguous**: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically.
42-
43-
Taxonomic authorities exist to standardize classification, but ...
44-
- There are many authorities.
45-
- They may disagree.
46-
- A given organism may be missing from some.
47-
48-
### Solution
49-
`TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used.
50-
51-
## Installation
52-
53-
`TaxonoPy` can be installed with `pip` after setting up a virtual environment.
54-
55-
### User Installation with `pip`
56-
57-
To install the latest version of `TaxonoPy`, run:
58-
```console
59-
pip install taxonopy
60-
```
61-
62-
### Usage
63-
You may view the help for the command line interface by running:
64-
```console
65-
taxonopy --help
66-
```
67-
This will show you the available commands and options:
68-
```console
69-
usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--cache-input CACHE_INPUT]
70-
[--show-cache-path] [--cache-stats] [--clear-cache]
71-
[--show-config] [--version]
72-
{resolve,trace,common-names} ...
73-
74-
TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance.
75-
76-
positional arguments:
77-
{resolve,trace,common-names}
78-
resolve Run the taxonomic resolution workflow
79-
trace Trace data provenance of TaxonoPy objects
80-
common-names Merge vernacular names (post-process) into resolved outputs
81-
82-
options:
83-
-h, --help show this help message and exit
84-
--cache-dir CACHE_DIR
85-
Directory for TaxonoPy cache (can also be set with TAXONOPY_CACHE_DIR environment variable) (default: None)
86-
--cache-input CACHE_INPUT
87-
Input dataset path to compute cache stats for when no command is provided (default: None)
88-
--show-cache-path Display the current cache directory path and exit (default: False)
89-
--cache-stats Display statistics about the cache and exit (default: False)
90-
--clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False)
91-
--show-config Show current configuration and exit (default: False)
92-
--version Show version number and exit
93-
```
94-
95-
### Cache behavior
96-
97-
`taxonopy resolve` caches parsed entries, entry groups, and every resolution attempt chain using [`diskcache`](https://grantjenks.com/docs/diskcache/) as a stable provenance artifact tied to the TaxonoPy version and input dataset. By default the cache root is `~/.cache/taxonopy`, but you can override it by setting the environment variable `TAXONOPY_CACHE_DIR` or specifying `--cache-dir`. Its primary purpose is to support the `trace` command, which allows you to trace the provenance of any taxonomic entry resolved by TaxonoPy.
98-
99-
- Each resolve run writes into `resolve_v<version>_<fingerprint>` where the fingerprint is a SHA-256 hash of the input files’ metadata, so namespaces stay stable per combination of dataset and package version.
100-
- Inspect a namespace without rerunning by invoking `taxonopy --cache-dir <root> --cache-input <input> --cache-stats`, which reports total size, entry counts, and key-prefix breakdowns. Passing `--cache-stats` after `resolve` or `trace` performs the same check and exits.
101-
- If both the namespace and the output directory already contain data, `taxonopy resolve` warns and exits unless you pass `--full-rerun`, which clears the cache namespace and output before proceeding. Use `--clear-cache` to wipe only the namespace.
102-
103-
#### Command: `resolve`
104-
The `resolve` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions.
105-
```
106-
usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR
107-
[--output-format {csv,parquet}]
108-
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
109-
[--log-file LOG_FILE] [--force-input] [--full-rerun]
110-
[--batch-size BATCH_SIZE] [--all-matches]
111-
[--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed]
112-
[--species-group] [--refresh-cache] [--cache-stats]
113-
114-
options:
115-
-h, --help show this help message and exit
116-
-i, --input INPUT Path to input Parquet or CSV file/directory
117-
-o, --output-dir OUTPUT_DIR
118-
Directory to save resolved and unsolved output files
119-
--output-format {csv,parquet}
120-
Output file format
121-
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
122-
Set logging level
123-
--log-file LOG_FILE Optional file to write logs to
124-
--force-input Force use of input metadata without resolution
125-
--full-rerun Replace existing cache/output if detected for this input
126-
127-
GNVerifier Settings:
128-
--batch-size BATCH_SIZE
129-
Max number of name queries per GNVerifier API/subprocess call
130-
--all-matches Return all matches instead of just the best one
131-
--capitalize Capitalize the first letter of each name
132-
--fuzzy-uninomial Enable fuzzy matching for uninomial names
133-
--fuzzy-relaxed Relax fuzzy matching criteria
134-
--species-group Enable group species matching
135-
136-
Cache Management:
137-
--refresh-cache Force refresh of cached objects (input parsing, grouping) before running.
138-
--cache-stats Display cache statistics for this input and exit.
139-
```
140-
It is recommended to keep GNVerifier settings at their defaults.
141-
142-
#### Command: `trace`
143-
The `trace` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy.
144-
```console
145-
usage: taxonopy trace [-h] {entry} ...
146-
147-
positional arguments:
148-
{entry}
149-
entry Trace an individual taxonomic entry by UUID
150-
151-
options:
152-
-h, --help show this help message and exit
153-
154-
usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose]
155-
156-
options:
157-
-h, --help show this help message and exit
158-
--uuid UUID UUID of the taxonomic entry
159-
--from-input FROM_INPUT
160-
Path to the original input dataset
161-
--format {json,text} Output format
162-
--verbose Show full details including all UUIDs in group
163-
```
164-
165-
#### Command: `common-names`
166-
The `common-names` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names.
167-
```console
168-
usage: taxonopy common-names [-h] --resolved-dir ANNOTATION_DIR --output-dir OUTPUT_DIR
169-
170-
options:
171-
-h, --help show this help message and exit
172-
--resolved-dir ANNOTATION_DIR
173-
Directory containing your *.resolved.parquet files
174-
--output-dir OUTPUT_DIR
175-
Directory to write annotated .parquet files
176-
```
177-
Note that the `common-names` command is a post-processing step and should be run after the `resolve` command.
178-
179-
### Example Usage
180-
181-
To perform taxonomic resolution on a dataset with subsequent common name annotation, run:
182-
```console
183-
taxonopy resolve \
184-
--input /path/to/formatted/input \
185-
--output-dir /path/to/resolved/output
186-
```
187-
```console
188-
taxonopy common-names \
189-
--resolved-dir /path/to/resolved/output \
190-
--output-dir /path/to/resolved_with_common-names/output
191-
```
14+
## Documentaion
15+
See https://imageomics.github.io/TaxonoPy for documentation on installation, usage, and more.
19216

19317
## Development
19418
See the [Wiki Development Page](https://github.com/Imageomics/TaxonoPy/wiki/Development) for development instructions.

0 commit comments

Comments
 (0)