|
1 | | -# TaxonoPy |
| 1 | +<h1 align="center"> |
| 2 | + <img src="docs/_assets/taxonopy_banner.svg" alt="TaxonoPy banner"> |
| 3 | +</h1> |
2 | 4 |
|
3 | 5 | [](https://doi.org/10.5281/zenodo.15499454) |
4 | 6 |
|
5 | 7 | [](https://pypi.org/project/taxonopy) |
6 | 8 | [](https://pypi.org/project/taxonopy) |
7 | 9 |
|
8 | | -`TaxonoPy` (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). See below for the structure of inputs and outputs. |
| 10 | +## TaxonoPy: Reproducible, Traceable, and Scalable Biological Taxonomy Alignment |
9 | 11 |
|
10 | | -## Purpose |
11 | | -The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies. |
| 12 | +TaxonoPy (taxon-o-pie) is a command-line tool for harmonizing large biodiversity datasets into a consistent taxonomy ready for AI applications. Built on the [Global Names Verifier (GNVerifier)](https://github.com/gnames/gnverifier), it provides complete provenance tracking, flexible resolution strategies, and batch processing of 100M+ records to address challenges in reproducibility and scale in massive multi-source taxonomy alignment. |
12 | 13 |
|
13 | | -Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers: |
14 | | - |
15 | | -- [The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/) |
16 | | -- [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/) |
17 | | -- [FathomNet](https://www.fathomnet.org/) |
18 | | -- [The Encyclopedia of Life (EOL)](https://eol.org/) |
19 | | - |
20 | | -The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa. |
21 | | - |
22 | | -### Input |
23 | | - |
24 | | -A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include: |
25 | | -- `uuid`: a unique identifier for each sample (required). |
26 | | -- `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`: the taxonomic ranks of the organism (required, may have sparsity). |
27 | | -- `scientific_name`: the scientific name of the organism, to the most specific rank available (optional). |
28 | | -- `common_name`: the common (i.e. vernacular) name of the organism (optional). |
29 | | - |
30 | | -See the example data in |
31 | | -- `examples/input/sample.parquet` |
32 | | -- `examples/resolved/sample.resolved.parquet` (generated with [`taxonopy resolve`](#command-resolve)) |
33 | | -- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with [`taxonopy common-names`](#command-common-names)) |
34 | | - |
35 | | -### Challenges |
36 | | -This taxonomy information is provided by each data provider and the original sources, but the classification can be... |
37 | | - |
38 | | -- **Inconsistent**: both between and within sources (e.g. kingdom Metazoa vs. Animalia). |
39 | | -- **Incomplete**: many samples are missing one or more ranks. Some have 'holes' where higher and lower ranks are present, but intermediate ranks are missing. |
40 | | -- **Incorrect**: some samples have incorrect classifications. This can come in the form of spelling errors, nonstandard ideosyncratic terms, or outdated classifications. |
41 | | -- **Ambiguous**: homonyms, synonyms, and other terms that can be interpreted in multiple ways unless handled systematically. |
42 | | - |
43 | | -Taxonomic authorities exist to standardize classification, but ... |
44 | | -- There are many authorities. |
45 | | -- They may disagree. |
46 | | -- A given organism may be missing from some. |
47 | | - |
48 | | -### Solution |
49 | | -`TaxonoPy` uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used. |
50 | | - |
51 | | -## Installation |
52 | | - |
53 | | -`TaxonoPy` can be installed with `pip` after setting up a virtual environment. |
54 | | - |
55 | | -### User Installation with `pip` |
56 | | - |
57 | | -To install the latest version of `TaxonoPy`, run: |
58 | | -```console |
59 | | -pip install taxonopy |
60 | | -``` |
61 | | - |
62 | | -### Usage |
63 | | -You may view the help for the command line interface by running: |
64 | | -```console |
65 | | -taxonopy --help |
66 | | -``` |
67 | | -This will show you the available commands and options: |
68 | | -```console |
69 | | -usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--cache-input CACHE_INPUT] |
70 | | - [--show-cache-path] [--cache-stats] [--clear-cache] |
71 | | - [--show-config] [--version] |
72 | | - {resolve,trace,common-names} ... |
73 | | - |
74 | | -TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance. |
75 | | - |
76 | | -positional arguments: |
77 | | - {resolve,trace,common-names} |
78 | | - resolve Run the taxonomic resolution workflow |
79 | | - trace Trace data provenance of TaxonoPy objects |
80 | | - common-names Merge vernacular names (post-process) into resolved outputs |
81 | | - |
82 | | -options: |
83 | | - -h, --help show this help message and exit |
84 | | - --cache-dir CACHE_DIR |
85 | | - Directory for TaxonoPy cache (can also be set with TAXONOPY_CACHE_DIR environment variable) (default: None) |
86 | | - --cache-input CACHE_INPUT |
87 | | - Input dataset path to compute cache stats for when no command is provided (default: None) |
88 | | - --show-cache-path Display the current cache directory path and exit (default: False) |
89 | | - --cache-stats Display statistics about the cache and exit (default: False) |
90 | | - --clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False) |
91 | | - --show-config Show current configuration and exit (default: False) |
92 | | - --version Show version number and exit |
93 | | -``` |
94 | | - |
95 | | -### Cache behavior |
96 | | - |
97 | | -`taxonopy resolve` caches parsed entries, entry groups, and every resolution attempt chain using [`diskcache`](https://grantjenks.com/docs/diskcache/) as a stable provenance artifact tied to the TaxonoPy version and input dataset. By default the cache root is `~/.cache/taxonopy`, but you can override it by setting the environment variable `TAXONOPY_CACHE_DIR` or specifying `--cache-dir`. Its primary purpose is to support the `trace` command, which allows you to trace the provenance of any taxonomic entry resolved by TaxonoPy. |
98 | | - |
99 | | -- Each resolve run writes into `resolve_v<version>_<fingerprint>` where the fingerprint is a SHA-256 hash of the input files’ metadata, so namespaces stay stable per combination of dataset and package version. |
100 | | -- Inspect a namespace without rerunning by invoking `taxonopy --cache-dir <root> --cache-input <input> --cache-stats`, which reports total size, entry counts, and key-prefix breakdowns. Passing `--cache-stats` after `resolve` or `trace` performs the same check and exits. |
101 | | -- If both the namespace and the output directory already contain data, `taxonopy resolve` warns and exits unless you pass `--full-rerun`, which clears the cache namespace and output before proceeding. Use `--clear-cache` to wipe only the namespace. |
102 | | - |
103 | | -#### Command: `resolve` |
104 | | -The `resolve` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions. |
105 | | -``` |
106 | | -usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR |
107 | | - [--output-format {csv,parquet}] |
108 | | - [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] |
109 | | - [--log-file LOG_FILE] [--force-input] [--full-rerun] |
110 | | - [--batch-size BATCH_SIZE] [--all-matches] |
111 | | - [--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed] |
112 | | - [--species-group] [--refresh-cache] [--cache-stats] |
113 | | -
|
114 | | -options: |
115 | | - -h, --help show this help message and exit |
116 | | - -i, --input INPUT Path to input Parquet or CSV file/directory |
117 | | - -o, --output-dir OUTPUT_DIR |
118 | | - Directory to save resolved and unsolved output files |
119 | | - --output-format {csv,parquet} |
120 | | - Output file format |
121 | | - --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} |
122 | | - Set logging level |
123 | | - --log-file LOG_FILE Optional file to write logs to |
124 | | - --force-input Force use of input metadata without resolution |
125 | | - --full-rerun Replace existing cache/output if detected for this input |
126 | | -
|
127 | | -GNVerifier Settings: |
128 | | - --batch-size BATCH_SIZE |
129 | | - Max number of name queries per GNVerifier API/subprocess call |
130 | | - --all-matches Return all matches instead of just the best one |
131 | | - --capitalize Capitalize the first letter of each name |
132 | | - --fuzzy-uninomial Enable fuzzy matching for uninomial names |
133 | | - --fuzzy-relaxed Relax fuzzy matching criteria |
134 | | - --species-group Enable group species matching |
135 | | -
|
136 | | -Cache Management: |
137 | | - --refresh-cache Force refresh of cached objects (input parsing, grouping) before running. |
138 | | - --cache-stats Display cache statistics for this input and exit. |
139 | | - ``` |
140 | | -It is recommended to keep GNVerifier settings at their defaults. |
141 | | - |
142 | | -#### Command: `trace` |
143 | | -The `trace` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy. |
144 | | -```console |
145 | | -usage: taxonopy trace [-h] {entry} ... |
146 | | - |
147 | | -positional arguments: |
148 | | - {entry} |
149 | | - entry Trace an individual taxonomic entry by UUID |
150 | | - |
151 | | -options: |
152 | | - -h, --help show this help message and exit |
153 | | - |
154 | | -usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose] |
155 | | - |
156 | | -options: |
157 | | - -h, --help show this help message and exit |
158 | | - --uuid UUID UUID of the taxonomic entry |
159 | | - --from-input FROM_INPUT |
160 | | - Path to the original input dataset |
161 | | - --format {json,text} Output format |
162 | | - --verbose Show full details including all UUIDs in group |
163 | | -``` |
164 | | - |
165 | | -#### Command: `common-names` |
166 | | -The `common-names` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names. |
167 | | -```console |
168 | | -usage: taxonopy common-names [-h] --resolved-dir ANNOTATION_DIR --output-dir OUTPUT_DIR |
169 | | - |
170 | | -options: |
171 | | - -h, --help show this help message and exit |
172 | | - --resolved-dir ANNOTATION_DIR |
173 | | - Directory containing your *.resolved.parquet files |
174 | | - --output-dir OUTPUT_DIR |
175 | | - Directory to write annotated .parquet files |
176 | | -``` |
177 | | -Note that the `common-names` command is a post-processing step and should be run after the `resolve` command. |
178 | | - |
179 | | -### Example Usage |
180 | | - |
181 | | -To perform taxonomic resolution on a dataset with subsequent common name annotation, run: |
182 | | -```console |
183 | | -taxonopy resolve \ |
184 | | - --input /path/to/formatted/input \ |
185 | | - --output-dir /path/to/resolved/output |
186 | | -``` |
187 | | -```console |
188 | | -taxonopy common-names \ |
189 | | - --resolved-dir /path/to/resolved/output \ |
190 | | - --output-dir /path/to/resolved_with_common-names/output |
191 | | -``` |
| 14 | +## Documentaion |
| 15 | +See https://imageomics.github.io/TaxonoPy for documentation on installation, usage, and more. |
192 | 16 |
|
193 | 17 | ## Development |
194 | 18 | See the [Wiki Development Page](https://github.com/Imageomics/TaxonoPy/wiki/Development) for development instructions. |
0 commit comments