Skip to content

Commit 1dad440

Browse files
author
Dahlia Li
committed
WIP contributing guide docs
1 parent b61e86c commit 1dad440

File tree

4 files changed

+4049
-3
lines changed

4 files changed

+4049
-3
lines changed
Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,30 @@
11
# Contributing
22

3-
We welcome contributions to TaxonoPy. More detailed guidance will be added here.
3+
We welcome contributions to TaxonoPy.
4+
5+
---
6+
7+
8+
## Contribution Opportunities
9+
10+
Documented failure cases are valuable inputs for improving TaxonoPy.
11+
Contributions may include:
12+
13+
* documenting additional failure patterns
14+
* proposing secondary tie-breaking heuristics
15+
* extending existing resolution profiles
16+
* adding dataset-specific disambiguation rules
17+
18+
Clear documentation of *why* a resolution fails is often as important as
19+
resolving it.
20+
21+
If you encounter recurring failure modes, consider opening an issue with:
22+
23+
* example UUIDs
24+
* trace output
25+
* GNVerifier results
26+
* proposed resolution logic
27+
28+
---
429

530
If you have suggestions or run into a bug, please open an issue at [https://github.com/Imageomics/TaxonoPy/issues](https://github.com/Imageomics/TaxonoPy/issues).
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Failure Analysis Workflow
2+
3+
A significant portion of TaxonoPy development involves understanding *why* certain taxonomic resolutions fail and whether those failures are expected, data-driven, or indicative of missing strategy coverage.
4+
5+
This workflow was developed during large-scale resolution of the **EOL dataset**, but applies broadly to other sources.
6+
7+
---
8+
9+
## 1. Identify Failed Resolution Entries
10+
11+
Start by locating entries marked as failed in resolved Parquet outputs.
12+
A common failure status encountered during analysis is:
13+
14+
* `FAILED_FORCED_INPUT`
15+
16+
Example command:
17+
18+
```bash
19+
parquet cat <resolved_parquet_files> \
20+
| grep FAILED_FORCED_INPUT \
21+
| head \
22+
| jq
23+
```
24+
25+
This step yields candidate UUIDs for deeper inspection.
26+
27+
---
28+
29+
## 2. Compare Raw Input vs. Final Resolution
30+
31+
For each failed UUID, compare the **raw input taxonomy** with the **final resolved output**.
32+
33+
Typical fields to inspect include:
34+
35+
* `scientific_name`
36+
* `kingdom``genus`
37+
* `source_dataset`
38+
* `resolution_status`
39+
* `resolution_strategy`
40+
41+
This comparison often reveals inconsistencies in the input taxonomy (e.g., genus assignments that differ from authoritative sources).
42+
43+
---
44+
45+
## 3. Trace Resolution Decisions
46+
47+
Use the `trace` command to inspect how TaxonoPy attempted to resolve the entry and why it failed.
48+
49+
Example:
50+
51+
```bash
52+
taxonopy --cache-dir <cache_directory> \
53+
trace entry \
54+
--uuid "<UUID>" \
55+
--from-input <source_dataset_directory> \
56+
--verbose
57+
```
58+
59+
The trace output provides:
60+
61+
* grouping information
62+
* query plan (term, rank, source)
63+
* resolution strategies attempted
64+
* explicit failure reasons
65+
* metadata used for match selection
66+
67+
---
68+
69+
## 4. Verify Against External Authorities (GNVerifier)
70+
71+
To determine whether a failure is due to missing data or genuine ambiguity,
72+
independently verify the same taxonomic name using **Global Names Verifier**.
73+
74+
=== "CLI / Alias Usage"
75+
76+
```bash
77+
gnverifier -j 1 \
78+
--format compact \
79+
--capitalize \
80+
--all_matches \
81+
--sources 11 \
82+
"<scientific_name>" | jq
83+
```
84+
85+
This approach uses the GNVerifier command-line tool directly and is
86+
suitable for shell-based workflows and batch inspection.
87+
88+
=== "API Usage (Programmatic)"
89+
90+
```bash
91+
curl -X POST "https://verifier.globalnames.org/api/v1/verifications" \
92+
-H "Content-Type: application/json" \
93+
-d '{
94+
"names": ["<scientific_name>"],
95+
"capitalize": true,
96+
"sources": [11]
97+
}' | jq
98+
```
99+
100+
This method uses the GNVerifier HTTP API and is appropriate for
101+
integration into automated pipelines or custom applications.
102+
103+
---
104+
105+
This step confirms whether multiple accepted records exist in authoritative
106+
sources such as GBIF.
107+
108+
## 5. Common Failure Pattern: Multi-Accepted Match Tie
109+
110+
Across analyzed EOL cases, the most frequent failure pattern observed was:
111+
112+
> **Tie between multiple accepted results with equal taxonomic matches**
113+
114+
These failures are typically produced by the strategy:
115+
116+
* `ExactMatchPrimarySourceMultiAcceptedTaxonomicMatch`
117+
118+
Example failure reason from trace output:
119+
120+
```json
121+
{
122+
"failure_reason": "Tie between N results with equal taxonomic matches"
123+
}
124+
```
125+
126+
---
127+
128+
## 6. Why This Strategy Fails
129+
130+
This strategy is intentionally conservative:
131+
132+
* it prioritizes correctness over forced resolution
133+
* it fails when multiple equally valid “best” matches exist
134+
* it avoids arbitrary selection without clear disambiguation signals
135+
136+
However, analysis shows that many tied matches differ subtly in ways not currently used for secondary discrimination, such as:
137+
138+
* author or publication year suffixes
139+
* infra-specific placeholders (e.g., `spec`)
140+
* rank depth differences
141+
* minor spelling or canonical variations
142+
143+
---

mkdocs.yml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,10 @@ nav:
88
- TaxonoPy:
99
- User guide: index.md
1010
- Quick reference: user-guide/quick-reference.md
11-
- CLI help reference: command_line_usage/help.md
11+
- CLI help reference:
12+
- command_line_usage/help.md
13+
- command_line_usage/tutorial.md
14+
1215
- Installation: user-guide/installation.md
1316
- IO:
1417
- user-guide/io/index.md
@@ -18,7 +21,13 @@ nav:
1821
- Development:
1922
- Contributing:
2023
- development/contributing/index.md
24+
<<<<<<< HEAD
2125
- Acknowledgments: acknowledgments.md
26+
=======
27+
- Failure Analysis Workflow:
28+
- development/failure_analysis_workflow/index.md
29+
- Acknowledgements: acknowledgements.md
30+
>>>>>>> 01fb7a8 (WIP contributing guide docs)
2231

2332
theme:
2433
name: material
@@ -46,7 +55,7 @@ theme:
4655
- content.code.copy
4756
- content.code.annotate
4857
- content.tooltips
49-
58+
- content.tabs.link
5059
extra_css:
5160
- stylesheets/extra.css
5261

@@ -79,3 +88,6 @@ markdown_extensions:
7988
- pymdownx.details
8089
- pymdownx.highlight
8190
- pymdownx.superfences
91+
- pymdownx.superfences
92+
- pymdownx.tabbed:
93+
alternate_style: true

0 commit comments

Comments
 (0)