The incorporation of molecular data into biodiversity informatics has expanded rapidly in recent years, with more than five million sequence-derived records from the International Nucleotide Sequence Database Collaboration (INSDC) now accessible via GBIF. Despite this growth, the extent to which these data are used in scientific research has not been systematically evaluated. This project proposes the development of a data pipeline to trace the use and citation of sequence-based GBIF records in the scholarly literature, leveraging the GBIF Literature API and associated services.
We will begin by identifying GBIF datasets sourced from INSDC and linking them to peer-reviewed publications that cite these data through GBIF-mediated downloads, focusing on literature types categorised as ""journal"", ""working paper"", or ""book"". Metadata will be compiled for each citation, including DOIs, download keys, country of collection, collection year, and unique occurrence identifiers (gbifID). Additional publication metadata, such as author affiliations, topics, and open access status, will be retrieved via the API.
Analyses will explore temporal, geographic, and thematic patterns in the use of sequence data, with visual outputs including maps of author affiliations and topic co-occurrence networks. A further objective is to identify occurrences of IUCN Red List species within cited datasets, thereby assessing the conservation relevance of sequence-based research.
The resulting workflow will support reproducibility and policy integration, contributing to the aims of the BMD project and enhancing the visibility and utility of molecular biodiversity data.
Niels Billiet, Mathias Dillen, Rob Waterhouse