Soil health is fundamental to environmental sustainability and food security, yet relevant knowledge remains fragmented across diverse sources, hindering its effective application. Knowledge graphs (KGs) offer a robust solution by integrating disparate information into a structured, semantically rich format. Addressing this need, this paper presents an ontology-compliant soil health knowledge graph (SHKG) derived from domain literature, and the semi-automated, human-in-the-loop pipeline developed to construct it. Our pipeline leverages large language models (LLMs) to accelerate knowledge extraction, while incorporating expert oversight to ensure ontological compliance and accuracy. The resulting KG integrates unstructured knowledge into 11,719 RDF triples representing 2,017 entities, including 1,785 soil-related concepts. The KG's fidelity was confirmed by soil scientists through a validation process involving competency questions. We demonstrate the KG's primary use case as the backbone for a knowledge discovery system in soil science. The KG, supporting ontology, and the source code of the pipeline are available here.
This work draws on the following primary resources:
- EEA (2023). Soil monitoring in Europe β Indicators and thresholds for soil health assessments.
- EEA (2024). The state of soils in Europe β Fully evidenced, spatially organised assessment of the pressures driving soil degradation.
The highβlevel structure of the SHKG follows the conceptual model from the EEA 2023 report (Figureβ―1.1). We have RDFβized this model into our top-level schema:
- RDF representation: see
top_level_KG.ttl
Illustration of the high-level structure of the soil health KG:
We utilized a pipeline that incorporates LLMs for the extraction of relevant information from the source text, followed by post-processing and alignment with established ontologies:
To assess the quality of LLM-generated RDF triples, we compare them against human-labeled gold standard triples using a comprehensive set of graph-matching metrics. The evaluation compares:
- LLM-generated triples:
eval_graphs/RDF_LLMs_raw.json - Gold standard triples:
benchmarks/text_RDF_gs.json
The metrics are implemented in eval_graphs/graph_matching.py and used in the pipeline notebook KGC_pipeline.ipynb:
| Metric | Description |
|---|---|
| Triple-Matching Precision | Fraction of predicted triples that exactly match gold standard triples (case-insensitive) |
| Triple-Matching Recall | Fraction of gold standard triples that are correctly predicted |
| Triple-Matching F1 | Harmonic mean of precision and recall for exact triple matches |
| G-ROUGE | Graph-level ROUGE score treating each edge as a sentence; measures n-gram overlap between predicted and gold graphs |
| G-BLEU | Graph-level BLEU score for evaluating the quality of generated triples using n-gram precision |
| G-BERTScore | Semantic similarity between predicted and gold edges using contextualized BERT embeddings with optimal bipartite matching |
| Graph Edit Distance (GED) | Minimum number of node/edge insertions, deletions, and substitutions needed to transform the predicted graph into the gold graph (normalized) |
These metrics provide complementary views: exact matching (Precision/Recall/F1), surface-level similarity (ROUGE/BLEU), semantic similarity (BERTScore), and structural similarity (GED).
.
βββ LICENSE
βββ README.md
βββ requirements.txt # Python dependencies
βββ KGC_pipeline.ipynb # Jupyter notebook demonstrating the full KGβconstruction pipeline
βββ uk2us.py # Utility script (UK β US spelling normalizer)
β
βββ top_level_KG.ttl # High-level structure of the SHKG, derived from the conceptual model (RDF/Turtle)
βββ soil_health_KG.ttl # Full Soil Health KG (RDF/Turtle)
βββ soil_health_SKOS.ttl # SKOS version of SHKG for publishing on AgroPortal (RDF/Turtle)
βββ shKG_metadata.ttl # Metadata describing the KG
βββ example_SWR.trig # Example SoilWise knowledge repository (TriG)
β
βββ CQs_sparql_queries/ # SPARQL queries translated from competency questions
βββ ex_ontovocabs/ # Linked external vocabularies & thesauri
βββ in_ontovocabs/ # Imported ontologies & schemas
βββ kg_validation/ # Raw KG validation results
βββ benchmarks/
β βββ text_RDF_gs.json # Text-to-RDF gold standard benchmark
β βββ CQs_SPARQL_ea.json # Competency question, SPARQL query, and expected answer dataset for KG validation
βββ eval_graphs/
β βββ graph_matching.py # Graph matching metrics (P/R/F1, ROUGE, BLEU, BERTScore, GED)
β βββ RDF_LLMs_raw.json # Raw LLM-generated RDF triples for evaluation
βββ imgs/
βββ β¦
-
Clone this repository
git clone https://github.com/soilwise-he/soil-health-knowledge-graph.git cd soil-health-knowledge-graph -
Install dependencies
pip install -r requirements.txt
-
Explore the KG
-
Load the main graph in Python or any RDF tool:
from rdflib import Graph g = Graph().parse("soil_health_KG.ttl", format="turtle") print(len(g), "triples loaded")
-
Run example SPARQL queries in
CQs_sparql_queries/or via the public endpoint at: https://repository.soilwise-he.eu/sparql/
-
-
Run the pipeline Open and run
KGC_pipeline.ipynbto see:- LLMβdriven triple generation (via GPTβ5 prompts)
- Turtle syntax check & repair
- Ontology alignment, entity normalization & relation disambiguation
- KG enrichment (invertible relations, external vocabularies)
- KG evaluation (triple-matching P/R/F1, G-ROUGE, G-BLEU, G-BERTScore, GED)
- KG validation (by competency questions)
- Example SoilWise knowledge repository (interlink with harvested Zenodo metadata records)
- Interactive Browser: https://soilwise-he.github.io/soil-health
- SPARQL Endpoint: https://repository.soilwise-he.eu/sparql/
- Searchable Vocabulary Browser: https://voc.soilwise-he.containers.wur.nl/
- AgroPortal Instance: https://agroportal.lirmm.fr/ontologies/SHKG
To ensure our soil health KG aligns with recognized standards, we incorporate a variety of well-established ontologies and schemes.
- SKOS Core
- Dublin Core
- RDF Schema
- Agrontology
- Semanticscience Integrated Ontology (SIO)
- Open Biological and Biomedical Ontology (OBO)
- QUDT
- Ontology of Units of Measure (OM)
- PROV-O
- Schema.org
- SWEET ontology
- Wikidata
- Biolink Model
- Allotrope Foundation Ontology
- REPRODUCE-ME Ontology
- BioAssay Ontology (BAO)
- Time Ontology
The KG leverages 19 classes and 205 properties drawn from above ontologies to formally define the types of entities and their relationships. All 19 classes come from existing ontologies, while 45 of the 205 properties are defined by us and the rest come from existing ontologies.
The KG is enriched by interlinking to controlled vocabularies and thesauri in the field of soil science to align with standard terminologies.
- Semantic Backbone for a broader SoilWise knowledge repository, an example of interlinking with harvested Zenodo metadata records is provided.
- Naturalβlanguage Question Answering over the KG via NL β SPARQL
- Benchmark for text2KG:β―converting scientific text β ontologyβcompliant RDF
-
Concept-specific comments
To leave comments on any individual concept, visit the Soilwise HE Data and Knowledge hub, search for your concept of interest, then scroll down to the Comments section (as shown in the screenshot below) and post your feedback directly there. -
Missing concepts
If you believe a soilβhealth concept is missing from the SHKG, please open a new GitHub issue to let us know.
@inproceedings{wang2025soil,
author = {Beichen Wang and LuΓs Moreira de Sousa and Anna Fensel},
title = {Make soil healthy again: Construction of ontology-compliant soil health knowledge graph with large language models},
booktitle = {Proceedings of the 13th Knowledge Capture Conference 2025},
year = {2025},
doi = {10.1145/3731443.3771730}
}This work was supported by the EU's Horizon Europe research and innovation programme within the SoilWise project (grant agreement ID: 101112838).
See Issues for planned tasks and enhancements.
- Code: MIT License Β See
LICENSE - Data & Ontologies: CCβ―BYβ―4.0 Β (Creative Commons Attribution 4.0 International)


