Skip to content

soilwise-he/soil-health-knowledge-graph

Repository files navigation

Soil Health Knowledge Graph

License: CC BY 4.0 Zenodo DOI Python Ask DeepWiki

free pitfalls were found

✨ Abstract

Soil health is fundamental to environmental sustainability and food security, yet relevant knowledge remains fragmented across diverse sources, hindering its effective application. Knowledge graphs (KGs) offer a robust solution by integrating disparate information into a structured, semantically rich format. Addressing this need, this paper presents an ontology-compliant soil health knowledge graph (SHKG) derived from domain literature, and the semi-automated, human-in-the-loop pipeline developed to construct it. Our pipeline leverages large language models (LLMs) to accelerate knowledge extraction, while incorporating expert oversight to ensure ontological compliance and accuracy. The resulting KG integrates unstructured knowledge into 11,719 RDF triples representing 2,017 entities, including 1,785 soil-related concepts. The KG's fidelity was confirmed by soil scientists through a validation process involving competency questions. We demonstrate the KG's primary use case as the backbone for a knowledge discovery system in soil science. The KG, supporting ontology, and the source code of the pipeline are available here.


πŸ“š Knowledge Sources

This work draws on the following primary resources:

  • EEA (2023). Soil monitoring in Europe – Indicators and thresholds for soil health assessments.
  • EEA (2024). The state of soils in Europe – Fully evidenced, spatially organised assessment of the pressures driving soil degradation.

🧩 Conceptual Model

The high‑level structure of the SHKG follows the conceptual model from the EEA 2023 report (Figureβ€―1.1). We have RDF‑ized this model into our top-level schema:

Conceptual Model (EEA 2023 report Figureβ€―1.1)


πŸ“ˆ Overview of the soil health KG

Illustration of the high-level structure of the soil health KG:

Soil Health KG overview

πŸ› οΈ Pipeline of KG Construction

We utilized a pipeline that incorporates LLMs for the extraction of relevant information from the source text, followed by post-processing and alignment with established ontologies:

Text2KG pipeline


πŸ“Š KG Evaluation

To assess the quality of LLM-generated RDF triples, we compare them against human-labeled gold standard triples using a comprehensive set of graph-matching metrics. The evaluation compares:

Evaluation Metrics

The metrics are implemented in eval_graphs/graph_matching.py and used in the pipeline notebook KGC_pipeline.ipynb:

Metric Description
Triple-Matching Precision Fraction of predicted triples that exactly match gold standard triples (case-insensitive)
Triple-Matching Recall Fraction of gold standard triples that are correctly predicted
Triple-Matching F1 Harmonic mean of precision and recall for exact triple matches
G-ROUGE Graph-level ROUGE score treating each edge as a sentence; measures n-gram overlap between predicted and gold graphs
G-BLEU Graph-level BLEU score for evaluating the quality of generated triples using n-gram precision
G-BERTScore Semantic similarity between predicted and gold edges using contextualized BERT embeddings with optimal bipartite matching
Graph Edit Distance (GED) Minimum number of node/edge insertions, deletions, and substitutions needed to transform the predicted graph into the gold graph (normalized)

These metrics provide complementary views: exact matching (Precision/Recall/F1), surface-level similarity (ROUGE/BLEU), semantic similarity (BERTScore), and structural similarity (GED).


πŸ“¦ Repository Contents

.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ KGC_pipeline.ipynb        # Jupyter notebook demonstrating the full KG‑construction pipeline
β”œβ”€β”€ uk2us.py                  # Utility script (UK ↔ US spelling normalizer)
β”‚
β”œβ”€β”€ top_level_KG.ttl          # High-level structure of the SHKG, derived from the conceptual model (RDF/Turtle)
β”œβ”€β”€ soil_health_KG.ttl        # Full Soil Health KG (RDF/Turtle)
β”œβ”€β”€ soil_health_SKOS.ttl      # SKOS version of SHKG for publishing on AgroPortal (RDF/Turtle)
β”œβ”€β”€ shKG_metadata.ttl         # Metadata describing the KG
β”œβ”€β”€ example_SWR.trig          # Example SoilWise knowledge repository (TriG)
β”‚
β”œβ”€β”€ CQs_sparql_queries/       # SPARQL queries translated from competency questions
β”œβ”€β”€ ex_ontovocabs/            # Linked external vocabularies & thesauri
β”œβ”€β”€ in_ontovocabs/            # Imported ontologies & schemas
β”œβ”€β”€ kg_validation/            # Raw KG validation results
β”œβ”€β”€ benchmarks/
β”‚   β”œβ”€β”€ text_RDF_gs.json       # Text-to-RDF gold standard benchmark
β”‚   └── CQs_SPARQL_ea.json     # Competency question, SPARQL query, and expected answer dataset for KG validation
β”œβ”€β”€ eval_graphs/
β”‚   β”œβ”€β”€ graph_matching.py      # Graph matching metrics (P/R/F1, ROUGE, BLEU, BERTScore, GED)
β”‚   └── RDF_LLMs_raw.json      # Raw LLM-generated RDF triples for evaluation
β”œβ”€β”€ imgs/
└── …

πŸš€ Quick Start

  1. Clone this repository

    git clone https://github.com/soilwise-he/soil-health-knowledge-graph.git
    cd soil-health-knowledge-graph
  2. Install dependencies

    pip install -r requirements.txt
  3. Explore the KG

    • Load the main graph in Python or any RDF tool:

      from rdflib import Graph
      g = Graph().parse("soil_health_KG.ttl", format="turtle")
      print(len(g), "triples loaded")
    • Run example SPARQL queries in CQs_sparql_queries/ or via the public endpoint at: https://repository.soilwise-he.eu/sparql/

  4. Run the pipeline Open and run KGC_pipeline.ipynb to see:

    • LLM‑driven triple generation (via GPT‑5 prompts)
    • Turtle syntax check & repair
    • Ontology alignment, entity normalization & relation disambiguation
    • KG enrichment (invertible relations, external vocabularies)
    • KG evaluation (triple-matching P/R/F1, G-ROUGE, G-BLEU, G-BERTScore, GED)
    • KG validation (by competency questions)
    • Example SoilWise knowledge repository (interlink with harvested Zenodo metadata records)

πŸ”— Resource Availability


πŸ”— Imported Ontologies & Schemas

To ensure our soil health KG aligns with recognized standards, we incorporate a variety of well-established ontologies and schemes.

The KG leverages 19 classes and 205 properties drawn from above ontologies to formally define the types of entities and their relationships. All 19 classes come from existing ontologies, while 45 of the 205 properties are defined by us and the rest come from existing ontologies.

πŸ”— Linked Vocabularies & Thesauri

The KG is enriched by interlinking to controlled vocabularies and thesauri in the field of soil science to align with standard terminologies.


πŸ’‘ Usecases

  1. Semantic Backbone for a broader SoilWise knowledge repository, an example of interlinking with harvested Zenodo metadata records is provided.
  2. Natural‑language Question Answering over the KG via NL β†’ SPARQL
  3. Benchmark for text2KG:β€―converting scientific text β†’ ontology‑compliant RDF

πŸ—£οΈ Feedback

  • Concept-specific comments
    To leave comments on any individual concept, visit the Soilwise HE Data and Knowledge hub, search for your concept of interest, then scroll down to the Comments section (as shown in the screenshot below) and post your feedback directly there.

    VocView Comments Section

  • Missing concepts
    If you believe a soil‑health concept is missing from the SHKG, please open a new GitHub issue to let us know.


πŸ“ How to Cite

@inproceedings{wang2025soil,
  author    = {Beichen Wang and LuΓ­s Moreira de Sousa and Anna Fensel},
  title     = {Make soil healthy again: Construction of ontology-compliant soil health knowledge graph with large language models},
  booktitle = {Proceedings of the 13th Knowledge Capture Conference 2025},
  year      = {2025},
  doi       = {10.1145/3731443.3771730}
}

πŸ™ Acknowledgements

This work was supported by the EU's Horizon Europe research and innovation programme within the SoilWise project (grant agreement ID: 101112838).

πŸ“ To-do

See Issues for planned tasks and enhancements.


πŸ“„ License

  • Code: MIT License Β See LICENSE
  • Data & Ontologies: CCβ€―BYβ€―4.0 Β (Creative Commons Attribution 4.0 International)

About

Repository for the soil health knowledge graph

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors