Skip to content

odoma-ch/ssh-citation-index

Repository files navigation

SSH Citation Index Modules

⚠️ Work in Progress: This project is currently under active development. Features, APIs, and documentation are subject to change. Some components may be incomplete or experimental. Use in production environments is not recommended at this time.

Description

SSH Citation Index modules are a collection of AI modules for extracting, parsing and disambiguating bibliografic references from publications in the Social Sciences and Humanities (SSH).

(Work on these modules is part of Odoma's contribution to deliverables D4.2 and D4.4 in WP4.)

Installation

Installation instructions are a work in progress. The project requires Python 3.10+ and dependencies listed in requirements.txt.

Project Structure

citation_index/
├── src/citation_index/          # Core application code
│   ├── cli/                     # Command-line interface entry points
│   ├── core/                    # Domain logic and data models
│   │   ├── connectors/          # External API integrations for citation linking(OpenAlex, OpenCitations, Wikidata, Matilda)
│   │   ├── extractors/          # PDF extraction engines (Grobid, Marker, MinerU, PyMuPDF)
│   │   ├── models/              # Pydantic data models for references
│   │   ├── parsers/             # TEI-XML and bibliographic parsing
│   │   └── segmenters/          # Reference segmentation and localization
│   ├── llm/                     # LLM client bindings and prompt management
│   ├── pipelines/               # Extraction and parsing workflow orchestration
│   ├── evaluation/              # Metrics and evaluation scripts
│   └── utils/                   # Shared helper functions
├── tests/                       # Test suite mirroring src/ structure
├── benchmarks/                  # Evaluation datasets and scripts
│   ├── cex/                     # CEX benchmark dataset
│   ├── excite/                  # EXCITE dataset
│   ├── linkedbook/              # LinkedBooks dataset
│   └── finetune/                # Fine-tuning datasets for LLM models
│   └── citation_linking/        # Citation linking scripts and test sets
├── prompts/                     # LLM prompt templates (YAML and Markdown)

Current Deployment Status

Text Extraction

  • Grobid integration
  • Marker PDF integration
  • MinerU integration
  • PyMuPDF integration
  • Extractor comparison and benchmarking

Reference Extraction and Parsing

  • TEI-XML parser (Grobid output)
  • LLM-based parser
  • Prompt templates and variants
  • Semantic reference locator/segmenter
  • Benchmarking: EXCITE, CEXgoldstandard, LinkedBooks

Citation Linking

  • OpenAlex API connector
  • OpenCitations API connector
  • Wikidata SPARQL connector
  • Matilda connector
  • Simple search and match pipeline
  • Advanced search and match pipeline
  • benchmark datasets(cex, excite, linkedbooks)
    • creation
    • annotation
    • evaluation

Citation Intent Classification

  • TODO

Entity Extraction (software, dataset, funding, entity mentions)

  • TODO

Infrastructure

  • Core data models (Reference, Person, Organization)
  • LLM client with retry logic
  • CLI interface
  • Test suite
  • REST API module
  • API documentation
  • Deployment guides
  • Docker containerization

Credits

The code contained in this repository is being developed by Yurui Zhu (Odoma). This work is carried out in the context of the EU-funded GRAPHIA project (grant ID: 101188018).

About

AI modules for the extraction, parsing and disambiguation of bibliographic references, with a focus on Social Sciences and Humanities (SSH).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors