SSH Citation Index Modules

⚠️ Work in Progress: This project is currently under active development. Features, APIs, and documentation are subject to change. Some components may be incomplete or experimental. Use in production environments is not recommended at this time.

Description

SSH Citation Index modules are a collection of AI modules for extracting, parsing and disambiguating bibliografic references from publications in the Social Sciences and Humanities (SSH).

(Work on these modules is part of Odoma's contribution to deliverables D4.2 and D4.4 in WP4.)

Installation

Installation instructions are a work in progress. The project requires Python 3.10+ and dependencies listed in requirements.txt.

Project Structure

citation_index/
├── src/citation_index/          # Core application code
│   ├── cli/                     # Command-line interface entry points
│   ├── core/                    # Domain logic and data models
│   │   ├── connectors/          # External API integrations for citation linking(OpenAlex, OpenCitations, Wikidata, Matilda)
│   │   ├── extractors/          # PDF extraction engines (Grobid, Marker, MinerU, PyMuPDF)
│   │   ├── models/              # Pydantic data models for references
│   │   ├── parsers/             # TEI-XML and bibliographic parsing
│   │   └── segmenters/          # Reference segmentation and localization
│   ├── llm/                     # LLM client bindings and prompt management
│   ├── pipelines/               # Extraction and parsing workflow orchestration
│   ├── evaluation/              # Metrics and evaluation scripts
│   └── utils/                   # Shared helper functions
├── tests/                       # Test suite mirroring src/ structure
├── benchmarks/                  # Evaluation datasets and scripts
│   ├── cex/                     # CEX benchmark dataset
│   ├── excite/                  # EXCITE dataset
│   ├── linkedbook/              # LinkedBooks dataset
│   └── finetune/                # Fine-tuning datasets for LLM models
│   └── citation_linking/        # Citation linking scripts and test sets
├── prompts/                     # LLM prompt templates (YAML and Markdown)

Current Deployment Status

Text Extraction

Reference Extraction and Parsing

TEI-XML parser (Grobid output)
LLM-based parser
Prompt templates and variants
Semantic reference locator/segmenter
Benchmarking: EXCITE, CEXgoldstandard, LinkedBooks

Citation Linking

Citation Intent Classification

TODO

Entity Extraction (software, dataset, funding, entity mentions)

TODO

Infrastructure

Credits

The code contained in this repository is being developed by Yurui Zhu (Odoma). This work is carried out in the context of the EU-funded GRAPHIA project (grant ID: 101188018).

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
benchmarks		benchmarks
deployment		deployment
examples		examples
prompts		prompts
scripts		scripts
src/citation_index		src/citation_index
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api_0.0.1.yml		api_0.0.1.yml
docker-compose.yml		docker-compose.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSH Citation Index Modules

Description

Installation

Project Structure

Current Deployment Status

Text Extraction

Reference Extraction and Parsing

Citation Linking

Citation Intent Classification

Entity Extraction (software, dataset, funding, entity mentions)

Infrastructure

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SSH Citation Index Modules

Description

Installation

Project Structure

Current Deployment Status

Text Extraction

Reference Extraction and Parsing

Citation Linking

Citation Intent Classification

Entity Extraction (software, dataset, funding, entity mentions)

Infrastructure

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages