⚠️ Work in Progress: This project is currently under active development. Features, APIs, and documentation are subject to change. Some components may be incomplete or experimental. Use in production environments is not recommended at this time.
SSH Citation Index modules are a collection of AI modules for extracting, parsing and disambiguating bibliografic references from publications in the Social Sciences and Humanities (SSH).
(Work on these modules is part of Odoma's contribution to deliverables D4.2 and D4.4 in WP4.)
Installation instructions are a work in progress. The project requires Python 3.10+ and dependencies listed in requirements.txt.
citation_index/
├── src/citation_index/ # Core application code
│ ├── cli/ # Command-line interface entry points
│ ├── core/ # Domain logic and data models
│ │ ├── connectors/ # External API integrations for citation linking(OpenAlex, OpenCitations, Wikidata, Matilda)
│ │ ├── extractors/ # PDF extraction engines (Grobid, Marker, MinerU, PyMuPDF)
│ │ ├── models/ # Pydantic data models for references
│ │ ├── parsers/ # TEI-XML and bibliographic parsing
│ │ └── segmenters/ # Reference segmentation and localization
│ ├── llm/ # LLM client bindings and prompt management
│ ├── pipelines/ # Extraction and parsing workflow orchestration
│ ├── evaluation/ # Metrics and evaluation scripts
│ └── utils/ # Shared helper functions
├── tests/ # Test suite mirroring src/ structure
├── benchmarks/ # Evaluation datasets and scripts
│ ├── cex/ # CEX benchmark dataset
│ ├── excite/ # EXCITE dataset
│ ├── linkedbook/ # LinkedBooks dataset
│ └── finetune/ # Fine-tuning datasets for LLM models
│ └── citation_linking/ # Citation linking scripts and test sets
├── prompts/ # LLM prompt templates (YAML and Markdown)
- Grobid integration
- Marker PDF integration
- MinerU integration
- PyMuPDF integration
- Extractor comparison and benchmarking
- TEI-XML parser (Grobid output)
- LLM-based parser
- Prompt templates and variants
- Semantic reference locator/segmenter
- Benchmarking: EXCITE, CEXgoldstandard, LinkedBooks
- OpenAlex API connector
- OpenCitations API connector
- Wikidata SPARQL connector
- Matilda connector
- Simple search and match pipeline
- Advanced search and match pipeline
- benchmark datasets(cex, excite, linkedbooks)
- creation
- annotation
- evaluation
- TODO
- TODO
- Core data models (Reference, Person, Organization)
- LLM client with retry logic
- CLI interface
- Test suite
- REST API module
- API documentation
- Deployment guides
- Docker containerization
The code contained in this repository is being developed by Yurui Zhu (Odoma). This work is carried out in the context of the EU-funded GRAPHIA project (grant ID: 101188018).