This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
on2vec is a toolkit for generating vector embeddings from OWL ontologies using Graph Neural Networks (GNNs), with comprehensive HuggingFace Sentence Transformers integration and MTEB benchmarking capabilities. The project creates production-ready models that combine ontological knowledge with semantic text understanding.
This is a Python project using UV for dependency management:
- Python >= 3.10 required
- Dependencies managed via
pyproject.tomlanduv.lock - Key dependencies: PyTorch, torch-geometric, owlready2, UMAP, matplotlib, polars
uv sync# Basic installation
pip install on2vec
# With benchmarking support
pip install on2vec[benchmark]
# All features
pip install on2vec[all]The codebase follows a modern pipeline architecture with distinct stages:
- OWL Processing (
main.py): Converts OWL ontologies to graphs and trains GNN models - Text Integration: Combines structural embeddings with semantic text features
- HuggingFace Model Creation (
create_hf_model.py): Creates sentence-transformers compatible models - Model Documentation: Auto-generates comprehensive model cards and upload instructions
- MTEB Benchmarking (
mteb_benchmarks/): Evaluation against standard benchmarks - Visualization Pipeline: UMAP projections and analysis tools
create_hf_model.py: Main CLI for end-to-end workflowsbatch_hf_models.py: Batch processing for multiple ontologieson2vec/sentence_transformer_hub.py: Core HuggingFace model creationon2vec/model_card_generator.py: Comprehensive documentation generationon2vec/metadata_utils.py: Smart metadata extraction and auto-detection
mteb_benchmarks/benchmark_runner.py: Full MTEB evaluation frameworkmteb_benchmarks/README.md: Benchmarking documentationtest_edam_model.py: Domain-specific model comparison
main.py: Core GNN training and embedding generationviz.py: UMAP visualization of embeddingsprocess_dir.py: Batch processing for multiple OWL fileson2vec/package: Modular components for programmatic use
on2vec hf ONTOLOGY.owl model-name# 1. Train ontology with text features
on2vec hf-train ONTOLOGY.owl --output embeddings.parquet
# 2. Create HuggingFace model (auto-detects base model)
on2vec hf-create embeddings.parquet model-name
# 3. Test model
on2vec hf-test ./hf_models/model-name
# 4. Inspect model details
on2vec inspect ./hf_models/model-nameon2vec hf-batch owl_files/ ./output --max-workers 4on2vec benchmark ./hf_models/model-name --quickon2vec benchmark ./hf_models/model-nameon2vec compare ./hf_models/model-name --detailedon2vec train ONTOLOGY.owl --output embeddings.parquet --model-type gcn --hidden-dim 128 --out-dim 64 --epochs 100on2vec embed model.pt ONTOLOGY.owl --output embeddings.parqueton2vec visualize embeddings.parquet --neighbors 15 --min-dist 0.1 --output visualization.pngon2vec inspect embeddings.parquet
on2vec convert embeddings.parquet embeddings.csvpython force_layout.py ONTOLOGY.owl --output_image layout.png --output_parquet coordinates.parquetpython dot_to_embed.py embeddings.parquet coordinates.parquet ONTOLOGY.owl output.parquet animation.gif- OWL files typically stored in
owl_files/directory - Generated embeddings saved as Parquet files
- Visualizations output as PNG images
- Animated transitions saved as GIF files
- Batch outputs organized in
output/directories
The GNN models support:
- Architectures: GCN (Graph Convolutional Networks), GAT (Graph Attention Networks)
- Loss Functions: triplet, contrastive, cosine, cross_entropy
- Embedding Dimensions: Configurable hidden_dim and output_dim
- Training: Configurable epochs with Adam optimizer
- OWL → Graph (nodes=classes, edges=subclass relationships)
- Graph → GNN Training → Node Embeddings
- Embeddings → UMAP/Visualization → 2D Projections
- Multiple layouts can be interpolated for animations
- Embeddings: Parquet files with node_id and embedding columns
- Visualizations: PNG images
- Coordinates: Parquet files with node_id, x, y columns
- Animations: GIF files showing layout transitions