RAGDiff v2.0 - Claude Code Instructions

This file contains project-specific instructions for Claude Code when working on the RAGDiff codebase.

Project Overview

RAGDiff v2.0 is a domain-based framework for comparing Retrieval-Augmented Generation (RAG) systems with LLM evaluation support. It provides both a CLI tool and a Python library API for systematic RAG provider comparison.

Key Features:

Dual API: File-based (CLI) and object-based (Python library)
Domain-based architecture: Organize comparisons by problem domain
LLM evaluation: Automated quality assessment via LiteLLM
Parallel execution: Configurable concurrency for runs and evaluations
Immutable snapshots: Reproducible results with config preservation

What's New in v2.0

RAGDiff v2.0 introduces a domain-based architecture that organizes RAG comparisons around problem domains:

Domains: Separate workspaces for different problem areas (e.g., tafsir, legal, medical)
Providers: RAG provider configurations (e.g., vectara-default, mongodb-local)
Query Sets: Collections of test queries for evaluation
Runs: Executions of query sets against providers
Comparisons: LLM-based evaluations of multiple runs

This architecture replaces the v1.x adapter-based approach with a more structured, reproducible workflow.

Running the CLI

IMPORTANT: The package must be installed in editable mode to work properly.

# Install in editable mode (do this once after cloning)
uv pip install -e .

# Run v2.0 CLI commands
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff compare tafsir <run-id-1> <run-id-2>

DO NOT use the old hacky method with sys.path.insert(0, 'src') - the package is properly configured in pyproject.toml with the correct entry points.

CLI Command Structure

RAGDiff v2.0 has two main commands:

1. `run` - Execute Query Sets

Execute a query set against a provider and save results:

# Basic run
uv run ragdiff run <domain> <provider> <query-set>

# Examples
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries --concurrency 5

# With options
uv run ragdiff run tafsir vectara-default test-queries \
  --domains-dir ./domains \
  --concurrency 10 \
  --timeout 30 \
  --quiet

What it does:

Loads provider configuration from domains/<domain>/providers/<provider>.yaml
Loads queries from domains/<domain>/query-sets/<query-set>.txt
Executes all queries against the provider
Saves run results to domains/<domain>/runs/<run-id>.json
Shows progress bar and summary table

2. `compare` - Evaluate Runs

Compare multiple runs using LLM evaluation:

# Basic comparison
uv run ragdiff compare <domain> <run-id-1> <run-id-2> [<run-id-3> ...]

# Examples
uv run ragdiff compare tafsir abc123 def456
uv run ragdiff compare tafsir abc123 def456 --concurrency 10
uv run ragdiff compare tafsir abc123 def456 --format json --output comparison.json

# With options
uv run ragdiff compare tafsir abc123 def456 \
  --domains-dir ./domains \
  --model gpt-4 \
  --temperature 0.0 \
  --concurrency 10 \
  --format markdown \
  --output report.md

Output formats:

table: Rich table to console (default)
json: JSON to file or console
markdown: Markdown report to file or console

What it does:

Loads runs from domains/<domain>/runs/
Uses LLM (via LiteLLM) to evaluate which provider performed better
Executes evaluations in parallel (configurable with --concurrency, default: 5)
Shows real-time progress bar with evaluation status
Saves comparison to domains/<domain>/comparisons/<comparison-id>.json
Outputs results in specified format

Performance tips:

Use --concurrency 10-20 for large query sets (faster evaluation)
Default concurrency is 5 (balanced for most cases)
Higher concurrency may hit API rate limits - adjust based on your LLM provider
Use --quiet to suppress progress output for automation

Python Library API (Object-Based)

In addition to the CLI, RAGDiff can be used as a Python library with configuration objects instead of files. This is useful for web applications, automated workflows, or any scenario where configurations are stored in databases rather than files.

Object-Based API

from ragdiff import execute_run, compare_runs
from ragdiff.core.models import Domain, ProviderConfig, QuerySet, Query, EvaluatorConfig

# Create configuration objects
domain = Domain(
    name="my-domain",
    description="Domain description",
    evaluator=EvaluatorConfig(
        model="gpt-4",
        temperature=0.0,
        prompt_template="Compare these results..."
    )
)

provider = ProviderConfig(
    name="vectara-prod",
    tool="vectara",
    config={
        "api_key": "${VECTARA_API_KEY}",
        "corpus_id": "${VECTARA_CORPUS_ID}"
    }
)

query_set = QuerySet(
    name="test-queries",
    domain="my-domain",
    queries=[
        Query(text="What is Python?", reference=None),
        Query(text="What is RAG?", reference=None)
    ]
)

# Execute run with objects (NO FILES NEEDED!)
run = execute_run(
    domain=domain,        # Object, not string
    provider=provider,    # Object, not string
    query_set=query_set,  # Object, not string
    concurrency=10
)

# Compare runs
comparison = compare_runs(
    domain=domain,  # Object, not string
    run_ids=[run1.id, run2.id],
    concurrency=5
)

Hybrid Usage

You can mix strings (file-based) and objects:

# Use domain from file, but objects for provider and queries
run = execute_run(
    domain="my-domain",      # String - loads from domains/my-domain/domain.yaml
    provider=provider_obj,   # Object
    query_set=query_set_obj, # Object
    domains_dir="/path/to/domains"  # Only used for string parameters
)

Benefits of Object-Based API

No filesystem operations: Perfect for web applications
Database-friendly: Store configurations in PostgreSQL, MongoDB, etc.
Thread-safe: No shared filesystem state
Faster: No file I/O overhead
Easier debugging: Direct Python stack traces
Backward compatible: File-based API still works exactly the same

When to Use Each API

File-Based (CLI):

Quick experiments and manual testing
Configuration files are convenient
Working with version-controlled configs
One-off comparisons

Object-Based (Library):

Web applications (like ragdiff-ui)
Automated workflows and CI/CD
Configurations in databases
Programmatic usage from Python code

Domain Directory Structure

domains/
├── tafsir/                    # Domain: Islamic tafsir
│   ├── domain.yaml            # Domain config (evaluator settings)
│   ├── providers/               # Provider configurations
│   │   ├── vectara-default.yaml
│   │   ├── mongodb-local.yaml
│   │   └── agentset-prod.yaml
│   ├── query-sets/            # Query collections
│   │   ├── test-queries.txt
│   │   └── production-queries.txt
│   ├── runs/                  # Run results (auto-created)
│   │   ├── <run-id-1>.json
│   │   └── <run-id-2>.json
│   └── comparisons/           # Comparison results (auto-created)
│       └── <comparison-id>.json
└── legal/                     # Domain: Legal documents
    ├── domain.yaml
    ├── providers/
    └── query-sets/

Project Structure

ragdiff/
├── src/ragdiff/              # Main package
│   ├── __init__.py           # Public API exports
│   ├── cli.py                # Main CLI entry point (imports cli_v2)
│   ├── cli_v2.py             # v2.0 CLI implementation
│   ├── version.py            # Version info
│   ├── core/                 # Core v2.0 models
│   │   ├── models_v2.py      # Domain-based models (Run, Comparison, etc.)
│   │   ├── loaders.py        # File loading utilities
│   │   ├── storage.py        # Persistence utilities
│   │   ├── errors.py         # Custom exceptions
│   │   └── logging.py        # Logging configuration
│   ├── providers/              # Provider implementations
│   │   ├── abc.py            # Provider abstract base class
│   │   ├── registry.py       # Provider registration
│   │   ├── factory.py        # Provider factory
│   │   ├── vectara.py        # Vectara provider
│   │   ├── mongodb.py        # MongoDB provider
│   │   └── agentset.py       # Agentset provider
│   ├── execution/            # Run execution
│   │   └── executor.py       # Parallel query execution
│   ├── comparison/           # Comparison engine
│   │   └── evaluator.py      # LLM-based evaluation
│   └── display/              # Output formatting (v1.x, kept for compatibility)
├── tests/                    # Test suite
│   ├── test_core_v2.py       # Core v2.0 tests
│   ├── test_providers.py       # Provider tests
│   ├── test_execution.py     # Execution engine tests
│   └── test_cli_v2.py        # CLI tests
├── domains/                  # Domain workspaces
│   └── example-domain/       # Example domain structure
└── pyproject.toml            # Package configuration

Architecture

Domain-Based Architecture

RAGDiff v2.0 organizes everything around domains:

Domain (domains/<domain>/domain.yaml):
- Name and description
- Evaluator configuration (LLM model, temperature, prompt template)
Provider (domains/<domain>/providers/<provider>.yaml):
- Name, tool type (vectara, mongodb, agentset)
- Configuration (API keys, endpoints, etc.)
Query Set (domains/<domain>/query-sets/<name>.txt):
- Text file with one query per line
- Used for consistent evaluation across providers
Run (domains/<domain>/runs/<run-id>.json):
- Results of executing a query set against a provider
- Includes all query results, errors, timing info
- Snapshots provider config and query set for reproducibility
Comparison (domains/<domain>/comparisons/<comparison-id>.json):
- LLM evaluation of multiple runs
- Per-query winner determination
- Quality scores and analysis

Provider Pattern

All RAG providers implement the Provider abstract base class:

class Provider(ABC):
    @abstractmethod
    def search(self, query: str, top_k: int = 5) -> list[RetrievedChunk]:
        """Execute search and return normalized results."""
        pass

New providers are automatically registered via:

from .registry import register_tool
register_tool("mongodb", MongoDBSystem)

Configuration System

YAML-based configuration in domain directories
Environment variable substitution with ${VAR_NAME} (preserved in snapshots)
LiteLLM integration for multi-provider LLM support
Config snapshotting for reproducibility

Version Management

Version is defined in src/ragdiff/version.py:

__version__ = "2.0.0"  # Current version

Follow semantic versioning:

MAJOR: Breaking changes to public API or provider interface
MINOR: New features, backward compatible
PATCH: Bug fixes

Testing

# Run all tests
uv run pytest tests/

# Run v2.0 tests only
uv run pytest tests/test_core_v2.py tests/test_providers.py tests/test_execution.py tests/test_cli_v2.py

# Run with coverage
uv run pytest tests/ --cov=src

# Run with verbose output
uv run pytest tests/ -v

All v2.0 tests must pass before committing. Current v2.0 test count: 78 tests.

Code Quality

The project uses pre-commit hooks:

ruff for linting and formatting
pytest for testing
Whitespace and YAML validation

Pre-commit hooks will automatically:

Format code with ruff
Fix linting issues where possible
Run all tests
Reject commits if tests fail

Development Workflow

Make changes to source code in src/ragdiff/
Add tests in tests/ for new functionality
Run tests with uv run pytest tests/
Test CLI with uv run ragdiff <command>
Update version in src/ragdiff/version.py if needed
Commit - pre-commit hooks will validate everything

Common Tasks

Creating a New Domain

# Create domain structure
mkdir -p domains/my-domain/{providers,query-sets,runs,comparisons}

# Create domain.yaml
cat > domains/my-domain/domain.yaml <<EOF
name: my-domain
description: Description of my domain
evaluator:
  model: gpt-4
  temperature: 0.0
  prompt_template: |
    Compare these RAG results...
EOF

# Create a provider config
cat > domains/my-domain/providers/vectara-test.yaml <<EOF
name: vectara-test
tool: vectara
config:
  api_key: \${VECTARA_API_KEY}
  corpus_id: \${VECTARA_CORPUS_ID}
  timeout: 30
EOF

# Create a query set
cat > domains/my-domain/query-sets/test-queries.txt <<EOF
Query 1
Query 2
Query 3
EOF

Adding a New Provider Implementation

Create src/ragdiff/providers/myprovider.py:

from ..core.models_v2 import RetrievedChunk
from ..core.errors import ConfigError, RunError
from .abc import System

class MyProvider(Provider):
    def __init__(self, config: dict):
        super().__init__(config)
        # Validate config
        if "api_key" not in config:
            raise ConfigError("Missing required field: api_key")
        self.api_key = config["api_key"]

    def search(self, query: str, top_k: int = 5) -> list[RetrievedChunk]:
        # Implement search logic
        results = self._call_api(query, top_k)
        return [
            RetrievedChunk(
                content=r["text"],
                score=r["score"],
                metadata={"source": r["source"]}
            )
            for r in results
        ]

# Register the provider
from .registry import register_tool
register_tool("myprovider", MyProvider)

Import in src/ragdiff/providers/__init__.py:

from . import myprovider  # noqa: F401

Add tests in tests/test_providers.py

Running Comparisons

# Step 1: Run query sets against different providers
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries
uv run ragdiff run tafsir agentset-prod test-queries

# Note the run IDs from the output (or check domains/tafsir/runs/)

# Step 2: Compare the runs
uv run ragdiff compare tafsir <run-id-1> <run-id-2> <run-id-3>

# Step 3: Export to different formats
uv run ragdiff compare tafsir <run-id-1> <run-id-2> --format json --output comparison.json
uv run ragdiff compare tafsir <run-id-1> <run-id-2> --format markdown --output report.md

Environment Variables

Required in .env file:

# Vectara
VECTARA_API_KEY=your_key
VECTARA_CORPUS_ID=your_corpus_id

# MongoDB Atlas
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/

# Agentset
AGENTSET_API_TOKEN=your_token
AGENTSET_NAMESPACE_ID=your_namespace_id

# LLM Providers (for evaluation via LiteLLM)
OPENAI_API_KEY=your_key          # For GPT models
ANTHROPIC_API_KEY=your_key       # For Claude models
GEMINI_API_KEY=your_key          # For Gemini models
OPENROUTER_API_KEY=your_key      # For OpenRouter (optional)

Key Design Principles

Domain-Driven: Organize work around problem domains
Reproducibility: Snapshot configs and queries in runs
Fail Fast: No fallbacks, clear error messages
Type Safety: Pydantic models, type hints everywhere
Testability: Every feature has tests
Separation of Concerns: Clean boundaries between components

Common Issues

"ModuleNotFoundError: No module named 'ragdiff'"

Fix: Install package in editable mode

uv pip install -e .

"command not found: ragdiff"

Fix: Use uv run ragdiff (not just ragdiff)

"Domain not found"

Fix: Ensure domain directory exists at domains/<domain>/ with domain.yaml

"Provider config not found"

Fix: Ensure provider config exists at domains/<domain>/providers/<provider>.yaml

"Query set not found"

Fix: Ensure query set exists at domains/<domain>/query-sets/<query-set>.txt

LiteLLM errors

Fix: Ensure LiteLLM is installed (uv pip install litellm) and API keys are set

SPIDER Protocol

This project follows the SPIDER protocol for systematic development:

Specification: Clear goals documented in codev/specs/
Planning: Implementation plans in codev/plans/
Implementation: Phased development with clear milestones (6 phases)
Defense: Comprehensive test coverage (78 v2.0 tests)
Evaluation: Code reviews in codev/reviews/
Reflection: Architecture documentation in codev/resources/arch.md

TICK Protocol

For smaller feature additions and changes, use the TICK protocol (see example in codev/specs/0002-adapter-variants.md):

T - Task/Specification: Problem statement, proposed solution, example use cases, success criteria
I - Implementation: Detailed changes required, files to modify, code snippets
C - Check/Testing: Test cases (unit, integration, manual), verification steps
K - Knowledge/Documentation: Configuration format, migration guide, design rationale

TICK specs should be created in codev/specs/ with format: NNNN-feature-name.md

v2.0 Implementation Status

✅ Phase 1: Core data models, file loading, storage (29 tests)
✅ Phase 2: Provider interface, tool registry (29 tests)
✅ Phase 3: Run execution engine (12 tests)
✅ Phase 4: Comparison engine with LiteLLM (5 tests)
✅ Phase 5: CLI commands (8 tests)
🔄 Phase 6: Documentation & CI/CD (in progress)

Notes for Claude Code

The CLI entry point is ragdiff (defined in pyproject.toml)
Always use uv run ragdiff for CLI commands
v2.0 uses domain-based architecture (not adapters)
Source code is in src/ragdiff/ (note the nested structure)
v2.0 models are in core/models_v2.py, providers are in providers/
Tests are comprehensive - run them after any changes
Pre-commit hooks enforce code quality - let them do their job
v1.x code still exists but is not the primary interface

FilesExpand file tree

CLAUDE.md

Latest commit

History