This file contains project-specific instructions for Claude Code when working on the RAGDiff codebase.
RAGDiff v2.0 is a domain-based framework for comparing Retrieval-Augmented Generation (RAG) systems with LLM evaluation support. It provides both a CLI tool and a Python library API for systematic RAG provider comparison.
Key Features:
- Dual API: File-based (CLI) and object-based (Python library)
- Domain-based architecture: Organize comparisons by problem domain
- LLM evaluation: Automated quality assessment via LiteLLM
- Parallel execution: Configurable concurrency for runs and evaluations
- Immutable snapshots: Reproducible results with config preservation
RAGDiff v2.0 introduces a domain-based architecture that organizes RAG comparisons around problem domains:
- Domains: Separate workspaces for different problem areas (e.g., tafsir, legal, medical)
- Providers: RAG provider configurations (e.g., vectara-default, mongodb-local)
- Query Sets: Collections of test queries for evaluation
- Runs: Executions of query sets against providers
- Comparisons: LLM-based evaluations of multiple runs
This architecture replaces the v1.x adapter-based approach with a more structured, reproducible workflow.
IMPORTANT: The package must be installed in editable mode to work properly.
# Install in editable mode (do this once after cloning)
uv pip install -e .
# Run v2.0 CLI commands
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff compare tafsir <run-id-1> <run-id-2>DO NOT use the old hacky method with sys.path.insert(0, 'src') - the package is properly configured in pyproject.toml with the correct entry points.
RAGDiff v2.0 has two main commands:
Execute a query set against a provider and save results:
# Basic run
uv run ragdiff run <domain> <provider> <query-set>
# Examples
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries --concurrency 5
# With options
uv run ragdiff run tafsir vectara-default test-queries \
--domains-dir ./domains \
--concurrency 10 \
--timeout 30 \
--quietWhat it does:
- Loads provider configuration from
domains/<domain>/providers/<provider>.yaml - Loads queries from
domains/<domain>/query-sets/<query-set>.txt - Executes all queries against the provider
- Saves run results to
domains/<domain>/runs/<run-id>.json - Shows progress bar and summary table
Compare multiple runs using LLM evaluation:
# Basic comparison
uv run ragdiff compare <domain> <run-id-1> <run-id-2> [<run-id-3> ...]
# Examples
uv run ragdiff compare tafsir abc123 def456
uv run ragdiff compare tafsir abc123 def456 --concurrency 10
uv run ragdiff compare tafsir abc123 def456 --format json --output comparison.json
# With options
uv run ragdiff compare tafsir abc123 def456 \
--domains-dir ./domains \
--model gpt-4 \
--temperature 0.0 \
--concurrency 10 \
--format markdown \
--output report.mdOutput formats:
table: Rich table to console (default)json: JSON to file or consolemarkdown: Markdown report to file or console
What it does:
- Loads runs from
domains/<domain>/runs/ - Uses LLM (via LiteLLM) to evaluate which provider performed better
- Executes evaluations in parallel (configurable with
--concurrency, default: 5) - Shows real-time progress bar with evaluation status
- Saves comparison to
domains/<domain>/comparisons/<comparison-id>.json - Outputs results in specified format
Performance tips:
- Use
--concurrency 10-20for large query sets (faster evaluation) - Default concurrency is 5 (balanced for most cases)
- Higher concurrency may hit API rate limits - adjust based on your LLM provider
- Use
--quietto suppress progress output for automation
In addition to the CLI, RAGDiff can be used as a Python library with configuration objects instead of files. This is useful for web applications, automated workflows, or any scenario where configurations are stored in databases rather than files.
from ragdiff import execute_run, compare_runs
from ragdiff.core.models import Domain, ProviderConfig, QuerySet, Query, EvaluatorConfig
# Create configuration objects
domain = Domain(
name="my-domain",
description="Domain description",
evaluator=EvaluatorConfig(
model="gpt-4",
temperature=0.0,
prompt_template="Compare these results..."
)
)
provider = ProviderConfig(
name="vectara-prod",
tool="vectara",
config={
"api_key": "${VECTARA_API_KEY}",
"corpus_id": "${VECTARA_CORPUS_ID}"
}
)
query_set = QuerySet(
name="test-queries",
domain="my-domain",
queries=[
Query(text="What is Python?", reference=None),
Query(text="What is RAG?", reference=None)
]
)
# Execute run with objects (NO FILES NEEDED!)
run = execute_run(
domain=domain, # Object, not string
provider=provider, # Object, not string
query_set=query_set, # Object, not string
concurrency=10
)
# Compare runs
comparison = compare_runs(
domain=domain, # Object, not string
run_ids=[run1.id, run2.id],
concurrency=5
)You can mix strings (file-based) and objects:
# Use domain from file, but objects for provider and queries
run = execute_run(
domain="my-domain", # String - loads from domains/my-domain/domain.yaml
provider=provider_obj, # Object
query_set=query_set_obj, # Object
domains_dir="/path/to/domains" # Only used for string parameters
)- No filesystem operations: Perfect for web applications
- Database-friendly: Store configurations in PostgreSQL, MongoDB, etc.
- Thread-safe: No shared filesystem state
- Faster: No file I/O overhead
- Easier debugging: Direct Python stack traces
- Backward compatible: File-based API still works exactly the same
File-Based (CLI):
- Quick experiments and manual testing
- Configuration files are convenient
- Working with version-controlled configs
- One-off comparisons
Object-Based (Library):
- Web applications (like ragdiff-ui)
- Automated workflows and CI/CD
- Configurations in databases
- Programmatic usage from Python code
domains/
├── tafsir/ # Domain: Islamic tafsir
│ ├── domain.yaml # Domain config (evaluator settings)
│ ├── providers/ # Provider configurations
│ │ ├── vectara-default.yaml
│ │ ├── mongodb-local.yaml
│ │ └── agentset-prod.yaml
│ ├── query-sets/ # Query collections
│ │ ├── test-queries.txt
│ │ └── production-queries.txt
│ ├── runs/ # Run results (auto-created)
│ │ ├── <run-id-1>.json
│ │ └── <run-id-2>.json
│ └── comparisons/ # Comparison results (auto-created)
│ └── <comparison-id>.json
└── legal/ # Domain: Legal documents
├── domain.yaml
├── providers/
└── query-sets/
ragdiff/
├── src/ragdiff/ # Main package
│ ├── __init__.py # Public API exports
│ ├── cli.py # Main CLI entry point (imports cli_v2)
│ ├── cli_v2.py # v2.0 CLI implementation
│ ├── version.py # Version info
│ ├── core/ # Core v2.0 models
│ │ ├── models_v2.py # Domain-based models (Run, Comparison, etc.)
│ │ ├── loaders.py # File loading utilities
│ │ ├── storage.py # Persistence utilities
│ │ ├── errors.py # Custom exceptions
│ │ └── logging.py # Logging configuration
│ ├── providers/ # Provider implementations
│ │ ├── abc.py # Provider abstract base class
│ │ ├── registry.py # Provider registration
│ │ ├── factory.py # Provider factory
│ │ ├── vectara.py # Vectara provider
│ │ ├── mongodb.py # MongoDB provider
│ │ └── agentset.py # Agentset provider
│ ├── execution/ # Run execution
│ │ └── executor.py # Parallel query execution
│ ├── comparison/ # Comparison engine
│ │ └── evaluator.py # LLM-based evaluation
│ └── display/ # Output formatting (v1.x, kept for compatibility)
├── tests/ # Test suite
│ ├── test_core_v2.py # Core v2.0 tests
│ ├── test_providers.py # Provider tests
│ ├── test_execution.py # Execution engine tests
│ └── test_cli_v2.py # CLI tests
├── domains/ # Domain workspaces
│ └── example-domain/ # Example domain structure
└── pyproject.toml # Package configuration
RAGDiff v2.0 organizes everything around domains:
-
Domain (
domains/<domain>/domain.yaml):- Name and description
- Evaluator configuration (LLM model, temperature, prompt template)
-
Provider (
domains/<domain>/providers/<provider>.yaml):- Name, tool type (vectara, mongodb, agentset)
- Configuration (API keys, endpoints, etc.)
-
Query Set (
domains/<domain>/query-sets/<name>.txt):- Text file with one query per line
- Used for consistent evaluation across providers
-
Run (
domains/<domain>/runs/<run-id>.json):- Results of executing a query set against a provider
- Includes all query results, errors, timing info
- Snapshots provider config and query set for reproducibility
-
Comparison (
domains/<domain>/comparisons/<comparison-id>.json):- LLM evaluation of multiple runs
- Per-query winner determination
- Quality scores and analysis
All RAG providers implement the Provider abstract base class:
class Provider(ABC):
@abstractmethod
def search(self, query: str, top_k: int = 5) -> list[RetrievedChunk]:
"""Execute search and return normalized results."""
passNew providers are automatically registered via:
from .registry import register_tool
register_tool("mongodb", MongoDBSystem)- YAML-based configuration in domain directories
- Environment variable substitution with
${VAR_NAME}(preserved in snapshots) - LiteLLM integration for multi-provider LLM support
- Config snapshotting for reproducibility
Version is defined in src/ragdiff/version.py:
__version__ = "2.0.0" # Current versionFollow semantic versioning:
- MAJOR: Breaking changes to public API or provider interface
- MINOR: New features, backward compatible
- PATCH: Bug fixes
# Run all tests
uv run pytest tests/
# Run v2.0 tests only
uv run pytest tests/test_core_v2.py tests/test_providers.py tests/test_execution.py tests/test_cli_v2.py
# Run with coverage
uv run pytest tests/ --cov=src
# Run with verbose output
uv run pytest tests/ -vAll v2.0 tests must pass before committing. Current v2.0 test count: 78 tests.
The project uses pre-commit hooks:
rufffor linting and formattingpytestfor testing- Whitespace and YAML validation
Pre-commit hooks will automatically:
- Format code with ruff
- Fix linting issues where possible
- Run all tests
- Reject commits if tests fail
- Make changes to source code in
src/ragdiff/ - Add tests in
tests/for new functionality - Run tests with
uv run pytest tests/ - Test CLI with
uv run ragdiff <command> - Update version in
src/ragdiff/version.pyif needed - Commit - pre-commit hooks will validate everything
# Create domain structure
mkdir -p domains/my-domain/{providers,query-sets,runs,comparisons}
# Create domain.yaml
cat > domains/my-domain/domain.yaml <<EOF
name: my-domain
description: Description of my domain
evaluator:
model: gpt-4
temperature: 0.0
prompt_template: |
Compare these RAG results...
EOF
# Create a provider config
cat > domains/my-domain/providers/vectara-test.yaml <<EOF
name: vectara-test
tool: vectara
config:
api_key: \${VECTARA_API_KEY}
corpus_id: \${VECTARA_CORPUS_ID}
timeout: 30
EOF
# Create a query set
cat > domains/my-domain/query-sets/test-queries.txt <<EOF
Query 1
Query 2
Query 3
EOF- Create
src/ragdiff/providers/myprovider.py:
from ..core.models_v2 import RetrievedChunk
from ..core.errors import ConfigError, RunError
from .abc import System
class MyProvider(Provider):
def __init__(self, config: dict):
super().__init__(config)
# Validate config
if "api_key" not in config:
raise ConfigError("Missing required field: api_key")
self.api_key = config["api_key"]
def search(self, query: str, top_k: int = 5) -> list[RetrievedChunk]:
# Implement search logic
results = self._call_api(query, top_k)
return [
RetrievedChunk(
content=r["text"],
score=r["score"],
metadata={"source": r["source"]}
)
for r in results
]
# Register the provider
from .registry import register_tool
register_tool("myprovider", MyProvider)- Import in
src/ragdiff/providers/__init__.py:
from . import myprovider # noqa: F401- Add tests in
tests/test_providers.py
# Step 1: Run query sets against different providers
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries
uv run ragdiff run tafsir agentset-prod test-queries
# Note the run IDs from the output (or check domains/tafsir/runs/)
# Step 2: Compare the runs
uv run ragdiff compare tafsir <run-id-1> <run-id-2> <run-id-3>
# Step 3: Export to different formats
uv run ragdiff compare tafsir <run-id-1> <run-id-2> --format json --output comparison.json
uv run ragdiff compare tafsir <run-id-1> <run-id-2> --format markdown --output report.mdRequired in .env file:
# Vectara
VECTARA_API_KEY=your_key
VECTARA_CORPUS_ID=your_corpus_id
# MongoDB Atlas
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/
# Agentset
AGENTSET_API_TOKEN=your_token
AGENTSET_NAMESPACE_ID=your_namespace_id
# LLM Providers (for evaluation via LiteLLM)
OPENAI_API_KEY=your_key # For GPT models
ANTHROPIC_API_KEY=your_key # For Claude models
GEMINI_API_KEY=your_key # For Gemini models
OPENROUTER_API_KEY=your_key # For OpenRouter (optional)- Domain-Driven: Organize work around problem domains
- Reproducibility: Snapshot configs and queries in runs
- Fail Fast: No fallbacks, clear error messages
- Type Safety: Pydantic models, type hints everywhere
- Testability: Every feature has tests
- Separation of Concerns: Clean boundaries between components
Fix: Install package in editable mode
uv pip install -e .Fix: Use uv run ragdiff (not just ragdiff)
Fix: Ensure domain directory exists at domains/<domain>/ with domain.yaml
Fix: Ensure provider config exists at domains/<domain>/providers/<provider>.yaml
Fix: Ensure query set exists at domains/<domain>/query-sets/<query-set>.txt
Fix: Ensure LiteLLM is installed (uv pip install litellm) and API keys are set
This project follows the SPIDER protocol for systematic development:
- Specification: Clear goals documented in codev/specs/
- Planning: Implementation plans in codev/plans/
- Implementation: Phased development with clear milestones (6 phases)
- Defense: Comprehensive test coverage (78 v2.0 tests)
- Evaluation: Code reviews in codev/reviews/
- Reflection: Architecture documentation in codev/resources/arch.md
For smaller feature additions and changes, use the TICK protocol (see example in codev/specs/0002-adapter-variants.md):
- T - Task/Specification: Problem statement, proposed solution, example use cases, success criteria
- I - Implementation: Detailed changes required, files to modify, code snippets
- C - Check/Testing: Test cases (unit, integration, manual), verification steps
- K - Knowledge/Documentation: Configuration format, migration guide, design rationale
TICK specs should be created in codev/specs/ with format: NNNN-feature-name.md
- ✅ Phase 1: Core data models, file loading, storage (29 tests)
- ✅ Phase 2: Provider interface, tool registry (29 tests)
- ✅ Phase 3: Run execution engine (12 tests)
- ✅ Phase 4: Comparison engine with LiteLLM (5 tests)
- ✅ Phase 5: CLI commands (8 tests)
- 🔄 Phase 6: Documentation & CI/CD (in progress)
- The CLI entry point is
ragdiff(defined inpyproject.toml) - Always use
uv run ragdifffor CLI commands - v2.0 uses domain-based architecture (not adapters)
- Source code is in
src/ragdiff/(note the nested structure) - v2.0 models are in
core/models_v2.py, providers are inproviders/ - Tests are comprehensive - run them after any changes
- Pre-commit hooks enforce code quality - let them do their job
- v1.x code still exists but is not the primary interface