Reliable LLM outputs start with clean context.
A reliability layer for LLM context. Deterministic deduplication that removes redundancy before it reaches your model.
Less redundant data. Lower costs. Faster responses. More efficient & deterministic results.
Context sources → Distill → LLM
(RAG, tools, memory, docs) (reliable outputs)
LLM outputs are unreliable because context is polluted. "Garbage in, garbage out."
30-40% of context assembled from multiple sources is semantically redundant. Same information from docs, code, memory, and tools competing for attention. This leads to:
- Non-deterministic outputs — Same workflow, different results
- Confused reasoning — Signal diluted by repetition
- Production failures — Works in demos, breaks at scale
You can't fix unreliable outputs with better prompts. You need to fix the context that goes in.
Math, not magic. No LLM calls. Fully deterministic.
| Step | What it does | Benefit |
|---|---|---|
| Deduplicate | Remove redundant information across sources | More reliable outputs |
| Compress | Keep what matters, remove the noise | Lower token costs |
| Summarize | Condense older context intelligently | Longer sessions |
| Cache | Instant retrieval for repeated patterns | Faster responses |
Query → Over-fetch (50) → Cluster → Select → MMR Re-rank (8) → LLM
- Over-fetch - Retrieve 3-5x more chunks than needed
- Cluster - Group semantically similar chunks (agglomerative clustering)
- Select - Pick best representative from each cluster
- MMR Re-rank - Balance relevance and diversity
Result: Deterministic, diverse context in ~12ms. No LLM calls. Fully auditable.
Download from GitHub Releases:
# macOS (Apple Silicon)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*darwin_arm64.tar.gz" | cut -d '"' -f 4) | tar xz
# macOS (Intel)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*darwin_amd64.tar.gz" | cut -d '"' -f 4) | tar xz
# Linux (amd64)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*linux_amd64.tar.gz" | cut -d '"' -f 4) | tar xz
# Linux (arm64)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*linux_arm64.tar.gz" | cut -d '"' -f 4) | tar xz
# Move to PATH
sudo mv distill /usr/local/bin/Or download directly from the releases page.
go install github.com/Siddhant-K-code/distill@latestdocker pull ghcr.io/siddhant-k-code/distill:latest
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distillgit clone https://github.com/Siddhant-K-code/distill.git
cd distill
go build -o distill .Start the API server and send chunks directly:
export OPENAI_API_KEY="your-key" # For embeddings
distill api --port 8080Deduplicate chunks:
curl -X POST http://localhost:8080/v1/dedupe \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"id": "1", "text": "React is a JavaScript library for building UIs."},
{"id": "2", "text": "React.js is a JS library for building user interfaces."},
{"id": "3", "text": "Vue is a progressive framework for building UIs."}
]
}'Response:
{
"chunks": [
{"id": "1", "text": "React is a JavaScript library for building UIs.", "cluster_id": 0},
{"id": "3", "text": "Vue is a progressive framework for building UIs.", "cluster_id": 1}
],
"stats": {
"input_count": 3,
"output_count": 2,
"reduction_pct": 33,
"latency_ms": 12
}
}With pre-computed embeddings (no OpenAI key needed):
curl -X POST http://localhost:8080/v1/dedupe \
-H "Content-Type: application/json" \
-d '{
"chunks": [
{"id": "1", "text": "React is...", "embedding": [0.1, 0.2, ...]},
{"id": "2", "text": "React.js is...", "embedding": [0.11, 0.21, ...]},
{"id": "3", "text": "Vue is...", "embedding": [0.9, 0.8, ...]}
]
}'Connect to Pinecone or Qdrant for retrieval + deduplication:
export PINECONE_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
distill serve --index my-index --port 8080Query with automatic deduplication:
curl -X POST http://localhost:8080/v1/retrieve \
-H "Content-Type: application/json" \
-d '{"query": "how do I reset my password?"}'Works with Claude, Cursor, Amp, and other MCP-compatible assistants:
distill mcpAdd to Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"distill": {
"command": "/path/to/distill",
"args": ["mcp"]
}
}
}See mcp/README.md for more configuration options.
distill api # Start standalone API server
distill serve # Start server with vector DB connection
distill mcp # Start MCP server for AI assistants
distill analyze # Analyze a file for duplicates
distill sync # Upload vectors to Pinecone with dedup
distill query # Test a query from command lineOPENAI_API_KEY # For text → embedding conversion (see note below)
PINECONE_API_KEY # For Pinecone backend
QDRANT_URL # For Qdrant backend (default: localhost:6334)
DISTILL_API_KEYS # Optional: protect your self-hosted instance (see below)If you're exposing Distill publicly, set DISTILL_API_KEYS to require authentication:
# Generate a random API key
export DISTILL_API_KEYS="sk-$(openssl rand -hex 32)"
# Or multiple keys (comma-separated)
export DISTILL_API_KEYS="sk-key1,sk-key2,sk-key3"Then include the key in requests:
curl -X POST http://your-server:8080/v1/dedupe \
-H "Authorization: Bearer sk-your-key" \
-H "Content-Type: application/json" \
-d '{"chunks": [...]}'If DISTILL_API_KEYS is not set, the API is open (suitable for local/internal use).
When you need it:
- Sending text chunks without pre-computed embeddings
- Using text queries with vector database retrieval
- Using the MCP server with text-based tools
When you DON'T need it:
- Sending chunks with pre-computed embeddings (include
"embedding": [...]in your request) - Using Distill purely for clustering/deduplication on existing vectors
What it's used for:
- Converts text to embeddings using
text-embedding-3-smallmodel - ~$0.00002 per 1K tokens (very cheap)
- Embeddings are used only for similarity comparison, never stored
Alternatives:
- Bring your own embeddings - include
"embedding"field in chunks - Self-host an embedding model - set
EMBEDDING_API_URLto your endpoint
| Parameter | Description | Default |
|---|---|---|
--threshold |
Clustering distance (lower = stricter) | 0.15 |
--lambda |
MMR balance: 1.0 = relevance, 0.0 = diversity | 0.5 |
--over-fetch-k |
Chunks to retrieve initially | 50 |
--target-k |
Chunks to return after dedup | 8 |
Use the pre-built image from GitHub Container Registry:
# Pull and run
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:latest
# Or with a specific version
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:v0.1.0# Start Distill + Qdrant (local vector DB)
docker-compose updocker build -t distill .
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key distill apifly launch
fly secrets set OPENAI_API_KEY=your-key
fly deployOr manually:
- Connect your GitHub repo
- Set environment variables (
OPENAI_API_KEY) - Deploy
Connect your repo and set OPENAI_API_KEY in environment variables.
┌─────────────────────────────────────────────────────────┐
│ Your App │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Distill │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Fetch │→ │ Cluster │→ │ Select │→ │ MMR │ │
│ │ 50 │ │ 12 │ │ 12 │ │ 8 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ 2ms 6ms <1ms 3ms │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ LLM │
└─────────────────────────────────────────────────────────┘
- Pinecone - Fully supported
- Qdrant - Fully supported
- Weaviate - Coming soon
- Code Assistants - Dedupe context from multiple files/repos
- RAG Pipelines - Remove redundant chunks before LLM
- Agent Workflows - Clean up tool outputs + memory + docs
- Enterprise - Deterministic outputs for compliance
LLMs are non-deterministic. Reliability requires deterministic preprocessing.
| LLM Compression | Distill | |
|---|---|---|
| Latency | ~500ms | ~12ms |
| Cost per call | $0.01+ | $0.0001 |
| Deterministic | No | Yes |
| Lossless | No | Yes |
| Auditable | No | Yes |
Use LLMs for reasoning. Use deterministic algorithms for reliability.
Works with your existing AI stack:
- LLM Providers: OpenAI, Anthropic
- Frameworks: LangChain, LlamaIndex
- Vector DBs: Pinecone, Qdrant, Weaviate, Chroma, pgvector
- Tools: Cursor, Lovable, and more
Contributions welcome! Please read the contributing guidelines first.
# Run tests
go test ./...
# Build
go build -o distill .AGPL-3.0 - see LICENSE
For commercial licensing, contact: [email protected]