Upload your documents, then chat with them. Answers are grounded in real source text and every claim is cited back to a specific chunk.
A from-scratch retrieval-augmented generation (RAG) application. The interesting parts are not the LLM call — they are the choices around chunking, embedding, retrieval scoring, prompt grounding, and citation UX.
Drop a screenshot or GIF named
demo.pnginto thescreenshots/folder to render here.
- End-to-end RAG pipeline built without LangChain — every step is visible and explainable
- Token-aware sentence-boundary chunking that respects the embedder's context window
- Local embeddings + local vector store — zero per-query cost on the retrieval side
- Grounded LLM prompting with explicit citation format and "I don't know" refusal behavior
- Real test coverage: 25 tests including 4 live calls to the Anthropic API
- Polished React UI with drag-and-drop upload, click-to-cite source pills, and color-coded relevance scores
- Deploy-ready architecture: stateless API, env-driven config, bring-your-own embeddings
| Layer | Choice | Why |
|---|---|---|
| Backend | FastAPI + Python 3.11+ (uv-managed) | Type-checked, async, automatic OpenAPI docs |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384-dim) |
Free, runs on CPU, well-known baseline |
| Vector store | ChromaDB persistent client | No external infrastructure, real cosine search |
| LLM | Anthropic Claude (claude-haiku-4-5 default) |
Fast, cheap, strong instruction-following |
| Frontend | Vite + React 19 + TypeScript + Tailwind v4 | Modern tooling, sub-second HMR |
| Tests | pytest (mocked + live) | Fast unit tests + verified real API behavior |
flowchart LR
User[User] -->|"upload PDF"| UI[React UI]
User -->|"chat query"| UI
UI -->|"REST /api/*"| API[FastAPI Backend]
subgraph backend [Backend]
API --> Chunk[Sentence chunker]
Chunk --> Embed[sentence-transformers]
Embed --> Chroma[(ChromaDB)]
API -->|"top-k cosine"| Chroma
API -->|"grounded prompt"| Claude[Anthropic Claude]
end
Claude -->|"answer + citations"| API
API -->|"answer + sources"| UI
For the deeper write-up — chunking strategy, scoring math, prompt design, and known limitations — see docs/architecture.md.
You'll need:
- uv (handles Python install automatically)
- Node.js 20+ (
brew install nodeon macOS) - An Anthropic API key
# 1. Backend
cd backend
cp .env.example .env # edit .env: paste your ANTHROPIC_API_KEY
uv sync # installs Python 3.11 + all deps
uv run uvicorn app.main:app --reload # serves http://127.0.0.1:8000
# 2. Frontend (in a second terminal)
cd frontend
npm install
npm run dev # opens http://localhost:5173Drag a PDF or text file from examples/ (or your own) into the left panel, then ask questions about it.
The examples/ folder contains a few short text files you can upload immediately to see the system in action:
paris_landmarks.txt— tests retrieval and refusal across many similar factsrag_intro.txt— tests the system answering questions about RAGgrace_hopper.txt— tests single-document factual recall
Sample question for paris_landmarks.txt:
"Which Paris landmarks were completed in the 1800s?"
Expected behaviour: Claude pulls only the Eiffel Tower (1889) and Arc de Triomphe (1836), correctly ignoring Notre-Dame (1345) and the Louvre (1793). Source chips at the bottom of the answer let you jump to the exact passages used.
The full discussion is in docs/architecture.md. Summary:
- Chunk size = 200 tokens, overlap = 30 tokens. The embedder's max input is 256 tokens; going over silently truncates. 200 leaves headroom and produces chunks that read coherently.
- Sentence-boundary chunking rather than fixed-width slicing. Avoids cutting mid-thought, which hurts both embedding quality and human readability of citations.
- L2-normalized embeddings + cosine distance. Cosine on normalized vectors is mathematically equivalent to dot product but lets ChromaDB use its standard cosine index.
- Bring-your-own embeddings to Chroma. A no-op embedding function is registered with the collection so we always pass embeddings explicitly. Keeps embedding choice fully under our control.
- Numbered citation contract in the system prompt. Each retrieved chunk is prefixed with
[N] (source: filename, p.N), and Claude is instructed to cite using the same[N]markers. Makes citations verifiable in the UI. - Explicit refusal instruction — if the context doesn't answer the question, the model must say so verbatim. Tested with a live integration test.
- Singleton model + Chroma client via
lru_cacheto avoid reloading PyTorch weights on every request. - Score normalization: cosine distance → similarity =
1 - distance, displayed as a 0–1 score with green/amber/gray relevance bands in the UI.
cd backend
uv run pytest -vThe suite has 25 tests in 4 files:
| File | What it covers | Live API? |
|---|---|---|
test_ingest.py |
PDF + text parsing, chunking, embedding storage | No |
test_retrieve.py |
Semantic ranking, score range, metadata roundtrip, edge cases | No |
test_generate.py |
Prompt construction, citation behavior, grounded answers, refusal | 3 live |
test_api.py |
All HTTP endpoints, validation, error paths, full pipeline | 1 live |
Live tests automatically skip when ANTHROPIC_API_KEY is not set, so the suite stays green for anyone cloning the repo.
01-rag-document-chat/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app + routes + lifespan
│ │ ├── ingest.py # parse → chunk → embed → store
│ │ ├── chunking.py # sentence-aware token chunker
│ │ ├── embeddings.py # sentence-transformers singleton
│ │ ├── vector_store.py # ChromaDB persistent client
│ │ ├── retrieve.py # semantic search → scored chunks
│ │ ├── generate.py # grounded Claude prompting
│ │ ├── schemas.py # Pydantic request/response models
│ │ └── config.py # pydantic-settings env loader
│ ├── tests/ # 25 tests (4 live)
│ ├── pyproject.toml
│ └── .env.example
├── frontend/
│ ├── src/
│ │ ├── App.tsx # 3-pane shell + chat state
│ │ ├── api/client.ts # typed fetch wrappers
│ │ ├── types.ts # mirrors backend Pydantic models
│ │ └── components/
│ │ ├── UploadPanel.tsx # drag-and-drop + status banners
│ │ ├── ChatPanel.tsx # bubbles, input, source pills
│ │ └── SourcesPanel.tsx # ranked chunks, scores, jump-to highlight
│ ├── package.json
│ └── vite.config.ts # Tailwind v4 + /api proxy
├── docs/
│ └── architecture.md # deeper technical writeup
├── examples/ # sample docs you can upload
├── screenshots/ # README images
└── README.md
- Streaming token-by-token responses
- Multi-turn conversation memory
- Multi-tenant data isolation / auth
- Docker + Fly.io / Railway deployment
- Re-ranking step (cross-encoder over the top-k)
- Hybrid search (BM25 + dense)
- Larger embedding model with bigger context window
These are deliberate cuts to keep the project tight and reviewable. Several would make natural follow-up commits.
MIT — see source for details.
