Upload your documents, then chat with them. Answers are grounded in real source text and every claim is cited back to a specific chunk.
A from-scratch retrieval-augmented generation (RAG) application. The interesting parts are not the LLM call — they are the choices around chunking, embedding, retrieval scoring, prompt grounding, and citation UX.
- End-to-end RAG pipeline built without LangChain — every step is visible and explainable
- Token-aware sentence-boundary chunking that respects the embedder's context window
- Local embeddings + local vector store — zero per-query cost on the retrieval side
- Grounded LLM prompting with explicit citation format and "I don't know" refusal behavior
- Real test coverage: 25 tests including 4 live calls to the Anthropic API
- Polished React UI with drag-and-drop upload, click-to-cite source pills, and color-coded relevance scores
- Deploy-ready architecture: stateless API, env-driven config, bring-your-own embeddings
| Layer | Choice | Why |
|---|---|---|
| Backend | FastAPI + Python 3.11+ (uv-managed) | Type-checked, async, automatic OpenAPI docs |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384-dim) |
Free, runs on CPU, well-known baseline |
| Vector store | ChromaDB persistent client | No external infrastructure, real cosine search |
| LLM | Anthropic Claude (claude-haiku-4-5 default) |
Fast, cheap, strong instruction-following |
| Frontend | Vite + React 19 + TypeScript + Tailwind v4 | Modern tooling, sub-second HMR |
| Tests | pytest (mocked + live) | Fast unit tests + verified real API behavior |
flowchart LR
User[User] -->|"upload PDF"| UI[React UI]
User -->|"chat query"| UI
UI -->|"REST /api/*"| API[FastAPI Backend]
subgraph backend [Backend]
API --> Chunk[Sentence chunker]
Chunk --> Embed[sentence-transformers]
Embed --> Chroma[(ChromaDB)]
API -->|"top-k cosine"| Chroma
API -->|"grounded prompt"| Claude[Anthropic Claude]
end
Claude -->|"answer + citations"| API
API -->|"answer + sources"| UI
For the deeper write-up — chunking strategy, scoring math, prompt design, and known limitations — see docs/architecture.md.
You'll need:
- uv (handles Python install automatically)
- Node.js 20+ (
brew install nodeon macOS) - An Anthropic API key
# 1. Backend
cd backend
cp .env.example .env # edit .env: paste your ANTHROPIC_API_KEY
uv sync # installs Python 3.11 + all deps
uv run uvicorn app.main:app --reload # serves http://127.0.0.1:8000
# 2. Frontend (in a second terminal)
cd frontend
npm install
npm run dev # opens http://localhost:5173Drag a PDF or text file from examples/ (or your own) into the left panel, then ask questions about it.
The examples/ folder contains a few short text files you can upload immediately to see the system in action:
paris_landmarks.txt— tests retrieval and refusal across many similar factsrag_intro.txt— tests the system answering questions about RAGgrace_hopper.txt— tests single-document factual recall
Sample question for paris_landmarks.txt:
"Which Paris landmarks were completed in the 1800s?"
Expected behaviour: Claude pulls only the Eiffel Tower (1889) and Arc de Triomphe (1836), correctly ignoring Notre-Dame (1345) and the Louvre (1793). Source chips at the bottom of the answer let you jump to the exact passages used.
The full discussion is in docs/architecture.md. Summary:
- Chunk size = 200 tokens, overlap = 30 tokens. The embedder's max input is 256 tokens; going over silently truncates. 200 leaves headroom and produces chunks that read coherently.
- Sentence-boundary chunking rather than fixed-width slicing. Avoids cutting mid-thought, which hurts both embedding quality and human readability of citations.
- L2-normalized embeddings + cosine distance. Cosine on normalized vectors is mathematically equivalent to dot product but lets ChromaDB use its standard cosine index.
- Bring-your-own embeddings to Chroma. A no-op embedding function is registered with the collection so we always pass embeddings explicitly. Keeps embedding choice fully under our control.
- Numbered citation contract in the system prompt. Each retrieved chunk is prefixed with
[N] (source: filename, p.N), and Claude is instructed to cite using the same[N]markers. Makes citations verifiable in the UI. - Explicit refusal instruction — if the context doesn't answer the question, the model must say so verbatim. Tested with a live integration test.
- Singleton model + Chroma client via
lru_cacheto avoid reloading PyTorch weights on every request. - Score normalization: cosine distance → similarity =
1 - distance, displayed as a 0–1 score with green/amber/gray relevance bands in the UI.
NotebookLM is the most familiar RAG product right now, but its design tradeoffs are nearly inverted from this project's:
- NotebookLM: big LLM (Gemini, 1M+ context) → big chunks → light retrieval. Mostly stuffs sources into the prompt directly.
- This project: small LLM (Claude Haiku, 200K context) → small chunks (200 tokens) → tight top-4 retrieval.
Both are valid — chunking strategy is downstream of context budget. Going small forces you to make every retrieval choice explicit and visible, which is also why the citations here are precise enough to highlight a single passage. See docs/architecture.md §10 for the full comparison and §11 for the concrete upgrade path (re-ranking, hybrid search, HyDE, parent-document retrieval, long-context swap, etc.).
cd backend
uv run pytest -vThe suite has 25 tests in 4 files:
| File | What it covers | Live API? |
|---|---|---|
test_ingest.py |
PDF + text parsing, chunking, embedding storage | No |
test_retrieve.py |
Semantic ranking, score range, metadata roundtrip, edge cases | No |
test_generate.py |
Prompt construction, citation behavior, grounded answers, refusal | 3 live |
test_api.py |
All HTTP endpoints, validation, error paths, full pipeline | 1 live |
Live tests automatically skip when ANTHROPIC_API_KEY is not set, so the suite stays green for anyone cloning the repo.
01-rag-document-chat/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app + routes + lifespan
│ │ ├── ingest.py # parse → chunk → embed → store
│ │ ├── chunking.py # sentence-aware token chunker
│ │ ├── embeddings.py # sentence-transformers singleton
│ │ ├── vector_store.py # ChromaDB persistent client
│ │ ├── retrieve.py # semantic search → scored chunks
│ │ ├── generate.py # grounded Claude prompting
│ │ ├── schemas.py # Pydantic request/response models
│ │ └── config.py # pydantic-settings env loader
│ ├── tests/ # 25 tests (4 live)
│ ├── pyproject.toml
│ └── .env.example
├── frontend/
│ ├── src/
│ │ ├── App.tsx # 3-pane shell + chat state
│ │ ├── api/client.ts # typed fetch wrappers
│ │ ├── types.ts # mirrors backend Pydantic models
│ │ └── components/
│ │ ├── UploadPanel.tsx # drag-and-drop + status banners
│ │ ├── ChatPanel.tsx # bubbles, input, source pills
│ │ └── SourcesPanel.tsx # ranked chunks, scores, jump-to highlight
│ ├── package.json
│ └── vite.config.ts # Tailwind v4 + /api proxy
├── docs/
│ └── architecture.md # deeper technical writeup
├── examples/ # sample docs you can upload
├── screenshots/ # README images
└── README.md
- Streaming token-by-token responses
- Multi-turn conversation memory
- Multi-tenant data isolation / auth
- Docker + Fly.io / Railway deployment
- Re-ranking step (cross-encoder over the top-k)
- Hybrid search (BM25 + dense)
- Larger embedding model with bigger context window
These are deliberate cuts to keep the project tight and reviewable. Several would make natural follow-up commits.
MIT — see source for details.

