RAG Document Chat

Upload your documents, then chat with them. Answers are grounded in real source text and every claim is cited back to a specific chunk.

A from-scratch retrieval-augmented generation (RAG) application. The interesting parts are not the LLM call — they are the choices around chunking, embedding, retrieval scoring, prompt grounding, and citation UX.

What this project demonstrates

End-to-end RAG pipeline built without LangChain — every step is visible and explainable
Token-aware sentence-boundary chunking that respects the embedder's context window
Local embeddings + local vector store — zero per-query cost on the retrieval side
Grounded LLM prompting with explicit citation format and "I don't know" refusal behavior
Real test coverage: 25 tests including 4 live calls to the Anthropic API
Polished React UI with drag-and-drop upload, click-to-cite source pills, and color-coded relevance scores
Deploy-ready architecture: stateless API, env-driven config, bring-your-own embeddings

Tech Stack

Layer	Choice	Why
Backend	FastAPI + Python 3.11+ (uv-managed)	Type-checked, async, automatic OpenAPI docs
Embeddings	`sentence-transformers/all-MiniLM-L6-v2` (384-dim)	Free, runs on CPU, well-known baseline
Vector store	ChromaDB persistent client	No external infrastructure, real cosine search
LLM	Anthropic Claude (`claude-haiku-4-5` default)	Fast, cheap, strong instruction-following
Frontend	Vite + React 19 + TypeScript + Tailwind v4	Modern tooling, sub-second HMR
Tests	pytest (mocked + live)	Fast unit tests + verified real API behavior

Architecture

flowchart LR
    User[User] -->|"upload PDF"| UI[React UI]
    User -->|"chat query"| UI
    UI -->|"REST /api/*"| API[FastAPI Backend]
    subgraph backend [Backend]
      API --> Chunk[Sentence chunker]
      Chunk --> Embed[sentence-transformers]
      Embed --> Chroma[(ChromaDB)]
      API -->|"top-k cosine"| Chroma
      API -->|"grounded prompt"| Claude[Anthropic Claude]
    end
    Claude -->|"answer + citations"| API
    API -->|"answer + sources"| UI

For the deeper write-up — chunking strategy, scoring math, prompt design, and known limitations — see docs/architecture.md.

Quick Start

You'll need:

uv (handles Python install automatically)
Node.js 20+ (brew install node on macOS)
An Anthropic API key

# 1. Backend
cd backend
cp .env.example .env                     # edit .env: paste your ANTHROPIC_API_KEY
uv sync                                  # installs Python 3.11 + all deps
uv run uvicorn app.main:app --reload     # serves http://127.0.0.1:8000

# 2. Frontend (in a second terminal)
cd frontend
npm install
npm run dev                              # opens http://localhost:5173

Drag a PDF or text file from examples/ (or your own) into the left panel, then ask questions about it.

Try it without any setup

The examples/ folder contains a few short text files you can upload immediately to see the system in action:

paris_landmarks.txt — tests retrieval and refusal across many similar facts
rag_intro.txt — tests the system answering questions about RAG
grace_hopper.txt — tests single-document factual recall

Sample question for paris_landmarks.txt:

"Which Paris landmarks were completed in the 1800s?"

Expected behaviour: Claude pulls only the Eiffel Tower (1889) and Arc de Triomphe (1836), correctly ignoring Notre-Dame (1345) and the Louvre (1793). Source chips at the bottom of the answer let you jump to the exact passages used.

Key Design Decisions

The full discussion is in docs/architecture.md. Summary:

Chunk size = 200 tokens, overlap = 30 tokens. The embedder's max input is 256 tokens; going over silently truncates. 200 leaves headroom and produces chunks that read coherently.
Sentence-boundary chunking rather than fixed-width slicing. Avoids cutting mid-thought, which hurts both embedding quality and human readability of citations.
L2-normalized embeddings + cosine distance. Cosine on normalized vectors is mathematically equivalent to dot product but lets ChromaDB use its standard cosine index.
Bring-your-own embeddings to Chroma. A no-op embedding function is registered with the collection so we always pass embeddings explicitly. Keeps embedding choice fully under our control.
Numbered citation contract in the system prompt. Each retrieved chunk is prefixed with [N] (source: filename, p.N), and Claude is instructed to cite using the same [N] markers. Makes citations verifiable in the UI.
Explicit refusal instruction — if the context doesn't answer the question, the model must say so verbatim. Tested with a live integration test.
Singleton model + Chroma client via lru_cache to avoid reloading PyTorch weights on every request.
Score normalization: cosine distance → similarity = 1 - distance, displayed as a 0–1 score with green/amber/gray relevance bands in the UI.

How this compares to NotebookLM

NotebookLM is the most familiar RAG product right now, but its design tradeoffs are nearly inverted from this project's:

NotebookLM: big LLM (Gemini, 1M+ context) → big chunks → light retrieval. Mostly stuffs sources into the prompt directly.
This project: small LLM (Claude Haiku, 200K context) → small chunks (200 tokens) → tight top-4 retrieval.

Both are valid — chunking strategy is downstream of context budget. Going small forces you to make every retrieval choice explicit and visible, which is also why the citations here are precise enough to highlight a single passage. See docs/architecture.md §10 for the full comparison and §11 for the concrete upgrade path (re-ranking, hybrid search, HyDE, parent-document retrieval, long-context swap, etc.).

Testing

cd backend
uv run pytest -v

The suite has 25 tests in 4 files:

File	What it covers	Live API?
`test_ingest.py`	PDF + text parsing, chunking, embedding storage	No
`test_retrieve.py`	Semantic ranking, score range, metadata roundtrip, edge cases	No
`test_generate.py`	Prompt construction, citation behavior, grounded answers, refusal	3 live
`test_api.py`	All HTTP endpoints, validation, error paths, full pipeline	1 live

Live tests automatically skip when ANTHROPIC_API_KEY is not set, so the suite stays green for anyone cloning the repo.

Project Structure

01-rag-document-chat/
├── backend/
│   ├── app/
│   │   ├── main.py          # FastAPI app + routes + lifespan
│   │   ├── ingest.py        # parse → chunk → embed → store
│   │   ├── chunking.py      # sentence-aware token chunker
│   │   ├── embeddings.py    # sentence-transformers singleton
│   │   ├── vector_store.py  # ChromaDB persistent client
│   │   ├── retrieve.py      # semantic search → scored chunks
│   │   ├── generate.py      # grounded Claude prompting
│   │   ├── schemas.py       # Pydantic request/response models
│   │   └── config.py        # pydantic-settings env loader
│   ├── tests/               # 25 tests (4 live)
│   ├── pyproject.toml
│   └── .env.example
├── frontend/
│   ├── src/
│   │   ├── App.tsx          # 3-pane shell + chat state
│   │   ├── api/client.ts    # typed fetch wrappers
│   │   ├── types.ts         # mirrors backend Pydantic models
│   │   └── components/
│   │       ├── UploadPanel.tsx   # drag-and-drop + status banners
│   │       ├── ChatPanel.tsx     # bubbles, input, source pills
│   │       └── SourcesPanel.tsx  # ranked chunks, scores, jump-to highlight
│   ├── package.json
│   └── vite.config.ts       # Tailwind v4 + /api proxy
├── docs/
│   └── architecture.md      # deeper technical writeup
├── examples/                # sample docs you can upload
├── screenshots/             # README images
└── README.md

Roadmap / Out of Scope (intentionally)

Streaming token-by-token responses
Multi-turn conversation memory
Multi-tenant data isolation / auth
Docker + Fly.io / Railway deployment
Re-ranking step (cross-encoder over the top-k)
Hybrid search (BM25 + dense)
Larger embedding model with bigger context window

These are deliberate cuts to keep the project tight and reviewable. Several would make natural follow-up commits.

License

MIT — see source for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Document Chat

What this project demonstrates

Tech Stack

Architecture

Quick Start

Try it without any setup

Key Design Decisions

How this compares to NotebookLM

Testing

Project Structure

Roadmap / Out of Scope (intentionally)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docs		docs
examples		examples
frontend		frontend
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

RAG Document Chat

What this project demonstrates

Tech Stack

Architecture

Quick Start

Try it without any setup

Key Design Decisions

How this compares to NotebookLM

Testing

Project Structure

Roadmap / Out of Scope (intentionally)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages