This document explains why the system is built the way it is, and what the alternatives would have been. The top-level README covers what it does and how to run it.
A RAG system is really two cooperating pipelines.
flowchart LR
File[PDF / TXT / MD] --> Parse[pypdf / file read]
Parse --> Chunker[Sentence-aware<br/>token chunker]
Chunker --> Embed[sentence-transformers<br/>all-MiniLM-L6-v2]
Embed --> Norm[L2 normalize]
Norm --> Chroma[(ChromaDB)]
flowchart LR
Q[User question] --> EmbedQ[Embed query]
EmbedQ --> Search[ChromaDB<br/>cosine top-k]
Search --> Score[distance → similarity]
Score --> Prompt[Build system + user<br/>prompt with [N] markers]
Prompt --> Claude[Anthropic Claude]
Claude --> Answer[Answer with [N] citations]
Answer --> UI[UI renders pills →<br/>click jumps to source]
The interesting choices live at the boundaries: chunking, scoring, and the prompt contract.
Choice: sentence-boundary chunks of ~200 tokens with ~30 tokens of overlap.
The naive baseline is fixed-width slicing — e.g. every 500 characters. That's bad for two reasons:
- It cuts mid-sentence. Embeddings of half-sentences are noisier and citations look ugly in the UI ("...the protocol assumes that all messages a").
- It ignores the embedder's token budget.
all-MiniLM-L6-v2has a hard 256-token max input. Anything longer is silently truncated by the tokenizer, meaning the latter half of an oversized chunk contributes nothing to the embedding but is still returned at retrieval time.
The implementation in app/chunking.py:
- Splits the document into sentences using a lightweight regex (
(?<=[.!?])\s+). Good enough for English prose; would swap forpysbdor a real tokenizer for production. - Greedily packs sentences into a chunk while measuring tokens with the embedder's own tokenizer.
- When adding a sentence would exceed
chunk_size(200), starts a new chunk. - Maintains
chunk_overlap(30 tokens) of trailing context from the previous chunk so a query that spans a chunk boundary still has a reasonable chance of matching both.
What I'd add for v2: semantic chunking (cluster adjacent sentences by embedding similarity) or a sliding window over Markdown headings.
Choice: sentence-transformers/all-MiniLM-L6-v2, L2-normalized.
| Property | Value |
|---|---|
| Output dim | 384 |
| Max input | 256 tokens |
| Speed | ~14k sentences/sec on CPU |
| Cost | $0 (runs locally) |
This is the de facto baseline for "small, fast, decent quality." It's not state-of-the-art — an OpenAI text-embedding-3-small would beat it on most benchmarks — but:
- It's free per query (the LLM call is the only paid hop).
- It runs on a laptop without a GPU.
- It's a recognizable choice for reviewers.
The embeddings are L2-normalized in app/embeddings.py so cosine distance equals 1 - dot product. This lets us use ChromaDB's standard cosine index without any custom math.
Choice: ChromaDB persistent client, configured for cosine distance.
Why not Pinecone / Weaviate / Qdrant?
- Zero infrastructure for the demo — data lives in
./chroma_data/as SQLite + parquet. - Real cosine search with HNSW indexing, not a toy.
- The same Python API works against a hosted Chroma later if needed.
One nuance: Chroma wants to own the embedding function. We override that with a no-op embedder (app/vector_store.py):
class _NoopEmbeddingFunction(EmbeddingFunction[Documents]):
def __call__(self, _: Documents) -> Embeddings:
raise RuntimeError(
"embeddings must be supplied explicitly via add(embeddings=...)"
)This guarantees we always pass embeddings explicitly and prevents Chroma from quietly downloading and using its default model behind our back.
app/retrieve.py does three things:
- Embed the query (single forward pass through the same
all-MiniLM-L6-v2model). - Call
collection.query(query_embeddings=..., n_results=top_k)— defaulttop_k=4. - Convert ChromaDB's cosine distance back into a more intuitive similarity score:
similarity = 1.0 - distanceThat gives [0.0, 1.0] where 1.0 is identical and 0.0 is orthogonal. The UI then bins this into:
> 0.5: green "strong match"0.35 – 0.5: amber "partial match"< 0.35: gray "weak match"
These thresholds are calibrated empirically against all-MiniLM-L6-v2. A different embedder would need recalibration.
This is the part most RAG demos get wrong.
Two failure modes to design against:
- Hallucination. The model invents facts that "sound right" but aren't in the documents.
- Sycophancy. The model answers the question even when there's no supporting context, because it doesn't want to disappoint.
The system prompt in app/generate.py addresses both:
You are a careful assistant that answers questions using ONLY the
provided context. Rules:
1. Use only the information in the context. Never use outside knowledge.
2. If the context doesn't contain the answer, say exactly:
"I don't have enough information in the provided documents to answer that."
3. Cite sources inline using the [N] markers from the context.
4. Be concise.
Each retrieved chunk is wrapped with a numbered marker:
[1] (source: paris_landmarks.txt, p.1)
The Eiffel Tower was completed in 1889 for the World's Fair...
[2] (source: paris_landmarks.txt, p.1)
Notre-Dame Cathedral was completed in 1345...
Claude then writes answers like "The Eiffel Tower was completed in 1889 [1]". The frontend parses [N] markers and renders them as clickable pills that highlight the corresponding source on the right.
There are 3 live tests in tests/test_generate.py that hit the real Claude API and verify:
- A grounded answer correctly uses
[N]citations. - The model refuses with the exact required string when the context is irrelevant.
- The model doesn't leak outside knowledge even when the question is about a topic it definitely knows.
| Method | Path | Purpose |
|---|---|---|
GET |
/health |
Live status, current model, indexed chunk count |
POST |
/upload |
Multipart file upload → ingest pipeline |
POST |
/chat |
{question} → {answer, sources[]} |
GET |
/documents |
Per-source chunk counts |
DELETE |
/documents |
Clear the entire collection |
Validation: 10 MB upload cap, allow-listed extensions (.pdf, .txt, .md), Pydantic-enforced request bodies.
CORS is env-driven (ALLOWED_ORIGINS) so the same code works for local dev and any deployment target.
Startup/shutdown uses FastAPI's modern lifespan context manager to (a) ensure the persist dir exists, (b) warm the Chroma collection, and (c) tidy temp upload files on exit.
Three-pane layout:
┌──────────────┬──────────────────────────┬──────────────┐
│ Upload / │ Chat │ Sources │
│ Documents │ • bubbles │ • [N] pills │
│ • drag-drop │ • input + send │ • scores │
│ • banners │ • source citation pills │ • flash hl │
└──────────────┴──────────────────────────┴──────────────┘
Why a 3-pane layout instead of a single chat column? Because the value of citations only lands when you can see them next to the conversation. Hiding sources in a popover lets the user forget to verify; showing them inline forces engagement.
Source identity is computed once via:
function sourceKey(c: RetrievedChunk): string {
return `${c.metadata.source}::${c.metadata.page ?? '-'}::${c.metadata.chunk_index}`
}This same key is used for React key props, deduplication when prepending newly cited chunks to the panel, and the click-to-highlight handler. Single source of truth for chunk identity across components.
Measured on an M1 MacBook Air:
| Operation | Latency |
|---|---|
| Embed one query | ~30 ms |
| Top-4 retrieval over 1000 chunks | <10 ms |
Claude claude-haiku-4-5 round-trip |
600–1200 ms |
Full /chat request, end-to-end |
~1 s |
| Ingest a 20-page PDF | 2–4 s (one-time, dominated by PDF parse + embed) |
The embedder is loaded once on first request via lru_cache — the first query after server start pays a ~2 s warmup cost.
- Streaming: switch
/chatto SSE so the answer appears token-by-token. The frontend is already structured to consume chunks. - Re-ranking: pull
top_k=20from the vector store, then run a small cross-encoder (e.g.ms-marco-MiniLM) to re-score and keep the top 4. Better recall for ambiguous queries. - Hybrid search: BM25 over the same chunks merged with dense retrieval. Helps on queries where the user uses the document's exact wording.
- Per-document filtering: let the user say "only search inside paris_landmarks.txt".
- Conversation memory: rewrite follow-up questions to be self-contained before retrieval (a small LLM call upfront).
- Eval harness: a fixed Q&A set per sample document, scored automatically against expected citations. Currently I verify by hand.
- LangChain / LlamaIndex — the whole pipeline fits in ~600 lines of readable Python and there's nothing to gain by hiding it.
- An auth layer — this is a single-user demo. Adding auth without a real multi-tenant data model would just be cosplay.
- Docker —
uvandnpmgive a fast enough dev loop; containerization is a deployment concern and lives in the roadmap.