Architecture & Design Decisions

This document explains why the system is built the way it is, and what the alternatives would have been. The top-level README covers what it does and how to run it.

1. The two pipelines

A RAG system is really two cooperating pipelines.

Ingest (write path)

flowchart LR
    File[PDF / TXT / MD] --> Parse[pypdf / file read]
    Parse --> Chunker[Sentence-aware<br/>token chunker]
    Chunker --> Embed[sentence-transformers<br/>all-MiniLM-L6-v2]
    Embed --> Norm[L2 normalize]
    Norm --> Chroma[(ChromaDB)]

Query (read path)

flowchart LR
    Q[User question] --> EmbedQ[Embed query]
    EmbedQ --> Search[ChromaDB<br/>cosine top-k]
    Search --> Score[distance → similarity]
    Score --> Prompt[Build system + user<br/>prompt with [N] markers]
    Prompt --> Claude[Anthropic Claude]
    Claude --> Answer[Answer with [N] citations]
    Answer --> UI[UI renders pills →<br/>click jumps to source]

The interesting choices live at the boundaries: chunking, scoring, and the prompt contract.

2. Chunking strategy

Choice: sentence-boundary chunks of ~200 tokens with ~30 tokens of overlap.

The naive baseline is fixed-width slicing — e.g. every 500 characters. That's bad for two reasons:

It cuts mid-sentence. Embeddings of half-sentences are noisier and citations look ugly in the UI ("...the protocol assumes that all messages a").
It ignores the embedder's token budget. all-MiniLM-L6-v2 has a hard 256-token max input. Anything longer is silently truncated by the tokenizer, meaning the latter half of an oversized chunk contributes nothing to the embedding but is still returned at retrieval time.

The implementation in app/chunking.py:

Splits the document into sentences using a lightweight regex ((?<=[.!?])\s+). Good enough for English prose; would swap for pysbd or a real tokenizer for production.
Greedily packs sentences into a chunk while measuring tokens with the embedder's own tokenizer.
When adding a sentence would exceed chunk_size (200), starts a new chunk.
Maintains chunk_overlap (30 tokens) of trailing context from the previous chunk so a query that spans a chunk boundary still has a reasonable chance of matching both.

What I'd add for v2: semantic chunking (cluster adjacent sentences by embedding similarity) or a sliding window over Markdown headings.

3. Embeddings

Choice: sentence-transformers/all-MiniLM-L6-v2, L2-normalized.

Property	Value
Output dim	384
Max input	256 tokens
Speed	~14k sentences/sec on CPU
Cost	$0 (runs locally)

This is the de facto baseline for "small, fast, decent quality." It's not state-of-the-art — an OpenAI text-embedding-3-small would beat it on most benchmarks — but:

It's free per query (the LLM call is the only paid hop).
It runs on a laptop without a GPU.
It's a recognizable choice for reviewers.

The embeddings are L2-normalized in app/embeddings.py so cosine distance equals 1 - dot product. This lets us use ChromaDB's standard cosine index without any custom math.

4. Vector store

Choice: ChromaDB persistent client, configured for cosine distance.

Why not Pinecone / Weaviate / Qdrant?

Zero infrastructure for the demo — data lives in ./chroma_data/ as SQLite + parquet.
Real cosine search with HNSW indexing, not a toy.
The same Python API works against a hosted Chroma later if needed.

One nuance: Chroma wants to own the embedding function. We override that with a no-op embedder (app/vector_store.py):

class _NoopEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, _: Documents) -> Embeddings:
        raise RuntimeError(
            "embeddings must be supplied explicitly via add(embeddings=...)"
        )

This guarantees we always pass embeddings explicitly and prevents Chroma from quietly downloading and using its default model behind our back.

5. Retrieval

app/retrieve.py does three things:

Embed the query (single forward pass through the same all-MiniLM-L6-v2 model).
Call collection.query(query_embeddings=..., n_results=top_k) — default top_k=4.
Convert ChromaDB's cosine distance back into a more intuitive similarity score:

similarity = 1.0 - distance

That gives [0.0, 1.0] where 1.0 is identical and 0.0 is orthogonal. The UI then bins this into:

> 0.5: green "strong match"
0.35 – 0.5: amber "partial match"
< 0.35: gray "weak match"

These thresholds are calibrated empirically against all-MiniLM-L6-v2. A different embedder would need recalibration.

6. Grounded prompting

This is the part most RAG demos get wrong.

Two failure modes to design against:

Hallucination. The model invents facts that "sound right" but aren't in the documents.
Sycophancy. The model answers the question even when there's no supporting context, because it doesn't want to disappoint.

The system prompt in app/generate.py addresses both:

You are a careful assistant that answers questions using ONLY the
provided context. Rules:
1. Use only the information in the context. Never use outside knowledge.
2. If the context doesn't contain the answer, say exactly:
   "I don't have enough information in the provided documents to answer that."
3. Cite sources inline using the [N] markers from the context.
4. Be concise.

Each retrieved chunk is wrapped with a numbered marker:

[1] (source: paris_landmarks.txt, p.1)
The Eiffel Tower was completed in 1889 for the World's Fair...

[2] (source: paris_landmarks.txt, p.1)
Notre-Dame Cathedral was completed in 1345...

Claude then writes answers like "The Eiffel Tower was completed in 1889 [1]". The frontend parses [N] markers and renders them as clickable pills that highlight the corresponding source on the right.

There are 3 live tests in tests/test_generate.py that hit the real Claude API and verify:

A grounded answer correctly uses [N] citations.
The model refuses with the exact required string when the context is irrelevant.
The model doesn't leak outside knowledge even when the question is about a topic it definitely knows.

7. API surface

Method	Path	Purpose
`GET`	`/health`	Live status, current model, indexed chunk count
`POST`	`/upload`	Multipart file upload → ingest pipeline
`POST`	`/chat`	`{question}` → `{answer, sources[]}`
`GET`	`/documents`	Per-source chunk counts
`DELETE`	`/documents`	Clear the entire collection

Validation: 10 MB upload cap, allow-listed extensions (.pdf, .txt, .md), Pydantic-enforced request bodies.

CORS is env-driven (ALLOWED_ORIGINS) so the same code works for local dev and any deployment target.

Startup/shutdown uses FastAPI's modern lifespan context manager to (a) ensure the persist dir exists, (b) warm the Chroma collection, and (c) tidy temp upload files on exit.

8. Frontend

Three-pane layout:

┌──────────────┬──────────────────────────┬──────────────┐
│  Upload /    │  Chat                    │  Sources     │
│  Documents   │  • bubbles               │  • [N] pills │
│  • drag-drop │  • input + send          │  • scores    │
│  • banners   │  • source citation pills │  • flash hl  │
└──────────────┴──────────────────────────┴──────────────┘

Why a 3-pane layout instead of a single chat column? Because the value of citations only lands when you can see them next to the conversation. Hiding sources in a popover lets the user forget to verify; showing them inline forces engagement.

Source identity is computed once via:

function sourceKey(c: RetrievedChunk): string {
  return `${c.metadata.source}::${c.metadata.page ?? '-'}::${c.metadata.chunk_index}`
}

This same key is used for React key props, deduplication when prepending newly cited chunks to the panel, and the click-to-highlight handler. Single source of truth for chunk identity across components.

9. Performance characteristics

Measured on an M1 MacBook Air:

Operation	Latency
Embed one query	~30 ms
Top-4 retrieval over 1000 chunks	<10 ms
Claude `claude-haiku-4-5` round-trip	600–1200 ms
Full `/chat` request, end-to-end	~1 s
Ingest a 20-page PDF	2–4 s (one-time, dominated by PDF parse + embed)

The embedder is loaded once on first request via lru_cache — the first query after server start pays a ~2 s warmup cost.

10. What I'd do differently in a v2

Streaming: switch /chat to SSE so the answer appears token-by-token. The frontend is already structured to consume chunks.
Re-ranking: pull top_k=20 from the vector store, then run a small cross-encoder (e.g. ms-marco-MiniLM) to re-score and keep the top 4. Better recall for ambiguous queries.
Hybrid search: BM25 over the same chunks merged with dense retrieval. Helps on queries where the user uses the document's exact wording.
Per-document filtering: let the user say "only search inside paris_landmarks.txt".
Conversation memory: rewrite follow-up questions to be self-contained before retrieval (a small LLM call upfront).
Eval harness: a fixed Q&A set per sample document, scored automatically against expected citations. Currently I verify by hand.

11. What's intentionally not here

LangChain / LlamaIndex — the whole pipeline fits in ~600 lines of readable Python and there's nothing to gain by hiding it.
An auth layer — this is a single-user demo. Adding auth without a real multi-tenant data model would just be cosplay.
Docker — uv and npm give a fast enough dev loop; containerization is a deployment concern and lives in the roadmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture & Design Decisions

1. The two pipelines

Ingest (write path)

Query (read path)

2. Chunking strategy

3. Embeddings

4. Vector store

5. Retrieval

6. Grounded prompting

7. API surface

8. Frontend

9. Performance characteristics

10. What I'd do differently in a v2

11. What's intentionally not here

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

Architecture & Design Decisions

1. The two pipelines

Ingest (write path)

Query (read path)

2. Chunking strategy

3. Embeddings

4. Vector store

5. Retrieval

6. Grounded prompting

7. API surface

8. Frontend

9. Performance characteristics

10. What I'd do differently in a v2

11. What's intentionally not here