Skip to content

feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER#623

Open
kechol wants to merge 1 commit into
tobi:mainfrom
kechol:feat/fts-tokenizer-env
Open

feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER#623
kechol wants to merge 1 commit into
tobi:mainfrom
kechol:feat/fts-tokenizer-env

Conversation

@kechol
Copy link
Copy Markdown

@kechol kechol commented May 4, 2026

#617

Summary

Add a QMD_FTS_TOKENIZER env var that lets users swap the FTS5 tokenizer used for documents_fts.

The default (porter unicode61) is English-tuned. On CJK / mixed-language corpora unicode61 only splits on Unicode whitespace, so a whole CJK sentence becomes one token and substring queries return zero hits. That silently degrades hybrid retrieval (BM25 ⊕ vector ⊕ rerank) to vector-only — and Reciprocal Rank Fusion still mixes in the useless BM25 ranking, which can actively hurt the final ordering.

Changes

  • getFtsTokenizer() resolves QMD_FTS_TOKENIZER against a whitelist of FTS5 built-in tokenizers and returns the value to interpolate.
  • documents_fts is created using that resolved tokenizer instead of the hardcoded literal.
  • Whitelist: porter unicode61 (default), porter ascii, unicode61, ascii, trigram. Anything else throws — keeps the env var safe to interpolate into the CREATE VIRTUAL TABLE statement.
  • Documented in README env var table and CHANGELOG.

Recommended setup for CJK / multilingual users

export QMD_FTS_TOKENIZER=trigram
# delete the existing index and re-run qmd update / qmd embed
rm ~/.cache/qmd/index.sqlite

trigram generates 3-char overlapping ngrams, which works for CJK and Latin queries ≥ 3 chars. Trade-offs: loses Porter stemming, minimum query length 3 chars, ~2-3× index growth.

Notes

  • CREATE VIRTUAL TABLE IF NOT EXISTS means the env var only affects newly created indexes — existing DBs need a rebuild for the new tokenizer to take effect. README and CHANGELOG call this out.
  • sanitizeFTS5Term in src/store.ts is already Unicode-aware (\p{L}\p{N}'_), so CJK content survives the sanitizer end-to-end. No query-builder changes needed.
  • Default behavior is unchanged when QMD_FTS_TOKENIZER is unset — fully backward-compatible.

Default `porter unicode61` is English-tuned. On CJK / mixed-language
corpora it splits only on whitespace, so a whole CJK sentence becomes a
single token and substring queries return zero hits — silently degrading
the hybrid pipeline to vector-only.

Set `QMD_FTS_TOKENIZER=trigram` (or any FTS5 built-in) to fix. The
value is validated against a strict whitelist before being interpolated
into the CREATE VIRTUAL TABLE statement.

Existing indexes need a rebuild for the change to take effect — the
schema is created once via CREATE VIRTUAL TABLE IF NOT EXISTS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant