feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER by kechol · Pull Request #623 · tobi/qmd

kechol · 2026-05-04T02:50:39Z

Summary

Add a QMD_FTS_TOKENIZER env var that lets users swap the FTS5 tokenizer used for documents_fts.

The default (porter unicode61) is English-tuned. On CJK / mixed-language corpora unicode61 only splits on Unicode whitespace, so a whole CJK sentence becomes one token and substring queries return zero hits. That silently degrades hybrid retrieval (BM25 ⊕ vector ⊕ rerank) to vector-only — and Reciprocal Rank Fusion still mixes in the useless BM25 ranking, which can actively hurt the final ordering.

Changes

getFtsTokenizer() resolves QMD_FTS_TOKENIZER against a whitelist of FTS5 built-in tokenizers and returns the value to interpolate.
documents_fts is created using that resolved tokenizer instead of the hardcoded literal.
Whitelist: porter unicode61 (default), porter ascii, unicode61, ascii, trigram. Anything else throws — keeps the env var safe to interpolate into the CREATE VIRTUAL TABLE statement.
Documented in README env var table and CHANGELOG.

Recommended setup for CJK / multilingual users

export QMD_FTS_TOKENIZER=trigram
# delete the existing index and re-run qmd update / qmd embed
rm ~/.cache/qmd/index.sqlite

trigram generates 3-char overlapping ngrams, which works for CJK and Latin queries ≥ 3 chars. Trade-offs: loses Porter stemming, minimum query length 3 chars, ~2-3× index growth.

Notes

CREATE VIRTUAL TABLE IF NOT EXISTS means the env var only affects newly created indexes — existing DBs need a rebuild for the new tokenizer to take effect. README and CHANGELOG call this out.
sanitizeFTS5Term in src/store.ts is already Unicode-aware (\p{L}\p{N}'_), so CJK content survives the sanitizer end-to-end. No query-builder changes needed.
Default behavior is unchanged when QMD_FTS_TOKENIZER is unset — fully backward-compatible.

Default `porter unicode61` is English-tuned. On CJK / mixed-language corpora it splits only on whitespace, so a whole CJK sentence becomes a single token and substring queries return zero hits — silently degrading the hybrid pipeline to vector-only. Set `QMD_FTS_TOKENIZER=trigram` (or any FTS5 built-in) to fix. The value is validated against a strict whitelist before being interpolated into the CREATE VIRTUAL TABLE statement. Existing indexes need a rebuild for the change to take effect — the schema is created once via CREATE VIRTUAL TABLE IF NOT EXISTS.

kechol mentioned this pull request May 4, 2026

BM25 / FTS5 returns zero hits for CJK queries (Chinese / Japanese / Korean) #617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER#623

feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER#623
kechol wants to merge 1 commit into
tobi:mainfrom
kechol:feat/fts-tokenizer-env

kechol commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kechol commented May 4, 2026

Summary

Changes

Recommended setup for CJK / multilingual users

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant