feat: configurable FTS5 tokenizer via QMD_FTS_TOKENIZER#623
Open
kechol wants to merge 1 commit into
Open
Conversation
Default `porter unicode61` is English-tuned. On CJK / mixed-language corpora it splits only on whitespace, so a whole CJK sentence becomes a single token and substring queries return zero hits — silently degrading the hybrid pipeline to vector-only. Set `QMD_FTS_TOKENIZER=trigram` (or any FTS5 built-in) to fix. The value is validated against a strict whitelist before being interpolated into the CREATE VIRTUAL TABLE statement. Existing indexes need a rebuild for the change to take effect — the schema is created once via CREATE VIRTUAL TABLE IF NOT EXISTS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#617
Summary
Add a
QMD_FTS_TOKENIZERenv var that lets users swap the FTS5 tokenizer used fordocuments_fts.The default (
porter unicode61) is English-tuned. On CJK / mixed-language corporaunicode61only splits on Unicode whitespace, so a whole CJK sentence becomes one token and substring queries return zero hits. That silently degrades hybrid retrieval (BM25 ⊕ vector ⊕ rerank) to vector-only — and Reciprocal Rank Fusion still mixes in the useless BM25 ranking, which can actively hurt the final ordering.Changes
getFtsTokenizer()resolvesQMD_FTS_TOKENIZERagainst a whitelist of FTS5 built-in tokenizers and returns the value to interpolate.documents_ftsis created using that resolved tokenizer instead of the hardcoded literal.porter unicode61(default),porter ascii,unicode61,ascii,trigram. Anything else throws — keeps the env var safe to interpolate into the CREATE VIRTUAL TABLE statement.Recommended setup for CJK / multilingual users
trigramgenerates 3-char overlapping ngrams, which works for CJK and Latin queries ≥ 3 chars. Trade-offs: loses Porter stemming, minimum query length 3 chars, ~2-3× index growth.Notes
CREATE VIRTUAL TABLE IF NOT EXISTSmeans the env var only affects newly created indexes — existing DBs need a rebuild for the new tokenizer to take effect. README and CHANGELOG call this out.sanitizeFTS5Terminsrc/store.tsis already Unicode-aware (\p{L}\p{N}'_), so CJK content survives the sanitizer end-to-end. No query-builder changes needed.QMD_FTS_TOKENIZERis unset — fully backward-compatible.