fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601
Open
fxstein wants to merge 1 commit into
Open
fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601fxstein wants to merge 1 commit into
fxstein wants to merge 1 commit into
Conversation
validateSemanticQuery uses /-\w/ to flag negation syntax, but that regex
also matches hyphens embedded inside identifiers like "DEC-0054" or
"ui-kit". Semantic queries containing these entirely reasonable tokens
are rejected with a confusing 'Negation (-term) is not supported' error.
Anchor the pattern to the start of the query or a whitespace boundary so
that only true negation tokens ("-word", '-"phrase"' at the start of a
word) trigger the validation, while mid-term hyphens pass through.
Adds test coverage for DEC-0054, scoped npm packages, compound adjectives,
and token-based identifiers.
Refs: tobi#418 (prior attempt, scope-reduced to the negation fix only)
Refs: tobi#305, tobi#417
Contributor
Author
|
Heads-up on the red The single failing test — I traced the root cause and opened a separate focused fix: #602. Once that lands, this PR's CI goes green automatically (verified — #602 itself is the first fully-green CI run on this codebase in two weeks). No action needed on this PR; just flagging that the red check here is pre-existing and orthogonal. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
validateSemanticQuery()rejects semantic (vec/hyde) queries that contain mid-term hyphens inside compound words or identifiers, with a misleading error message:The regex
/-\w/matches any hyphen-followed-by-word-char, so perfectly reasonable tokens trigger false positives:DEC-0054,RFC-0011,CVE-2024-1234@scope/ui-kit,material-uistate-of-the-art,role-based,multi-agent,chain-of-thoughttoken-based,context-aware,fine-tunedUsers see a confusing "Negation is not supported" error for queries that contain no intentional negation at all.
Root Cause
The regex does not distinguish between true negation (
-wordat the start of a query or after whitespace — i.e. syntax borrowed from lex) and internal hyphens in compound words (multi-agent,DEC-0054). Both match/-\w/.Fix
One-line change in
validateSemanticQuery()— anchor the negation regex to the start of the query or a whitespace boundary:Now only true negation tokens (
-wordor-"phrase"at the start of a word) match. Mid-term hyphens pass through unchanged.Testing
Added four new test cases in
test/structured-search.test.tscovering common identifier patterns:"DEC-0054 architecture decision""how does @scope/ui-kit work""state-of-the-art retrieval""token-based chunking""performance -sports"(true negation)"foo -bar baz"(true negation)'-"exact phrase"'(true negation)Also verified manually against a 7,665 document production index —
vec: "multi-agent orchestration"returns results (88% top hit) instead of the negation error.Relationship to #418
This is a scope-reduced follow-up to #418, which you closed on 2026-04-05 with:
#418 bundled two independent changes:
sanitizeFTS5Term()— preserve hyphens so theunicode61tokenizer splits symmetrically at query time.validateSemanticQuery()negation regex fix — this PR.Since #418 was closed you took a different (and arguably cleaner) approach for (1) —
sanitizeHyphenatedTerm()which splits on-into separate tokens. That change landed onmainand already addresses the lex-side hyphen problem.Piece (2) — the
validateSemanticQuery()false positive — is an independent bug still present on currentmainand is unaffected bysanitizeHyphenatedTerm(). This PR isolates only that fix, with no overlap with your hyphenated-term work.Fixes #414
Environment
main(post-rebase, includessanitizeHyphenatedTermand the newrerankparameter)