Skip to content

fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601

Open
fxstein wants to merge 1 commit into
tobi:mainfrom
fxstein:fix/semantic-negation-regex
Open

fix: allow mid-term hyphens in vec/hyde queries (DEC-0054, ui-kit)#601
fxstein wants to merge 1 commit into
tobi:mainfrom
fxstein:fix/semantic-negation-regex

Conversation

@fxstein
Copy link
Copy Markdown
Contributor

@fxstein fxstein commented Apr 23, 2026

Problem

validateSemanticQuery() rejects semantic (vec / hyde) queries that contain mid-term hyphens inside compound words or identifiers, with a misleading error message:

qmd query 'vec: how does DEC-0054 work'
# Error: Negation (-term) is not supported in vec/hyde queries. Use lex for exclusions.

The regex /-\w/ matches any hyphen-followed-by-word-char, so perfectly reasonable tokens trigger false positives:

  • Document / decision / CVE identifiers — DEC-0054, RFC-0011, CVE-2024-1234
  • Scoped npm packages and component names — @scope/ui-kit, material-ui
  • Compound adjectives — state-of-the-art, role-based, multi-agent, chain-of-thought
  • Hyphenated technical terms — token-based, context-aware, fine-tuned

Users see a confusing "Negation is not supported" error for queries that contain no intentional negation at all.

Root Cause

The regex does not distinguish between true negation (-word at the start of a query or after whitespace — i.e. syntax borrowed from lex) and internal hyphens in compound words (multi-agent, DEC-0054). Both match /-\w/.

Fix

One-line change in validateSemanticQuery() — anchor the negation regex to the start of the query or a whitespace boundary:

- if (/-\w/.test(query) || /-"/.test(query)) {
+ if (/(?:^|\s)-[\w"]/.test(query)) {

Now only true negation tokens (-word or -"phrase" at the start of a word) match. Mid-term hyphens pass through unchanged.

Testing

Added four new test cases in test/structured-search.test.ts covering common identifier patterns:

Query Before After
"DEC-0054 architecture decision" ❌ Negation error ✅ Accepted
"how does @scope/ui-kit work" ❌ Negation error ✅ Accepted
"state-of-the-art retrieval" ❌ Negation error ✅ Accepted
"token-based chunking" ❌ Negation error ✅ Accepted
"performance -sports" (true negation) ✅ Rejected ✅ Rejected
"foo -bar baz" (true negation) ✅ Rejected ✅ Rejected
'-"exact phrase"' (true negation) ✅ Rejected ✅ Rejected

Also verified manually against a 7,665 document production index — vec: "multi-agent orchestration" returns results (88% top hit) instead of the negation error.

Relationship to #418

This is a scope-reduced follow-up to #418, which you closed on 2026-04-05 with:

Closing — the underscore handling landed in #404. The remaining hyphen + validateSemanticQuery changes conflict with main and would need a rebase. Happy to revisit if you want to open a focused follow-up PR. Thanks!

#418 bundled two independent changes:

  1. FTS5 hyphen handling in sanitizeFTS5Term() — preserve hyphens so the unicode61 tokenizer splits symmetrically at query time.
  2. validateSemanticQuery() negation regex fix — this PR.

Since #418 was closed you took a different (and arguably cleaner) approach for (1) — sanitizeHyphenatedTerm() which splits on - into separate tokens. That change landed on main and already addresses the lex-side hyphen problem.

Piece (2) — the validateSemanticQuery() false positive — is an independent bug still present on current main and is unaffected by sanitizeHyphenatedTerm(). This PR isolates only that fix, with no overlap with your hyphenated-term work.

Fixes #414

Environment

  • QMD: main (post-rebase, includes sanitizeHyphenatedTerm and the new rerank parameter)
  • Platform: macOS (Apple Silicon), Linux (container)
  • Node: v24.14.0

validateSemanticQuery uses /-\w/ to flag negation syntax, but that regex
also matches hyphens embedded inside identifiers like "DEC-0054" or
"ui-kit". Semantic queries containing these entirely reasonable tokens
are rejected with a confusing 'Negation (-term) is not supported' error.

Anchor the pattern to the start of the query or a whitespace boundary so
that only true negation tokens ("-word", '-"phrase"' at the start of a
word) trigger the validation, while mid-term hyphens pass through.

Adds test coverage for DEC-0054, scoped npm packages, compound adjectives,
and token-based identifiers.

Refs: tobi#418 (prior attempt, scope-reduced to the negation fix only)
Refs: tobi#305, tobi#417
@fxstein
Copy link
Copy Markdown
Contributor Author

fxstein commented Apr 23, 2026

Heads-up on the red Bun (ubuntu-latest) check: it's not caused by this PR.

The single failing test — Store Creation > createStore throws without explicit path in test mode — has been failing on upstream main itself for 14 consecutive runs since 2026-04-09 (the first red run was the merge of #537). Every PR opened against main since then has inherited the same red check.

I traced the root cause and opened a separate focused fix: #602. Once that lands, this PR's CI goes green automatically (verified — #602 itself is the first fully-green CI run on this codebase in two weeks).

No action needed on this PR; just flagging that the red check here is pre-existing and orthogonal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vec/hyde queries reject hyphenated compound words as negation operators

1 participant