Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133
Open
joewiz wants to merge 7 commits intoeXist-db:developfrom
Open
Implement W3C XQuery and XPath Full Text 3.0 (contains text expression)#6133joewiz wants to merge 7 commits intoeXist-db:developfrom
joewiz wants to merge 7 commits intoeXist-db:developfrom
Conversation
6 tasks
Add full XQuery Full Text 3.0 grammar support to the ANTLR parser and tree walker, along with the complete set of AST expression classes. Grammar (XQuery.g): FTContainsExpr, FTSelection, boolean operators, positional filter syntax, FTMatchOptions, FTTimes, FT option declaration, reserved keyword registration. Tree walker (XQueryTree.g): Maps parsed FT constructs to expression class instances with full match option handling. AST classes (org.exist.xquery.ft): FTAbstractExpr, FTAnd, FTOr, FTMildNot, FTUnaryNot, FTWords, FTPrimaryWithOptions, FTSelection, FTContent, FTDistance, FTOrder, FTRange, FTScope, FTTimes, FTUnit, FTWindow. Error codes: FTST0008, FTST0009, FTST0013, FTST0018/0019, FTDY0016/0017, FTDY0020. W3C spec: https://www.w3.org/TR/xpath-full-text-30/ Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Implement the W3C XQFT 3.0 formal semantics evaluator using the AllMatches model. Sequential (non-indexed) evaluator that tokenizes document text at query time. FTEvaluator: tokenizer, AllMatches model, boolean operators, FTWords evaluation, positional filters (window, distance, content, scope, order), sentence/paragraph boundary detection, wildcard matching, ignore option, case/diacritics/stemming modes. FTContainsExpr: the 'contains text' expression integrating with the XQuery expression evaluation framework. Case mode semantics (XQFTTS interpretation): LOWERCASE/UPPERCASE act as filters — only source tokens already in the specified case are eligible for matching. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
FTMatchOptions: case, diacritics, stemming, stop words, wildcards, thesaurus, language options with inheritance. FTThesaurus: WordNet-format thesaurus loading and synonym lookup. XQueryContext: ft-option declaration support in query prolog. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
ForExpr/LetExpr: score variable binding support. StaticXQueryException/XQuery: preserve FTST* error codes in static analysis exceptions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
162 tests: parser, evaluator unit tests, integration tests, and W3C-aligned conformance tests covering boolean operators, positional filters, match options, wildcards, thesaurus, and edge cases. XQFTTS 1.0.4: 675/685 (98.5%) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
2428a96 to
02b4669
Compare
Add default cases to switches, fix parameter reassignment in FTContainsExpr.eval(), collapse nested if in FTEvaluator, move field declarations before inner classes, replace FQNs with imports in XQueryContext, and suppress NPathComplexity on FTEvaluator class. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the W3C XQuery and XPath Full Text 3.0 specification, adding native support for the
contains textexpression, boolean full-text operators, positional filters, match options, and score variables.This is a sequential (non-indexed) evaluator that tokenizes document text at query time. It is not backed by Lucene or any persistent index — all matching is performed in-memory during query evaluation. This makes it correct and complete for the W3C spec, but it is not suitable for large-scale full-text search on stored collections. For indexed full-text search, eXist-db's existing Lucene-based
ft:query()remains the recommended approach. A future phase could integrate this W3C syntax with Lucene for indexed evaluation.Related: #4584 (the sequential evaluator correctly handles matches spanning multiple text nodes).
Co-Authored-By: Claude Opus 4.6 [email protected]
What Changed
New package:
org.exist.xquery.ftFTContainsExpr.javacontains textexpression — evaluates FTSelection against context nodes, manages match option inheritanceFTEvaluator.javaFTAbstractExpr.javaFTAnd.javaftand)FTOr.javaftor)FTMildNot.javanot in) — excludes matches without full negationFTUnaryNot.javaftnot)FTWords.javaany/all/phrase/any word/all wordsmodesFTPrimaryWithOptions.javaFTSelection.javaFTContent.javaat start,at end,entire contentFTDistance.javaFTWindow.javaFTScope.javasame sentence/same paragraph/different sentence/different paragraphFTOrder.javaFTRange.javaFTTimes.javaoccurs exactly/at least/at most/from N to M timesFTUnit.javawords,sentences,paragraphsFTMatchOptions.javaFTThesaurus.javaGrammar changes
XQuery.gftContainsExprproduction, full FT selection/option/filter syntax, FT prolog declarations, 27 new grammar rules, reserved keyword registrationXQueryTree.gModified existing files
ForExpr.javafor $x score $s in ...binding (§3.1)LetExpr.javalet score $s := ...binding (§3.1)XQueryContext.javaXQuery.javaStaticXQueryExceptionfor proper FTST error propagationStaticXQueryException.javaErrorCodes.javaTests
FTParserTest.javaFTEvaluatorTest.javaFTContainsTest.javaFTConformanceTest.javaXQTS Results
W3C XQFTTS 1.0.4 (685 tests):
Remaining 10 failures — triage
FODC0005— tests usefn:doc()with relative URIs; eXist's fn:doc cannot load from local filesystem. Not FT-specific.FTST0018— thesaurus lookup fails when combined withusing case uppercase. Case option interaction with thesaurus.trueis correct per W3C formal semantics (first-token-at-start + last-token-at-end). Test expects stricter contiguity.Test suite results
Spec Reference
Limitations and Future Work
No index backing (in-memory only)
This implementation is a sequential evaluator — it tokenizes and matches document text at query time without using any persistent index. This means:
ft:query()) is a separate, proprietary API. This implementation does not use it. A future phase could routecontains textexpressions to the Lucene index when a suitable index configuration exists, falling back to sequential evaluation otherwise.Features not implemented
using stemmingoption is parsed and recognized but uses a basic suffix-stripping heuristic. Integration with a proper stemming library (e.g., Snowball via Lucene) would improve recall.using thesaurus defaultoption (implementation-defined default thesaurus) is not supported.using stop words defaultoption uses this list. Language-specific stop word lists are not bundled.Relationship to eXist-db's existing full-text support
eXist-db has two existing full-text mechanisms:
ft:query()— Lucene-backed, proprietary XQuery function. Indexed, fast, supports Lucene query syntax. Recommended for production full-text search.near()— Deprecated proprietary function. Minimal functionality.This PR adds a third mechanism: W3C standard
contains textsyntax. It is complementary toft:query(), not a replacement. Applications that need indexed full-text search should continue usingft:query(). Applications that need W3C-standard syntax (portability, XQFTTS compliance, or precise positional matching) can now usecontains text.