|
| 1 | +# 7. Semantic Query Processing |
| 2 | + |
| 3 | +Date: 2026-02-27 |
| 4 | + |
| 5 | +## Status |
| 6 | + |
| 7 | +Accepted |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +We will be using doc-only embedding rather than Bi-encoding (query and document). This distinction is summarized by the [OpenSearch blog](https://opensearch.org/blog/improving-document-retrieval-with-sparse-semantic-encoders/) as follows: |
| 12 | + |
| 13 | +> In bi-encoder mode, both documents and search queries are passed through deep encoders. |
| 14 | +> In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized. |
| 15 | +
|
| 16 | +Our OpenSearch may run in AWS OpenSearch Serverless architecture, which prevents us from installing our own models directly in OpenSearch. Therefore, we need to create our tokens to construct a query outside of OpenSearch. |
| 17 | + |
| 18 | +The specific details on how the will be implemented is not a concerns of this TIMDEX API codebase, but we do care about the information flow which is described below. At a high level, we will be calling an external tool to transform a text string (a user query) into a query structure that we can use in our SemanticQueryBuilder and HybridQueryBuilder. |
| 19 | + |
| 20 | +### Example semantic query structure |
| 21 | + |
| 22 | +We anticipate that for an input of "hello world", we should expect a response similar to: |
| 23 | + |
| 24 | +```json |
| 25 | +{ |
| 26 | + "query": { |
| 27 | + "bool": { |
| 28 | + "should": [ |
| 29 | + { |
| 30 | + "rank_feature": { |
| 31 | + "field": "embedding_full_record.[CLS]", |
| 32 | + "boost": 1.0 |
| 33 | + } |
| 34 | + }, |
| 35 | + { |
| 36 | + "rank_feature": { |
| 37 | + "field": "embedding_full_record.[SEP]", |
| 38 | + "boost": 1.0 |
| 39 | + } |
| 40 | + }, |
| 41 | + { |
| 42 | + "rank_feature": { |
| 43 | + "field": "embedding_full_record.world", |
| 44 | + "boost": 3.4208686351776123 |
| 45 | + } |
| 46 | + }, |
| 47 | + { |
| 48 | + "rank_feature": { |
| 49 | + "field": "embedding_full_record.hello", |
| 50 | + "boost": 6.937756538391113 |
| 51 | + } |
| 52 | + } |
| 53 | + ] |
| 54 | + } |
| 55 | + } |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +### Semantic and Lexical query flows |
| 60 | + |
| 61 | +```mermaid |
| 62 | +flowchart LR |
| 63 | + sq(semantic) |
| 64 | + kq(keyword) |
| 65 | + tsb(timdex-semantic-builder external service) |
| 66 | +
|
| 67 | + subgraph s [Semantic Query] |
| 68 | + SemanticQueryBuilder <--> tsb |
| 69 | + end |
| 70 | +
|
| 71 | + subgraph l [Lexical Query] |
| 72 | + LexicalQueryBuilder |
| 73 | + end |
| 74 | +
|
| 75 | + kq --> l --> OpenSearch |
| 76 | + sq --> s --> OpenSearch |
| 77 | +``` |
| 78 | + |
| 79 | +Keyword, or lexical, queries will be handled entirely in this repository codebase. |
| 80 | + |
| 81 | +Semantic queries will be coordinated in this repository codebase, but constructed by a separate external service. The libraries necessary to create the query structure are python libraries and thus can't be done entirely in this ruby codebase. |
| 82 | + |
| 83 | +Hybrid queries are not in the diagram, but generally consist of sending a single combined query that consists of both lexical and semantic parts. Hybrid can thus be exepected to call both LexicalQueryBuilder and SemanticQueryBuilder to construct the single OpenSearch query. |
| 84 | + |
| 85 | +## Decision |
| 86 | + |
| 87 | +We will develop a separate tool (`timdex-semantic-builder` in diagram) outside of TIMDEX API that will accept a user query and return a semantic query structure ready to be used in our Semantic and Hybrid query builders. |
| 88 | + |
| 89 | +## Consequences |
| 90 | + |
| 91 | +By using an external tool, rather than running the model directly in OpenSearch we are able to consider moving to OpenSearch Serverless which may be easier to manage longterm. |
| 92 | + |
| 93 | +We also get to control the query construction more directly, which might end up being useful. |
| 94 | + |
| 95 | +This does mean we will need to mantain an additional tool to generate semantic queries and that we will need to make an additional external call (likely to a lambda, but not decided yet) during semantic query construction which will introduce some amount of latency. |
0 commit comments