Skip to content

Commit 0e4cfa8

Browse files
authored
Merge pull request #947 from MITLibraries/use-424-semantic-query-builder-plan
Adds ADR for semantic query processing flow
2 parents 28ab1e6 + 682a9a8 commit 0e4cfa8

File tree

1 file changed

+95
-0
lines changed

1 file changed

+95
-0
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# 7. Semantic Query Processing
2+
3+
Date: 2026-02-27
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
We will be using doc-only embedding rather than Bi-encoding (query and document). This distinction is summarized by the [OpenSearch blog](https://opensearch.org/blog/improving-document-retrieval-with-sparse-semantic-encoders/) as follows:
12+
13+
> In bi-encoder mode, both documents and search queries are passed through deep encoders.
14+
> In document-only mode, documents are still passed through deep encoders, but search queries are instead tokenized.
15+
16+
Our OpenSearch may run in AWS OpenSearch Serverless architecture, which prevents us from installing our own models directly in OpenSearch. Therefore, we need to create our tokens to construct a query outside of OpenSearch.
17+
18+
The specific details on how the will be implemented is not a concerns of this TIMDEX API codebase, but we do care about the information flow which is described below. At a high level, we will be calling an external tool to transform a text string (a user query) into a query structure that we can use in our SemanticQueryBuilder and HybridQueryBuilder.
19+
20+
### Example semantic query structure
21+
22+
We anticipate that for an input of "hello world", we should expect a response similar to:
23+
24+
```json
25+
{
26+
"query": {
27+
"bool": {
28+
"should": [
29+
{
30+
"rank_feature": {
31+
"field": "embedding_full_record.[CLS]",
32+
"boost": 1.0
33+
}
34+
},
35+
{
36+
"rank_feature": {
37+
"field": "embedding_full_record.[SEP]",
38+
"boost": 1.0
39+
}
40+
},
41+
{
42+
"rank_feature": {
43+
"field": "embedding_full_record.world",
44+
"boost": 3.4208686351776123
45+
}
46+
},
47+
{
48+
"rank_feature": {
49+
"field": "embedding_full_record.hello",
50+
"boost": 6.937756538391113
51+
}
52+
}
53+
]
54+
}
55+
}
56+
}
57+
```
58+
59+
### Semantic and Lexical query flows
60+
61+
```mermaid
62+
flowchart LR
63+
sq(semantic)
64+
kq(keyword)
65+
tsb(timdex-semantic-builder external service)
66+
67+
subgraph s [Semantic Query]
68+
SemanticQueryBuilder <--> tsb
69+
end
70+
71+
subgraph l [Lexical Query]
72+
LexicalQueryBuilder
73+
end
74+
75+
kq --> l --> OpenSearch
76+
sq --> s --> OpenSearch
77+
```
78+
79+
Keyword, or lexical, queries will be handled entirely in this repository codebase.
80+
81+
Semantic queries will be coordinated in this repository codebase, but constructed by a separate external service. The libraries necessary to create the query structure are python libraries and thus can't be done entirely in this ruby codebase.
82+
83+
Hybrid queries are not in the diagram, but generally consist of sending a single combined query that consists of both lexical and semantic parts. Hybrid can thus be exepected to call both LexicalQueryBuilder and SemanticQueryBuilder to construct the single OpenSearch query.
84+
85+
## Decision
86+
87+
We will develop a separate tool (`timdex-semantic-builder` in diagram) outside of TIMDEX API that will accept a user query and return a semantic query structure ready to be used in our Semantic and Hybrid query builders.
88+
89+
## Consequences
90+
91+
By using an external tool, rather than running the model directly in OpenSearch we are able to consider moving to OpenSearch Serverless which may be easier to manage longterm.
92+
93+
We also get to control the query construction more directly, which might end up being useful.
94+
95+
This does mean we will need to mantain an additional tool to generate semantic queries and that we will need to make an additional external call (likely to a lambda, but not decided yet) during semantic query construction which will introduce some amount of latency.

0 commit comments

Comments
 (0)