Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ Please help us add more systems and run the benchmarks on more types of VMs:
- [ ] MS SQL Server with Column Store Index (without publishing)
- [ ] OceanBase
- [ ] Planetscale (without publishing)
- [ ] Quickwit
- [x] Quickwit
- [ ] Redshift Spectrum
- [ ] Seafowl
- [ ] ShitholeDB
Expand Down
49 changes: 49 additions & 0 deletions quickwit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Quickwit

[Quickwit](https://quickwit.io) is a Rust-based search engine for log analytics, built on top of [Tantivy](https://github.com/quickwit-oss/tantivy). It exposes an Elasticsearch-compatible REST API for ingestion and search, but does not implement an SQL endpoint, so this benchmark uses the native Elasticsearch query DSL directly.

## Methodology

Infrastructure:
- Single-node Quickwit **v0.9.0-rc** (Docker `quickwit/quickwit:v0.9.0-rc`).

Stable **0.8.2** is missing `cardinality`, `wildcard`, and several other features the benchmark relies on, so we use the v0.9 release candidate. The v0.9 line is still unreleased — as soon as a stable v0.9.x ships, bump `QW_IMAGE` in `benchmark.sh`.

Index configuration (`index_config.yaml`):
- All scalar fields declared with `fast: true` so they can participate in aggregations and sorts.
- Keyword-like text fields use the `raw` tokenizer with the `raw` fast-field normalizer to mimic Elasticsearch's `keyword` mapping.
- `EventTime` is set as the index's timestamp field, providing time-based pruning.

Ingestion (`benchmark.sh`):
- Streams `hits.json.gz` decompressed into `quickwit tool local-ingest`, which builds splits directly on local storage. We do **not** use the Elasticsearch bulk endpoint: v0.9's sharded ingest-v2 API caps single-node throughput to a few MB/s in our testing and stalls waiting for shards to scale. `local-ingest` bypasses the ingest pipeline entirely.
- The server picks up the new splits on its next metastore poll (default 30 s).

Queries (`queries.json`):
- Each query in `queries.sql` is hand-translated to the Elasticsearch DSL on the corresponding line of `queries.json`, and submitted to `/api/v1/_elastic/hits/_search`.
- Timing is taken from the `took` field returned by Quickwit (milliseconds, engine-internal).
- Queries that are not expressible in Quickwit's DSL are recorded as `null`.

## Unsupported queries

The following ClickBench queries cannot currently be expressed in Quickwit's Elasticsearch-compatible DSL and are reported as `null`:

| Q | Reason |
|----|-----------------------------------------------------------------------|
| 19 | `extract(minute FROM …)` — no scripted/runtime fields |
| 26 | `ORDER BY` on text field — `sort by field on type text is currently not supported` |
| 27 | `ORDER BY` on text field |
| 28 | `AVG(length(URL))` — no scripted/runtime fields |
| 29 | `REGEXP_REPLACE` — not supported |
| 30 | `SUM(col + N)` — no scripted aggregations |
| 36 | `ClientIP - N` — no scripted aggregations |
| 40 | `CASE WHEN …` — no scripted/runtime fields |

All other 35 queries run through the native Elasticsearch DSL, including `cardinality` (Q5/6/9/10/11/12/14) and `wildcard` (Q21/22/23/24).

## Running

```bash
bash benchmark.sh
```

Installs Docker and Quickwit, creates the index, downloads `hits.json.gz`, runs `local-ingest`, then runs `run.sh` to time each query three times with caches dropped between runs.
90 changes: 90 additions & 0 deletions quickwit/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#!/bin/bash
set -eo pipefail

export DEBIAN_FRONTEND=noninteractive

# Install prerequisites quietly
sudo apt-get update -qq >/dev/null
sudo apt-get install -y -qq wget curl jq bc docker.io >/dev/null
sudo systemctl start docker

# We use the Quickwit v0.9 release candidate. Stable v0.8.2 is missing
# `cardinality`, `wildcard`, and several other features the benchmark relies
# on; only the v0.9 line (still unreleased as of writing) provides them.
QW_IMAGE="quickwit/quickwit:v0.9.0-rc"
sudo docker pull -q "$QW_IMAGE" >/dev/null

# Quickwit's data directory (shared between the server and the local-ingest
# container).
QW_DATA="$(pwd)/qwdata"
sudo rm -rf "$QW_DATA"
mkdir -p "$QW_DATA"

# Start the server in the background. Quickwit defaults: REST on 7280, gRPC on 7281.
# Mount node-config.yaml on top of the image's default config to bump the
# searcher timeouts (defaults are 30s, which is too low for some of the
# nested high-cardinality aggregations on the full 100M-row dataset).
sudo docker run -d --name qw --network host \
-v "$QW_DATA":/quickwit/qwdata \
-v "$(pwd)/node-config.yaml":/quickwit/config/quickwit.yaml \
"$QW_IMAGE" run >/dev/null
echo "Quickwit container started"

# Wait for the server to come up.
for i in $(seq 1 60); do
if curl -sS -f http://localhost:7280/api/v1/version >/dev/null 2>&1; then
echo "Quickwit is ready"
break
fi
sleep 1
done

# Create the index from the YAML config.
curl -sS -X POST http://localhost:7280/api/v1/indexes \
-H 'Content-Type: application/yaml' \
--data-binary @index_config.yaml | jq -r '.index_uid // .message'

# Download the data quietly (the dataset is ~14 GB; full progress would
# dominate the captured benchmark log).
wget --continue -q 'https://datasets.clickhouse.com/hits_compatible/hits.json.gz'

START=$(date +%s)

# Use `quickwit tool local-ingest` instead of the Elasticsearch-compatible
# bulk endpoint. v0.9's sharded ingest-v2 API caps single-node throughput
# to a few MB/s and gets stuck waiting for shards to scale, while
# `local-ingest` builds splits directly and writes them to the index
# storage. The running server picks up new splits on its next metastore
# poll (default 30s).
#
# local-ingest emits a "Num docs ... Thrghput ... Time" progress line
# roughly once per second; we throttle that to once per ~30 seconds so
# the captured log stays compact, and pass the surrounding lines through
# unchanged.
zcat hits.json.gz | sudo docker run --rm -i --network host \
-v "$QW_DATA":/quickwit/qwdata \
"$QW_IMAGE" tool local-ingest --index hits -y 2>&1 \
| awk '/Num docs/ { n = systime(); if (n - last >= 30) { print; fflush(); last = n } next }
{ print; fflush() }'

# Wait long enough for the server to refresh its metastore view.
sleep 35

# Show stats.
curl -sS "http://localhost:7280/api/v1/indexes/hits/describe" \
| jq '{num_published_docs, num_published_splits, size_published_splits}' \
| tee stats.json

END=$(date +%s)
echo "Load time: $((END - START))"

# Data size on disk.
echo -n "Data size: "
sudo du -sb "$QW_DATA" | awk '{print $1}'

# Run queries
chmod +x run.sh
./run.sh

sudo docker stop qw 2>/dev/null || true
sudo docker rm qw 2>/dev/null || true
149 changes: 149 additions & 0 deletions quickwit/index_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
version: 0.8

index_id: hits

doc_mapping:
mode: strict
timestamp_field: EventTime
field_mappings:
- {name: WatchID, type: i64, indexed: true, fast: true}
- {name: JavaEnable, type: i64, indexed: true, fast: true}
- {name: Title, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: GoodEvent, type: i64, indexed: true, fast: true}
- name: EventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- name: EventDate
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: CounterID, type: i64, indexed: true, fast: true}
- {name: ClientIP, type: i64, indexed: true, fast: true}
- {name: RegionID, type: i64, indexed: true, fast: true}
- {name: UserID, type: i64, indexed: true, fast: true}
- {name: CounterClass, type: i64, indexed: true, fast: true}
- {name: OS, type: i64, indexed: true, fast: true}
- {name: UserAgent, type: i64, indexed: true, fast: true}
- {name: URL, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: Referer, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: IsRefresh, type: i64, indexed: true, fast: true}
- {name: RefererCategoryID, type: i64, indexed: true, fast: true}
- {name: RefererRegionID, type: i64, indexed: true, fast: true}
- {name: URLCategoryID, type: i64, indexed: true, fast: true}
- {name: URLRegionID, type: i64, indexed: true, fast: true}
- {name: ResolutionWidth, type: i64, indexed: true, fast: true}
- {name: ResolutionHeight, type: i64, indexed: true, fast: true}
- {name: ResolutionDepth, type: i64, indexed: true, fast: true}
- {name: FlashMajor, type: i64, indexed: true, fast: true}
- {name: FlashMinor, type: i64, indexed: true, fast: true}
- {name: FlashMinor2, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: NetMajor, type: i64, indexed: true, fast: true}
- {name: NetMinor, type: i64, indexed: true, fast: true}
- {name: UserAgentMajor, type: i64, indexed: true, fast: true}
- {name: UserAgentMinor, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: CookieEnable, type: i64, indexed: true, fast: true}
- {name: JavascriptEnable, type: i64, indexed: true, fast: true}
- {name: IsMobile, type: i64, indexed: true, fast: true}
- {name: MobilePhone, type: i64, indexed: true, fast: true}
- {name: MobilePhoneModel, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: Params, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: IPNetworkID, type: i64, indexed: true, fast: true}
- {name: TraficSourceID, type: i64, indexed: true, fast: true}
- {name: SearchEngineID, type: i64, indexed: true, fast: true}
- {name: SearchPhrase, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: AdvEngineID, type: i64, indexed: true, fast: true}
- {name: IsArtifical, type: i64, indexed: true, fast: true}
- {name: WindowClientWidth, type: i64, indexed: true, fast: true}
- {name: WindowClientHeight, type: i64, indexed: true, fast: true}
- {name: ClientTimeZone, type: i64, indexed: true, fast: true}
- name: ClientEventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: SilverlightVersion1, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion2, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion3, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion4, type: i64, indexed: true, fast: true}
- {name: PageCharset, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: CodeVersion, type: i64, indexed: true, fast: true}
- {name: IsLink, type: i64, indexed: true, fast: true}
- {name: IsDownload, type: i64, indexed: true, fast: true}
- {name: IsNotBounce, type: i64, indexed: true, fast: true}
- {name: FUniqID, type: i64, indexed: true, fast: true}
- {name: OriginalURL, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HID, type: i64, indexed: true, fast: true}
- {name: IsOldCounter, type: i64, indexed: true, fast: true}
- {name: IsEvent, type: i64, indexed: true, fast: true}
- {name: IsParameter, type: i64, indexed: true, fast: true}
- {name: DontCountHits, type: i64, indexed: true, fast: true}
- {name: WithHash, type: i64, indexed: true, fast: true}
- {name: HitColor, type: text, tokenizer: raw, fast: {normalizer: raw}}
- name: LocalEventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: Age, type: i64, indexed: true, fast: true}
- {name: Sex, type: i64, indexed: true, fast: true}
- {name: Income, type: i64, indexed: true, fast: true}
- {name: Interests, type: i64, indexed: true, fast: true}
- {name: Robotness, type: i64, indexed: true, fast: true}
- {name: RemoteIP, type: i64, indexed: true, fast: true}
- {name: WindowName, type: i64, indexed: true, fast: true}
- {name: OpenerName, type: i64, indexed: true, fast: true}
- {name: HistoryLength, type: i64, indexed: true, fast: true}
- {name: BrowserLanguage, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: BrowserCountry, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: SocialNetwork, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: SocialAction, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HTTPError, type: i64, indexed: true, fast: true}
- {name: SendTiming, type: i64, indexed: true, fast: true}
- {name: DNSTiming, type: i64, indexed: true, fast: true}
- {name: ConnectTiming, type: i64, indexed: true, fast: true}
- {name: ResponseStartTiming, type: i64, indexed: true, fast: true}
- {name: ResponseEndTiming, type: i64, indexed: true, fast: true}
- {name: FetchTiming, type: i64, indexed: true, fast: true}
- {name: SocialSourceNetworkID, type: i64, indexed: true, fast: true}
- {name: SocialSourcePage, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamPrice, type: i64, indexed: true, fast: true}
- {name: ParamOrderID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamCurrency, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamCurrencyID, type: i64, indexed: true, fast: true}
- {name: OpenstatServiceName, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatCampaignID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatAdID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatSourceID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMSource, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMMedium, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMCampaign, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMContent, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMTerm, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: FromTag, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HasGCLID, type: i64, indexed: true, fast: true}
- {name: RefererHash, type: i64, indexed: true, fast: true}
- {name: URLHash, type: i64, indexed: true, fast: true}
- {name: CLID, type: i64, indexed: true, fast: true}

store_source: false

indexing_settings:
commit_timeout_secs: 30
merge_policy:
type: stable_log
merge_factor: 10
max_merge_factor: 12

search_settings:
default_search_fields: []
17 changes: 17 additions & 0 deletions quickwit/node-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
version: 0.8

searcher:
# Bump the per-request and leaf-search timeouts above the 30s default —
# a few of the high-cardinality aggregations on the full 100M-row ClickBench
# dataset (e.g. WatchID + ClientIP nested terms) take longer than that.
request_timeout_secs: 60
leaf_request_timeout_secs: 60

# Disable the per-split partial result cache so warm runs don't replay a
# memoized answer. The other in-memory caches (fast_field_cache,
# split_footer_cache, predicate_cache) are data-level caches (analogous to
# ClickHouse's query condition cache) and are kept at their defaults;
# run.sh restarts the container before each query so they also start cold
# for the first run.
partial_request_cache:
capacity: 0
Loading