Skip to content

docs: Updating spark documentation #5766

Open
ShimonSte wants to merge 7 commits intomainfrom
docs/spark-connector-query-settings
Open

docs: Updating spark documentation #5766
ShimonSte wants to merge 7 commits intomainfrom
docs/spark-connector-query-settings

Conversation

@ShimonSte
Copy link
Copy Markdown
Contributor

New Section Issues Addressed
Supported ClickHouse server versions #432 — adds a minimum version table (21.6+ for partition-id, 25.3+ for VariantType, etc.)
Push-down operations #476 — lists all 5 push-down interfaces (column pruning, filters, limit, aggregations, runtime filters) with a table, notes on unsupported
expressions, and how to enable runtime filtering
Connector parallelism #435 — explains read parallelism (1 task per CH partition, 3 modes for Distributed), write parallelism (repartition/sort flow), and how to
control task count
Working with Distributed tables #402 — explicitly documents that inserts go to local tables, all shard hostnames must be accessible, and how to fall back to
coordinator-only mode
Passing query settings and Java client options #430, #431 — documents the option.<key> /
option.custom_http_params mechanism for both Catalog and TableProvider APIs, with a common settings table
Performance tuning #434 — read and write tuning tables plus a recommended bulk-load config snippet
Troubleshooting #436, #365,
#374 — covers Broken Pipe, WHERE 1=0 schema inference overhead, shard hostname resolution, too many tasks, partition expression errors, and stale schema

@ShimonSte ShimonSte requested review from a team as code owners March 17, 2026 09:39
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
clickhouse-docs Ready Ready Preview, Comment May 5, 2026 8:53am
clickhouse-docs-zh Building Building Preview, Comment May 5, 2026 8:53am
3 Skipped Deployments
Project Deployment Actions Updated (UTC)
clickhouse-docs-jp Ignored Ignored May 5, 2026 8:53am
clickhouse-docs-ko Ignored Ignored Preview May 5, 2026 8:53am
clickhouse-docs-ru Ignored Ignored Preview May 5, 2026 8:53am

Request Review

Copy link
Copy Markdown
Contributor

@BentsiLeviav BentsiLeviav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments I left, addressing these issues would make our docs even better:

  • Let's fix the UInt64 mapping. Right now, it maps to LongType, after your change here (ClickHouse/spark-clickhouse-connector#477), it is DecimalType(20, 0)
  • For the option allowUnsupportedSharding docs say since 0.10.0, the code says since 0.9.0
  • There is an SBT syntax issue (line 141) - the artifact name is unquoted
  • We wrongly use : in the jar path instead of - . --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}.jar
    should be changed to --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
  • We have a typo in the Gradle snippet in line 131 (repositries)
  • In the compatibility matrix - let's update that main uses JDBC 0.9.5 and not 0.9.4
  • In version 0.9.0, we added the option to provide settings on read using spark.clickhouse.read.settings - let's document it in the configuration table.
  • spark.clickhouse.read.jsonAs is missing from the configuration table as well

| `spark.clickhouse.read.splitByPartitionId` (partition-id-based filtering) | 21.6+ |
| `VariantType` / `JSON` type support | 25.3+ |

For production deployments, we recommend running ClickHouse 23.x or later. The connector is tested against the latest stable ClickHouse releases.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should recommend using the latest, and not a specific version - WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the GitHub workflow ClickHouse version list? That way, users would be able to see exactly what versions we test with


When using ClickHouse `Distributed` tables with the connector, there is an important networking and architecture consideration:

**The connector bypasses the Distributed engine and connects directly to each shard node.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current phrasing sounds a bit like it’s “working around”. Ill suggest writing something like:

Suggested change
**The connector bypasses the Distributed engine and connects directly to each shard node.**
**Following ClickHouse best practices for high-throughput ingestion, the connector writes data directly to the underlying shard nodes rather than using the Distributed engine.**

Feel free to suggest other versions if you find them fit.

This means:

1. **All shard hostnames must be reachable** from Spark executors. The connector reads cluster topology from `system.clusters` on the coordinator node, then opens direct connections to each shard. If the shard hostnames returned by `system.clusters` are internal cluster names not resolvable from outside, reads and writes will fail.
2. **Inserts go to local tables**: When writing, data is inserted directly into the local `MergeTree` table on each shard (not through the `Distributed` table). This is more efficient but requires that the local table exists on every shard and that Spark can connect to each shard directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| `max_execution_time=300` | `option.custom_http_params` | Extend query timeout (seconds) for large reads |
| `session_timeout=60` | `option.custom_http_params` | Extend HTTP session timeout |
| `ssl` | `option.ssl` | Enable TLS (`true`/`false`) |
| `client_name` | `option.client_name` | Tag requests with a client name visible in `system.query_log` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't append multiple client_name values in the connector, and the Java Client doesn't do it automatically (in situations where there are multiple client_name sets, as in https://github.com/ClickHouse/spark-clickhouse-connector/blob/a1d8b7b32cae27e2fe38133c10ce10613b824013/clickhouse-core/src/main/scala/com/clickhouse/spark/client/NodeClient.scala#L97-L104), so I'm not sure that will work. Did you verify it?

### Via TableProvider API {#query-settings-tableprovider-api}

```python
df.read \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean spark.read?


### Common settings {#query-settings-common}

| Setting | Example value | Purpose |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table header is not aligned with the data. The first two columns should be swapped

Image


**Cause**: When aggregation push-down is triggered, the connector sends a probe query to ClickHouse to determine the output schema of the pushed-down aggregation. This is by design and only occurs when aggregation push-down is active.

**Fix**: If this causes unacceptable load, avoid triggering aggregation push-down by handling the aggregation in Spark instead:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "fix" reads the entire table into Spark before aggregating, which in Spark's use case (working with big data), would potentially be far worse than the probe query it is trying to avoid.

For most cases, the WHERE 1=0 probe query is just a schema-determination round-trip (near-zero cost, and if it is not, it might suggest other problems), and the SELECT * workaround trades a free probe query for a full table scan.

Do we have other alternatives to avoid it? I see the question in ClickHouse/spark-clickhouse-connector#374 was for a situation where a schema is being provided. Is there a feature flag disabling this once a schema is provided? If not, I would encourage us to keep the issue open, and once we develop such functionality, add it to the docs.


### Read parallelism {#read-parallelism}

The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.
The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table. Please visit [Table Partitions](https://clickhouse.com/docs/partitions) for more information on partitioning.

| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=true` (default) | (number of shards) × (partitions per shard) tasks, each targeting a specific shard node directly |
| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=false` | 1 task, reading through the `Distributed` coordinator node |

For a table with no `PARTITION BY` clause, the connector creates one Spark partition per ClickHouse data part group, which may result in many small tasks. Use `PARTITION BY` in your ClickHouse table to control parallelism granularity.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which may result in many small

Would love to hear your thoughts to see if my understanding here is right:

  1. planInputPartitions returns Array[InputParition] and Spark creates once task per element -
  2. createReader(partition) is called once per element from that array. It takes a single ClickHouseInputPartition and creates one reader for it.

So, to summarize - the mapping is 1 to 1 - One InputPartition object mapped to one Spark task (which mapped to one reader).

If that is correct, there is no fan out.

For the no-PARTITION BY case: queryPartitionSpec returns Array(NoPartitionSpec) (one element) -> inputPartitions has length 1->
planInputPartitions returns an array of 1 → Spark schedules exactly 1 task.

…docs

- Add releases page link for tested ClickHouse versions
- Add Table Partitions link in read parallelism section
- Add distributed writing docs link in Distributed tables section
- Remove option.client_name from examples (connector sets its own client name)
- Clarify repartitionStrictly vs coalesce/repartition distinction
- Rename troubleshooting section to "Too many schema inference WHERE 1=0 queries"
- Spell out allowUnsupportedSharding consequences (data skew, wrong shard writes)
- Fix WHERE 1=0 troubleshooting: add pushAggregation.enabled config option,
  note that .schema() does not suppress the probe query
- Fix repartitionNum description: it IS used via requiredNumPartitions()
- Fix read.settings: must use spark.conf.set(), .option() has no effect;
  add set/collect/unset pattern for per-query scoping
Add a note block linking to ClientConfigProperties.java and the
documentation pages for Java client configuration and ClickHouse
server session settings.
Add tabbed code examples showing how to use spark.clickhouse.read.settings
and spark.clickhouse.write.* options in Python, Scala, and Java.
…m Spark connector

Remove the Common timeout scenarios section and the Connection resets on
AWS PrivateLink or NLB troubleshooting entry. The NLB broken pipe scenario
could not be reproduced from EMR against the staging endpoint (public HTTPS
path, no NLB in the TCP path), so this content is not verified and should
not be published.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Don't Merge Don't merge yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants