docs: Updating spark documentation by ShimonSte · Pull Request #5766 · ClickHouse/clickhouse-docs

ShimonSte · 2026-03-17T09:39:03Z

New Section	Issues Addressed
Supported ClickHouse server versions	#432 — adds a minimum version table (21.6+ for partition-id, 25.3+ for VariantType, etc.)
Push-down operations	#476 — lists all 5 push-down interfaces (column pruning, filters, limit, aggregations, runtime filters) with a table, notes on unsupported
expressions, and how to enable runtime filtering
Connector parallelism	#435 — explains read parallelism (1 task per CH partition, 3 modes for Distributed), write parallelism (repartition/sort flow), and how to
control task count
Working with Distributed tables	#402 — explicitly documents that inserts go to local tables, all shard hostnames must be accessible, and how to fall back to
coordinator-only mode
Passing query settings and Java client options	#430, #431 — documents the `option.<key>` /
`option.custom_http_params` mechanism for both Catalog and TableProvider APIs, with a common settings table
Performance tuning	#434 — read and write tuning tables plus a recommended bulk-load config snippet
Troubleshooting	#436, #365,
#374 — covers Broken Pipe, `WHERE 1=0` schema inference overhead, shard hostname resolution, too many tasks, partition expression errors, and stale schema

vercel · 2026-03-17T09:39:09Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
clickhouse-docs	Ready	Preview, Comment	May 5, 2026 8:53am
clickhouse-docs-zh	Building	Preview, Comment	May 5, 2026 8:53am

3 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
clickhouse-docs-jp	Ignored		May 5, 2026 8:53am
clickhouse-docs-ko	Ignored	Preview	May 5, 2026 8:53am
clickhouse-docs-ru	Ignored	Preview	May 5, 2026 8:53am

BentsiLeviav

Besides the comments I left, addressing these issues would make our docs even better:

Let's fix the UInt64 mapping. Right now, it maps to LongType, after your change here (ClickHouse/spark-clickhouse-connector#477), it is DecimalType(20, 0)
For the option allowUnsupportedSharding docs say since 0.10.0, the code says since 0.9.0
There is an SBT syntax issue (line 141) - the artifact name is unquoted
We wrongly use : in the jar path instead of - . --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}.jar
should be changed to --jars /path/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
We have a typo in the Gradle snippet in line 131 (repositries)
In the compatibility matrix - let's update that main uses JDBC 0.9.5 and not 0.9.4
In version 0.9.0, we added the option to provide settings on read using spark.clickhouse.read.settings - let's document it in the configuration table.
spark.clickhouse.read.jsonAs is missing from the configuration table as well

BentsiLeviav · 2026-03-22T06:56:59Z

+| `spark.clickhouse.read.splitByPartitionId` (partition-id-based filtering) | 21.6+ |
+| `VariantType` / `JSON` type support | 25.3+ |
+
+For production deployments, we recommend running ClickHouse 23.x or later. The connector is tested against the latest stable ClickHouse releases.


I think we should recommend using the latest, and not a specific version - WDYT?

Can you add a link to the GitHub workflow ClickHouse version list? That way, users would be able to see exactly what versions we test with

BentsiLeviav · 2026-03-22T09:58:28Z

+
+When using ClickHouse `Distributed` tables with the connector, there is an important networking and architecture consideration:
+
+**The connector bypasses the Distributed engine and connects directly to each shard node.**


The current phrasing sounds a bit like it’s “working around”. Ill suggest writing something like:

Suggested change

**The connector bypasses the Distributed engine and connects directly to each shard node.**

**Following ClickHouse best practices for high-throughput ingestion, the connector writes data directly to the underlying shard nodes rather than using the Distributed engine.**

Feel free to suggest other versions if you find them fit.

BentsiLeviav · 2026-03-22T10:01:24Z

+This means:
+
+1. **All shard hostnames must be reachable** from Spark executors. The connector reads cluster topology from `system.clusters` on the coordinator node, then opens direct connections to each shard. If the shard hostnames returned by `system.clusters` are internal cluster names not resolvable from outside, reads and writes will fail.
+2. **Inserts go to local tables**: When writing, data is inserted directly into the local `MergeTree` table on each shard (not through the `Distributed` table). This is more efficient but requires that the local table exists on every shard and that Spark can connect to each shard directly.


would be nice to include a link to this doc page https://clickhouse.com/docs/engines/table-engines/special/distributed#distributed-writing-data

BentsiLeviav · 2026-03-22T11:34:03Z

+| `max_execution_time=300` | `option.custom_http_params` | Extend query timeout (seconds) for large reads |
+| `session_timeout=60` | `option.custom_http_params` | Extend HTTP session timeout |
+| `ssl` | `option.ssl` | Enable TLS (`true`/`false`) |
+| `client_name` | `option.client_name` | Tag requests with a client name visible in `system.query_log` |


We don't append multiple client_name values in the connector, and the Java Client doesn't do it automatically (in situations where there are multiple client_name sets, as in https://github.com/ClickHouse/spark-clickhouse-connector/blob/a1d8b7b32cae27e2fe38133c10ce10613b824013/clickhouse-core/src/main/scala/com/clickhouse/spark/client/NodeClient.scala#L97-L104), so I'm not sure that will work. Did you verify it?

BentsiLeviav · 2026-03-22T15:36:11Z

+### Via TableProvider API {#query-settings-tableprovider-api}
+
+```python
+df.read \


Did you mean spark.read?

BentsiLeviav · 2026-03-23T06:41:36Z

+
+### Common settings {#query-settings-common}
+
+| Setting | Example value | Purpose |


The table header is not aligned with the data. The first two columns should be swapped

BentsiLeviav · 2026-03-23T06:57:41Z

+
+**Cause**: When aggregation push-down is triggered, the connector sends a probe query to ClickHouse to determine the output schema of the pushed-down aggregation. This is by design and only occurs when aggregation push-down is active.
+
+**Fix**: If this causes unacceptable load, avoid triggering aggregation push-down by handling the aggregation in Spark instead:


This "fix" reads the entire table into Spark before aggregating, which in Spark's use case (working with big data), would potentially be far worse than the probe query it is trying to avoid.

For most cases, the WHERE 1=0 probe query is just a schema-determination round-trip (near-zero cost, and if it is not, it might suggest other problems), and the SELECT * workaround trades a free probe query for a full table scan.

Do we have other alternatives to avoid it? I see the question in ClickHouse/spark-clickhouse-connector#374 was for a situation where a schema is being provided. Is there a feature flag disabling this once a schema is provided? If not, I would encourage us to keep the issue open, and once we develop such functionality, add it to the docs.

BentsiLeviav · 2026-03-23T07:09:37Z

+
+### Read parallelism {#read-parallelism}
+
+The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.


Suggested change

The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.

The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table. Please visit [Table Partitions](https://clickhouse.com/docs/partitions) for more information on partitioning.

BentsiLeviav · 2026-03-23T07:58:53Z

+| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=true` (default) | (number of shards) × (partitions per shard) tasks, each targeting a specific shard node directly |
+| `Distributed` table with `spark.clickhouse.read.distributed.convertLocal=false` | 1 task, reading through the `Distributed` coordinator node |
+
+For a table with no `PARTITION BY` clause, the connector creates one Spark partition per ClickHouse data part group, which may result in many small tasks. Use `PARTITION BY` in your ClickHouse table to control parallelism granularity.


which may result in many small

Would love to hear your thoughts to see if my understanding here is right:

planInputPartitions returns Array[InputParition] and Spark creates once task per element -

createReader(partition) is called once per element from that array. It takes a single ClickHouseInputPartition and creates one reader for it.

So, to summarize - the mapping is 1 to 1 - One InputPartition object mapped to one Spark task (which mapped to one reader).

If that is correct, there is no fan out.

For the no-PARTITION BY case: queryPartitionSpec returns Array(NoPartitionSpec) (one element) -> inputPartitions has length 1->
planInputPartitions returns an array of 1 → Spark schedules exactly 1 task.

…nector

…docs - Add releases page link for tested ClickHouse versions - Add Table Partitions link in read parallelism section - Add distributed writing docs link in Distributed tables section - Remove option.client_name from examples (connector sets its own client name) - Clarify repartitionStrictly vs coalesce/repartition distinction - Rename troubleshooting section to "Too many schema inference WHERE 1=0 queries" - Spell out allowUnsupportedSharding consequences (data skew, wrong shard writes) - Fix WHERE 1=0 troubleshooting: add pushAggregation.enabled config option, note that .schema() does not suppress the probe query - Fix repartitionNum description: it IS used via requiredNumPartitions() - Fix read.settings: must use spark.conf.set(), .option() has no effect; add set/collect/unset pattern for per-query scoping

Add a note block linking to ClientConfigProperties.java and the documentation pages for Java client configuration and ClickHouse server session settings.

Add tabbed code examples showing how to use spark.clickhouse.read.settings and spark.clickhouse.write.* options in Python, Scala, and Java.

…m Spark connector Remove the Common timeout scenarios section and the Connection resets on AWS PrivateLink or NLB troubleshooting entry. The NLB broken pipe scenario could not be reproduced from EMR against the staging endpoint (public HTTPS path, no NLB in the TCP path), so this content is not verified and should not be published.

ShimonSte requested review from a team as code owners March 17, 2026 09:39

vercel Bot deployed to Preview – clickhouse-docs March 17, 2026 09:45 View deployment

ShimonSte requested a review from BentsiLeviav March 17, 2026 09:56

BentsiLeviav requested a review from mzitnik March 18, 2026 14:11

BentsiLeviav requested changes Mar 23, 2026

View reviewed changes

ShimonSte marked this pull request as draft March 23, 2026 11:03

ShimonSte added the Don't Merge Don't merge yet label Mar 23, 2026

vercel Bot had a problem deploying to Preview – clickhouse-docs April 12, 2026 15:09 Failure

vercel Bot had a problem deploying to Preview – clickhouse-docs April 14, 2026 16:16 Failure

ShimonSte added 4 commits April 15, 2026 09:44

docs: add Passing query settings and Java client options to Spark con…

fc9f1db

…nector

Add note with links to Java client options and server settings docs

c9550ae

Add a note block linking to ClientConfigProperties.java and the documentation pages for Java client configuration and ClickHouse server session settings.

Add Scala and Java examples for TableProvider API settings

76dbbf0

Add tabbed code examples showing how to use spark.clickhouse.read.settings and spark.clickhouse.write.* options in Python, Scala, and Java.

ShimonSte force-pushed the docs/spark-connector-query-settings branch from afd11e8 to 76dbbf0 Compare April 15, 2026 06:44

vercel Bot had a problem deploying to Preview – clickhouse-docs April 15, 2026 06:46 Failure

Fix markdown lint: remove extra blank line

7aa14a8

vercel Bot deployed to Preview – clickhouse-docs April 15, 2026 07:04 View deployment

vercel Bot deployed to Preview – clickhouse-docs April 15, 2026 08:35 View deployment

ShimonSte marked this pull request as ready for review April 15, 2026 08:53

ShimonSte requested a review from BentsiLeviav April 15, 2026 08:53

ShimonSte force-pushed the docs/spark-connector-query-settings branch from de4ea8a to 8d6bfbc Compare April 21, 2026 10:31

vercel Bot deployed to Preview – clickhouse-docs April 21, 2026 10:42 View deployment

Added variant integer issue in trubleshoot section

cd64a5a

vercel Bot deployed to Preview – clickhouse-docs May 5, 2026 08:53 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Updating spark documentation #5766

docs: Updating spark documentation #5766
ShimonSte wants to merge 7 commits intomainfrom
docs/spark-connector-query-settings

ShimonSte commented Mar 17, 2026

Uh oh!

vercel Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

BentsiLeviav left a comment

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

BentsiLeviav Mar 22, 2026

Uh oh!

BentsiLeviav Mar 23, 2026

Uh oh!

BentsiLeviav Mar 23, 2026

Uh oh!

BentsiLeviav Mar 23, 2026

Uh oh!

BentsiLeviav Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		When using ClickHouse `Distributed` tables with the connector, there is an important networking and architecture consideration:

		The connector bypasses the Distributed engine and connects directly to each shard node.

	The connector bypasses the Distributed engine and connects directly to each shard node.
	Following ClickHouse best practices for high-throughput ingestion, the connector writes data directly to the underlying shard nodes rather than using the Distributed engine.


		### Common settings {#query-settings-common}

		\| Setting \| Example value \| Purpose \|


		Cause: When aggregation push-down is triggered, the connector sends a probe query to ClickHouse to determine the output schema of the pushed-down aggregation. This is by design and only occurs when aggregation push-down is active.

		Fix: If this causes unacceptable load, avoid triggering aggregation push-down by handling the aggregation in Spark instead:


		### Read parallelism {#read-parallelism}

		The connector creates one Spark input partition per ClickHouse physical partition (data part group). The number of Spark read tasks equals the number of distinct partition values in the ClickHouse table.

Conversation

ShimonSte commented Mar 17, 2026

Uh oh!

vercel Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BentsiLeviav left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Mar 17, 2026 •

edited

Loading