[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419
Conversation
…ses when a database is specified in the URL When DBeaver (or any Hive-JDBC client) connects to the Kyuubi JDBC engine with a database path in the URL, DatabaseMetaData.getTables should return only the tables from that database. Previously, the Hive JDBC driver converted a null schemaPattern to "%" before sending the request, which the engine forwarded straight into a WHERE TABLE_SCHEMA LIKE '%' clause on the backend, returning tables from every database. - Introduce JdbcDialect.setCurrentDatabase(connection, database) as an extension point; the default is a no-op returning false. MySQLDialect overrides it with USE `db` so MySQL/StarRocks/Doris switch the backend connection on session open. - JdbcSessionImpl records the successfully applied database in effectiveDatabase and only keeps it when setCurrentDatabase returns true. "default" (the Hive JDBC driver's fallback when no database is specified in the URL) is tolerated on USE failure to stay compatible with backends that have no "default" database. - JdbcOperationManager.newGetTablesOperation treats a blank schema or "%" as "use the session's effective database", routing through effectiveDatabase.orNull. Backends without an effective database (PostgreSQL, Oracle, etc.) keep the previous no-filter behavior. - MySQLDialect now implements getCatalogsOperation and getSchemasOperation (previously featureNotSupported). getSchemasOperation aliases result columns to the JDBC-standard TABLE_CATALOG / TABLE_SCHEM so DatabaseMetaData.getSchemas works with clients such as DBeaver. Closes apache#7305.
| } else { | ||
| schemaName | ||
| } | ||
| val query = dialect.getTablesQuery(catalogName, effectiveSchema, tableName, tableTypes) |
There was a problem hiding this comment.
I don't recall the details of the initial code of this part, but the API looks problematic.
the Hive Thrift GetTablesOperation API is derived from the JDBC API, and that's why they look almost identical, for JDBC engine, such a call should directly map to the JDBC API instead of requesting a SQL (if you take a look at the JDBC driver implementation, you may find some impls are non-SQL-based), the current design complicates the whole thing.
There was a problem hiding this comment.
Thanks for raising this — you're right, and I want to confirm I took it seriously rather than hand-wave it away. To make sure I understood, I ran the full set of DatabaseMetaData calls against three different drivers (mysql-connector-j 8.4.0, postgresql 42.7.2, clickhouse-jdbc 0.6.5) using a small Java probe and docker containers. Test fixtures were tailored to each backend (MySQL/PG had PK + FK + a function; ClickHouse used MergeTree which has no native PK/FK).
Cross-driver behavior of each API:
| API | MySQL (databaseTerm=CATALOG) |
MySQL (databaseTerm=SCHEMA) |
PostgreSQL | ClickHouse |
|---|---|---|---|---|
getCatalogs() |
✓ implemented | ⚠ empty (driver databaseTerm selection) |
✓ implemented | ✓ implemented |
getSchemas() |
⚠ empty (driver databaseTerm selection) |
✓ implemented | ✓ implemented | — model has no schema layer |
getTableTypes() |
✓ | ✓ | ✓ | ✓ |
getTables(...) |
✓ | ✓ | ✓ | ✓ |
getColumns(...) |
✓ | ✓ | ✓ | ✓ |
getPrimaryKeys |
✓ | ✓ | ✓ | — MergeTree has no PK |
getImported / Exported / CrossReference |
✓ | ✓ | ✓ | — no FK in CK |
getFunctions(...) |
✓ | ✓ | ✓ | (no fixture created) |
getTypeInfo() |
✓ | ✓ | ✓ | ✓ |
Things that hold across all three drivers:
- Every JDBC-standard metadata method is implemented — none threw
SQLFeatureNotSupportedException; concepts that don't apply just return empty result sets. (The JDBC contract requires this; clients differ in how strictly they enforce it.) - Column names follow the JDBC spec (
TABLE_CAT,TABLE_SCHEM,TABLE_NAME,COLUMN_NAME,DATA_TYPE,KEY_SEQ,PK_NAME, ...) — which is exactly whatGetTablesOperation/GetColumnsOperation/ etc. on the Hive Thrift side expect, since the Thrift API was derived from the JDBC API in the first place. - Driver-specific quirks are minor and well-defined — the
databaseTermselection is unique to mysql-connector-j; PG returns column labels in lowercase while MySQL/CK use uppercase (JDBC spec says case-insensitive, but engine-side string comparisons would need to match).
So the JDBC standard API really does cover every operation the dialect models today, across very different DBMS families. Concretely, this means the JdbcDialect.getXxxQuery family of methods could be deleted in favor of the engine calling connection.getMetaData.getXxx(...) directly — and dialects whose driver implements metadata via a non-SQL path (Avatica, etc.) would automatically be used the way they were designed to.
The few wrinkles I hit are all driver-side and resolvable in a few lines of engine-level adapter code, not by reaching for SQL:
getCatalogs/getSchemasare selected by mysql-connector-j'sdatabaseTermproperty (only one returns rows in a given mode) — handled by checking the property and routing, or by calling both and taking whichever is non-empty.DATA_TYPEisintin the JDBC spec but the current SQL returns a string —java.sql.Types-to-name mapping is one helper.- The current SQL returns extra MySQL-specific columns (
ENGINE,TABLE_ROWS, ...) but the Hive ThriftGetTablesresult schema is fixed to JDBC-standard columns, so those extras are already dropped during serialization today and aren't reaching clients — no real loss.
I'd like to keep this PR scoped to the URL-database bug since that's what it set out to fix, and migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test — not something I'd want to fold into a bug fix. Happy to open a follow-up issue tracking the migration if you agree with that split. WDYT?
There was a problem hiding this comment.
migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test
if we do that first, do we still need this patch?
There was a problem hiding this comment.
No SQLFeatureNotSupportedException anywhere
Actually, the JDBC API Javadocs define behavior in a very clear way, it should not throw SQLFeatureNotSupportedException unless Javadocs say it can. Some tools, like DBeaver, may tolerate it if the JDBC driver vendor does not respect the API contract, but tools like DataGrip may not.
There was a problem hiding this comment.
BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)
There was a problem hiding this comment.
migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test
if we do that first, do we still need this patch?
Yes — the bug is the Hive JDBC client converting null schemaPattern to "%", and JDBC spec says "%" means "don't narrow by schema", so calling DatabaseMetaData.getTables(...) directly reproduces it. Verified on mysql-connector-j 8.4.0 (connection opened against jdbc:mysql://.../dbA):
conn.getCatalog() == "dbA"
getTables(null, "%", "%", null) → tables from dbA AND dbB ← bug reproduces
getTables("dbA", null, "%", null) → tables from dbA only ← only narrows when explicit
So the engine still has to substitute the URL-scoped database into the call when the client sends null/"%". That routing logic (effectiveDatabase + the if-blank-then-route block in JdbcOperationManager) survives the refactor unchanged — only the line that builds a SQL string flips to conn.getMetaData.getTables(...). The setSchema-on-session-open machinery is also independent.
What the refactor does subsume: the two getCatalogsOperation / getSchemasOperation SQL implementations I added to MySQLDialect — those go away once the engine calls getCatalogs() / getSchemas() directly.
Happy either way on ordering — bug fix first then refactor, or refactor first and a smaller patch on top.
There was a problem hiding this comment.
BTW, we MUST avoid copying the code from
mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle themysql-connector-j-related issue. (this will be a horrible experience ...)
Agreed, and thanks for catching this early. I've rewritten all my prior PR comments and the in-tree comment in MySQLDialect.setSchema to remove any reference / paraphrase of mysql-connector-j source — keeping only the documented databaseTerm property and black-box test observations. Also amended the FOLLOWUP commit message similarly. Will follow this rule going forward.
…tandard setCatalog/setSchema Replace the hand-rolled `USE \`db\`` with `Connection.setCatalog` plus `Connection.setSchema` in MySQLDialect.setCurrentDatabase. The MySQL Connector/J `databaseTerm` connection property determines which of the two calls actually switches the session database; invoking both ensures the switch happens regardless of how the user configured the URL, while staying on standard JDBC API instead of hand-rolled SQL. StarRocksDialect / DorisDialect inherit this via MySQLDialect. Also add three StarRocks regression tests covering: scoping when URL specifies a database, no-filter behavior when URL has no database, and session-open failure on a non-existent database.
…Dialect Per review feedback, replicate the JDBC `Connection#setSchema(String)` API on JdbcDialect with a leading `conn` parameter. The default forwards to the standard API; dialects whose driver does not support it can override to a no-op. Drop the boolean return value — callers treat the call as if it always takes effect. JdbcSessionImpl now records `effectiveDatabase` unconditionally after the call, decoupling "backend connection switched" from "engine metadata scope".
…ession open The Hive JDBC driver sends `USE_DATABASE = "default"` when the user did not specify a database in the connection URL — a protocol-level stub indistinguishable from a genuine request for a database literally named "default". Forwarding that into `setSchema` plus recording it as `effectiveDatabase` caused PG and ClickHouse metadata operations to scope to a non-existent or unintended schema, breaking the existing `get tables` suites in CI. Filter `"default"` at the session entry point: skip both the dialect `setSchema` call and the `effectiveDatabase` assignment for it. Drops the prior try/catch that only existed to tolerate the failure path. Also rewords the in-tree `MySQLDialect.setSchema` comment to describe the externally observed behavior of the documented `databaseTerm` connection property, without paraphrasing driver internals. URL paths that name a real non-`default` database keep the new scoping behavior; URL paths that explicitly name `default` continue to behave as before this PR (no metadata scope), which is the same as URL paths with no database — there is no observable regression.
1edb352 to
f10df51
Compare
Where the patch stands nowPosting a wrap-up. Want to flag one decision up-front since it's the least obvious from the diff and the most likely thing to want a sanity-check. Heads-up: how
|
| URL form | Before this PR | After this PR |
|---|---|---|
| no database in URL | no metadata scope | no metadata scope (unchanged) |
URL specifies a non-default database |
no metadata scope (the original bug) | scoped to that database ✓ |
URL explicitly names default (and the backend has one) |
no metadata scope | no metadata scope (no change) |
No observable regression — the third row was already in this state before this PR (the engine never URL-scoped at all). Users on backends with a real default schema can still target it after connecting via USE default;, which goes through a different code path (newSetCurrentDatabaseOperation). I considered alternatives (try/catch around setSchema, or "verify the schema exists before recording it") but the first is brittle on PG/CH, and the second still does the wrong thing on ClickHouse since CH genuinely has a default database.
The rest of the patch, briefly
Three engine-side layers:
- Dialect API —
JdbcDialect.setSchema(conn, schema): UnitmirrorsConnection#setSchemawith a leadingconn; default forwards to the standard. OnlyMySQLDialectoverrides — calls bothsetCatalog+setSchemaso the database switch happens regardless of thedatabaseTermconnection property the user configured. - Session open — invokes the dialect's
setSchemafor the URL-scoped database (modulo the"default"stub above) and records it aseffectiveDatabase. - Operation routing —
JdbcOperationManager.newGetTablesOperationsubstituteseffectiveDatabasewhen the client sends blank or"%"schemaPattern, undoing the Hive driver'snull → "%"translation.
Tracked separately
Migrating the dialect API from "return a SQL string" to "engine calls DatabaseMetaData.getXxx(...) directly" — your point on the GetTablesOperation thread, with cross-driver verification confirming feasibility on MySQL / PG / CH. This PR's URL-database routing logic survives that refactor unchanged. Will open a separate issue.
|
Can we do mapping |
|
Happy to do the refactor first if that's the preferred order. One thing to confirm though, since I want to make sure we're aligned: as I noted above, the URL-database routing ( Given that, do you still prefer refactor-first, or merge this as the minimal bug fix and follow up with the refactor? Either order works for me. |
|
the JDBC API Javadocs say
I interpret it to
I don't understand why |
|
Note, some JDBC clients (e.g., Cloudera HUE) use |
|
You're right, and I want to walk back my earlier claim that the substitution is required. I tested this end-to-end against master (without this PR) using DBeaver and a real MySQL backend, and traced what actually triggers the bug: On master, So the root cause of KYUUBI-7305 is that DBeaver fails to obtain a database list and falls back to That makes the |
yes, this meets my expectations exactly. |
|
Going to close this PR — the discussion drifted from the actual root cause, and the substitution mechanism here turned out to be unnecessary as you pointed out. Apologies for the noise on this thread; I should have traced the failure path empirically much earlier instead of arguing from incomplete reasoning, and I appreciate your patience walking me back to it. I'll open a new PR taking the direction you've been suggesting all along: align the dialect API with JDBC semantics by migrating the hand-written SQL methods ( |
Why are the changes needed?
This is a bug fix.
The Hive JDBC driver rewrites a
nullschemaPatternto"%"inGetTables, and the JDBC engine forwards it directly intoWHERE TABLE_SCHEMA LIKE '%', so DBeaver — which always passesnullonce a connection is established — sees tables from every database in the backend rather than the one in its connection URL. Compounding this, the JDBC engine never applies the URL-level database to the underlying JDBC connection, so there is no "current database" on the backend for a filter to fall back to. The fix introduces a dialect-levelsetCurrentDatabasehook, tracks aneffectiveDatabaseon the session, and routes blank/"%"schema patterns through it inGetTablesOperation.MySQLDialectadditionally implementsgetCatalogsOperation/getSchemasOperationwith JDBC-standard column labels (TABLE_CATALOG/TABLE_SCHEM) so the schemas tree in DBeaver populates correctly.Non-MySQL dialects inherit the default no-op
setCurrentDatabaseandeffectiveDatabasestaysNone, so PostgreSQL / Oracle / etc. keep their existing behavior.Note:
getSchemas()continues to return all accessible schemas per the JDBC spec — the URL-level database only affects the default query context and the blank/"%"fallback ingetTables. This matches the behavior of the native MySQL JDBC driver. Clients that want to restrict the visible schema list should do so via client-side filters (e.g. DBeaver's "Schemas" filter).How was this patch tested?
Added unit / integration tests in
MySQLOperationSuitecovering:getTableswith a database in the URL returns only that database's tablesgetTableswith a"%"schema pattern is routed to the session's effective databasegetSchemasreturns results with JDBC-standardTABLE_CATALOG/TABLE_SCHEMcolumn labelsgetCatalogsbehaviorAlso verified end-to-end against a live StarRocks backend using Spark's Hive-JDBC beeline:
Was this patch authored or co-authored using generative AI tooling?
Human-authored core design and implementation. Tests, verification, and PR description drafting were assisted by Claude Code.
Generated-by: Claude Code (Opus 4.7)
Closes #7305. Supersedes the stale #7307 per @pan3793's suggestion there.
Co-authored-by: zhzhenqin [email protected]