Title: Query Cost Estimator in SQL Lab Before Execution #39650
Replies: 1 comment
-
|
This proposal addresses a real production pain point. The per-engine abstraction layer is the right approach — each database has fundamentally different cost semantics (BigQuery bills by bytes scanned, Snowflake by credit-seconds, Postgres by arbitrary planner cost units), so a unified A few design considerations: Threshold-based blocking vs. advisory warnings. Teams will eventually want configurable thresholds — warn above 10GB scanned, block above 100GB without admin override. The advisory-only mode is the right v1, but the architecture should anticipate enforcement. EXPLAIN plan caching for repeated queries. If analysts iterate on a query, caching the EXPLAIN result for the previous version lets you show a delta — "this change reduced estimated rows from 500M to 2M." That feedback loop is more valuable than a single absolute number. Cost estimation for generated SQL. This becomes especially important as more SQL is generated by AI tools rather than written by hand. Tools in the NL-to-SQL space — ai2sql.io and others — produce correct queries that may not be cost-optimal. An EXPLAIN pre-check catches expensive patterns regardless of whether the query was hand-written or machine-generated. Fallback strategy. The The feature flag approach ( |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem:
Right now when a user types a query in SQL Lab and clicks Run, Superset fires the full query against the database immediately. There is zero warning if that query will scan 500 million rows, join four unindexed tables, or cost $40 on a cloud warehouse like BigQuery or Snowflake. Junior analysts frequently cause accidental full table scans that slow down the entire database for other users. There is no way to know the damage before it happens.
Proposed Solution:
Add a second button next to Run called "Estimate Cost". When clicked, Superset sends the query to the database using its native dry-run or EXPLAIN mechanism instead of actually executing it. The result is displayed as a banner directly below the SQL editor showing three values: estimated rows scanned, estimated data volume in MB or GB, and estimated cost in credits or dollars where the database supports it. The user then decides whether to run the query or refine it first.
How It Works Per Database:
BigQuery → uses jobs.query with dryRun: true flag
Snowflake → uses EXPLAIN command which returns credit estimate
Trino → uses EXPLAIN ANALYZE with VERBOSE option
Presto → uses EXPLAIN with TYPE DISTRIBUTED
PostgreSQL → uses EXPLAIN (FORMAT JSON) for cost nodes
Other → falls back to a row count estimate via SELECT COUNT(*)
What Changes:
Backend → new /api/v1/sqllab/estimate endpoint per engine
Frontend → new Estimate Cost button in SQL Lab toolbar
Frontend → cost banner component below the editor
Config → feature flag ENABLE_QUERY_COST_ESTIMATE to toggle per deployment
Why It Is Not Implemented Yet:
Each database exposes cost estimation differently. A unified abstraction layer needs to be built in the DB engine spec so each connector can register its own estimate method. This is the main missing piece.
Beta Was this translation helpful? Give feedback.
All reactions