-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalcommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance
Description
Description
Problem
Ray Data currently performs no compile-time constant folding or basic algebraic simplification on expressions. As a result, constant sub-expressions are re-evaluated for every row at runtime, even when the result is trivially known during planning.
Examples of wasted computation:
lit(3) + lit(5)→ recomputed per row instead of folded tolit(8)oncecol("x") * lit(1)→ unnecessary multiplication per row instead of simplified tocol("x")lit(False) & expensive_udf(col("data"))→ executes the expensive UDF unnecessarily, even though result is alwaysFalse
This leads to avoidable CPU overhead, especially in large-scale pipelines with many chained map, filter, or with_columns operations.
Proposed Solution
Add a ConstantFoldingRule to the logical optimizer that:
- Folds pure constant expressions (
lit(3) + lit(5)→lit(8)) - Applies algebraic identities (
x * 1→x,x + 0→x,x * 0→0(with null handling)) - Performs boolean short-circuit & constant propagation:
False & <expr>→FalseTrue | <expr>→True~True→False
- Handles nested expressions and repeated applications until fixpoint
- Eliminates redundant operations (e.g.
NOT(NOT(x))→x)
Benefits
- Performance: Eliminates redundant per-row computation → faster execution, especially for constant-heavy projections/filters
- Plan simplification: Produces cleaner expression trees → enables better downstream optimizations (predicate pushdown, column pruning, fusion, etc.)
- Zero runtime cost: All folding happens during logical planning
- Backward compatible: Transparent to existing user code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalcommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance