Skip to content

Commit f49e4b6

Browse files
authored
chore: Mark expressions with known correctness issues as incompatible (#3675)
* fix: mark expressions with known correctness issues as incompatible Review all open correctness issues and mark affected expressions as Incompatible so they fall back to Spark by default. Update the compatibility guide with detailed documentation of each incompatibility and links to tracking issues. Expressions marked Incompatible: - ArrayContains (#3346), GetArrayItem (#3330, #3332), ArrayRemove (#3173) - Hour, Minute, Second for TimestampNTZ inputs (#3180) - TruncTimestamp for non-UTC timezones (#2649) - Ceil, Floor for Decimal inputs (#1729) - Tan (#1897), Corr (#2646), StructsToJson (#3016) * fix: reformat expressions.md with prettier * fix: enable allowIncompatible in tests for newly incompatible expressions * fix: apply spotless formatting to updated test files * fix: enable ArrayContains allowIncompatible for map_contains_key test map_contains_key is internally rewritten by Spark to use ArrayContains, so it needs the allowIncompatible config to run natively. * fix: remove unused GetArrayItem import from CometArrayExpressionSuite * fix: enable Corr allowIncompatible for covariance & correlation test
1 parent 2e07099 commit f49e4b6

File tree

19 files changed

+515
-304
lines changed

19 files changed

+515
-304
lines changed

docs/source/user-guide/latest/compatibility.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,51 @@ Expressions that are not 100% Spark-compatible will fall back to Spark by defaul
5858
`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See
5959
the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting.
6060

61+
### Array Expressions
62+
63+
- **ArrayContains**: Returns null instead of false for empty arrays with literal values.
64+
[#3346](https://github.com/apache/datafusion-comet/issues/3346)
65+
- **ArrayRemove**: Returns null when the element to remove is null, instead of removing null elements from the array.
66+
[#3173](https://github.com/apache/datafusion-comet/issues/3173)
67+
- **GetArrayItem**: Known correctness issues with index handling, including off-by-one errors and incorrect results
68+
with dynamic (non-literal) index values.
69+
[#3330](https://github.com/apache/datafusion-comet/issues/3330),
70+
[#3332](https://github.com/apache/datafusion-comet/issues/3332)
71+
- **ArraysOverlap**: Inconsistent behavior when arrays contain NULL values.
72+
[#3645](https://github.com/apache/datafusion-comet/issues/3645),
73+
[#2036](https://github.com/apache/datafusion-comet/issues/2036)
74+
- **ArrayUnion**: Sorts input arrays before performing the union, while Spark preserves the order of the first array
75+
and appends unique elements from the second.
76+
[#3644](https://github.com/apache/datafusion-comet/issues/3644)
77+
78+
### Date/Time Expressions
79+
80+
- **Hour, Minute, Second**: Incorrectly apply timezone conversion to TimestampNTZ inputs. TimestampNTZ stores local
81+
time without timezone, so no conversion should be applied. These expressions work correctly with Timestamp inputs.
82+
[#3180](https://github.com/apache/datafusion-comet/issues/3180)
83+
- **TruncTimestamp (date_trunc)**: Produces incorrect results when used with non-UTC timezones. Compatible when
84+
timezone is UTC.
85+
[#2649](https://github.com/apache/datafusion-comet/issues/2649)
86+
87+
### Math Expressions
88+
89+
- **Ceil, Floor**: Incorrect results for Decimal type inputs.
90+
[#1729](https://github.com/apache/datafusion-comet/issues/1729)
91+
- **Tan**: `tan(-0.0)` produces `0.0` instead of `-0.0`.
92+
[#1897](https://github.com/apache/datafusion-comet/issues/1897)
93+
94+
### Aggregate Expressions
95+
96+
- **Corr**: Returns null instead of NaN in some edge cases.
97+
[#2646](https://github.com/apache/datafusion-comet/issues/2646)
98+
- **First, Last**: These functions are not deterministic. When `ignoreNulls` is set, results may not match Spark.
99+
[#1630](https://github.com/apache/datafusion-comet/issues/1630)
100+
101+
### Struct Expressions
102+
103+
- **StructsToJson (to_json)**: Does not support `+Infinity` and `-Infinity` for numeric types (float, double).
104+
[#3016](https://github.com/apache/datafusion-comet/issues/3016)
105+
61106
## Regular Expressions
62107

63108
Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's

docs/source/user-guide/latest/expressions.md

Lines changed: 98 additions & 98 deletions
Large diffs are not rendered by default.

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ object QueryPlanSerde extends Logging with CometExprShim {
116116
classOf[Sinh] -> CometScalarFunction("sinh"),
117117
classOf[Sqrt] -> CometScalarFunction("sqrt"),
118118
classOf[Subtract] -> CometSubtract,
119-
classOf[Tan] -> CometScalarFunction("tan"),
119+
classOf[Tan] -> CometTan,
120120
classOf[Tanh] -> CometScalarFunction("tanh"),
121121
classOf[Cot] -> CometScalarFunction("cot"),
122122
classOf[UnaryMinus] -> CometUnaryMinus,

spark/src/main/scala/org/apache/comet/serde/aggregates.scala

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -584,6 +584,13 @@ object CometStddevPop extends CometAggregateExpressionSerde[StddevPop] with Come
584584
}
585585

586586
object CometCorr extends CometAggregateExpressionSerde[Corr] {
587+
588+
override def getSupportLevel(expr: Corr): SupportLevel =
589+
Incompatible(
590+
Some(
591+
"Returns null instead of NaN in some edge cases" +
592+
" (https://github.com/apache/datafusion-comet/issues/2646)"))
593+
587594
override def convert(
588595
aggExpr: AggregateExpression,
589596
corr: Corr,

spark/src/main/scala/org/apache/comet/serde/arrays.scala

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,12 @@ object CometArrayRemove
3535
with CometExprShim
3636
with ArraysBase {
3737

38+
override def getSupportLevel(expr: ArrayRemove): SupportLevel =
39+
Incompatible(
40+
Some(
41+
"Returns null when element is null instead of removing null elements" +
42+
" (https://github.com/apache/datafusion-comet/issues/3173)"))
43+
3844
override def convert(
3945
expr: ArrayRemove,
4046
inputs: Seq[Attribute],
@@ -131,6 +137,13 @@ object CometArrayAppend extends CometExpressionSerde[ArrayAppend] {
131137
}
132138

133139
object CometArrayContains extends CometExpressionSerde[ArrayContains] {
140+
141+
override def getSupportLevel(expr: ArrayContains): SupportLevel =
142+
Incompatible(
143+
Some(
144+
"Returns null instead of false for empty arrays with literal values" +
145+
" (https://github.com/apache/datafusion-comet/issues/3346)"))
146+
134147
override def convert(
135148
expr: ArrayContains,
136149
inputs: Seq[Attribute],
@@ -472,6 +485,14 @@ object CometCreateArray extends CometExpressionSerde[CreateArray] {
472485
}
473486

474487
object CometGetArrayItem extends CometExpressionSerde[GetArrayItem] {
488+
489+
override def getSupportLevel(expr: GetArrayItem): SupportLevel =
490+
Incompatible(
491+
Some(
492+
"Known correctness issues with index handling" +
493+
" (https://github.com/apache/datafusion-comet/issues/3330," +
494+
" https://github.com/apache/datafusion-comet/issues/3332)"))
495+
475496
override def convert(
476497
expr: GetArrayItem,
477498
inputs: Seq[Attribute],

spark/src/main/scala/org/apache/comet/serde/datetime.scala

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,18 @@ object CometQuarter extends CometExpressionSerde[Quarter] with CometExprGetDateF
177177
}
178178

179179
object CometHour extends CometExpressionSerde[Hour] {
180+
181+
override def getSupportLevel(expr: Hour): SupportLevel = {
182+
if (expr.child.dataType.typeName == "timestamp_ntz") {
183+
Incompatible(
184+
Some(
185+
"Incorrectly applies timezone conversion to TimestampNTZ inputs" +
186+
" (https://github.com/apache/datafusion-comet/issues/3180)"))
187+
} else {
188+
Compatible()
189+
}
190+
}
191+
180192
override def convert(
181193
expr: Hour,
182194
inputs: Seq[Attribute],
@@ -203,6 +215,18 @@ object CometHour extends CometExpressionSerde[Hour] {
203215
}
204216

205217
object CometMinute extends CometExpressionSerde[Minute] {
218+
219+
override def getSupportLevel(expr: Minute): SupportLevel = {
220+
if (expr.child.dataType.typeName == "timestamp_ntz") {
221+
Incompatible(
222+
Some(
223+
"Incorrectly applies timezone conversion to TimestampNTZ inputs" +
224+
" (https://github.com/apache/datafusion-comet/issues/3180)"))
225+
} else {
226+
Compatible()
227+
}
228+
}
229+
206230
override def convert(
207231
expr: Minute,
208232
inputs: Seq[Attribute],
@@ -229,6 +253,18 @@ object CometMinute extends CometExpressionSerde[Minute] {
229253
}
230254

231255
object CometSecond extends CometExpressionSerde[Second] {
256+
257+
override def getSupportLevel(expr: Second): SupportLevel = {
258+
if (expr.child.dataType.typeName == "timestamp_ntz") {
259+
Incompatible(
260+
Some(
261+
"Incorrectly applies timezone conversion to TimestampNTZ inputs" +
262+
" (https://github.com/apache/datafusion-comet/issues/3180)"))
263+
} else {
264+
Compatible()
265+
}
266+
}
267+
232268
override def convert(
233269
expr: Second,
234270
inputs: Seq[Attribute],
@@ -402,10 +438,19 @@ object CometTruncTimestamp extends CometExpressionSerde[TruncTimestamp] {
402438
"microsecond")
403439

404440
override def getSupportLevel(expr: TruncTimestamp): SupportLevel = {
441+
val timezone = expr.timeZoneId.getOrElse("UTC")
442+
val isUtc = timezone == "UTC" || timezone == "Etc/UTC"
405443
expr.format match {
406444
case Literal(fmt: UTF8String, _) =>
407445
if (supportedFormats.contains(fmt.toString.toLowerCase(Locale.ROOT))) {
408-
Compatible()
446+
if (isUtc) {
447+
Compatible()
448+
} else {
449+
Incompatible(
450+
Some(
451+
s"Incorrect results in non-UTC timezone '$timezone'" +
452+
" (https://github.com/apache/datafusion-comet/issues/2649)"))
453+
}
409454
} else {
410455
Unsupported(Some(s"Format $fmt is not supported"))
411456
}

spark/src/main/scala/org/apache/comet/serde/math.scala

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
package org.apache.comet.serde
2121

22-
import org.apache.spark.sql.catalyst.expressions.{Abs, Atan2, Attribute, Ceil, CheckOverflow, Expression, Floor, Hex, If, LessThanOrEqual, Literal, Log, Log10, Log2, Unhex}
22+
import org.apache.spark.sql.catalyst.expressions.{Abs, Atan2, Attribute, Ceil, CheckOverflow, Expression, Floor, Hex, If, LessThanOrEqual, Literal, Log, Log10, Log2, Tan, Unhex}
2323
import org.apache.spark.sql.types.{DecimalType, NumericType}
2424

2525
import org.apache.comet.CometSparkSessionExtensions.withInfo
@@ -38,6 +38,18 @@ object CometAtan2 extends CometExpressionSerde[Atan2] {
3838
}
3939

4040
object CometCeil extends CometExpressionSerde[Ceil] {
41+
42+
override def getSupportLevel(expr: Ceil): SupportLevel = {
43+
expr.child.dataType match {
44+
case _: DecimalType =>
45+
Incompatible(
46+
Some(
47+
"Incorrect results for Decimal type inputs" +
48+
" (https://github.com/apache/datafusion-comet/issues/1729)"))
49+
case _ => Compatible()
50+
}
51+
}
52+
4153
override def convert(
4254
expr: Ceil,
4355
inputs: Seq[Attribute],
@@ -58,6 +70,18 @@ object CometCeil extends CometExpressionSerde[Ceil] {
5870
}
5971

6072
object CometFloor extends CometExpressionSerde[Floor] {
73+
74+
override def getSupportLevel(expr: Floor): SupportLevel = {
75+
expr.child.dataType match {
76+
case _: DecimalType =>
77+
Incompatible(
78+
Some(
79+
"Incorrect results for Decimal type inputs" +
80+
" (https://github.com/apache/datafusion-comet/issues/1729)"))
81+
case _ => Compatible()
82+
}
83+
}
84+
6185
override def convert(
6286
expr: Floor,
6387
inputs: Seq[Attribute],
@@ -174,6 +198,24 @@ object CometAbs extends CometExpressionSerde[Abs] with MathExprBase {
174198
}
175199
}
176200

201+
object CometTan extends CometExpressionSerde[Tan] {
202+
203+
override def getSupportLevel(expr: Tan): SupportLevel =
204+
Incompatible(
205+
Some(
206+
"tan(-0.0) produces incorrect result" +
207+
" (https://github.com/apache/datafusion-comet/issues/1897)"))
208+
209+
override def convert(
210+
expr: Tan,
211+
inputs: Seq[Attribute],
212+
binding: Boolean): Option[ExprOuterClass.Expr] = {
213+
val childExpr = expr.children.map(exprToProtoInternal(_, inputs, binding))
214+
val optExpr = scalarFunctionExprToProto("tan", childExpr: _*)
215+
optExprWithInfo(optExpr, expr, expr.children: _*)
216+
}
217+
}
218+
177219
sealed trait MathExprBase {
178220
protected def nullIfNegative(expression: Expression): Expression = {
179221
val zero = Literal.default(expression.dataType)

spark/src/main/scala/org/apache/comet/serde/structs.scala

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,12 @@ object CometGetArrayStructFields extends CometExpressionSerde[GetArrayStructFiel
105105

106106
object CometStructsToJson extends CometExpressionSerde[StructsToJson] {
107107

108+
override def getSupportLevel(expr: StructsToJson): SupportLevel =
109+
Incompatible(
110+
Some(
111+
"Does not support Infinity/-Infinity for numeric types" +
112+
" (https://github.com/apache/datafusion-comet/issues/3016)"))
113+
108114
override def convert(
109115
expr: StructsToJson,
110116
inputs: Seq[Attribute],

spark/src/test/resources/sql-tests/expressions/aggregate/corr.sql

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
-- specific language governing permissions and limitations
1616
-- under the License.
1717

18+
-- Config: spark.comet.expression.Corr.allowIncompatible=true
1819
-- ConfigMatrix: parquet.enable.dictionary=false,true
1920

2021
statement

spark/src/test/resources/sql-tests/expressions/array/array_remove.sql

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
-- specific language governing permissions and limitations
1616
-- under the License.
1717

18+
-- Config: spark.comet.expression.ArrayRemove.allowIncompatible=true
1819
-- ConfigMatrix: parquet.enable.dictionary=false,true
1920

2021
statement

0 commit comments

Comments
 (0)