Skip to content

Conversation

@vigneshsiva11
Copy link

Which issue does this PR close?

Closes #3255

Rationale for this change

When reading Iceberg tables, timestamp columns may be dictionary-encoded in the underlying Parquet files.
In the current implementation of datediff, scalar and array arguments can be converted into arrays of different lengths, which leads to a runtime error:

Cannot perform binary operation on arrays of different length

This behavior differs from Spark, which correctly broadcasts scalar inputs and handles dictionary-backed columns without error.
This change ensures Comet’s datediff implementation aligns with Spark semantics and avoids execution failures.

What changes are included in this PR?

  • Ensure both datediff arguments are converted into arrays of the same length by correctly broadcasting scalars
  • Normalize dictionary-backed inputs to Date32Array before applying the binary operation
  • Prevent array length mismatches during vectorized execution

How are these changes tested?

  • Existing unit tests in datafusion-comet-spark-expr were run locally
  • Manual verification using Iceberg timestamp columns with to_date and datediff
  • Verified that operations succeed when Comet is enabled and match Spark behavior

Copilot AI review requested due to automatic review settings January 26, 2026 05:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Comet’s datediff execution on dictionary-encoded / Iceberg-backed timestamp inputs by ensuring scalar arguments are broadcast to the correct batch length and normalizing inputs before the binary op.

Changes:

  • Broadcast scalar datediff arguments to the columnar batch length to avoid array-length mismatches.
  • Cast inputs to Date32 prior to subtraction to handle dictionary-backed arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// under the License.

use arrow::array::{Array, Date32Array, Int32Array};
use arrow::compute::cast;
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use arrow::compute::cast; is unused (the code calls arrow::compute::cast(...) with a fully-qualified path). This will fail CI because clippy is run with -D warnings. Either remove this import or use cast(&end_arr, &DataType::Date32) / cast(&start_arr, &DataType::Date32) so the import is actually used.

Suggested change
use arrow::compute::cast;

Copilot uses AI. Check for mistakes.
Comment on lines 111 to 114

Ok(ColumnarValue::Array(Arc::new(result)))
}

fn aliases(&self) -> &[String] {
&self.aliases
}
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aliases field still includes "datediff", but the fn aliases(&self) -> &[String] implementation was removed. If ScalarUDFImpl's default aliases() returns an empty slice (as used by other UDF impls in this crate), this will drop the datediff alias and can break Spark SQL/function resolution that relies on the alias rather than the primary name date_diff. Re-introduce aliases() (returning &self.aliases) or remove aliases entirely and ensure the UDF is registered under the intended name(s).

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.13%. Comparing base (f09f8af) to head (c4a39b0).
⚠️ Report is 896 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3278      +/-   ##
============================================
+ Coverage     56.12%   60.13%   +4.00%     
- Complexity      976     1456     +480     
============================================
  Files           119      175      +56     
  Lines         11743    16085    +4342     
  Branches       2251     2665     +414     
============================================
+ Hits           6591     9673    +3082     
- Misses         4012     5066    +1054     
- Partials       1140     1346     +206     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vigneshsiva11 vigneshsiva11 changed the title Fix datediff array length mismatch for dictionary-backed timestamps fix(spark-expr): handle array length mismatch in datediff for dictionary-backed timestamps Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CometNativeException: "arrays of different length" when using to_date on Iceberg Timestamp column

2 participants