Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 60 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,8 @@ directly -- always use the visitor pattern (`visit_rows` with typed `GetData` ac
whether a new test duplicates the flow of an existing nearby test and should be
merged into it as a new `#[case]`. A common pattern is toggling a feature (e.g.
column mapping on/off) and asserting success vs. error.
- Reuse helpers from `test_utils` instead of writing custom ones when possible.
- Reuse helpers from `test_utils` and the integration-test fixtures instead of writing
custom ones when possible. See **Common test helpers** below for a curated starter list.
- **Committing in tests:** Use `txn.commit(engine)?.unwrap_committed()` to assert a
successful commit and get the `CommittedTransaction`. Do NOT use `match` + `panic!`
for this -- `unwrap_committed()` provides a clear error message on failure. Available
Expand All @@ -140,6 +141,64 @@ directly -- always use the visitor pattern (`visit_rows` with typed `GetData` ac
both `add_commit` (writing log files) and `snapshot`/`Snapshot::try_new` (reading the
table). Always include a trailing slash in directory URLs to ensure correct path joining.

### Common test helpers
Comment thread
sanujbasu marked this conversation as resolved.

Before writing a custom helper, check this curated list and the locations below.
This list is non-exhaustive -- when in doubt, browse the source files directly
(`test-utils/src/lib.rs`, `kernel/tests/integration/common/`,
`kernel/tests/integration/<topic>/mod.rs`).

**Arrow construction (from `delta_kernel::arrow`)**

- `arrow::array::new_null_array(&arrow_type, n)` -- Arrow array of `n` nulls of any Arrow
type. Prefer this over per-type `Int32Array::from(vec![None as Option<i32>])` builders.
- `engine::arrow_conversion::TryIntoArrow`:
`(&kernel_data_type).try_into_arrow()` for `DataType`,
`(&kernel_struct_type).try_into_arrow()` for `StructType` -> Arrow `Schema`.

**Engine + table setup (from `test_utils`)**

- `test_table_setup()` / `test_table_setup_mt()` -- engine + temp table path. Use the `_mt`
variant under `#[tokio::test(flavor = "multi_thread")]`.
- `engine_store_setup(name, opts)` -- returns `(store, engine, table_location)` when a test
needs direct object-store access.
- `setup_test_tables(...)` -- multiple pre-built tables for read/scan tests.

**Table creation in tests**

- Prefer the kernel `create_table` builder
(`delta_kernel::transaction::create_table::create_table`). It exercises the same path
connectors use and auto-derives the protocol from the schema and feature flags.
- `test_utils::create_table` (a JSON helper that hand-rolls protocol + metadata) is older
but still needed when the kernel builder cannot enable a particular feature combination.

**Schema fixtures**

- `test_utils`: `nested_schema`, `schema_with_type`, `nested_schema_with_type`,
`multi_schema_with_type`, `top_level_ntz_schema` / `nested_ntz_schema` /
`multiple_ntz_schema`, `top_level_variant_schema` / `nested_variant_schema` /
`multiple_variant_schema`.
- `kernel/tests/integration/create_table/mod.rs`: `simple_schema`, `partition_test_schema`.

**Commit + read helpers (from `test_utils`)**

- `add_commit`, `add_staged_commit` -- write a JSON commit at a given version.
- `read_actions_from_commit` -- read raw JSON actions from a specific commit. Use this
instead of hand-rolled `serde_json` parsing.
- `test_read` -- full-scan read of a table; use for round-trip assertions.
- `into_record_batch` -- convert `Box<dyn EngineData>` to Arrow `RecordBatch`.

**Assertion helpers (from `test_utils`)**

- `assert_schema_has_field(schema, &["a", "b"])` -- assert a (possibly nested) field path.
- `assert_result_error_with_message(result, "needle")` -- assert an error contains a
substring.

**If a name here doesn't match what's in code:** the list may have drifted from a rename.
Run `rg '^pub (fn|async fn)' test-utils/src/lib.rs` to discover the current public surface,
and update this section in your PR. The same pattern works for
`kernel/tests/integration/common/write_utils.rs`.

## Protocol TLDR

The [Delta protocol spec](https://raw.githubusercontent.com/delta-io/delta/master/PROTOCOL.md)
Expand Down
75 changes: 63 additions & 12 deletions kernel/src/partition/serialization.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,44 @@ const UNIX_EPOCH_CE_DAYS: i32 = 719_163;
// Binary Raw UTF-8: String::from_utf8 strict, error on invalid
// Struct/Array/Map Error not valid partition types

/// Whether a partition-value [`Scalar`] is equivalent to null under the Delta protocol's
/// partition value serialization rules.
///
/// The protocol collapses three Scalar shapes onto JSON null in `AddFile.partitionValues`:
/// `Scalar::Null(_)`, empty `Scalar::String`, and empty `Scalar::Binary`. Partition
/// values are serialized according to protocol rules on write and interpreted against
/// the table schema on read, so all three forms must be treated as null before enforcing
/// NOT NULL constraints.
///
/// Read paths do not need this predicate: the on-wire collapse to JSON null is undone at
/// JSON deserialization (null map entries are dropped before kernel sees a Scalar), so a
/// "null" partition value never appears in Scalar form on reads.
pub(crate) fn would_serialize_to_null(value: &Scalar) -> bool {
match value {
Scalar::Null(_) => true,
Scalar::String(s) if s.is_empty() => true,
Scalar::Binary(b) if b.is_empty() => true,
_ => false,
}
}

/// Serializes a [`Scalar`] partition value to a protocol-compliant string for the
/// `partitionValues` map in Add actions.
///
/// Returns `Ok(None)` for null values (regardless of the null's data type), empty strings,
/// and empty binary (the Delta protocol treats all three as null partition values). Returns
/// `Err` for non-null values of types that cannot be partition columns (Struct, Array, Map)
/// or for binary values that are not valid UTF-8.
/// Returns `Ok(None)` for any value the Delta protocol treats as null in `partitionValues`:
/// `Scalar::Null`, empty `Scalar::String`, and empty `Scalar::Binary`. Non-null partition
/// values are serialized according to protocol rules; readers interpret the stored value
/// against the table schema. Returns `Err` for non-null values of types that cannot be
/// partition columns (Struct, Array, Map) or for binary values that are not valid UTF-8.
///
/// The inverse of [`PrimitiveType::parse_scalar`].
///
/// [`PrimitiveType::parse_scalar`]: crate::schema::PrimitiveType::parse_scalar
pub fn serialize_partition_value(value: &Scalar) -> DeltaResult<Option<String>> {
match value {
Scalar::Null(_) => Ok(None),
Scalar::String(s) => Ok(if s.is_empty() { None } else { Some(s.clone()) }),
Scalar::String(s) if s.is_empty() => Ok(None),
Scalar::String(s) => Ok(Some(s.clone())),
Scalar::Boolean(v) => Ok(Some(v.to_string())),
Scalar::Byte(v) => Ok(Some(v.to_string())),
Scalar::Short(v) => Ok(Some(v.to_string())),
Expand All @@ -70,13 +93,8 @@ pub fn serialize_partition_value(value: &Scalar) -> DeltaResult<Option<String>>
Scalar::Timestamp(us) => Ok(Some(format_timestamp(*us)?)),
Scalar::TimestampNtz(us) => Ok(Some(format_timestamp_ntz(*us)?)),
Scalar::Decimal(d) => Ok(Some(format_decimal(d))),
Scalar::Binary(b) => {
if b.is_empty() {
Ok(None)
} else {
Ok(Some(format_binary(b)?))
}
}
Scalar::Binary(b) if b.is_empty() => Ok(None),
Scalar::Binary(b) => Ok(Some(format_binary(b)?)),
Scalar::Struct(_) | Scalar::Array(_) | Scalar::Map(_) => Err(Error::generic(format!(
"cannot serialize partition value: type {:?} is not a valid partition column type",
value.data_type()
Expand Down Expand Up @@ -254,6 +272,39 @@ mod tests {
assert_eq!(serialize_partition_value(&input).unwrap(), None);
}

/// Tests the canonical null-equivalence rule: `would_serialize_to_null` returns true for
/// exactly the three Scalar shapes that collapse to JSON null (`Scalar::Null`, empty
/// `Scalar::String`, empty `Scalar::Binary`) and false for everything else, including
/// non-empty strings/binary and primitive scalars.
#[rstest]
// Null-equivalent: all three shapes return true.
#[case::null_int(Scalar::Null(DataType::INTEGER), true)]
#[case::null_string(Scalar::Null(DataType::STRING), true)]
#[case::null_binary(Scalar::Null(DataType::BINARY), true)]
#[case::empty_string(Scalar::String(String::new()), true)]
#[case::empty_binary(Scalar::Binary(Vec::new()), true)]
// Non-empty strings/binary and primitive scalars return false.
#[case::nonempty_string(Scalar::String("a".into()), false)]
#[case::single_space(Scalar::String(" ".into()), false)]
#[case::nonempty_binary(Scalar::Binary(vec![0x00]), false)]
#[case::integer(Scalar::Integer(0), false)]
#[case::boolean_false(Scalar::Boolean(false), false)]
fn test_would_serialize_to_null(#[case] value: Scalar, #[case] expected: bool) {
assert_eq!(would_serialize_to_null(&value), expected);
}

/// Cross-check: any value where `would_serialize_to_null` is true must also serialize
/// to `Ok(None)`, and vice versa for the cases above. Locks the predicate to the actual
/// serialization outcome so the helper cannot drift from `serialize_partition_value`.
#[rstest]
#[case(Scalar::Null(DataType::INTEGER))]
#[case(Scalar::String(String::new()))]
#[case(Scalar::Binary(Vec::new()))]
fn test_would_serialize_to_null_matches_serialize_result(#[case] value: Scalar) {
assert!(would_serialize_to_null(&value));
assert_eq!(serialize_partition_value(&value).unwrap(), None);
}

// ============================================================================
// Spark reference: non-null serialization
// ============================================================================
Expand Down
92 changes: 76 additions & 16 deletions kernel/src/partition/validation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,12 @@
//! - [`validate_keys`]: checks key completeness (case-insensitive matching, normalizes to schema
//! case, detects post-normalization duplicates).
//! - [`validate_types`]: checks that each `Scalar`'s type matches the partition column's schema
//! type. Null scalars skip the type check.

//! type, and that non-null partition columns are never assigned a value that would serialize to a
//! null partition value.
use std::collections::HashMap;

use crate::expressions::Scalar;
use crate::partition::serialization::would_serialize_to_null;
use crate::schema::{DataType, StructType};
use crate::{DeltaResult, Error};

Expand Down Expand Up @@ -105,8 +106,15 @@ fn validate_keys(
}

/// Validates that each [`Scalar`] value's type is compatible with the corresponding
/// partition column's type in the table schema. Null scalars skip the type check
/// (null is valid for any partition column type).
/// partition column's type in the table schema. Values that would serialize to a null
/// partition value skip the value type check, but a
/// null partition value is only legal when the schema field is nullable.
///
/// Nullability is enforced before any non-null type check, so the rejection treats null
/// scalars and their serialization-equivalent forms (empty strings, empty binary)
/// uniformly -- otherwise an empty `Scalar::String` for a `nullable: false` column would
/// pass validation, then collapse to JSON null at serialization, landing a null value in
/// a NOT NULL column.
///
/// # Parameters
/// - `logical_schema`: the table's logical schema.
Expand All @@ -117,6 +125,8 @@ fn validate_keys(
/// # Errors
/// - A partition column is not found in the table schema
/// - A partition column's schema type is not a primitive (struct, array, map are rejected)
/// - A null-equivalent value (null scalar, empty string, or empty binary) is provided for a
/// non-nullable partition column
/// - A non-null value's type does not match the partition column's schema type
fn validate_types(
logical_schema: &StructType,
Expand All @@ -138,7 +148,13 @@ fn validate_types(
Partition columns must be primitive types."
)));
}
if value.is_null() {
if would_serialize_to_null(value) {
if !field.nullable {
return Err(Error::invalid_partition_values(format!(
"partition column '{col_name}' is not nullable but received a value that \
serializes to null (null scalar, empty string, or empty binary)"
)));
}
continue;
}
let actual_type = value.data_type();
Expand Down Expand Up @@ -172,6 +188,14 @@ mod tests {
validate_types(&schema, &values).unwrap_err().to_string()
}

/// Like [`assert_type_ok`] but uses a `nullable` schema field. Used by null-value cases,
/// since null scalars are only valid for nullable partition columns.
fn assert_type_ok_nullable(data_type: DataType, value: Scalar) {
let schema = StructType::new_unchecked(vec![StructField::nullable("p", data_type)]);
let values = HashMap::from([("p".to_string(), value)]);
validate_types(&schema, &values).unwrap();
}

// ============================================================================
// validate_keys
// ============================================================================
Expand Down Expand Up @@ -310,7 +334,9 @@ mod tests {
/// Every non-null row from the "String and binary encoding" table in
/// `partition/mod.rs`. Row numbers reference that table.
#[rstest]
// STRING rows
// STRING rows. Rows 49 (empty binary) and 65 (empty string) are null-equivalent under
// the protocol and are covered by the null-into-nullable / null-into-non-nullable tests
// further down -- they are intentionally absent from this "type-match passes" group.
#[case(DataType::STRING, Scalar::String("\x00".into()))] // row 47
#[case(DataType::STRING, Scalar::String("before\x00after".into()))] // row 48
#[case(DataType::STRING, Scalar::String("a{b".into()))] // row 54
Expand All @@ -324,11 +350,9 @@ mod tests {
#[case(DataType::STRING, Scalar::String("a&b+c$d;e,f".into()))] // row 62
#[case(DataType::STRING, Scalar::String("Serbia/srb%".into()))] // row 63
#[case(DataType::STRING, Scalar::String("100%25".into()))] // row 64
#[case(DataType::STRING, Scalar::String("".into()))] // row 65
#[case(DataType::STRING, Scalar::String(" ".into()))] // row 67
#[case(DataType::STRING, Scalar::String(" ".into()))] // row 68
// BINARY rows
#[case(DataType::BINARY, Scalar::Binary(vec![]))] // row 49
#[case(DataType::BINARY, Scalar::Binary(vec![0xDE, 0xAD, 0xBE, 0xEF]))] // row 50
#[case(DataType::BINARY, Scalar::Binary(vec![0x48, 0x45, 0x4C, 0x4C, 0x4F]))] // row 51
#[case(DataType::BINARY, Scalar::Binary(vec![0x00, 0xFF]))] // row 52
Expand All @@ -340,8 +364,10 @@ mod tests {
assert_type_ok(data_type, value);
}

/// NULL rows from the encoding table. Null is valid for every partition column type,
/// regardless of the Scalar::Null inner type.
/// Null-equivalent rows from the encoding table. Under the Delta protocol, three Scalar
/// shapes serialize to JSON null in `partitionValues`: `Scalar::Null`, empty `Scalar::String`,
/// and empty `Scalar::Binary`. All three are valid for every partition column type when the
/// schema field is nullable, regardless of the `Scalar::Null` inner type.
#[rstest]
#[case(DataType::INTEGER, Scalar::Null(DataType::INTEGER))] // row 5
#[case(DataType::LONG, Scalar::Null(DataType::LONG))] // row 8
Expand All @@ -356,16 +382,50 @@ mod tests {
#[case(DataType::TIMESTAMP_NTZ, Scalar::Null(DataType::TIMESTAMP_NTZ))] // row 46
#[case(DataType::STRING, Scalar::Null(DataType::STRING))] // row 66
#[case(DataType::BINARY, Scalar::Null(DataType::BINARY))] // (binary NULL)
fn test_validate_types_null_returns_ok(#[case] data_type: DataType, #[case] value: Scalar) {
assert_type_ok(data_type, value);
#[case(DataType::STRING, Scalar::String(String::new()))] // row 65: empty string
#[case(DataType::BINARY, Scalar::Binary(Vec::new()))] // row 49: empty binary
fn test_validate_types_null_into_nullable_returns_ok(
#[case] data_type: DataType,
#[case] value: Scalar,
) {
assert_type_ok_nullable(data_type, value);
}

/// Null skips the type check even when the Scalar::Null inner type does not match
/// the schema column type. This is intentional: null is valid for any partition
/// column regardless of what type the Null carries.
/// Null skips the value type check even when the `Scalar::Null` inner type does not match
/// the schema column type. This is intentional: null carries no concrete type to compare
/// against.
#[test]
fn test_validate_types_null_with_mismatched_inner_type_returns_ok() {
assert_type_ok(DataType::INTEGER, Scalar::Null(DataType::STRING));
assert_type_ok_nullable(DataType::INTEGER, Scalar::Null(DataType::STRING));
}

/// A null-equivalent partition value (null scalar, empty string, or empty binary) for
/// a non-nullable partition column is rejected
#[rstest]
#[case(DataType::INTEGER, Scalar::Null(DataType::INTEGER))]
#[case(DataType::LONG, Scalar::Null(DataType::LONG))]
#[case(DataType::STRING, Scalar::Null(DataType::STRING))]
#[case(DataType::BINARY, Scalar::Null(DataType::BINARY))]
#[case(DataType::BOOLEAN, Scalar::Null(DataType::BOOLEAN))]
#[case(DataType::DATE, Scalar::Null(DataType::DATE))]
#[case(DataType::TIMESTAMP, Scalar::Null(DataType::TIMESTAMP))]
#[case(DataType::TIMESTAMP_NTZ, Scalar::Null(DataType::TIMESTAMP_NTZ))]
#[case(DataType::decimal(10, 2).unwrap(), Scalar::Null(DataType::decimal(10, 2).unwrap()))]
// Null with mismatched inner type is also rejected the schema's nullability is the
// only thing checked for null scalars.
#[case(DataType::INTEGER, Scalar::Null(DataType::STRING))]
// Empty string and empty binary serialize to JSON null per the Delta protocol; they
// must be rejected for NOT NULL partition columns alongside `Scalar::Null` so a
// serialization-equivalent value cannot bypass the nullability check.
#[case(DataType::STRING, Scalar::String(String::new()))]
#[case(DataType::BINARY, Scalar::Binary(Vec::new()))]
fn test_validate_types_null_into_non_nullable_returns_error(
#[case] data_type: DataType,
#[case] value: Scalar,
) {
let err = assert_type_err(data_type, value);
assert!(err.contains("not nullable"), "{err}");
assert!(err.contains("'p'"), "{err}");
}

/// Type mismatch: every non-null Scalar variant against the wrong schema type.
Expand Down
5 changes: 4 additions & 1 deletion kernel/src/transaction/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -884,7 +884,10 @@ impl<S: SupportsDataFiles> Transaction<S> {
/// - **Type checking**: rejects non-primitive partition column types (struct, array, map) and
/// validates that each non-null `Scalar`'s type matches the partition column's schema type.
/// For example, passing `Scalar::String("2024")` for an `INTEGER` column returns an error.
/// Null scalars skip the value type check (null is valid for any primitive partition column).
/// Null-equivalent scalars (null scalars, empty strings, and empty binary) all of which
/// collapse to JSON null in `partitionValues`) skip the value type check, but they are only
/// legal when the partition column is nullable; passing any of these for a `nullable: false`
/// partition column returns an error.
///
/// - **Value serialization**: serializes each `Scalar` to a protocol-compliant string per the
/// Delta protocol's "Partition Value Serialization" rules. `Scalar::Null(...)` becomes `None`
Expand Down
Loading
Loading