feat: add column mapping support for add_column#2389
feat: add column mapping support for add_column#2389william-ch-databricks wants to merge 2 commits intodelta-io:mainfrom
Conversation
36cd0da to
d943ca2
Compare
…2385) ## Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/2385/files) to review incremental changes. - [**stack/alter-table-1-refactor-state**](#2385) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2385/files)] - [stack/alter-table-2-supports-data-files](#2386) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2386/files/029d6672081d32caea314dca61f3688e841f3036..50130c30c4378997fbf3d0ef7426ce23992e5c85)] - [stack/alter-table-3-framework-add-column](#2387) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2387/files/50130c30c4378997fbf3d0ef7426ce23992e5c85..eaa5277d025434aa78b79beed1a7cedfe82aa621)] - [stack/alter-table-4-set-nullable](#2388) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2388/files/eaa5277d025434aa78b79beed1a7cedfe82aa621..7506e1273ef349fbf628e4a6db5d0e7c1b4cbd8b)] - [stack/alter-table-5-column-mapping-add](#2389) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2389/files/7506e1273ef349fbf628e4a6db5d0e7c1b4cbd8b..d943ca2945da1ec9c354db57c889d93456381dee)] - [stack/alter-table-6-drop-column](#2390) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2390/files/d943ca2945da1ec9c354db57c889d93456381dee..40e3a1ff9e2915b456d0f0a76a50eb67f27a1ab1)] - [stack/alter-table-7-rename-column](#2391) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2391/files/40e3a1ff9e2915b456d0f0a76a50eb67f27a1ab1..628af3c02767f008a4e3c80a146aec7e5a3ac0b3)] --------- ## What changes are proposed in this pull request? Splits Transaction's snapshot into two concerns: - `read_snapshot_opt: Option<SnapshotRef>` -- the pre-commit table state (None for CREATE TABLE) - `effective_table_config: TableConfiguration` -- the config this commit will produce This separates "what did I read?" (conflict detection, post-commit snapshots) from "what will this commit produce?" (schema, protocol, stats, write context). Write-path call sites read from `effective_table_config`; read-path call sites use `read_snapshot()`. Also adds `should_emit_protocol` / `should_emit_metadata` flags to replace the old `is_create_table()` checks for Protocol/Metadata action emission, and replaces the synthetic pre-commit snapshot in CREATE TABLE with direct `TableConfiguration` construction. This is a pure refactor with no behaviour change. ## How was this change tested? All existing tests pass. Added unit tests for `LogSegment::new_for_version_zero` (valid input, non-zero version rejection, non-commit file rejection).
19fbe4b to
ca02602
Compare
ca02602 to
38face1
Compare
38face1 to
13e5a34
Compare
74af2a0 to
2133e27
Compare
cb5ca4e to
28be97c
Compare
…a-io#2386) ## Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/2386/files) to review incremental changes. - [stack/alter-table-1-refactor-state](delta-io#2385) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2385/files)] [MERGED] - [**stack/alter-table-2-supports-data-files**](delta-io#2386) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2386/files)] - [stack/alter-table-3-framework-add-column](delta-io#2387) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2387/files/a71c8b8c83d33b1ab882cae984973a76aafc3a31..6dd1b32e8c5163018f2c836f733b58fc4bdc2dc2)] - [stack/alter-table-4-set-nullable](delta-io#2388) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2388/files/6dd1b32e8c5163018f2c836f733b58fc4bdc2dc2..d121f77a73a37f12421c42dc5c4a5626ad2a1e10)] - [stack/alter-table-5-column-mapping-add](delta-io#2389) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2389/files/d121f77a73a37f12421c42dc5c4a5626ad2a1e10..cb5ca4e01024e50ca6b5b0576bbeb9bb6b184d17)] - [stack/alter-table-6-drop-column](delta-io#2390) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2390/files/cb5ca4e01024e50ca6b5b0576bbeb9bb6b184d17..6e91f960109a4d4db662e417780bb4beaac88f5f)] - [stack/alter-table-7-rename-column](delta-io#2391) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2391/files/6e91f960109a4d4db662e417780bb4beaac88f5f..300d676946d785eb195018e2a9ee696c1f6623a6)] - [stack/alter-table-8-operation-string](delta-io#2447) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2447/files/300d676946d785eb195018e2a9ee696c1f6623a6..fee7c029d62cfb55b809ced35c0cb0553ddaf3e7)] - [stack/alter-table-9-is-blind-append](delta-io#2448) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2448/files/fee7c029d62cfb55b809ced35c0cb0553ddaf3e7..3d44b51e0d037c8f33ffbece2d7deecd800d5544)] --------- ## What changes are proposed in this pull request? Introduce a `SupportsDataFiles` marker trait to gate data-file methods at compile time. This prevents future metadata-only transaction types (like `AlterTable`) from accessing methods that add, remove, or configure data files. **Gated behind `SupportsDataFiles` (only `ExistingTable` and `CreateTable`):** | Method | Reason | |--------|--------| | `stats_schema()` | Returns table-specific stats schema for writing files | | `stats_columns()` | Returns column names needing stats collection | | `generate_logical_to_physical()` | Builds partition column transform for file writing | | `get_write_context()` | Returns full write context (target dir, schemas, column mapping) | | `add_files()` | Queues file additions | **Kept on `impl<S> Transaction<S>` (all transaction types):** | Method | Reason | |--------|--------| | `commit()` | Shared commit path; handles empty data gracefully | | `with_data_change()` / `set_data_change()` | ALTER TABLE sets `data_change: false` | | `with_engine_info()` | Applicable to all commits | | `with_commit_info()` | Applicable to all commits | | `with_transaction_id()` | Applicable to all commits | | `with_domain_metadata()` | Schema changes may need domain metadata | | `add_files_schema()` | Read-only accessor; used by `generate_adds()` which handles empty state internally. Kept ungated so it remains available if it becomes dynamic in the future. | | `validate_add_files_stats()` | Internal; handles empty data with early return | | `generate_adds()` | Internal; returns early when no files queued | | `generate_remove_actions()` | Internal; returns early when no files queued | | `generate_dv_update_actions()` | Internal; returns early when no DV updates | | `generate_adds_for_dv_update()` | Internal; only called from `generate_dv_update_actions` | | `build_crc_delta()` | Internal; shared commit path | | `into_committed()` / `into_conflicted()` / `into_retryable()` | Internal; shared commit path | ## How was this change tested? All existing tests pass unchanged. The trait adds compile-time safety with no runtime behavior change. A `compile_fail` doctest will be added in the follow-up PR that introduces `AlterTable` to verify that data file methods are rejected at compile time.
28be97c to
80b72bb
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2389 +/- ##
==========================================
+ Coverage 88.40% 88.47% +0.06%
==========================================
Files 178 178
Lines 58235 58669 +434
Branches 58235 58669 +434
==========================================
+ Hits 51485 51905 +420
- Misses 4783 4790 +7
- Partials 1967 1974 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f396c64 to
ae41dcb
Compare
41651eb to
a2b5b8b
Compare
scottsand-db
left a comment
There was a problem hiding this comment.
Looks awesome! Left some comments.
| /// | ||
| /// Returns [`Error::InvalidProtocol`] if the property is present but not a non-negative | ||
| /// integer. | ||
| pub(crate) fn max_column_id(&self) -> DeltaResult<Option<i64>> { |
There was a problem hiding this comment.
Don't we have this already as a TableProperty? It should parse it for us already, though I'm not sure it does the non-negative check.
| if has_id || has_physical_name { | ||
| return Err(Error::generic(format!( | ||
| "Field '{}' already has column mapping metadata. \ | ||
| Pre-existing column mapping metadata is not supported for CREATE TABLE.", |
There was a problem hiding this comment.
Couldn't this just be a loop where we iterate over the two fields, look each one up, and then return a very specific error message for that exact field instead of referring to both fields in the error message?
| } | ||
|
|
||
| /// Returns the largest `delta.columnMapping.id` found anywhere in `schema`, including nested | ||
| /// data types. Returns `0` if no field carries a column mapping ID. |
There was a problem hiding this comment.
Should this function set the default, or should we return an Option and let the caller set the default to zero? I defer to you.
| pub(crate) struct SchemaEvolutionResult { | ||
| /// The evolved schema after all operations are applied. | ||
| pub schema: SchemaRef, | ||
| /// `Some(id)` if column-mapped columns were added and `delta.columnMapping.maxColumnId` |
There was a problem hiding this comment.
If column mapping is enabled, what's an example of a non-column-mapped column?
This sentence structure is a bit confusing. It's probably best to just say: if Some then X must be updated to this new value.
| /// | ||
| /// The field must not already exist in the schema (case-insensitive). The field must be | ||
| /// nullable because existing data files do not contain this column and will read NULL for it. | ||
| /// `field` must not carry `delta.columnMapping.id` or `delta.columnMapping.physicalName` |
There was a problem hiding this comment.
field and any of its nested fields must not carry
| let cm_enabled = column_mapping_mode != ColumnMappingMode::None; | ||
|
|
||
| // Take max of persisted `maxColumnId` and schema-walked max to guard against a | ||
| // non-conforming writer leaving the property stale. Matches delta-spark's `findMaxColumnId`. |
There was a problem hiding this comment.
Delete Matches delta-spark's findMaxColumnId.
| @@ -146,14 +161,6 @@ pub(crate) fn apply_schema_operations( | |||
| // `DELTA_FEATURES_REQUIRE_MANUAL_ENABLEMENT` and requires the user to enable the | |||
| // feature explicitly before adding such a column. | |||
| SchemaOperation::AddColumn { field } => { | |||
There was a problem hiding this comment.
From claude:
On a non-CM table, an AddColumn whose top-level field is clean but an inner struct field carries delta.columnMapping.id/physicalName will pass this check. The error is still caught later by make_physical/validate_schema_column_mapping, so this is not a soundness bug, but the error surfaces far from the source.
Either make reject_preexisting_cm_metadata recursive (preferred — symmetric with assign_field_column_mapping which is recursive), or document explicitly that callers rely on validate_schema_column_mapping for nested cases.
| // For CREATE TABLE, reject any pre-existing column mapping metadata. | ||
| // This avoids conflicts between user-provided IDs/physical names and the ones we assign. | ||
| // ALTER TABLE (adding columns) will need different handling in the future. | ||
| // TODO: Also check for nested column IDs (`delta.columnMapping.nested.ids`) once |
There was a problem hiding this comment.
Do we need to add this TODO back? cc @dengsh12 FYI
| #[rstest] | ||
| #[case::one_column(1)] | ||
| #[case::three_columns(3)] | ||
| #[case::no_cm_one_column(None, 1)] |
There was a problem hiding this comment.
You can do the cross-product more easily by just using rstest values instead of rstest case.
| } | ||
| } | ||
| assert!(max_column_id(&reloaded) > orig); | ||
| } |
There was a problem hiding this comment.
Else, if column mapping is not enabled, please add an assertion that metadata().configuration().get("delta.columnMapping.maxColumnId").is_none()
a2b5b8b to
957a912
Compare
Implement the SetNullable schema operation which changes a column's nullability from NOT NULL to nullable. If the column is already nullable, this is a no-op. Only the safe direction is allowed (NOT NULL -> nullable) per the Delta protocol.
When column mapping is enabled (mode = name or id), newly added columns via ALTER TABLE ADD COLUMN are now assigned column mapping metadata (unique ID and UUID-based physical name). The maxColumnId table property is updated in the evolved metadata configuration. Reuses the existing assign_field_column_mapping function (now pub(crate)) which handles recursive assignment for nested types.
957a912 to
6a7ccf9
Compare
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Extends
ALTER TABLE ... ADD COLUMNto work on tables with column mapping enabled. New columns get a fresh column ID and a UUID-based physical name assigned, anddelta.columnMapping.maxColumnIdis updated in table properties.How was this change tested?