feat: Write parquet id for nested items in map/array according to metadata#2508
feat: Write parquet id for nested items in map/array according to metadata#2508dengsh12 wants to merge 13 commits intodelta-io:mainfrom
Conversation
d9cde6e to
f01f689
Compare
f01f689 to
3078f79
Compare
|
PR title does not match the required pattern. Please ensure you follow the conventional commits spec. Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: |
64b3fcf to
d058829
Compare
| // 3. Rebuild the element field. `apply_schema` applies kernel's schema, so the input arrow | ||
| // field's metadata comes entirely from the kernel side. Parquet field ID is the only | ||
| // metadata that should be set on the element field here. Since kernel's | ||
| // `ArrayType::element_type` is a bare `DataType` (not a `StructField`), there is no | ||
| // per-element metadata on the kernel side. The element's Parquet field ID comes from the | ||
| // ancestor's metadata. | ||
| let transformed_arrow_element_field = ArrowField::new( |
There was a problem hiding this comment.
https://github.com/delta-io/delta-kernel-rs/pull/331/changes#r1807071289 context here(we stripped the metadata in the past as well)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2508 +/- ##
==========================================
+ Coverage 88.50% 88.57% +0.07%
==========================================
Files 179 179
Lines 59031 59572 +541
Branches 59031 59572 +541
==========================================
+ Hits 52245 52766 +521
- Misses 4785 4791 +6
- Partials 2001 2015 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| } | ||
| } | ||
|
|
||
| /// Recursive Arrow-type builder that propagates nested field IDs from `ancestor`'s |
| } | ||
|
|
||
| // deconstruct the array, then rebuild the mapped version | ||
| // Apply a kernel [`ArrayType`] to an Arrow [`ListArray`]. |
There was a problem hiding this comment.
Can you please give an example? and clarify what "apply" means.
It's not intuitive to me what "apply a kernel ArrayType to an Arrow ListArray" means.
i.e. to me that is equivalently as confusing as "apply an Kernel integer to an arrow integer". I don't know what "apply" means.
There was a problem hiding this comment.
Added detail example and explanation(centralized to apply_schema_to_inner to avoid repeated comments)
|
|
||
| // deconstruct a map, and rebuild it with the specified target kernel type | ||
| fn apply_schema_to_map(array: &dyn Array, kernel_map_type: &MapType) -> DeltaResult<MapArray> { | ||
| // Apply a kernel [`MapType`] to an Arrow [`MapArray`]. |
There was a problem hiding this comment.
Added detail example and explaination(centralized to apply_schema_to_inner to avoid repeated comments)
34ce46d to
766ace4
Compare
766ace4 to
c648031
Compare
…lta-io#2504) ## Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/2504/files) to review incremental changes. - [stack/add-v3-feature](delta-io#2475) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2475/files)] [MERGED] - [**stack/write-part-when-ice3**](delta-io#2504) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2504/files)] - [stack/block-alter-table](delta-io#2505) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2505/files/15d151f8b271a72cb421ccfa81cda7d5887d1b37..562605f3911acd99854efae7940deb9344e7ee55)] - [stack/write-nested-id](delta-io#2508) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2508/files/562605f3911acd99854efae7940deb9344e7ee55..766ace41c2176c546f533be4ff0d7a9aa24ae001)] --------- ## What changes are proposed in this pull request? Materialize partition values when icebergCompatV3 enabled. ## How was this change tested? 1. When `icebergCompatV3` enabled, partition columns are contained into `physical_write_schema` 2. When `icebergCompatV3` supported but not enabled, partition columns are not contained into `physical_write_schema` Note: other related cases(e.g. `MaterializePartitionColumns`) are already covered by existing tests so we are not adding tests here
| let id: i64 = v.parse().map_err(|_| { | ||
| ArrowError::SchemaError(format!( | ||
| "'{k}' on field '{}' must be an integer, got '{v}'", | ||
| arrow_field.name() | ||
| )) | ||
| })?; | ||
| return Ok(( | ||
| ColumnMetadataKey::ParquetFieldId.as_ref().to_string(), | ||
| MetadataValue::Number(id), | ||
| )); | ||
| } | ||
| Ok((k.clone(), MetadataValue::from(v))) |
There was a problem hiding this comment.
Note: This is a fix, previously we filled ColumnMetadataKey::ParquetFieldId with string, but from here
delta-kernel-rs/kernel/src/engine/arrow_utils.rs
Lines 955 to 958 in 78670eb
MetadataValue::Number
| if !nested_ids.is_empty() { | ||
| kernel_metadata.insert( | ||
| ColumnMetadataKey::ParquetFieldNestedIds | ||
| .as_ref() | ||
| .to_string(), | ||
| MetadataValue::Other(serde_json::Value::Object(nested_ids)), | ||
| ); | ||
| } |
There was a problem hiding this comment.
Note: I'm not validating if arrow_metadata contains different ColumnMetadataKey::ParquetFieldNestedIds for simplicity(feel like defense in depth). If arrow_metadata contains ColumnMetadataKey::ParquetFieldNestedIds which is different from collect_nested_field_ids_from_arrow, it would be invalid data. But lmk if you think it's worth defending here!
There was a problem hiding this comment.
Think again we probably should not error, in the past we may write to column mapping but not icebergCompat table with delta.columnmapping.nested.ids, but didn't write parquet id. If we error here, those tables are not readable
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Added the functionality to attach parquet field id for key/value/element in kernel map/array, when we transform a kernel schema to arrow schema in
DefaultEngine. Added comments onenginetrait statingengineneed to be aware of the nested parquet id metadata as well.Recommended review start:
ArrowField::try_from_kernel,apply_schema, andStructField::try_from_arrow. These three func are entrance of transforming kernel schema to arrow schema, or applying kernel schema onto arrow dataHow was this change tested?
Integration test:
RecordBatchand kernel schema with nested parquet id. Write to a parquet file and read the raw just-written parquet to validate the field ids are correctly written.Unit tests:
For both
ArrowField::try_from_kernelandapply_schemastruct<map<array, struct<map<array,int>>>>to validate parquet id attaching properly.parquet.field.nested.idsanddelta.columnMapping.nested.idsparquet.field.nested.idsis not a json,parquet.field.nested.idscontains non-integer parquet id)For apply_schema
For
StructField::try_from_arrow:struct<map<array, struct<map<array,int>>>>to validate parquet nested id metadata filled correctly.