feat: Write parquet id for nested items in map/array according to metadata#2508
feat: Write parquet id for nested items in map/array according to metadata#2508dengsh12 wants to merge 13 commits intodelta-io:mainfrom
Conversation
d9cde6e to
f01f689
Compare
f01f689 to
3078f79
Compare
|
PR title does not match the required pattern. Please ensure you follow the conventional commits spec. Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: Your title should start with Title: |
64b3fcf to
d058829
Compare
| // 3. Rebuild the element field. `apply_schema` applies kernel's schema, so the input arrow | ||
| // field's metadata comes entirely from the kernel side. Parquet field ID is the only | ||
| // metadata that should be set on the element field here. Since kernel's | ||
| // `ArrayType::element_type` is a bare `DataType` (not a `StructField`), there is no | ||
| // per-element metadata on the kernel side. The element's Parquet field ID comes from the | ||
| // ancestor's metadata. | ||
| let transformed_arrow_element_field = ArrowField::new( |
There was a problem hiding this comment.
https://github.com/delta-io/delta-kernel-rs/pull/331/changes#r1807071289 context here(we stripped the metadata in the past as well)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2508 +/- ##
==========================================
+ Coverage 88.50% 88.57% +0.07%
==========================================
Files 179 179
Lines 59031 59572 +541
Branches 59031 59572 +541
==========================================
+ Hits 52245 52766 +521
- Misses 4785 4791 +6
- Partials 2001 2015 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| } | ||
| } | ||
|
|
||
| /// Recursive Arrow-type builder that propagates nested field IDs from `ancestor`'s |
| } | ||
|
|
||
| // deconstruct the array, then rebuild the mapped version | ||
| // Apply a kernel [`ArrayType`] to an Arrow [`ListArray`]. |
There was a problem hiding this comment.
Can you please give an example? and clarify what "apply" means.
It's not intuitive to me what "apply a kernel ArrayType to an Arrow ListArray" means.
i.e. to me that is equivalently as confusing as "apply an Kernel integer to an arrow integer". I don't know what "apply" means.
There was a problem hiding this comment.
Added detail example and explanation(centralized to apply_schema_to_inner to avoid repeated comments)
|
|
||
| // deconstruct a map, and rebuild it with the specified target kernel type | ||
| fn apply_schema_to_map(array: &dyn Array, kernel_map_type: &MapType) -> DeltaResult<MapArray> { | ||
| // Apply a kernel [`MapType`] to an Arrow [`MapArray`]. |
There was a problem hiding this comment.
Added detail example and explaination(centralized to apply_schema_to_inner to avoid repeated comments)
34ce46d to
766ace4
Compare
766ace4 to
c648031
Compare
…lta-io#2504) ## Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/2504/files) to review incremental changes. - [stack/add-v3-feature](delta-io#2475) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2475/files)] [MERGED] - [**stack/write-part-when-ice3**](delta-io#2504) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2504/files)] - [stack/block-alter-table](delta-io#2505) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2505/files/15d151f8b271a72cb421ccfa81cda7d5887d1b37..562605f3911acd99854efae7940deb9344e7ee55)] - [stack/write-nested-id](delta-io#2508) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/2508/files/562605f3911acd99854efae7940deb9344e7ee55..766ace41c2176c546f533be4ff0d7a9aa24ae001)] --------- ## What changes are proposed in this pull request? Materialize partition values when icebergCompatV3 enabled. ## How was this change tested? 1. When `icebergCompatV3` enabled, partition columns are contained into `physical_write_schema` 2. When `icebergCompatV3` supported but not enabled, partition columns are not contained into `physical_write_schema` Note: other related cases(e.g. `MaterializePartitionColumns`) are already covered by existing tests so we are not adding tests here
| let id: i64 = v.parse().map_err(|_| { | ||
| ArrowError::SchemaError(format!( | ||
| "'{k}' on field '{}' must be an integer, got '{v}'", | ||
| arrow_field.name() | ||
| )) | ||
| })?; | ||
| return Ok(( | ||
| ColumnMetadataKey::ParquetFieldId.as_ref().to_string(), | ||
| MetadataValue::Number(id), | ||
| )); | ||
| } | ||
| Ok((k.clone(), MetadataValue::from(v))) |
There was a problem hiding this comment.
Note: This is a fix, previously we filled ColumnMetadataKey::ParquetFieldId with string, but from here
delta-kernel-rs/kernel/src/engine/arrow_utils.rs
Lines 955 to 958 in 78670eb
MetadataValue::Number
| if !nested_ids.is_empty() { | ||
| kernel_metadata.insert( | ||
| ColumnMetadataKey::ParquetFieldNestedIds | ||
| .as_ref() | ||
| .to_string(), | ||
| MetadataValue::Other(serde_json::Value::Object(nested_ids)), | ||
| ); | ||
| } |
There was a problem hiding this comment.
Note: I'm not validating if arrow_metadata contains different ColumnMetadataKey::ParquetFieldNestedIds for simplicity(feel like defense in depth). If arrow_metadata contains ColumnMetadataKey::ParquetFieldNestedIds which is different from collect_nested_field_ids_from_arrow, it would be invalid data. But lmk if you think it's worth defending here!
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Added the functionality to attach parquet field id for key/value/element in kernel map/array, when we transform a kernel schema to arrow schema in
DefaultEngine. Added comments onenginetrait statingengineneed to be aware of the nested parquet id metadata as well.Recommended review start:
ArrowField::try_from_kernel,apply_schema, andStructField::try_from_arrow. These three func are entrance of transforming kernel schema to arrow schema, or applying kernel schema onto arrow dataHow was this change tested?
Integration test:
RecordBatchand kernel schema with nested parquet id. Write to a parquet file and read the raw just-written parquet to validate the field ids are correctly written.Unit tests:
For both
ArrowField::try_from_kernelandapply_schemastruct<map<array, struct<map<array,int>>>>to validate parquet id attaching properly.parquet.field.nested.idsanddelta.columnMapping.nested.idsparquet.field.nested.idsis not a json,parquet.field.nested.idscontains non-integer parquet id)For apply_schema
For
StructField::try_from_arrow:struct<map<array, struct<map<array,int>>>>to validate parquet nested id metadata filled correctly.